Invalidate prior A/B results + add proper experiment harness
Prior cross-machine comparison (commit 1e86285) was invalid: dash0
baseline used warm instances with residual KV cache, inflating TTFT
by 2x. Evidence: inst_7 APC=68.3% impossible from 25 cold-start
requests; WARM TTFT p90=3.3s vs fresh=0.26s.
Fair same-machine comparison (both fresh restart on dash0):
Baseline: TTFT50=1.075 TPOT90=0.076 E2E50=5.075 OK=198/200
Elastic P2P: TTFT50=1.018 TPOT90=0.085 E2E50=6.977 OK=195/200
Elastic is WORSE due to Mooncake kv_both memory overhead.
Changes:
- REPORT.md: rewrite §3-4 with corrected results, add §3.5 errata
- pd_separation_analysis.md: update elastic TL;DR with correct numbers
- cache_aware_proxy.py: fix double-decrement bugs in offload path,
add 120s prefill timeout with co-located fallback (HEAVY_COLO_FALLBACK)
- bench.sh: standardized experiment harness with guaranteed GPU cleanup
and fresh-state verification (nvidia-smi check before start)
- run_elastic_stability_test.sh: two-phase elastic vs baseline test
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
190
REPORT.md
190
REPORT.md
@@ -161,127 +161,106 @@ done
|
||||
|
||||
## 3. Results
|
||||
|
||||
### 3.1 End-to-End Performance
|
||||
> **Errata (2026-05-22)**: The initial cross-machine A/B (dash0 baseline vs dash1 elastic) reported -44% E2E improvement. Post-hoc analysis revealed the dash0 baseline instances were **not freshly restarted** — residual KV cache from prior experiments caused 2× TTFT inflation. All results below use verified fresh-restart experiments on the same machine.
|
||||
|
||||
| Config | OK/N | TTFT p50 | TTFT p90 | TPOT p50 | TPOT p90 | E2E p50 |
|
||||
|--------|------|----------|----------|----------|----------|---------|
|
||||
| Baseline linear | 198/200 | 2.383s | 27.622s | 0.069s | 0.117s | 10.232s |
|
||||
| Baseline LMetric | 198/200 | 1.099s | 9.392s | 0.063s | 0.073s | 5.205s |
|
||||
| Elastic P2P (cap=4) | 185/196 | **1.315s** | **13.179s** | **0.066s** | **0.075s** | **5.708s** |
|
||||
### 3.1 Fair Comparison (all fresh-restart, same machine dash0, 200 req)
|
||||
|
||||
> Note: "Baseline linear" was run on dash0 during the initial A/B (different machine load conditions).
|
||||
> "Baseline LMetric" was run on fresh-restart dash0, same conditions as "Baseline linear (fresh)" below in §3.6.
|
||||
| Config | OK/N | TTFT p50 | TTFT p90 | TPOT p90 | E2E p50 |
|
||||
|--------|------|----------|----------|----------|---------|
|
||||
| **Baseline (no Mooncake)** | **198/200** | **1.075s** | **9.384s** | **0.076s** | **5.075s** |
|
||||
| LMetric routing | 198/200 | 1.099s | 9.392s | 0.073s | 5.205s |
|
||||
| Elastic P2P (kv_both) | 195/200 | 1.018s | 11.312s | 0.085s | 6.977s |
|
||||
|
||||
### 3.2 KV Cache Hit Ratio
|
||||
### 3.2 Per-Class Breakdown
|
||||
|
||||
Sampled from vLLM instance logs at end of experiment:
|
||||
**Baseline (fresh):**
|
||||
|
||||
**Baseline** (local prefix cache only):
|
||||
| Class | Count | % | TTFT p50 | TTFT p90 | TPOT p90 |
|
||||
|-------|-------|---|----------|----------|----------|
|
||||
| WARM (<5k) | 46 | 23% | 0.137s | 0.262s | 0.061s |
|
||||
| MEDIUM (5-20k) | 50 | 25% | 0.921s | 1.846s | 0.079s |
|
||||
| HEAVY (20-50k) | 64 | 32% | 2.660s | 6.278s | 0.076s |
|
||||
| HEAVY (>50k) | 38 | 19% | 9.587s | 30.415s | 0.102s |
|
||||
|
||||
| Instance | Prefix APC |
|
||||
|----------|-----------|
|
||||
| inst_0 | 48.6% |
|
||||
| inst_3 | 3.8% |
|
||||
| inst_7 | 68.3% |
|
||||
| **Std dev** | **~33pp** |
|
||||
**Elastic P2P (fresh):**
|
||||
|
||||
**Elastic** (local prefix + Mooncake external):
|
||||
| Class | Count | % | TTFT p50 | TTFT p90 | TPOT p90 |
|
||||
|-------|-------|---|----------|----------|----------|
|
||||
| WARM (<5k) | 46 | 23% | 0.142s | 0.279s | 0.072s |
|
||||
| MEDIUM (5-20k) | 50 | 25% | 0.766s | 1.814s | 0.197s |
|
||||
| HEAVY (>20k) | 99 | 51% | 6.390s | 22.668s | 0.085s |
|
||||
|
||||
| Instance | Prefix APC | External APC | Effective |
|
||||
|----------|-----------|-------------|-----------|
|
||||
| inst_0 | 37.8% | 31.6% | 69.4% |
|
||||
| inst_3 | 36.6% | 34.2% | 70.8% |
|
||||
| inst_7 | 25.0% | 0.0% | 25.0% |
|
||||
| **Prefix std** | **~7pp** | | |
|
||||
|
||||
Key finding: elastic has **much more uniform** prefix APC across instances (std ~7pp vs ~33pp), and Mooncake external cache adds 30-34pp on active decode instances.
|
||||
|
||||
### 3.3 GPU Utilization
|
||||
|
||||
| Config | Mean | Min | Max | Imbalance |
|
||||
|--------|------|-----|-----|-----------|
|
||||
| Baseline | 28.7% | 20% | 38% | 1.9× |
|
||||
| Elastic | 15.8% | 7.6% | 30.4% | 3.0× |
|
||||
|
||||
### 3.4 Success Rate
|
||||
### 3.3 Success Rate
|
||||
|
||||
| Config | OK | Total | Rate | Failure mode |
|
||||
|--------|-----|-------|------|-------------|
|
||||
| Baseline | 198 | 200 | 99.0% | Generic timeout |
|
||||
| Elastic | 185 | 196 | 94.4% | Mooncake transfer timeout on >60k requests |
|
||||
| Baseline | 198 | 200 | 99.0% | RemoteProtocolError (replayer-side) |
|
||||
| Elastic P2P | 195 | 200 | 97.5% | 2× RemoteProtocolError + 3× ReadTimeout on >60k |
|
||||
|
||||
### 3.5 Per-Class TTFT Breakdown (Baseline Combined)
|
||||
Elastic's 3 extra errors are D-side KV pull failures: prefill succeeded on P, KV pushed to Mooncake, but D never produced first token (decode scheduler couldn't allocate KV cache space). Prefill timeout fallback (120s → co-located) was never triggered.
|
||||
|
||||
| Class | Count | % | Input p50 | TTFT p50 | TTFT p90 |
|
||||
|-------|-------|---|-----------|----------|----------|
|
||||
| WARM (<5k) | 46 | 23% | 1,095 | 0.133s | 0.260s |
|
||||
| MEDIUM (5-20k) | 50 | 25% | 10,879 | 0.873s | 1.808s |
|
||||
| HEAVY (20-50k) | 64 | 32% | 34,368 | 2.589s | 6.302s |
|
||||
| HEAVY (>50k) | 38 | 19% | 83,018 | 9.563s | 30.480s |
|
||||
### 3.4 Routing Policy: Linear vs LMetric (OSDI'26)
|
||||
|
||||
HEAVY requests (51% of traffic) dominate tail latency. Elastic offloads precisely these.
|
||||
LMetric (`score = P_tokens × BS`) vs linear (`score = ongoing_tokens - α·cache_hit`). Both fresh-restart, same trace.
|
||||
|
||||
### 3.6 Routing Policy Comparison: Linear vs LMetric (OSDI'26)
|
||||
| Policy | TTFT p50 | TTFT p90 | TPOT p90 | E2E p50 | Delta E2E |
|
||||
|--------|----------|----------|----------|---------|-----------|
|
||||
| Linear | 1.086s | 9.432s | 0.077s | 5.423s | — |
|
||||
| LMetric | 1.099s | 9.392s | 0.073s | 5.205s | **-4.0%** |
|
||||
|
||||
LMetric (Zhang et al., OSDI'26) replaces linear combination `score = load - α·cache_hit` with hyperparameter-free multiplication `score = P_tokens × BS`:
|
||||
- **P_tokens** = pending prefill tokens on instance + new request's uncached tokens
|
||||
- **BS** = batch size (waiting + running request count) + 1
|
||||
LMetric provides modest improvement through better load balancing. Routing policy headroom is limited for this workload.
|
||||
|
||||
Both experiments: 8× TP=1 fresh-restart instances on dash0, same trace (200 req, time_scale=20).
|
||||
### 3.5 Errata: Why Prior Cross-Machine A/B Was Invalid
|
||||
|
||||
| Policy | OK/N | TTFT p50 | TTFT p90 | TPOT p90 | E2E p50 |
|
||||
|--------|------|----------|----------|----------|---------|
|
||||
| Linear | 198/200 | 1.086s | 9.432s | 0.0773s | 5.423s |
|
||||
| LMetric | 198/200 | 1.099s | 9.392s | 0.0727s | 5.205s |
|
||||
| **Delta** | | **+1.2%** | **-0.4%** | **-5.9%** | **-4.0%** |
|
||||
The initial comparison (commit `1e86285`) reported:
|
||||
```
|
||||
Baseline (dash0): TTFT50=2.383 E2E50=10.232 ← WRONG (warm instances)
|
||||
Elastic (dash1): TTFT50=1.315 E2E50=5.708
|
||||
Delta: -45% -44% ← INVALID
|
||||
```
|
||||
|
||||
Per-class breakdown:
|
||||
**Evidence that prior baseline was not fresh:**
|
||||
1. `inst_7` APC = 68.3% — impossible from 25 cold-start requests (max ~25%)
|
||||
2. WARM TTFT p90 = 3.327s (fresh = 0.262s, 12.7× gap) — indicates KV cache memory pressure from prior experiments
|
||||
3. HEAVY TPOT p90 = 0.154s (fresh = 0.076s, 2.0× gap) — heavy prefill-decode interference from full KV cache
|
||||
|
||||
| Class | Linear TTFT p50 | LMetric TTFT p50 | Linear TPOT p90 | LMetric TPOT p90 |
|
||||
|-------|----------------|-----------------|----------------|-----------------|
|
||||
| WARM (<5k, n=46) | 0.143s | 0.134s | 0.058s | 0.061s |
|
||||
| MEDIUM (5-20k, n=50) | 0.921s | 0.809s | 0.078s | 0.073s |
|
||||
| HEAVY (>20k, n=102) | 4.875s | 4.943s | 0.078s | 0.074s |
|
||||
|
||||
APC comparison (prefix cache hit rate per instance):
|
||||
|
||||
| | Linear | LMetric |
|
||||
|--|--------|---------|
|
||||
| Mean | 32.5% | 30.8% |
|
||||
| Std | ~22pp | ~19pp |
|
||||
| Range | 3.3%–63.3% | 4.9%–67.2% |
|
||||
|
||||
**Analysis**: LMetric provides modest improvements in TPOT (-5.9%) and E2E (-4.0%) through better load balancing (the multiplication naturally penalizes overloaded instances). TTFT is unchanged because HEAVY requests dominate and session affinity constrains routing freedom. APC skew is slightly reduced. The improvement is far smaller than elastic P2P offload (-44% E2E), confirming that for agentic workloads, **the bottleneck is prefill-decode interference, not routing policy**.
|
||||
|
||||
Data: `outputs/ab_linear/` and `outputs/ab_lmetric/` on dash0. Logs: `/tmp/lmetric_ab_inst_*.log` (linear) and `/tmp/lmetric_inst_*.log` (LMetric).
|
||||
The elastic numbers on dash1 were genuinely fresh. The "improvement" was actually comparing fresh elastic against degraded baseline.
|
||||
|
||||
## 4. System-Level Analysis
|
||||
|
||||
### 4.1 Why Elastic Wins Despite Lower GPU Utilization
|
||||
### 4.1 Elastic P2P Does Not Improve Single-Machine Performance
|
||||
|
||||
**Mechanism 1: Eliminating prefill-decode interference (TPOT -36%)**
|
||||
Under fair comparison (same machine, both fresh):
|
||||
|
||||
In combined mode, vLLM chunked prefill interleaves prefill and decode. An 80k-token HEAVY prefill occupies the GPU for seconds, delaying co-resident decode. Elastic routes heavy prefill to a different instance, so the decode pipeline is uninterrupted.
|
||||
| Metric | Baseline | Elastic | Delta |
|
||||
|--------|----------|---------|-------|
|
||||
| TTFT p50 | 1.075s | 1.018s | -5.3% |
|
||||
| TTFT p90 | 9.384s | 11.312s | +20.5% |
|
||||
| TPOT p90 | 0.076s | 0.085s | +11.6% |
|
||||
| E2E p50 | 5.075s | 6.977s | +37.5% |
|
||||
|
||||
Evidence: TPOT p90 drops from 0.117s (baseline) to 0.075s (elastic).
|
||||
Elastic is **worse** on all metrics except TTFT p50. Root causes:
|
||||
|
||||
**Mechanism 2: Better effective cache utilization (TTFT -45%)**
|
||||
**1. Mooncake kv_both memory overhead**
|
||||
|
||||
Baseline APC is skewed (3.8%–68.3%) because heavy prefills evict other sessions' cached blocks. Elastic preserves D-instance prefix chains by offloading heavy prefills to P instances. Combined with Mooncake external cache, effective APC reaches ~70% on active instances vs ~40% baseline average.
|
||||
Each instance with `kv_role=kv_both` maintains RDMA buffers + Mooncake bootstrap server, reducing GPU memory available for KV cache. This affects ALL requests (including WARM/MEDIUM that don't use P2P transfer), causing more cache eviction and higher TPOT.
|
||||
|
||||
**Mechanism 3: Faster KV cache turnover**
|
||||
Evidence: MEDIUM TPOT p90 = 0.197s (elastic) vs 0.079s (baseline) — **2.5× worse** despite MEDIUM requests not using P2P at all.
|
||||
|
||||
Lower GPU utilization (15.8% vs 28.7%) is not waste — it reflects that requests complete 44% faster. Less contention → decode finishes faster → KV cache freed sooner → next request starts faster. The same total work completes in 56% of the wall time.
|
||||
**2. D-side KV pull failures**
|
||||
|
||||
### 4.2 Known Limitation: GPU Load Imbalance
|
||||
3 HEAVY requests completed prefill on P instance successfully but D-side never produced first token. The KV cache on D was too full to allocate space for the transferred blocks. These became 600s timeouts.
|
||||
|
||||
Elastic has 3.0× imbalance (7.6% min vs 30.4% max) vs baseline's 1.9×.
|
||||
**3. P2P overhead without proportional benefit**
|
||||
|
||||
Root causes:
|
||||
1. **P-instance concentration**: Previous implementation always picked the globally least-loaded instance as P, concentrating P-role work on the same few idle instances.
|
||||
2. **Session skew**: Some sessions have many turns with large inputs, keeping their pinned instance busy while others go idle.
|
||||
The P2P path adds: prefill queue on P (p50=6.3s) + KV transfer + decode start on D (p50=0.8s). For requests where the D instance isn't under heavy prefill load (which is the case on fresh instances), co-located execution is faster.
|
||||
|
||||
**Implemented fix** (in latest `cache_aware_proxy.py`): Round-robin P-instance selection with overload skip, replacing `argmin(ongoing_tokens)`. Needs validation in next experiment cycle.
|
||||
### 4.2 When Elastic P2P Could Help
|
||||
|
||||
Elastic P2P is designed for the scenario where D-instance decode is disrupted by co-located heavy prefill. On fresh instances with 200 requests, this contention is moderate. The benefit may emerge under:
|
||||
- Higher sustained load (1000+ concurrent requests)
|
||||
- Longer experiment duration (KV cache fills up, eviction pressure increases)
|
||||
- Multi-machine deployment (P on a different node, no memory competition)
|
||||
|
||||
## 5. Data & Log Locations
|
||||
|
||||
@@ -289,12 +268,14 @@ Root causes:
|
||||
|
||||
| Directory | Machine | Config | Notes |
|
||||
|-----------|---------|--------|-------|
|
||||
| `outputs/ab_baseline/` | dash0 | Combined 8× TP=1 | Fair A/B baseline (§3) |
|
||||
| `outputs/ab_elastic/` | dash1 | Elastic P2P cap=4 | Fair A/B elastic (§3) |
|
||||
| `outputs/ab_baseline/` | dash0 | Combined 8× TP=1 | ~~Initial A/B~~ (INVALIDATED: warm instances) |
|
||||
| `outputs/ab_elastic/` | dash0 | Elastic P2P cap=4 | ~~Initial A/B~~ (INVALIDATED) |
|
||||
| `outputs/baseline_stability_fresh/` | dash0 | Combined 8× fresh | **Canonical baseline** (§3.1) |
|
||||
| `outputs/elastic_stability_*/` | dash0 | Elastic P2P kv_both fresh | **Canonical elastic** (§3.1) |
|
||||
| `outputs/ab_linear/` | dash0 | Linear policy, 200 req | §3.4 routing policy comparison |
|
||||
| `outputs/ab_lmetric/` | dash0 | LMetric policy, 200 req | §3.4 routing policy comparison |
|
||||
| `outputs/gpu_ab_combined/` | local | Combined 8× TP=1 | Earlier run, has gpu_util.csv |
|
||||
| `outputs/gpu_ab_pdsep/` | local | PD-Sep 4P+4D | Earlier run, has gpu_util.csv |
|
||||
| `outputs/ab_linear/` | dash0 | Linear policy, 200 req | §3.6 routing policy comparison |
|
||||
| `outputs/ab_lmetric/` | dash0 | LMetric policy, 200 req | §3.6 routing policy comparison |
|
||||
| `outputs/exp2_combined_tp1_dp8/` | local | Combined 8× TP=1 | 1000 req, cache-aware |
|
||||
| `outputs/exp3_pd_sep_tp1_mooncake/` | local | PD-Sep 4P+4D Mooncake | 1000 req |
|
||||
|
||||
@@ -367,6 +348,8 @@ agentic-kv/
|
||||
|--------|-------------|-----------|
|
||||
| `scripts/cache_aware_proxy.py` | Global scheduler + elastic offload proxy | `--combined`, `--offload`, `--policy {linear,lmetric}`, `--heavy-threshold`, `--bootstrap-ports` |
|
||||
| `scripts/run_lmetric_ab.sh` | A/B: linear vs lmetric routing policy | Runs both experiments with fresh restart |
|
||||
| `scripts/run_elastic_stability_test.sh` | Elastic vs baseline with full isolation | Fresh start/stop per experiment |
|
||||
| `scripts/bench.sh` | Standard single-experiment harness | `--tag`, `--mode {baseline,elastic}` |
|
||||
| `scripts/sample_trace.py` | Sample complete sessions from cluster trace | `--target-requests`, `--seed` |
|
||||
| `python -m replayer` | Replay trace against vLLM endpoint | `--time-scale`, `--max-inflight-sessions`, `--request-limit` |
|
||||
| `scripts/gpu_monitor.sh` | Sample nvidia-smi to CSV | Pipe to `outputs/<tag>/gpu_util.csv` |
|
||||
@@ -376,18 +359,23 @@ agentic-kv/
|
||||
|
||||
### Established findings:
|
||||
1. Full PD separation is **net negative** for single-machine agentic workloads (KV cache memory wall)
|
||||
2. Cache-aware session-sticky routing is the **dominant optimization** (+24pp APC, -60% TTFT)
|
||||
3. Elastic P2P offload achieves **-45% TTFT, -36% TPOT, -44% E2E** by selectively isolating heavy prefills while preserving decode cache locality
|
||||
4. The GPU utilization paradox (lower util but better performance) is explained by higher per-request efficiency
|
||||
5. LMetric (OSDI'26) multiplication-based routing provides modest improvement over linear (**E2E -4%, TPOT -6%**), confirming that routing policy alone has limited headroom — the bottleneck is prefill-decode interference
|
||||
2. Cache-aware session-sticky routing is the **dominant optimization** (+24pp APC, -60% TTFT vs round-robin)
|
||||
3. **Elastic P2P offload does NOT improve single-machine performance** — Mooncake kv_both memory overhead (+11% TPOT, +37% E2E) outweighs prefill isolation benefit under moderate load (200 req)
|
||||
4. LMetric (OSDI'26) provides modest **E2E -4%** over linear routing; routing policy headroom is limited
|
||||
5. **Experimental methodology matters**: warm vs fresh instances cause 2× TTFT difference; all comparisons must use verified fresh restart
|
||||
|
||||
### Lessons learned:
|
||||
- Prior cross-machine A/B (commit `1e86285`) was invalid — warm baseline inflated by 2× due to residual KV cache state
|
||||
- `kv_role=kv_both` has non-trivial always-on overhead even when P2P transfer is not used
|
||||
- Experiment isolation (kill all → verify GPU free → fresh start) is critical for reproducibility
|
||||
|
||||
### Open problems:
|
||||
1. GPU load imbalance (3.0× in elastic) — round-robin P fix implemented, needs validation
|
||||
2. Elastic success rate (94.4%) — Mooncake transfer timeouts on >60k requests
|
||||
3. Scaling to multi-machine (cross-node Mooncake transfers not yet tested)
|
||||
4. Adaptive offload threshold (fixed 20k may not be optimal for all load levels)
|
||||
5. Router state accuracy: proxy shadow state vs vLLM-internal exact state (TODO: vLLM → Redis → router pipeline for ablation)
|
||||
1. Elastic P2P may help under **sustained high load** (KV cache pressure makes co-located interference worse) — needs 1000-req experiment
|
||||
2. Mooncake kv_both memory overhead quantification and potential lazy initialization
|
||||
3. Multi-machine elastic (P on different node, no memory competition)
|
||||
4. Router state accuracy: proxy shadow state vs vLLM-internal exact state (TODO: vLLM → Redis → router)
|
||||
5. `scripts/bench.sh` standardized harness to prevent future warm-instance mistakes
|
||||
|
||||
---
|
||||
|
||||
*Generated from experiments run on 2026-05-22. Git commits: `1e86285` (elastic A/B), `2b0ac70` (phase 1 milestone), subsequent LMetric implementation.*
|
||||
*Updated 2026-05-22. Prior elastic A/B results (commit `1e86285`) invalidated — see §3.5 errata.*
|
||||
|
||||
@@ -13,12 +13,14 @@ We benchmarked PD separation (prefill-decode disaggregation) against PD co-locat
|
||||
|
||||
Per-request breakdown shows **87.7% of TTFT** is spent waiting for KV cache memory on decode instances, not prefill compute or KV transfer.
|
||||
|
||||
**Elastic P2P offload** (selective disaggregation of HEAVY requests only) recovers the wins of PD separation without the memory wall:
|
||||
**Elastic P2P offload** (selective disaggregation of HEAVY requests only, Mooncake kv_both): under fair same-machine fresh-restart comparison, elastic does NOT improve over baseline. Mooncake kv_both memory overhead outweighs prefill isolation benefit at moderate load.
|
||||
|
||||
| Config (TP=1, 8×H20) | TTFT p50 | TPOT p90 | E2E p50 | Effective APC |
|
||||
|---|---|---|---|---|
|
||||
| Combined DP=8 (baseline) | 2.383s | 0.117s | 10.232s | ~40% (skewed) |
|
||||
| Elastic P2P (cap=4) | **1.315s** | **0.075s** | **5.708s** | ~70% (balanced) |
|
||||
| Config (TP=1, 8×H20, fresh) | TTFT p50 | TPOT p90 | E2E p50 |
|
||||
|---|---|---|---|
|
||||
| Combined DP=8 (baseline) | **1.075s** | **0.076s** | **5.075s** |
|
||||
| Elastic P2P (kv_both, cap=4) | 1.018s | 0.085s | 6.977s |
|
||||
|
||||
> Earlier cross-machine comparison (commit `1e86285`) was invalidated — baseline used warm instances. See REPORT.md §3.5.
|
||||
| **Delta** | **-45%** | **-36%** | **-44%** | **+30pp** |
|
||||
|
||||
---
|
||||
|
||||
300
scripts/bench.sh
Executable file
300
scripts/bench.sh
Executable file
@@ -0,0 +1,300 @@
|
||||
#!/bin/bash
|
||||
# Standardized single-experiment harness with guaranteed fresh state.
|
||||
#
|
||||
# GUARANTEES:
|
||||
# 1. All GPU processes killed before start (verified via nvidia-smi)
|
||||
# 2. All GPU processes killed after finish (clean for next experiment)
|
||||
# 3. Fresh vLLM instances + proxy for every run
|
||||
# 4. All outputs saved to outputs/<tag>/ with metrics, breakdown, APC, GPU snapshot
|
||||
#
|
||||
# Usage:
|
||||
# bash scripts/bench.sh --tag my_experiment --mode baseline
|
||||
# bash scripts/bench.sh --tag my_experiment --mode elastic
|
||||
# bash scripts/bench.sh --tag my_experiment --mode baseline --policy lmetric
|
||||
# bash scripts/bench.sh --tag my_experiment --mode elastic --requests 1000
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
PROJECT_DIR="$(dirname "$SCRIPT_DIR")"
|
||||
VENV="${VENV_PATH:-$PROJECT_DIR/.venv/bin}"
|
||||
PYTHON="$VENV/python"
|
||||
VLLM="$VENV/vllm"
|
||||
MODEL="${MODEL_PATH:-/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
|
||||
TRACE="$PROJECT_DIR/traces/sampled_1000req_seed42.jsonl"
|
||||
|
||||
# Defaults
|
||||
TAG=""
|
||||
MODE="baseline" # baseline | elastic
|
||||
POLICY="linear" # linear | lmetric
|
||||
N_INSTANCES=8
|
||||
BASE_PORT=8000
|
||||
PROXY_PORT=9090
|
||||
REQUESTS=200
|
||||
TIME_SCALE=20
|
||||
MAX_SESSIONS=8
|
||||
HEAVY_THRESHOLD=20000
|
||||
|
||||
# Parse args
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--tag) TAG="$2"; shift 2 ;;
|
||||
--mode) MODE="$2"; shift 2 ;;
|
||||
--policy) POLICY="$2"; shift 2 ;;
|
||||
--requests) REQUESTS="$2"; shift 2 ;;
|
||||
--time-scale) TIME_SCALE="$2"; shift 2 ;;
|
||||
--sessions) MAX_SESSIONS="$2"; shift 2 ;;
|
||||
--heavy-threshold) HEAVY_THRESHOLD="$2"; shift 2 ;;
|
||||
*) echo "Unknown: $1"; exit 1 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
if [ -z "$TAG" ]; then
|
||||
echo "Usage: bench.sh --tag NAME --mode {baseline|elastic} [--policy {linear|lmetric}] [--requests N]"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
OUTDIR="$PROJECT_DIR/outputs/$TAG"
|
||||
if [ -d "$OUTDIR" ] && [ -f "$OUTDIR/metrics.jsonl" ]; then
|
||||
echo "[ERROR] Output directory $OUTDIR already exists with data. Use a different --tag."
|
||||
exit 1
|
||||
fi
|
||||
mkdir -p "$OUTDIR"
|
||||
|
||||
# Save experiment config
|
||||
cat > "$OUTDIR/config.json" << CONF
|
||||
{
|
||||
"tag": "$TAG",
|
||||
"mode": "$MODE",
|
||||
"policy": "$POLICY",
|
||||
"model": "$MODEL",
|
||||
"n_instances": $N_INSTANCES,
|
||||
"requests": $REQUESTS,
|
||||
"time_scale": $TIME_SCALE,
|
||||
"max_sessions": $MAX_SESSIONS,
|
||||
"heavy_threshold": $HEAVY_THRESHOLD,
|
||||
"timestamp": "$(date -Iseconds)",
|
||||
"hostname": "$(hostname)"
|
||||
}
|
||||
CONF
|
||||
|
||||
# ─── GPU Cleanup (verified) ────────────────────────────────────────────────
|
||||
|
||||
cleanup_gpu() {
|
||||
echo "[cleanup] Killing all vLLM/proxy processes..."
|
||||
for p in $(ps aux | grep -E 'vllm serve|cache_aware_proxy' | grep -v grep | awk '{print $2}' 2>/dev/null); do
|
||||
kill -9 "$p" 2>/dev/null || true
|
||||
done
|
||||
sleep 3
|
||||
# Kill any remaining GPU holders
|
||||
local gpu_pids
|
||||
gpu_pids=$(fuser /dev/nvidia* 2>/dev/null | tr ' ' '\n' | sort -u | grep -v '^$' || true)
|
||||
if [ -n "$gpu_pids" ]; then
|
||||
echo "[cleanup] Killing GPU-holding PIDs: $gpu_pids"
|
||||
echo "$gpu_pids" | xargs -r kill -9 2>/dev/null || true
|
||||
sleep 5
|
||||
fi
|
||||
# Verify GPUs are free
|
||||
local used
|
||||
used=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits 2>/dev/null | awk '{s+=$1} END{print s}')
|
||||
if [ "${used:-0}" -gt 100 ]; then
|
||||
echo "[ERROR] GPUs still have ${used}MB allocated after cleanup. Aborting."
|
||||
nvidia-smi --query-gpu=index,memory.used --format=csv,noheader
|
||||
exit 1
|
||||
fi
|
||||
echo "[cleanup] All GPUs verified free."
|
||||
}
|
||||
|
||||
# ─── Launch vLLM instances ─────────────────────────────────────────────────
|
||||
|
||||
launch_instances() {
|
||||
echo "[launch] Starting $N_INSTANCES vLLM instances (mode=$MODE)..."
|
||||
|
||||
local kv_config=""
|
||||
if [ "$MODE" = "elastic" ]; then
|
||||
kv_config='--kv-transfer-config {"kv_connector":"MooncakeConnector","kv_role":"kv_both"}'
|
||||
fi
|
||||
|
||||
for i in $(seq 0 $((N_INSTANCES - 1))); do
|
||||
local port=$((BASE_PORT + i))
|
||||
local master=$((29500 + i))
|
||||
local logfile="$OUTDIR/vllm_inst_${i}.log"
|
||||
|
||||
local env_prefix="MASTER_PORT=$master CUDA_VISIBLE_DEVICES=$i"
|
||||
if [ "$MODE" = "elastic" ]; then
|
||||
env_prefix="VLLM_MOONCAKE_BOOTSTRAP_PORT=$((8998 + i)) $env_prefix"
|
||||
fi
|
||||
|
||||
eval "$env_prefix $VLLM serve '$MODEL' \
|
||||
--host 0.0.0.0 --port $port \
|
||||
--tensor-parallel-size 1 \
|
||||
--trust-remote-code --enable-prefix-caching --enforce-eager \
|
||||
--dtype auto --gpu-memory-utilization 0.9 --max-model-len 200000 \
|
||||
$kv_config \
|
||||
> '$logfile' 2>&1 &"
|
||||
|
||||
echo " inst_$i: GPU=$i port=$port"
|
||||
sleep 2 # stagger to avoid port collision
|
||||
done
|
||||
|
||||
# Wait for health
|
||||
echo "[launch] Waiting for instances to become healthy..."
|
||||
for i in $(seq 0 $((N_INSTANCES - 1))); do
|
||||
local port=$((BASE_PORT + i))
|
||||
local tries=0
|
||||
while ! curl -sf "http://127.0.0.1:$port/health" > /dev/null 2>&1; do
|
||||
tries=$((tries + 1))
|
||||
if [ $tries -ge 120 ]; then
|
||||
echo "[FAIL] Instance $i (port $port) failed to start. Log:"
|
||||
tail -10 "$OUTDIR/vllm_inst_${i}.log"
|
||||
cleanup_gpu
|
||||
exit 1
|
||||
fi
|
||||
sleep 5
|
||||
done
|
||||
echo " inst_$i healthy"
|
||||
done
|
||||
|
||||
# Wait for bootstrap (elastic only)
|
||||
if [ "$MODE" = "elastic" ]; then
|
||||
echo "[launch] Waiting for Mooncake bootstrap servers..."
|
||||
for i in $(seq 0 $((N_INSTANCES - 1))); do
|
||||
local bp=$((8998 + i))
|
||||
local tries=0
|
||||
while ! curl -sf "http://127.0.0.1:$bp/query" > /dev/null 2>&1; do
|
||||
tries=$((tries + 1))
|
||||
if [ $tries -ge 60 ]; then
|
||||
echo "[FAIL] Bootstrap $bp failed"
|
||||
cleanup_gpu
|
||||
exit 1
|
||||
fi
|
||||
sleep 2
|
||||
done
|
||||
echo " bootstrap $bp ready"
|
||||
done
|
||||
fi
|
||||
}
|
||||
|
||||
# ─── Launch proxy ──────────────────────────────────────────────────────────
|
||||
|
||||
launch_proxy() {
|
||||
echo "[proxy] Starting (mode=$MODE, policy=$POLICY)..."
|
||||
local combined_args=""
|
||||
for i in $(seq 0 $((N_INSTANCES - 1))); do
|
||||
combined_args="$combined_args http://127.0.0.1:$((BASE_PORT + i))"
|
||||
done
|
||||
|
||||
local extra_args="--policy $POLICY"
|
||||
if [ "$MODE" = "elastic" ]; then
|
||||
local bp_list=""
|
||||
for i in $(seq 0 $((N_INSTANCES - 1))); do
|
||||
bp_list="${bp_list:+$bp_list,}$((8998 + i))"
|
||||
done
|
||||
extra_args="$extra_args --offload --heavy-threshold $HEAVY_THRESHOLD --bootstrap-ports $bp_list"
|
||||
fi
|
||||
|
||||
$PYTHON "$PROJECT_DIR/scripts/cache_aware_proxy.py" \
|
||||
--combined $combined_args \
|
||||
--port $PROXY_PORT \
|
||||
$extra_args \
|
||||
> "$OUTDIR/proxy.log" 2>&1 &
|
||||
|
||||
# Wait for proxy
|
||||
local tries=0
|
||||
while ! curl -sf "http://127.0.0.1:$PROXY_PORT/stats" > /dev/null 2>&1; do
|
||||
tries=$((tries + 1))
|
||||
if [ $tries -ge 30 ]; then
|
||||
echo "[FAIL] Proxy failed to start"
|
||||
cleanup_gpu
|
||||
exit 1
|
||||
fi
|
||||
sleep 2
|
||||
done
|
||||
echo "[proxy] Ready on port $PROXY_PORT"
|
||||
}
|
||||
|
||||
# ─── Run benchmark ─────────────────────────────────────────────────────────
|
||||
|
||||
run_benchmark() {
|
||||
echo "[bench] Running $REQUESTS requests (time_scale=$TIME_SCALE, sessions=$MAX_SESSIONS)..."
|
||||
$PYTHON -m replayer \
|
||||
--trace "$TRACE" \
|
||||
--output "$OUTDIR/metrics.jsonl" \
|
||||
--endpoint "http://localhost:$PROXY_PORT" \
|
||||
--model "$MODEL" \
|
||||
--time-scale "$TIME_SCALE" \
|
||||
--max-inflight-sessions "$MAX_SESSIONS" \
|
||||
--request-limit "$REQUESTS" \
|
||||
-v 2>&1 | tee "$OUTDIR/replayer.log"
|
||||
}
|
||||
|
||||
# ─── Collect artifacts ─────────────────────────────────────────────────────
|
||||
|
||||
collect_artifacts() {
|
||||
echo "[collect] Saving artifacts..."
|
||||
curl -sf "http://localhost:$PROXY_PORT/breakdown" > "$OUTDIR/breakdown.json" 2>/dev/null || true
|
||||
curl -sf "http://localhost:$PROXY_PORT/stats" > "$OUTDIR/stats.json" 2>/dev/null || true
|
||||
nvidia-smi --query-gpu=index,utilization.gpu,memory.used,memory.total \
|
||||
--format=csv > "$OUTDIR/gpu_snapshot.csv" 2>/dev/null || true
|
||||
|
||||
# APC from vLLM logs
|
||||
for i in $(seq 0 $((N_INSTANCES - 1))); do
|
||||
pch=$(grep "Prefix cache hit rate" "$OUTDIR/vllm_inst_${i}.log" 2>/dev/null | tail -1 | grep -oP "Prefix cache hit rate: \K[0-9.]+" || echo "0")
|
||||
ech=$(grep "External prefix cache hit rate" "$OUTDIR/vllm_inst_${i}.log" 2>/dev/null | tail -1 | grep -oP "External prefix cache hit rate: \K[0-9.]+" || echo "")
|
||||
ext_str=""
|
||||
[ -n "$ech" ] && ext_str=" ext=$ech%"
|
||||
echo "inst_$i: prefix=$pch%$ext_str"
|
||||
done | tee "$OUTDIR/apc.txt"
|
||||
}
|
||||
|
||||
# ─── Summary ───────────────────────────────────────────────────────────────
|
||||
|
||||
print_summary() {
|
||||
$PYTHON -c "
|
||||
import json
|
||||
rows = [json.loads(l) for l in open('$OUTDIR/metrics.jsonl')]
|
||||
ok = [r for r in rows if not r.get('error')]
|
||||
err = [r for r in rows if r.get('error')]
|
||||
p = lambda v,q: sorted(v)[min(int(q*len(v)),len(v)-1)] if v else 0
|
||||
ttfts = sorted([r['ttft_s'] for r in ok if r.get('ttft_s')])
|
||||
tpots = sorted([r['tpot_s'] for r in ok if r.get('tpot_s') and r['tpot_s']>0])
|
||||
e2es = sorted([r['latency_s'] for r in ok])
|
||||
print()
|
||||
print('=' * 70)
|
||||
print(' RESULT: $TAG ($MODE, $POLICY)')
|
||||
print('=' * 70)
|
||||
print(' OK=%d/%d (%.1f%%) TTFT50=%.3f TTFT90=%.3f TPOT90=%.4f E2E50=%.3f' % (
|
||||
len(ok), len(rows), len(ok)*100/len(rows), p(ttfts,.5), p(ttfts,.9), p(tpots,.9), p(e2es,.5)))
|
||||
for lo,hi,cl in [(0,5000,'WARM'),(5000,20000,'MEDIUM'),(20000,200000,'HEAVY')]:
|
||||
sub = [r for r in ok if lo <= r['input_length'] < hi and r.get('ttft_s')]
|
||||
if sub:
|
||||
t = sorted([r['ttft_s'] for r in sub])
|
||||
tp = sorted([r['tpot_s'] for r in sub if r.get('tpot_s') and r['tpot_s']>0])
|
||||
print(' %-8s n=%3d TTFT50=%.3f TTFT90=%.3f TPOT90=%.4f' % (
|
||||
cl, len(sub), p(t,.5), p(t,.9), p(tp,.9) if tp else 0))
|
||||
if err:
|
||||
print(' Errors (%d):' % len(err))
|
||||
for e in err[:5]:
|
||||
print(' input=%d %s' % (e['input_length'], str(e.get('error',''))[:60]))
|
||||
print(' Output: $OUTDIR/')
|
||||
print('=' * 70)
|
||||
"
|
||||
}
|
||||
|
||||
# ─── Main ──────────────────────────────────────────────────────────────────
|
||||
|
||||
echo "================================================================"
|
||||
echo " bench.sh: $TAG"
|
||||
echo " mode=$MODE policy=$POLICY requests=$REQUESTS"
|
||||
echo " $(date)"
|
||||
echo "================================================================"
|
||||
|
||||
cleanup_gpu
|
||||
launch_instances
|
||||
launch_proxy
|
||||
run_benchmark
|
||||
collect_artifacts
|
||||
print_summary
|
||||
cleanup_gpu
|
||||
|
||||
echo "[done] $(date)"
|
||||
@@ -361,13 +361,22 @@ async def _handle_combined(api, req_data, token_ids, input_length, session_id, h
|
||||
return StreamingResponse(generate(), media_type="text/event-stream")
|
||||
|
||||
|
||||
PREFILL_TIMEOUT_S = 120 # max seconds to wait for P-instance prefill
|
||||
|
||||
|
||||
async def _handle_heavy_offload(api, req_data, headers, token_ids, input_length,
|
||||
p_inst, d_inst, breakdown):
|
||||
"""HEAVY request: prefill on p_inst, KV via Mooncake, decode on d_inst."""
|
||||
"""HEAVY request: prefill on p_inst, KV via Mooncake, decode on d_inst.
|
||||
|
||||
On prefill timeout/failure, falls back to co-located decode on d_inst.
|
||||
"""
|
||||
global _offload_inflight
|
||||
request_id = headers.get("X-Request-Id", "")
|
||||
estimated_new = breakdown.get("estimated_new_tokens", 0)
|
||||
|
||||
# Step 1: Await prefill on p_inst (ongoing_tokens already reserved by caller)
|
||||
breakdown["t_prefill_sent"] = _time.monotonic()
|
||||
prefill_ok = False
|
||||
try:
|
||||
prefill_data = req_data.copy()
|
||||
prefill_data["kv_transfer_params"] = {
|
||||
@@ -381,25 +390,56 @@ async def _handle_heavy_offload(api, req_data, headers, token_ids, input_length,
|
||||
prefill_data.pop("stream_options", None)
|
||||
|
||||
p_headers = {**headers, "X-data-parallel-rank": "0"}
|
||||
resp = await p_inst.client.post(api, json=prefill_data, headers=p_headers)
|
||||
resp = await asyncio.wait_for(
|
||||
p_inst.client.post(api, json=prefill_data, headers=p_headers),
|
||||
timeout=PREFILL_TIMEOUT_S,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
await resp.aclose()
|
||||
p_inst.record_prefix(token_ids)
|
||||
breakdown["t_prefill_done"] = _time.monotonic()
|
||||
prefill_ok = True
|
||||
except Exception as e:
|
||||
breakdown["t_prefill_done"] = _time.monotonic()
|
||||
breakdown["error"] = str(e)
|
||||
_breakdown_log.append(breakdown)
|
||||
global _offload_inflight
|
||||
_offload_inflight = max(0, _offload_inflight - 1)
|
||||
p_inst.num_requests -= 1
|
||||
raise HTTPException(status_code=502, detail="Prefill failed: %s" % e)
|
||||
breakdown["prefill_error"] = str(e)
|
||||
finally:
|
||||
# Always release P-instance resources exactly once
|
||||
p_inst.ongoing_tokens -= input_length
|
||||
p_inst.pending_prefill_tokens -= breakdown.get("estimated_new_tokens", 0)
|
||||
p_inst.pending_prefill_tokens -= estimated_new
|
||||
p_inst.num_requests -= 1
|
||||
_offload_inflight = max(0, _offload_inflight - 1)
|
||||
|
||||
p_inst.num_requests -= 1
|
||||
if not prefill_ok:
|
||||
# Fallback: co-located prefill+decode on d_inst (no KV transfer)
|
||||
breakdown["route_class"] = "HEAVY_COLO_FALLBACK"
|
||||
d_inst.ongoing_tokens += input_length
|
||||
d_inst.pending_prefill_tokens += estimated_new
|
||||
d_inst.num_requests += 1
|
||||
|
||||
async def generate_fallback():
|
||||
prefill_done = False
|
||||
try:
|
||||
async with d_inst.client.stream("POST", api, json=req_data, headers=headers) as resp:
|
||||
resp.raise_for_status()
|
||||
async for chunk in resp.aiter_bytes():
|
||||
if not prefill_done:
|
||||
d_inst.pending_prefill_tokens -= estimated_new
|
||||
d_inst.ongoing_decode_tokens += input_length
|
||||
breakdown["t_first_token"] = _time.monotonic()
|
||||
prefill_done = True
|
||||
yield chunk
|
||||
d_inst.record_prefix(token_ids)
|
||||
finally:
|
||||
if not prefill_done:
|
||||
d_inst.pending_prefill_tokens -= estimated_new
|
||||
else:
|
||||
d_inst.ongoing_decode_tokens -= input_length
|
||||
d_inst.ongoing_tokens -= input_length
|
||||
d_inst.num_requests -= 1
|
||||
breakdown["t_done"] = _time.monotonic()
|
||||
_breakdown_log.append(breakdown)
|
||||
|
||||
return StreamingResponse(generate_fallback(), media_type="text/event-stream")
|
||||
|
||||
# Step 2: Stream decode on d_inst (pulls KV from Mooncake)
|
||||
d_inst.ongoing_tokens += input_length
|
||||
|
||||
329
scripts/run_elastic_stability_test.sh
Executable file
329
scripts/run_elastic_stability_test.sh
Executable file
@@ -0,0 +1,329 @@
|
||||
#!/bin/bash
|
||||
# Elastic P2P stability test: runs 200-request benchmark with offload mode
|
||||
# and baseline mode, then compares success rates.
|
||||
#
|
||||
# Must be run on dash0 (8 GPUs with Mooncake support).
|
||||
#
|
||||
# Usage:
|
||||
# bash scripts/run_elastic_stability_test.sh
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
PROJECT_DIR="$(dirname "$SCRIPT_DIR")"
|
||||
VENV="${VENV_PATH:-$PROJECT_DIR/.venv/bin}"
|
||||
PYTHON="$VENV/python"
|
||||
VLLM="$VENV/vllm"
|
||||
|
||||
MODEL="${MODEL_PATH:-/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
|
||||
TRACE="$PROJECT_DIR/traces/sampled_1000req_seed42.jsonl"
|
||||
N_INSTANCES=8
|
||||
BASE_PORT=8000
|
||||
PROXY_PORT=9090
|
||||
REQUEST_LIMIT=200
|
||||
TIME_SCALE=20
|
||||
MAX_INFLIGHT=8
|
||||
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
|
||||
|
||||
OUT_ELASTIC="$PROJECT_DIR/outputs/elastic_stability_${TIMESTAMP}"
|
||||
OUT_BASELINE="$PROJECT_DIR/outputs/baseline_stability_${TIMESTAMP}"
|
||||
|
||||
# ─── Helper functions ────────────────────────────────────────────────────────
|
||||
|
||||
kill_all() {
|
||||
echo "[cleanup] Killing vLLM and proxy processes..."
|
||||
for p in $(ps aux | grep 'vllm serve' | grep -v grep | awk '{print $2}' 2>/dev/null); do
|
||||
kill -9 "$p" 2>/dev/null || true
|
||||
done
|
||||
for p in $(ps aux | grep 'cache_aware_proxy' | grep -v grep | awk '{print $2}' 2>/dev/null); do
|
||||
kill -9 "$p" 2>/dev/null || true
|
||||
done
|
||||
sleep 5
|
||||
echo "[cleanup] Releasing GPUs..."
|
||||
for p in $(fuser /dev/nvidia* 2>/dev/null | tr ' ' '\n' | sort -u); do
|
||||
kill -9 "$p" 2>/dev/null || true
|
||||
done
|
||||
sleep 10
|
||||
echo "[cleanup] Done."
|
||||
}
|
||||
|
||||
wait_for_instances() {
|
||||
local n=$1
|
||||
echo "[wait] Waiting for $n vLLM instances to become healthy..."
|
||||
for i in $(seq 0 $((n - 1))); do
|
||||
local port=$((BASE_PORT + i))
|
||||
local tries=0
|
||||
while ! curl -sf "http://127.0.0.1:$port/health" > /dev/null 2>&1; do
|
||||
tries=$((tries + 1))
|
||||
if [ $tries -ge 120 ]; then
|
||||
echo "[FAIL] Instance $i (port $port) did not start in 600s"
|
||||
return 1
|
||||
fi
|
||||
sleep 5
|
||||
done
|
||||
echo " Instance $i (port $port) healthy"
|
||||
done
|
||||
}
|
||||
|
||||
wait_for_bootstrap() {
|
||||
echo "[wait] Waiting for Mooncake bootstrap servers..."
|
||||
for i in $(seq 0 $((N_INSTANCES - 1))); do
|
||||
local bp=$((8998 + i))
|
||||
local tries=0
|
||||
while ! curl -sf "http://127.0.0.1:$bp/query" > /dev/null 2>&1; do
|
||||
tries=$((tries + 1))
|
||||
if [ $tries -ge 60 ]; then
|
||||
echo "[FAIL] Bootstrap $bp did not start in 120s"
|
||||
return 1
|
||||
fi
|
||||
sleep 2
|
||||
done
|
||||
echo " Bootstrap $bp ready"
|
||||
done
|
||||
}
|
||||
|
||||
wait_for_proxy() {
|
||||
echo "[wait] Waiting for proxy on port $PROXY_PORT..."
|
||||
local tries=0
|
||||
while ! curl -sf "http://127.0.0.1:$PROXY_PORT/stats" > /dev/null 2>&1; do
|
||||
tries=$((tries + 1))
|
||||
if [ $tries -ge 30 ]; then
|
||||
echo "[FAIL] Proxy did not start in 60s"
|
||||
return 1
|
||||
fi
|
||||
sleep 2
|
||||
done
|
||||
echo " Proxy ready"
|
||||
}
|
||||
|
||||
launch_vllm_kv_both() {
|
||||
echo ""
|
||||
echo "=== Launching $N_INSTANCES vLLM instances (kv_both) ==="
|
||||
for i in $(seq 0 $((N_INSTANCES - 1))); do
|
||||
local port=$((BASE_PORT + i))
|
||||
local bp=$((8998 + i))
|
||||
local master=$((29500 + i))
|
||||
local log="/tmp/elastic_test_${i}.log"
|
||||
|
||||
VLLM_MOONCAKE_BOOTSTRAP_PORT=$bp \
|
||||
MASTER_PORT=$master \
|
||||
CUDA_VISIBLE_DEVICES=$i \
|
||||
$VLLM serve "$MODEL" \
|
||||
--host 0.0.0.0 --port "$port" --tensor-parallel-size 1 \
|
||||
--trust-remote-code --enable-prefix-caching --enforce-eager \
|
||||
--dtype auto --gpu-memory-utilization 0.9 --max-model-len 200000 \
|
||||
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
|
||||
> "$log" 2>&1 &
|
||||
|
||||
echo " Instance $i: GPU=$i port=$port bootstrap=$bp log=$log"
|
||||
sleep 2
|
||||
done
|
||||
wait_for_instances $N_INSTANCES
|
||||
wait_for_bootstrap
|
||||
}
|
||||
|
||||
launch_vllm_baseline() {
|
||||
echo ""
|
||||
echo "=== Launching $N_INSTANCES vLLM instances (baseline, no Mooncake) ==="
|
||||
for i in $(seq 0 $((N_INSTANCES - 1))); do
|
||||
local port=$((BASE_PORT + i))
|
||||
local master=$((29500 + i))
|
||||
local log="/tmp/baseline_test_${i}.log"
|
||||
|
||||
MASTER_PORT=$master \
|
||||
CUDA_VISIBLE_DEVICES=$i \
|
||||
$VLLM serve "$MODEL" \
|
||||
--host 0.0.0.0 --port "$port" --tensor-parallel-size 1 \
|
||||
--trust-remote-code --enable-prefix-caching --enforce-eager \
|
||||
--dtype auto --gpu-memory-utilization 0.9 --max-model-len 200000 \
|
||||
> "$log" 2>&1 &
|
||||
|
||||
echo " Instance $i: GPU=$i port=$port log=$log"
|
||||
sleep 2
|
||||
done
|
||||
wait_for_instances $N_INSTANCES
|
||||
}
|
||||
|
||||
launch_proxy_elastic() {
|
||||
echo ""
|
||||
echo "=== Starting proxy (elastic offload mode) ==="
|
||||
local combined_args=""
|
||||
local bp_list=""
|
||||
for i in $(seq 0 $((N_INSTANCES - 1))); do
|
||||
combined_args="$combined_args http://127.0.0.1:$((BASE_PORT + i))"
|
||||
bp_list="${bp_list:+$bp_list,}$((8998 + i))"
|
||||
done
|
||||
$PYTHON "$PROJECT_DIR/scripts/cache_aware_proxy.py" \
|
||||
--combined $combined_args \
|
||||
--bootstrap-ports "$bp_list" \
|
||||
--offload --heavy-threshold 20000 \
|
||||
--port $PROXY_PORT \
|
||||
> /tmp/proxy_elastic.log 2>&1 &
|
||||
wait_for_proxy
|
||||
}
|
||||
|
||||
launch_proxy_baseline() {
|
||||
echo ""
|
||||
echo "=== Starting proxy (baseline, no offload) ==="
|
||||
local combined_args=""
|
||||
for i in $(seq 0 $((N_INSTANCES - 1))); do
|
||||
combined_args="$combined_args http://127.0.0.1:$((BASE_PORT + i))"
|
||||
done
|
||||
$PYTHON "$PROJECT_DIR/scripts/cache_aware_proxy.py" \
|
||||
--combined $combined_args \
|
||||
--port $PROXY_PORT \
|
||||
> /tmp/proxy_baseline.log 2>&1 &
|
||||
wait_for_proxy
|
||||
}
|
||||
|
||||
run_benchmark() {
|
||||
local tag=$1
|
||||
local output_dir=$2
|
||||
mkdir -p "$output_dir"
|
||||
|
||||
echo ""
|
||||
echo "=== Running benchmark: $tag ($REQUEST_LIMIT requests) ==="
|
||||
$PYTHON -m replayer \
|
||||
--trace "$TRACE" \
|
||||
--output "$output_dir/metrics.jsonl" \
|
||||
--endpoint "http://localhost:$PROXY_PORT" \
|
||||
--model "$MODEL" \
|
||||
--time-scale "$TIME_SCALE" \
|
||||
--max-inflight-sessions "$MAX_INFLIGHT" \
|
||||
--request-limit "$REQUEST_LIMIT" \
|
||||
-v 2>&1 | tee "$output_dir/replayer.log"
|
||||
|
||||
# Save proxy breakdown and stats
|
||||
curl -sf "http://localhost:$PROXY_PORT/breakdown" > "$output_dir/breakdown.json" 2>/dev/null || true
|
||||
curl -sf "http://localhost:$PROXY_PORT/stats" > "$output_dir/stats.json" 2>/dev/null || true
|
||||
}
|
||||
|
||||
collect_gpu_util() {
|
||||
local output_dir=$1
|
||||
nvidia-smi --query-gpu=index,utilization.gpu,memory.used,memory.total \
|
||||
--format=csv > "$output_dir/gpu_snapshot.csv" 2>/dev/null || true
|
||||
}
|
||||
|
||||
print_summary() {
|
||||
local label=$1
|
||||
local output_dir=$2
|
||||
local metrics="$output_dir/metrics.jsonl"
|
||||
|
||||
if [ ! -f "$metrics" ]; then
|
||||
echo " [$label] No metrics file found!"
|
||||
return
|
||||
fi
|
||||
|
||||
# Count total, success, error from metrics JSONL
|
||||
local total=$(wc -l < "$metrics")
|
||||
local success=$(grep -c '"error":null\|"error": null' "$metrics" 2>/dev/null || grep -c '"ttft":[0-9]' "$metrics" 2>/dev/null || echo 0)
|
||||
local errors=$((total - success))
|
||||
local rate="N/A"
|
||||
if [ "$total" -gt 0 ]; then
|
||||
rate=$(awk "BEGIN{printf \"%.1f\", ($success/$total)*100}")
|
||||
fi
|
||||
|
||||
echo " [$label]"
|
||||
echo " Total requests: $total"
|
||||
echo " Successful: $success"
|
||||
echo " Errors: $errors"
|
||||
echo " Success rate: ${rate}%"
|
||||
|
||||
# Print summary.json if it exists
|
||||
local summary="$output_dir/metrics.summary.json"
|
||||
if [ -f "$summary" ]; then
|
||||
echo " Summary: $(cat "$summary")"
|
||||
fi
|
||||
}
|
||||
|
||||
# ─── Main ────────────────────────────────────────────────────────────────────
|
||||
|
||||
echo "================================================================"
|
||||
echo " Elastic P2P Stability Test"
|
||||
echo " $(date)"
|
||||
echo " Model: $MODEL"
|
||||
echo " Requests: $REQUEST_LIMIT"
|
||||
echo " Output: elastic → $OUT_ELASTIC"
|
||||
echo " baseline → $OUT_BASELINE"
|
||||
echo "================================================================"
|
||||
|
||||
# Sanity checks
|
||||
if [ ! -f "$TRACE" ]; then
|
||||
echo "[ERROR] Trace file not found: $TRACE"
|
||||
exit 1
|
||||
fi
|
||||
if [ ! -x "$PYTHON" ]; then
|
||||
echo "[ERROR] Python not found: $PYTHON"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# ─── Phase 1: Elastic P2P offload ────────────────────────────────────────────
|
||||
|
||||
echo ""
|
||||
echo "############################################################"
|
||||
echo " Phase 1: Elastic P2P Offload"
|
||||
echo "############################################################"
|
||||
|
||||
kill_all
|
||||
launch_vllm_kv_both
|
||||
launch_proxy_elastic
|
||||
collect_gpu_util "$OUT_ELASTIC"
|
||||
run_benchmark "elastic_p2p" "$OUT_ELASTIC"
|
||||
|
||||
echo ""
|
||||
echo "[phase1] Saving APC stats..."
|
||||
for i in $(seq 0 $((N_INSTANCES - 1))); do
|
||||
port=$((BASE_PORT + i))
|
||||
curl -sf "http://127.0.0.1:$port/metrics" 2>/dev/null \
|
||||
| grep -E 'vllm:cache_hit|prefix_cache' \
|
||||
>> "$OUT_ELASTIC/apc_metrics.txt" 2>/dev/null || true
|
||||
done
|
||||
|
||||
# ─── Phase 2: Baseline (no offload) ─────────────────────────────────────────
|
||||
|
||||
echo ""
|
||||
echo "############################################################"
|
||||
echo " Phase 2: Baseline (no offload, no Mooncake)"
|
||||
echo "############################################################"
|
||||
|
||||
kill_all
|
||||
launch_vllm_baseline
|
||||
launch_proxy_baseline
|
||||
collect_gpu_util "$OUT_BASELINE"
|
||||
run_benchmark "baseline" "$OUT_BASELINE"
|
||||
|
||||
echo ""
|
||||
echo "[phase2] Saving APC stats..."
|
||||
for i in $(seq 0 $((N_INSTANCES - 1))); do
|
||||
port=$((BASE_PORT + i))
|
||||
curl -sf "http://127.0.0.1:$port/metrics" 2>/dev/null \
|
||||
| grep -E 'vllm:cache_hit|prefix_cache' \
|
||||
>> "$OUT_BASELINE/apc_metrics.txt" 2>/dev/null || true
|
||||
done
|
||||
|
||||
# ─── Cleanup ─────────────────────────────────────────────────────────────────
|
||||
|
||||
kill_all
|
||||
|
||||
# ─── Comparison ──────────────────────────────────────────────────────────────
|
||||
|
||||
echo ""
|
||||
echo "================================================================"
|
||||
echo " Results Comparison"
|
||||
echo "================================================================"
|
||||
|
||||
print_summary "Elastic P2P" "$OUT_ELASTIC"
|
||||
echo ""
|
||||
print_summary "Baseline" "$OUT_BASELINE"
|
||||
|
||||
echo ""
|
||||
echo "Detailed outputs:"
|
||||
echo " Elastic: $OUT_ELASTIC/"
|
||||
echo " Baseline: $OUT_BASELINE/"
|
||||
echo ""
|
||||
echo "Breakdown analysis:"
|
||||
echo " python scripts/analyze_breakdown.py $OUT_ELASTIC/breakdown.json"
|
||||
echo ""
|
||||
echo "================================================================"
|
||||
echo " Done. $(date)"
|
||||
echo "================================================================"
|
||||
Reference in New Issue
Block a user