Invalidate prior A/B results + add proper experiment harness

Prior cross-machine comparison (commit 1e86285) was invalid: dash0
baseline used warm instances with residual KV cache, inflating TTFT
by 2x. Evidence: inst_7 APC=68.3% impossible from 25 cold-start
requests; WARM TTFT p90=3.3s vs fresh=0.26s.

Fair same-machine comparison (both fresh restart on dash0):
  Baseline:    TTFT50=1.075  TPOT90=0.076  E2E50=5.075  OK=198/200
  Elastic P2P: TTFT50=1.018  TPOT90=0.085  E2E50=6.977  OK=195/200
Elastic is WORSE due to Mooncake kv_both memory overhead.

Changes:
- REPORT.md: rewrite §3-4 with corrected results, add §3.5 errata
- pd_separation_analysis.md: update elastic TL;DR with correct numbers
- cache_aware_proxy.py: fix double-decrement bugs in offload path,
  add 120s prefill timeout with co-located fallback (HEAVY_COLO_FALLBACK)
- bench.sh: standardized experiment harness with guaranteed GPU cleanup
  and fresh-state verification (nvidia-smi check before start)
- run_elastic_stability_test.sh: two-phase elastic vs baseline test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-22 17:54:21 +08:00
parent e4fa56cb1e
commit fc92410ec9
5 changed files with 775 additions and 116 deletions

190
REPORT.md
View File

@@ -161,127 +161,106 @@ done
## 3. Results
### 3.1 End-to-End Performance
> **Errata (2026-05-22)**: The initial cross-machine A/B (dash0 baseline vs dash1 elastic) reported -44% E2E improvement. Post-hoc analysis revealed the dash0 baseline instances were **not freshly restarted** — residual KV cache from prior experiments caused 2× TTFT inflation. All results below use verified fresh-restart experiments on the same machine.
| Config | OK/N | TTFT p50 | TTFT p90 | TPOT p50 | TPOT p90 | E2E p50 |
|--------|------|----------|----------|----------|----------|---------|
| Baseline linear | 198/200 | 2.383s | 27.622s | 0.069s | 0.117s | 10.232s |
| Baseline LMetric | 198/200 | 1.099s | 9.392s | 0.063s | 0.073s | 5.205s |
| Elastic P2P (cap=4) | 185/196 | **1.315s** | **13.179s** | **0.066s** | **0.075s** | **5.708s** |
### 3.1 Fair Comparison (all fresh-restart, same machine dash0, 200 req)
> Note: "Baseline linear" was run on dash0 during the initial A/B (different machine load conditions).
> "Baseline LMetric" was run on fresh-restart dash0, same conditions as "Baseline linear (fresh)" below in §3.6.
| Config | OK/N | TTFT p50 | TTFT p90 | TPOT p90 | E2E p50 |
|--------|------|----------|----------|----------|---------|
| **Baseline (no Mooncake)** | **198/200** | **1.075s** | **9.384s** | **0.076s** | **5.075s** |
| LMetric routing | 198/200 | 1.099s | 9.392s | 0.073s | 5.205s |
| Elastic P2P (kv_both) | 195/200 | 1.018s | 11.312s | 0.085s | 6.977s |
### 3.2 KV Cache Hit Ratio
### 3.2 Per-Class Breakdown
Sampled from vLLM instance logs at end of experiment:
**Baseline (fresh):**
**Baseline** (local prefix cache only):
| Class | Count | % | TTFT p50 | TTFT p90 | TPOT p90 |
|-------|-------|---|----------|----------|----------|
| WARM (<5k) | 46 | 23% | 0.137s | 0.262s | 0.061s |
| MEDIUM (5-20k) | 50 | 25% | 0.921s | 1.846s | 0.079s |
| HEAVY (20-50k) | 64 | 32% | 2.660s | 6.278s | 0.076s |
| HEAVY (>50k) | 38 | 19% | 9.587s | 30.415s | 0.102s |
| Instance | Prefix APC |
|----------|-----------|
| inst_0 | 48.6% |
| inst_3 | 3.8% |
| inst_7 | 68.3% |
| **Std dev** | **~33pp** |
**Elastic P2P (fresh):**
**Elastic** (local prefix + Mooncake external):
| Class | Count | % | TTFT p50 | TTFT p90 | TPOT p90 |
|-------|-------|---|----------|----------|----------|
| WARM (<5k) | 46 | 23% | 0.142s | 0.279s | 0.072s |
| MEDIUM (5-20k) | 50 | 25% | 0.766s | 1.814s | 0.197s |
| HEAVY (>20k) | 99 | 51% | 6.390s | 22.668s | 0.085s |
| Instance | Prefix APC | External APC | Effective |
|----------|-----------|-------------|-----------|
| inst_0 | 37.8% | 31.6% | 69.4% |
| inst_3 | 36.6% | 34.2% | 70.8% |
| inst_7 | 25.0% | 0.0% | 25.0% |
| **Prefix std** | **~7pp** | | |
Key finding: elastic has **much more uniform** prefix APC across instances (std ~7pp vs ~33pp), and Mooncake external cache adds 30-34pp on active decode instances.
### 3.3 GPU Utilization
| Config | Mean | Min | Max | Imbalance |
|--------|------|-----|-----|-----------|
| Baseline | 28.7% | 20% | 38% | 1.9× |
| Elastic | 15.8% | 7.6% | 30.4% | 3.0× |
### 3.4 Success Rate
### 3.3 Success Rate
| Config | OK | Total | Rate | Failure mode |
|--------|-----|-------|------|-------------|
| Baseline | 198 | 200 | 99.0% | Generic timeout |
| Elastic | 185 | 196 | 94.4% | Mooncake transfer timeout on >60k requests |
| Baseline | 198 | 200 | 99.0% | RemoteProtocolError (replayer-side) |
| Elastic P2P | 195 | 200 | 97.5% | 2× RemoteProtocolError + 3× ReadTimeout on >60k |
### 3.5 Per-Class TTFT Breakdown (Baseline Combined)
Elastic's 3 extra errors are D-side KV pull failures: prefill succeeded on P, KV pushed to Mooncake, but D never produced first token (decode scheduler couldn't allocate KV cache space). Prefill timeout fallback (120s → co-located) was never triggered.
| Class | Count | % | Input p50 | TTFT p50 | TTFT p90 |
|-------|-------|---|-----------|----------|----------|
| WARM (<5k) | 46 | 23% | 1,095 | 0.133s | 0.260s |
| MEDIUM (5-20k) | 50 | 25% | 10,879 | 0.873s | 1.808s |
| HEAVY (20-50k) | 64 | 32% | 34,368 | 2.589s | 6.302s |
| HEAVY (>50k) | 38 | 19% | 83,018 | 9.563s | 30.480s |
### 3.4 Routing Policy: Linear vs LMetric (OSDI'26)
HEAVY requests (51% of traffic) dominate tail latency. Elastic offloads precisely these.
LMetric (`score = P_tokens × BS`) vs linear (`score = ongoing_tokens - α·cache_hit`). Both fresh-restart, same trace.
### 3.6 Routing Policy Comparison: Linear vs LMetric (OSDI'26)
| Policy | TTFT p50 | TTFT p90 | TPOT p90 | E2E p50 | Delta E2E |
|--------|----------|----------|----------|---------|-----------|
| Linear | 1.086s | 9.432s | 0.077s | 5.423s | — |
| LMetric | 1.099s | 9.392s | 0.073s | 5.205s | **-4.0%** |
LMetric (Zhang et al., OSDI'26) replaces linear combination `score = load - α·cache_hit` with hyperparameter-free multiplication `score = P_tokens × BS`:
- **P_tokens** = pending prefill tokens on instance + new request's uncached tokens
- **BS** = batch size (waiting + running request count) + 1
LMetric provides modest improvement through better load balancing. Routing policy headroom is limited for this workload.
Both experiments: 8× TP=1 fresh-restart instances on dash0, same trace (200 req, time_scale=20).
### 3.5 Errata: Why Prior Cross-Machine A/B Was Invalid
| Policy | OK/N | TTFT p50 | TTFT p90 | TPOT p90 | E2E p50 |
|--------|------|----------|----------|----------|---------|
| Linear | 198/200 | 1.086s | 9.432s | 0.0773s | 5.423s |
| LMetric | 198/200 | 1.099s | 9.392s | 0.0727s | 5.205s |
| **Delta** | | **+1.2%** | **-0.4%** | **-5.9%** | **-4.0%** |
The initial comparison (commit `1e86285`) reported:
```
Baseline (dash0): TTFT50=2.383 E2E50=10.232 ← WRONG (warm instances)
Elastic (dash1): TTFT50=1.315 E2E50=5.708
Delta: -45% -44% ← INVALID
```
Per-class breakdown:
**Evidence that prior baseline was not fresh:**
1. `inst_7` APC = 68.3% — impossible from 25 cold-start requests (max ~25%)
2. WARM TTFT p90 = 3.327s (fresh = 0.262s, 12.7× gap) — indicates KV cache memory pressure from prior experiments
3. HEAVY TPOT p90 = 0.154s (fresh = 0.076s, 2.0× gap) — heavy prefill-decode interference from full KV cache
| Class | Linear TTFT p50 | LMetric TTFT p50 | Linear TPOT p90 | LMetric TPOT p90 |
|-------|----------------|-----------------|----------------|-----------------|
| WARM (<5k, n=46) | 0.143s | 0.134s | 0.058s | 0.061s |
| MEDIUM (5-20k, n=50) | 0.921s | 0.809s | 0.078s | 0.073s |
| HEAVY (>20k, n=102) | 4.875s | 4.943s | 0.078s | 0.074s |
APC comparison (prefix cache hit rate per instance):
| | Linear | LMetric |
|--|--------|---------|
| Mean | 32.5% | 30.8% |
| Std | ~22pp | ~19pp |
| Range | 3.3%63.3% | 4.9%67.2% |
**Analysis**: LMetric provides modest improvements in TPOT (-5.9%) and E2E (-4.0%) through better load balancing (the multiplication naturally penalizes overloaded instances). TTFT is unchanged because HEAVY requests dominate and session affinity constrains routing freedom. APC skew is slightly reduced. The improvement is far smaller than elastic P2P offload (-44% E2E), confirming that for agentic workloads, **the bottleneck is prefill-decode interference, not routing policy**.
Data: `outputs/ab_linear/` and `outputs/ab_lmetric/` on dash0. Logs: `/tmp/lmetric_ab_inst_*.log` (linear) and `/tmp/lmetric_inst_*.log` (LMetric).
The elastic numbers on dash1 were genuinely fresh. The "improvement" was actually comparing fresh elastic against degraded baseline.
## 4. System-Level Analysis
### 4.1 Why Elastic Wins Despite Lower GPU Utilization
### 4.1 Elastic P2P Does Not Improve Single-Machine Performance
**Mechanism 1: Eliminating prefill-decode interference (TPOT -36%)**
Under fair comparison (same machine, both fresh):
In combined mode, vLLM chunked prefill interleaves prefill and decode. An 80k-token HEAVY prefill occupies the GPU for seconds, delaying co-resident decode. Elastic routes heavy prefill to a different instance, so the decode pipeline is uninterrupted.
| Metric | Baseline | Elastic | Delta |
|--------|----------|---------|-------|
| TTFT p50 | 1.075s | 1.018s | -5.3% |
| TTFT p90 | 9.384s | 11.312s | +20.5% |
| TPOT p90 | 0.076s | 0.085s | +11.6% |
| E2E p50 | 5.075s | 6.977s | +37.5% |
Evidence: TPOT p90 drops from 0.117s (baseline) to 0.075s (elastic).
Elastic is **worse** on all metrics except TTFT p50. Root causes:
**Mechanism 2: Better effective cache utilization (TTFT -45%)**
**1. Mooncake kv_both memory overhead**
Baseline APC is skewed (3.8%68.3%) because heavy prefills evict other sessions' cached blocks. Elastic preserves D-instance prefix chains by offloading heavy prefills to P instances. Combined with Mooncake external cache, effective APC reaches ~70% on active instances vs ~40% baseline average.
Each instance with `kv_role=kv_both` maintains RDMA buffers + Mooncake bootstrap server, reducing GPU memory available for KV cache. This affects ALL requests (including WARM/MEDIUM that don't use P2P transfer), causing more cache eviction and higher TPOT.
**Mechanism 3: Faster KV cache turnover**
Evidence: MEDIUM TPOT p90 = 0.197s (elastic) vs 0.079s (baseline) — **2.5× worse** despite MEDIUM requests not using P2P at all.
Lower GPU utilization (15.8% vs 28.7%) is not waste — it reflects that requests complete 44% faster. Less contention → decode finishes faster → KV cache freed sooner → next request starts faster. The same total work completes in 56% of the wall time.
**2. D-side KV pull failures**
### 4.2 Known Limitation: GPU Load Imbalance
3 HEAVY requests completed prefill on P instance successfully but D-side never produced first token. The KV cache on D was too full to allocate space for the transferred blocks. These became 600s timeouts.
Elastic has 3.0× imbalance (7.6% min vs 30.4% max) vs baseline's 1.9×.
**3. P2P overhead without proportional benefit**
Root causes:
1. **P-instance concentration**: Previous implementation always picked the globally least-loaded instance as P, concentrating P-role work on the same few idle instances.
2. **Session skew**: Some sessions have many turns with large inputs, keeping their pinned instance busy while others go idle.
The P2P path adds: prefill queue on P (p50=6.3s) + KV transfer + decode start on D (p50=0.8s). For requests where the D instance isn't under heavy prefill load (which is the case on fresh instances), co-located execution is faster.
**Implemented fix** (in latest `cache_aware_proxy.py`): Round-robin P-instance selection with overload skip, replacing `argmin(ongoing_tokens)`. Needs validation in next experiment cycle.
### 4.2 When Elastic P2P Could Help
Elastic P2P is designed for the scenario where D-instance decode is disrupted by co-located heavy prefill. On fresh instances with 200 requests, this contention is moderate. The benefit may emerge under:
- Higher sustained load (1000+ concurrent requests)
- Longer experiment duration (KV cache fills up, eviction pressure increases)
- Multi-machine deployment (P on a different node, no memory competition)
## 5. Data & Log Locations
@@ -289,12 +268,14 @@ Root causes:
| Directory | Machine | Config | Notes |
|-----------|---------|--------|-------|
| `outputs/ab_baseline/` | dash0 | Combined 8× TP=1 | Fair A/B baseline (§3) |
| `outputs/ab_elastic/` | dash1 | Elastic P2P cap=4 | Fair A/B elastic (§3) |
| `outputs/ab_baseline/` | dash0 | Combined 8× TP=1 | ~~Initial A/B~~ (INVALIDATED: warm instances) |
| `outputs/ab_elastic/` | dash0 | Elastic P2P cap=4 | ~~Initial A/B~~ (INVALIDATED) |
| `outputs/baseline_stability_fresh/` | dash0 | Combined 8× fresh | **Canonical baseline** (§3.1) |
| `outputs/elastic_stability_*/` | dash0 | Elastic P2P kv_both fresh | **Canonical elastic** (§3.1) |
| `outputs/ab_linear/` | dash0 | Linear policy, 200 req | §3.4 routing policy comparison |
| `outputs/ab_lmetric/` | dash0 | LMetric policy, 200 req | §3.4 routing policy comparison |
| `outputs/gpu_ab_combined/` | local | Combined 8× TP=1 | Earlier run, has gpu_util.csv |
| `outputs/gpu_ab_pdsep/` | local | PD-Sep 4P+4D | Earlier run, has gpu_util.csv |
| `outputs/ab_linear/` | dash0 | Linear policy, 200 req | §3.6 routing policy comparison |
| `outputs/ab_lmetric/` | dash0 | LMetric policy, 200 req | §3.6 routing policy comparison |
| `outputs/exp2_combined_tp1_dp8/` | local | Combined 8× TP=1 | 1000 req, cache-aware |
| `outputs/exp3_pd_sep_tp1_mooncake/` | local | PD-Sep 4P+4D Mooncake | 1000 req |
@@ -367,6 +348,8 @@ agentic-kv/
|--------|-------------|-----------|
| `scripts/cache_aware_proxy.py` | Global scheduler + elastic offload proxy | `--combined`, `--offload`, `--policy {linear,lmetric}`, `--heavy-threshold`, `--bootstrap-ports` |
| `scripts/run_lmetric_ab.sh` | A/B: linear vs lmetric routing policy | Runs both experiments with fresh restart |
| `scripts/run_elastic_stability_test.sh` | Elastic vs baseline with full isolation | Fresh start/stop per experiment |
| `scripts/bench.sh` | Standard single-experiment harness | `--tag`, `--mode {baseline,elastic}` |
| `scripts/sample_trace.py` | Sample complete sessions from cluster trace | `--target-requests`, `--seed` |
| `python -m replayer` | Replay trace against vLLM endpoint | `--time-scale`, `--max-inflight-sessions`, `--request-limit` |
| `scripts/gpu_monitor.sh` | Sample nvidia-smi to CSV | Pipe to `outputs/<tag>/gpu_util.csv` |
@@ -376,18 +359,23 @@ agentic-kv/
### Established findings:
1. Full PD separation is **net negative** for single-machine agentic workloads (KV cache memory wall)
2. Cache-aware session-sticky routing is the **dominant optimization** (+24pp APC, -60% TTFT)
3. Elastic P2P offload achieves **-45% TTFT, -36% TPOT, -44% E2E** by selectively isolating heavy prefills while preserving decode cache locality
4. The GPU utilization paradox (lower util but better performance) is explained by higher per-request efficiency
5. LMetric (OSDI'26) multiplication-based routing provides modest improvement over linear (**E2E -4%, TPOT -6%**), confirming that routing policy alone has limited headroom — the bottleneck is prefill-decode interference
2. Cache-aware session-sticky routing is the **dominant optimization** (+24pp APC, -60% TTFT vs round-robin)
3. **Elastic P2P offload does NOT improve single-machine performance** — Mooncake kv_both memory overhead (+11% TPOT, +37% E2E) outweighs prefill isolation benefit under moderate load (200 req)
4. LMetric (OSDI'26) provides modest **E2E -4%** over linear routing; routing policy headroom is limited
5. **Experimental methodology matters**: warm vs fresh instances cause 2× TTFT difference; all comparisons must use verified fresh restart
### Lessons learned:
- Prior cross-machine A/B (commit `1e86285`) was invalid — warm baseline inflated by 2× due to residual KV cache state
- `kv_role=kv_both` has non-trivial always-on overhead even when P2P transfer is not used
- Experiment isolation (kill all → verify GPU free → fresh start) is critical for reproducibility
### Open problems:
1. GPU load imbalance (3.0× in elastic) — round-robin P fix implemented, needs validation
2. Elastic success rate (94.4%) — Mooncake transfer timeouts on >60k requests
3. Scaling to multi-machine (cross-node Mooncake transfers not yet tested)
4. Adaptive offload threshold (fixed 20k may not be optimal for all load levels)
5. Router state accuracy: proxy shadow state vs vLLM-internal exact state (TODO: vLLM → Redis → router pipeline for ablation)
1. Elastic P2P may help under **sustained high load** (KV cache pressure makes co-located interference worse) — needs 1000-req experiment
2. Mooncake kv_both memory overhead quantification and potential lazy initialization
3. Multi-machine elastic (P on different node, no memory competition)
4. Router state accuracy: proxy shadow state vs vLLM-internal exact state (TODO: vLLM → Redis → router)
5. `scripts/bench.sh` standardized harness to prevent future warm-instance mistakes
---
*Generated from experiments run on 2026-05-22. Git commits: `1e86285` (elastic A/B), `2b0ac70` (phase 1 milestone), subsequent LMetric implementation.*
*Updated 2026-05-22. Prior elastic A/B results (commit `1e86285`) invalidated — see §3.5 errata.*

View File

@@ -13,12 +13,14 @@ We benchmarked PD separation (prefill-decode disaggregation) against PD co-locat
Per-request breakdown shows **87.7% of TTFT** is spent waiting for KV cache memory on decode instances, not prefill compute or KV transfer.
**Elastic P2P offload** (selective disaggregation of HEAVY requests only) recovers the wins of PD separation without the memory wall:
**Elastic P2P offload** (selective disaggregation of HEAVY requests only, Mooncake kv_both): under fair same-machine fresh-restart comparison, elastic does NOT improve over baseline. Mooncake kv_both memory overhead outweighs prefill isolation benefit at moderate load.
| Config (TP=1, 8×H20) | TTFT p50 | TPOT p90 | E2E p50 | Effective APC |
|---|---|---|---|---|
| Combined DP=8 (baseline) | 2.383s | 0.117s | 10.232s | ~40% (skewed) |
| Elastic P2P (cap=4) | **1.315s** | **0.075s** | **5.708s** | ~70% (balanced) |
| Config (TP=1, 8×H20, fresh) | TTFT p50 | TPOT p90 | E2E p50 |
|---|---|---|---|
| Combined DP=8 (baseline) | **1.075s** | **0.076s** | **5.075s** |
| Elastic P2P (kv_both, cap=4) | 1.018s | 0.085s | 6.977s |
> Earlier cross-machine comparison (commit `1e86285`) was invalidated — baseline used warm instances. See REPORT.md §3.5.
| **Delta** | **-45%** | **-36%** | **-44%** | **+30pp** |
---

300
scripts/bench.sh Executable file
View File

@@ -0,0 +1,300 @@
#!/bin/bash
# Standardized single-experiment harness with guaranteed fresh state.
#
# GUARANTEES:
# 1. All GPU processes killed before start (verified via nvidia-smi)
# 2. All GPU processes killed after finish (clean for next experiment)
# 3. Fresh vLLM instances + proxy for every run
# 4. All outputs saved to outputs/<tag>/ with metrics, breakdown, APC, GPU snapshot
#
# Usage:
# bash scripts/bench.sh --tag my_experiment --mode baseline
# bash scripts/bench.sh --tag my_experiment --mode elastic
# bash scripts/bench.sh --tag my_experiment --mode baseline --policy lmetric
# bash scripts/bench.sh --tag my_experiment --mode elastic --requests 1000
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_DIR="$(dirname "$SCRIPT_DIR")"
VENV="${VENV_PATH:-$PROJECT_DIR/.venv/bin}"
PYTHON="$VENV/python"
VLLM="$VENV/vllm"
MODEL="${MODEL_PATH:-/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
TRACE="$PROJECT_DIR/traces/sampled_1000req_seed42.jsonl"
# Defaults
TAG=""
MODE="baseline" # baseline | elastic
POLICY="linear" # linear | lmetric
N_INSTANCES=8
BASE_PORT=8000
PROXY_PORT=9090
REQUESTS=200
TIME_SCALE=20
MAX_SESSIONS=8
HEAVY_THRESHOLD=20000
# Parse args
while [[ $# -gt 0 ]]; do
case "$1" in
--tag) TAG="$2"; shift 2 ;;
--mode) MODE="$2"; shift 2 ;;
--policy) POLICY="$2"; shift 2 ;;
--requests) REQUESTS="$2"; shift 2 ;;
--time-scale) TIME_SCALE="$2"; shift 2 ;;
--sessions) MAX_SESSIONS="$2"; shift 2 ;;
--heavy-threshold) HEAVY_THRESHOLD="$2"; shift 2 ;;
*) echo "Unknown: $1"; exit 1 ;;
esac
done
if [ -z "$TAG" ]; then
echo "Usage: bench.sh --tag NAME --mode {baseline|elastic} [--policy {linear|lmetric}] [--requests N]"
exit 1
fi
OUTDIR="$PROJECT_DIR/outputs/$TAG"
if [ -d "$OUTDIR" ] && [ -f "$OUTDIR/metrics.jsonl" ]; then
echo "[ERROR] Output directory $OUTDIR already exists with data. Use a different --tag."
exit 1
fi
mkdir -p "$OUTDIR"
# Save experiment config
cat > "$OUTDIR/config.json" << CONF
{
"tag": "$TAG",
"mode": "$MODE",
"policy": "$POLICY",
"model": "$MODEL",
"n_instances": $N_INSTANCES,
"requests": $REQUESTS,
"time_scale": $TIME_SCALE,
"max_sessions": $MAX_SESSIONS,
"heavy_threshold": $HEAVY_THRESHOLD,
"timestamp": "$(date -Iseconds)",
"hostname": "$(hostname)"
}
CONF
# ─── GPU Cleanup (verified) ────────────────────────────────────────────────
cleanup_gpu() {
echo "[cleanup] Killing all vLLM/proxy processes..."
for p in $(ps aux | grep -E 'vllm serve|cache_aware_proxy' | grep -v grep | awk '{print $2}' 2>/dev/null); do
kill -9 "$p" 2>/dev/null || true
done
sleep 3
# Kill any remaining GPU holders
local gpu_pids
gpu_pids=$(fuser /dev/nvidia* 2>/dev/null | tr ' ' '\n' | sort -u | grep -v '^$' || true)
if [ -n "$gpu_pids" ]; then
echo "[cleanup] Killing GPU-holding PIDs: $gpu_pids"
echo "$gpu_pids" | xargs -r kill -9 2>/dev/null || true
sleep 5
fi
# Verify GPUs are free
local used
used=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits 2>/dev/null | awk '{s+=$1} END{print s}')
if [ "${used:-0}" -gt 100 ]; then
echo "[ERROR] GPUs still have ${used}MB allocated after cleanup. Aborting."
nvidia-smi --query-gpu=index,memory.used --format=csv,noheader
exit 1
fi
echo "[cleanup] All GPUs verified free."
}
# ─── Launch vLLM instances ─────────────────────────────────────────────────
launch_instances() {
echo "[launch] Starting $N_INSTANCES vLLM instances (mode=$MODE)..."
local kv_config=""
if [ "$MODE" = "elastic" ]; then
kv_config='--kv-transfer-config {"kv_connector":"MooncakeConnector","kv_role":"kv_both"}'
fi
for i in $(seq 0 $((N_INSTANCES - 1))); do
local port=$((BASE_PORT + i))
local master=$((29500 + i))
local logfile="$OUTDIR/vllm_inst_${i}.log"
local env_prefix="MASTER_PORT=$master CUDA_VISIBLE_DEVICES=$i"
if [ "$MODE" = "elastic" ]; then
env_prefix="VLLM_MOONCAKE_BOOTSTRAP_PORT=$((8998 + i)) $env_prefix"
fi
eval "$env_prefix $VLLM serve '$MODEL' \
--host 0.0.0.0 --port $port \
--tensor-parallel-size 1 \
--trust-remote-code --enable-prefix-caching --enforce-eager \
--dtype auto --gpu-memory-utilization 0.9 --max-model-len 200000 \
$kv_config \
> '$logfile' 2>&1 &"
echo " inst_$i: GPU=$i port=$port"
sleep 2 # stagger to avoid port collision
done
# Wait for health
echo "[launch] Waiting for instances to become healthy..."
for i in $(seq 0 $((N_INSTANCES - 1))); do
local port=$((BASE_PORT + i))
local tries=0
while ! curl -sf "http://127.0.0.1:$port/health" > /dev/null 2>&1; do
tries=$((tries + 1))
if [ $tries -ge 120 ]; then
echo "[FAIL] Instance $i (port $port) failed to start. Log:"
tail -10 "$OUTDIR/vllm_inst_${i}.log"
cleanup_gpu
exit 1
fi
sleep 5
done
echo " inst_$i healthy"
done
# Wait for bootstrap (elastic only)
if [ "$MODE" = "elastic" ]; then
echo "[launch] Waiting for Mooncake bootstrap servers..."
for i in $(seq 0 $((N_INSTANCES - 1))); do
local bp=$((8998 + i))
local tries=0
while ! curl -sf "http://127.0.0.1:$bp/query" > /dev/null 2>&1; do
tries=$((tries + 1))
if [ $tries -ge 60 ]; then
echo "[FAIL] Bootstrap $bp failed"
cleanup_gpu
exit 1
fi
sleep 2
done
echo " bootstrap $bp ready"
done
fi
}
# ─── Launch proxy ──────────────────────────────────────────────────────────
launch_proxy() {
echo "[proxy] Starting (mode=$MODE, policy=$POLICY)..."
local combined_args=""
for i in $(seq 0 $((N_INSTANCES - 1))); do
combined_args="$combined_args http://127.0.0.1:$((BASE_PORT + i))"
done
local extra_args="--policy $POLICY"
if [ "$MODE" = "elastic" ]; then
local bp_list=""
for i in $(seq 0 $((N_INSTANCES - 1))); do
bp_list="${bp_list:+$bp_list,}$((8998 + i))"
done
extra_args="$extra_args --offload --heavy-threshold $HEAVY_THRESHOLD --bootstrap-ports $bp_list"
fi
$PYTHON "$PROJECT_DIR/scripts/cache_aware_proxy.py" \
--combined $combined_args \
--port $PROXY_PORT \
$extra_args \
> "$OUTDIR/proxy.log" 2>&1 &
# Wait for proxy
local tries=0
while ! curl -sf "http://127.0.0.1:$PROXY_PORT/stats" > /dev/null 2>&1; do
tries=$((tries + 1))
if [ $tries -ge 30 ]; then
echo "[FAIL] Proxy failed to start"
cleanup_gpu
exit 1
fi
sleep 2
done
echo "[proxy] Ready on port $PROXY_PORT"
}
# ─── Run benchmark ─────────────────────────────────────────────────────────
run_benchmark() {
echo "[bench] Running $REQUESTS requests (time_scale=$TIME_SCALE, sessions=$MAX_SESSIONS)..."
$PYTHON -m replayer \
--trace "$TRACE" \
--output "$OUTDIR/metrics.jsonl" \
--endpoint "http://localhost:$PROXY_PORT" \
--model "$MODEL" \
--time-scale "$TIME_SCALE" \
--max-inflight-sessions "$MAX_SESSIONS" \
--request-limit "$REQUESTS" \
-v 2>&1 | tee "$OUTDIR/replayer.log"
}
# ─── Collect artifacts ─────────────────────────────────────────────────────
collect_artifacts() {
echo "[collect] Saving artifacts..."
curl -sf "http://localhost:$PROXY_PORT/breakdown" > "$OUTDIR/breakdown.json" 2>/dev/null || true
curl -sf "http://localhost:$PROXY_PORT/stats" > "$OUTDIR/stats.json" 2>/dev/null || true
nvidia-smi --query-gpu=index,utilization.gpu,memory.used,memory.total \
--format=csv > "$OUTDIR/gpu_snapshot.csv" 2>/dev/null || true
# APC from vLLM logs
for i in $(seq 0 $((N_INSTANCES - 1))); do
pch=$(grep "Prefix cache hit rate" "$OUTDIR/vllm_inst_${i}.log" 2>/dev/null | tail -1 | grep -oP "Prefix cache hit rate: \K[0-9.]+" || echo "0")
ech=$(grep "External prefix cache hit rate" "$OUTDIR/vllm_inst_${i}.log" 2>/dev/null | tail -1 | grep -oP "External prefix cache hit rate: \K[0-9.]+" || echo "")
ext_str=""
[ -n "$ech" ] && ext_str=" ext=$ech%"
echo "inst_$i: prefix=$pch%$ext_str"
done | tee "$OUTDIR/apc.txt"
}
# ─── Summary ───────────────────────────────────────────────────────────────
print_summary() {
$PYTHON -c "
import json
rows = [json.loads(l) for l in open('$OUTDIR/metrics.jsonl')]
ok = [r for r in rows if not r.get('error')]
err = [r for r in rows if r.get('error')]
p = lambda v,q: sorted(v)[min(int(q*len(v)),len(v)-1)] if v else 0
ttfts = sorted([r['ttft_s'] for r in ok if r.get('ttft_s')])
tpots = sorted([r['tpot_s'] for r in ok if r.get('tpot_s') and r['tpot_s']>0])
e2es = sorted([r['latency_s'] for r in ok])
print()
print('=' * 70)
print(' RESULT: $TAG ($MODE, $POLICY)')
print('=' * 70)
print(' OK=%d/%d (%.1f%%) TTFT50=%.3f TTFT90=%.3f TPOT90=%.4f E2E50=%.3f' % (
len(ok), len(rows), len(ok)*100/len(rows), p(ttfts,.5), p(ttfts,.9), p(tpots,.9), p(e2es,.5)))
for lo,hi,cl in [(0,5000,'WARM'),(5000,20000,'MEDIUM'),(20000,200000,'HEAVY')]:
sub = [r for r in ok if lo <= r['input_length'] < hi and r.get('ttft_s')]
if sub:
t = sorted([r['ttft_s'] for r in sub])
tp = sorted([r['tpot_s'] for r in sub if r.get('tpot_s') and r['tpot_s']>0])
print(' %-8s n=%3d TTFT50=%.3f TTFT90=%.3f TPOT90=%.4f' % (
cl, len(sub), p(t,.5), p(t,.9), p(tp,.9) if tp else 0))
if err:
print(' Errors (%d):' % len(err))
for e in err[:5]:
print(' input=%d %s' % (e['input_length'], str(e.get('error',''))[:60]))
print(' Output: $OUTDIR/')
print('=' * 70)
"
}
# ─── Main ──────────────────────────────────────────────────────────────────
echo "================================================================"
echo " bench.sh: $TAG"
echo " mode=$MODE policy=$POLICY requests=$REQUESTS"
echo " $(date)"
echo "================================================================"
cleanup_gpu
launch_instances
launch_proxy
run_benchmark
collect_artifacts
print_summary
cleanup_gpu
echo "[done] $(date)"

View File

@@ -361,13 +361,22 @@ async def _handle_combined(api, req_data, token_ids, input_length, session_id, h
return StreamingResponse(generate(), media_type="text/event-stream")
PREFILL_TIMEOUT_S = 120 # max seconds to wait for P-instance prefill
async def _handle_heavy_offload(api, req_data, headers, token_ids, input_length,
p_inst, d_inst, breakdown):
"""HEAVY request: prefill on p_inst, KV via Mooncake, decode on d_inst."""
"""HEAVY request: prefill on p_inst, KV via Mooncake, decode on d_inst.
On prefill timeout/failure, falls back to co-located decode on d_inst.
"""
global _offload_inflight
request_id = headers.get("X-Request-Id", "")
estimated_new = breakdown.get("estimated_new_tokens", 0)
# Step 1: Await prefill on p_inst (ongoing_tokens already reserved by caller)
breakdown["t_prefill_sent"] = _time.monotonic()
prefill_ok = False
try:
prefill_data = req_data.copy()
prefill_data["kv_transfer_params"] = {
@@ -381,25 +390,56 @@ async def _handle_heavy_offload(api, req_data, headers, token_ids, input_length,
prefill_data.pop("stream_options", None)
p_headers = {**headers, "X-data-parallel-rank": "0"}
resp = await p_inst.client.post(api, json=prefill_data, headers=p_headers)
resp = await asyncio.wait_for(
p_inst.client.post(api, json=prefill_data, headers=p_headers),
timeout=PREFILL_TIMEOUT_S,
)
resp.raise_for_status()
await resp.aclose()
p_inst.record_prefix(token_ids)
breakdown["t_prefill_done"] = _time.monotonic()
prefill_ok = True
except Exception as e:
breakdown["t_prefill_done"] = _time.monotonic()
breakdown["error"] = str(e)
_breakdown_log.append(breakdown)
global _offload_inflight
_offload_inflight = max(0, _offload_inflight - 1)
p_inst.num_requests -= 1
raise HTTPException(status_code=502, detail="Prefill failed: %s" % e)
breakdown["prefill_error"] = str(e)
finally:
# Always release P-instance resources exactly once
p_inst.ongoing_tokens -= input_length
p_inst.pending_prefill_tokens -= breakdown.get("estimated_new_tokens", 0)
p_inst.pending_prefill_tokens -= estimated_new
p_inst.num_requests -= 1
_offload_inflight = max(0, _offload_inflight - 1)
p_inst.num_requests -= 1
if not prefill_ok:
# Fallback: co-located prefill+decode on d_inst (no KV transfer)
breakdown["route_class"] = "HEAVY_COLO_FALLBACK"
d_inst.ongoing_tokens += input_length
d_inst.pending_prefill_tokens += estimated_new
d_inst.num_requests += 1
async def generate_fallback():
prefill_done = False
try:
async with d_inst.client.stream("POST", api, json=req_data, headers=headers) as resp:
resp.raise_for_status()
async for chunk in resp.aiter_bytes():
if not prefill_done:
d_inst.pending_prefill_tokens -= estimated_new
d_inst.ongoing_decode_tokens += input_length
breakdown["t_first_token"] = _time.monotonic()
prefill_done = True
yield chunk
d_inst.record_prefix(token_ids)
finally:
if not prefill_done:
d_inst.pending_prefill_tokens -= estimated_new
else:
d_inst.ongoing_decode_tokens -= input_length
d_inst.ongoing_tokens -= input_length
d_inst.num_requests -= 1
breakdown["t_done"] = _time.monotonic()
_breakdown_log.append(breakdown)
return StreamingResponse(generate_fallback(), media_type="text/event-stream")
# Step 2: Stream decode on d_inst (pulls KV from Mooncake)
d_inst.ongoing_tokens += input_length

View File

@@ -0,0 +1,329 @@
#!/bin/bash
# Elastic P2P stability test: runs 200-request benchmark with offload mode
# and baseline mode, then compares success rates.
#
# Must be run on dash0 (8 GPUs with Mooncake support).
#
# Usage:
# bash scripts/run_elastic_stability_test.sh
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_DIR="$(dirname "$SCRIPT_DIR")"
VENV="${VENV_PATH:-$PROJECT_DIR/.venv/bin}"
PYTHON="$VENV/python"
VLLM="$VENV/vllm"
MODEL="${MODEL_PATH:-/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
TRACE="$PROJECT_DIR/traces/sampled_1000req_seed42.jsonl"
N_INSTANCES=8
BASE_PORT=8000
PROXY_PORT=9090
REQUEST_LIMIT=200
TIME_SCALE=20
MAX_INFLIGHT=8
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
OUT_ELASTIC="$PROJECT_DIR/outputs/elastic_stability_${TIMESTAMP}"
OUT_BASELINE="$PROJECT_DIR/outputs/baseline_stability_${TIMESTAMP}"
# ─── Helper functions ────────────────────────────────────────────────────────
kill_all() {
echo "[cleanup] Killing vLLM and proxy processes..."
for p in $(ps aux | grep 'vllm serve' | grep -v grep | awk '{print $2}' 2>/dev/null); do
kill -9 "$p" 2>/dev/null || true
done
for p in $(ps aux | grep 'cache_aware_proxy' | grep -v grep | awk '{print $2}' 2>/dev/null); do
kill -9 "$p" 2>/dev/null || true
done
sleep 5
echo "[cleanup] Releasing GPUs..."
for p in $(fuser /dev/nvidia* 2>/dev/null | tr ' ' '\n' | sort -u); do
kill -9 "$p" 2>/dev/null || true
done
sleep 10
echo "[cleanup] Done."
}
wait_for_instances() {
local n=$1
echo "[wait] Waiting for $n vLLM instances to become healthy..."
for i in $(seq 0 $((n - 1))); do
local port=$((BASE_PORT + i))
local tries=0
while ! curl -sf "http://127.0.0.1:$port/health" > /dev/null 2>&1; do
tries=$((tries + 1))
if [ $tries -ge 120 ]; then
echo "[FAIL] Instance $i (port $port) did not start in 600s"
return 1
fi
sleep 5
done
echo " Instance $i (port $port) healthy"
done
}
wait_for_bootstrap() {
echo "[wait] Waiting for Mooncake bootstrap servers..."
for i in $(seq 0 $((N_INSTANCES - 1))); do
local bp=$((8998 + i))
local tries=0
while ! curl -sf "http://127.0.0.1:$bp/query" > /dev/null 2>&1; do
tries=$((tries + 1))
if [ $tries -ge 60 ]; then
echo "[FAIL] Bootstrap $bp did not start in 120s"
return 1
fi
sleep 2
done
echo " Bootstrap $bp ready"
done
}
wait_for_proxy() {
echo "[wait] Waiting for proxy on port $PROXY_PORT..."
local tries=0
while ! curl -sf "http://127.0.0.1:$PROXY_PORT/stats" > /dev/null 2>&1; do
tries=$((tries + 1))
if [ $tries -ge 30 ]; then
echo "[FAIL] Proxy did not start in 60s"
return 1
fi
sleep 2
done
echo " Proxy ready"
}
launch_vllm_kv_both() {
echo ""
echo "=== Launching $N_INSTANCES vLLM instances (kv_both) ==="
for i in $(seq 0 $((N_INSTANCES - 1))); do
local port=$((BASE_PORT + i))
local bp=$((8998 + i))
local master=$((29500 + i))
local log="/tmp/elastic_test_${i}.log"
VLLM_MOONCAKE_BOOTSTRAP_PORT=$bp \
MASTER_PORT=$master \
CUDA_VISIBLE_DEVICES=$i \
$VLLM serve "$MODEL" \
--host 0.0.0.0 --port "$port" --tensor-parallel-size 1 \
--trust-remote-code --enable-prefix-caching --enforce-eager \
--dtype auto --gpu-memory-utilization 0.9 --max-model-len 200000 \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
> "$log" 2>&1 &
echo " Instance $i: GPU=$i port=$port bootstrap=$bp log=$log"
sleep 2
done
wait_for_instances $N_INSTANCES
wait_for_bootstrap
}
launch_vllm_baseline() {
echo ""
echo "=== Launching $N_INSTANCES vLLM instances (baseline, no Mooncake) ==="
for i in $(seq 0 $((N_INSTANCES - 1))); do
local port=$((BASE_PORT + i))
local master=$((29500 + i))
local log="/tmp/baseline_test_${i}.log"
MASTER_PORT=$master \
CUDA_VISIBLE_DEVICES=$i \
$VLLM serve "$MODEL" \
--host 0.0.0.0 --port "$port" --tensor-parallel-size 1 \
--trust-remote-code --enable-prefix-caching --enforce-eager \
--dtype auto --gpu-memory-utilization 0.9 --max-model-len 200000 \
> "$log" 2>&1 &
echo " Instance $i: GPU=$i port=$port log=$log"
sleep 2
done
wait_for_instances $N_INSTANCES
}
launch_proxy_elastic() {
echo ""
echo "=== Starting proxy (elastic offload mode) ==="
local combined_args=""
local bp_list=""
for i in $(seq 0 $((N_INSTANCES - 1))); do
combined_args="$combined_args http://127.0.0.1:$((BASE_PORT + i))"
bp_list="${bp_list:+$bp_list,}$((8998 + i))"
done
$PYTHON "$PROJECT_DIR/scripts/cache_aware_proxy.py" \
--combined $combined_args \
--bootstrap-ports "$bp_list" \
--offload --heavy-threshold 20000 \
--port $PROXY_PORT \
> /tmp/proxy_elastic.log 2>&1 &
wait_for_proxy
}
launch_proxy_baseline() {
echo ""
echo "=== Starting proxy (baseline, no offload) ==="
local combined_args=""
for i in $(seq 0 $((N_INSTANCES - 1))); do
combined_args="$combined_args http://127.0.0.1:$((BASE_PORT + i))"
done
$PYTHON "$PROJECT_DIR/scripts/cache_aware_proxy.py" \
--combined $combined_args \
--port $PROXY_PORT \
> /tmp/proxy_baseline.log 2>&1 &
wait_for_proxy
}
run_benchmark() {
local tag=$1
local output_dir=$2
mkdir -p "$output_dir"
echo ""
echo "=== Running benchmark: $tag ($REQUEST_LIMIT requests) ==="
$PYTHON -m replayer \
--trace "$TRACE" \
--output "$output_dir/metrics.jsonl" \
--endpoint "http://localhost:$PROXY_PORT" \
--model "$MODEL" \
--time-scale "$TIME_SCALE" \
--max-inflight-sessions "$MAX_INFLIGHT" \
--request-limit "$REQUEST_LIMIT" \
-v 2>&1 | tee "$output_dir/replayer.log"
# Save proxy breakdown and stats
curl -sf "http://localhost:$PROXY_PORT/breakdown" > "$output_dir/breakdown.json" 2>/dev/null || true
curl -sf "http://localhost:$PROXY_PORT/stats" > "$output_dir/stats.json" 2>/dev/null || true
}
collect_gpu_util() {
local output_dir=$1
nvidia-smi --query-gpu=index,utilization.gpu,memory.used,memory.total \
--format=csv > "$output_dir/gpu_snapshot.csv" 2>/dev/null || true
}
print_summary() {
local label=$1
local output_dir=$2
local metrics="$output_dir/metrics.jsonl"
if [ ! -f "$metrics" ]; then
echo " [$label] No metrics file found!"
return
fi
# Count total, success, error from metrics JSONL
local total=$(wc -l < "$metrics")
local success=$(grep -c '"error":null\|"error": null' "$metrics" 2>/dev/null || grep -c '"ttft":[0-9]' "$metrics" 2>/dev/null || echo 0)
local errors=$((total - success))
local rate="N/A"
if [ "$total" -gt 0 ]; then
rate=$(awk "BEGIN{printf \"%.1f\", ($success/$total)*100}")
fi
echo " [$label]"
echo " Total requests: $total"
echo " Successful: $success"
echo " Errors: $errors"
echo " Success rate: ${rate}%"
# Print summary.json if it exists
local summary="$output_dir/metrics.summary.json"
if [ -f "$summary" ]; then
echo " Summary: $(cat "$summary")"
fi
}
# ─── Main ────────────────────────────────────────────────────────────────────
echo "================================================================"
echo " Elastic P2P Stability Test"
echo " $(date)"
echo " Model: $MODEL"
echo " Requests: $REQUEST_LIMIT"
echo " Output: elastic → $OUT_ELASTIC"
echo " baseline → $OUT_BASELINE"
echo "================================================================"
# Sanity checks
if [ ! -f "$TRACE" ]; then
echo "[ERROR] Trace file not found: $TRACE"
exit 1
fi
if [ ! -x "$PYTHON" ]; then
echo "[ERROR] Python not found: $PYTHON"
exit 1
fi
# ─── Phase 1: Elastic P2P offload ────────────────────────────────────────────
echo ""
echo "############################################################"
echo " Phase 1: Elastic P2P Offload"
echo "############################################################"
kill_all
launch_vllm_kv_both
launch_proxy_elastic
collect_gpu_util "$OUT_ELASTIC"
run_benchmark "elastic_p2p" "$OUT_ELASTIC"
echo ""
echo "[phase1] Saving APC stats..."
for i in $(seq 0 $((N_INSTANCES - 1))); do
port=$((BASE_PORT + i))
curl -sf "http://127.0.0.1:$port/metrics" 2>/dev/null \
| grep -E 'vllm:cache_hit|prefix_cache' \
>> "$OUT_ELASTIC/apc_metrics.txt" 2>/dev/null || true
done
# ─── Phase 2: Baseline (no offload) ─────────────────────────────────────────
echo ""
echo "############################################################"
echo " Phase 2: Baseline (no offload, no Mooncake)"
echo "############################################################"
kill_all
launch_vllm_baseline
launch_proxy_baseline
collect_gpu_util "$OUT_BASELINE"
run_benchmark "baseline" "$OUT_BASELINE"
echo ""
echo "[phase2] Saving APC stats..."
for i in $(seq 0 $((N_INSTANCES - 1))); do
port=$((BASE_PORT + i))
curl -sf "http://127.0.0.1:$port/metrics" 2>/dev/null \
| grep -E 'vllm:cache_hit|prefix_cache' \
>> "$OUT_BASELINE/apc_metrics.txt" 2>/dev/null || true
done
# ─── Cleanup ─────────────────────────────────────────────────────────────────
kill_all
# ─── Comparison ──────────────────────────────────────────────────────────────
echo ""
echo "================================================================"
echo " Results Comparison"
echo "================================================================"
print_summary "Elastic P2P" "$OUT_ELASTIC"
echo ""
print_summary "Baseline" "$OUT_BASELINE"
echo ""
echo "Detailed outputs:"
echo " Elastic: $OUT_ELASTIC/"
echo " Baseline: $OUT_BASELINE/"
echo ""
echo "Breakdown analysis:"
echo " python scripts/analyze_breakdown.py $OUT_ELASTIC/breakdown.json"
echo ""
echo "================================================================"
echo " Done. $(date)"
echo "================================================================"