Prior cross-machine comparison (commit 1e86285) was invalid: dash0
baseline used warm instances with residual KV cache, inflating TTFT
by 2x. Evidence: inst_7 APC=68.3% impossible from 25 cold-start
requests; WARM TTFT p90=3.3s vs fresh=0.26s.
Fair same-machine comparison (both fresh restart on dash0):
Baseline: TTFT50=1.075 TPOT90=0.076 E2E50=5.075 OK=198/200
Elastic P2P: TTFT50=1.018 TPOT90=0.085 E2E50=6.977 OK=195/200
Elastic is WORSE due to Mooncake kv_both memory overhead.
Changes:
- REPORT.md: rewrite §3-4 with corrected results, add §3.5 errata
- pd_separation_analysis.md: update elastic TL;DR with correct numbers
- cache_aware_proxy.py: fix double-decrement bugs in offload path,
add 120s prefill timeout with co-located fallback (HEAVY_COLO_FALLBACK)
- bench.sh: standardized experiment harness with guaranteed GPU cleanup
and fresh-state verification (nvidia-smi check before start)
- run_elastic_stability_test.sh: two-phase elastic vs baseline test
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
18 KiB
Milestone Report: Elastic P2P vs PD-Combined Baseline
Date: 2026-05-22 Author: Gahow Wang Status: Phase 1 complete — baseline + elastic validated, system-level analysis done
1. Research Question
For agentic LLM workloads (long input, short output, high KV cache reuse), is prefill-decode disaggregation beneficial? If full PD separation hurts (proven in §3), can selective disaggregation of only heavy requests improve serving latency while preserving KV cache locality?
2. Experimental Setup
2.1 Hardware
| Resource | Spec |
|---|---|
| Machine | dash0 / dash1 (identical config) |
| GPU | 8× NVIDIA H20 96GB HBM, NVLink |
| Network | 4× ConnectX-7 200Gbps RDMA |
| Storage | cpfs shared storage across machines |
2.2 Software
| Component | Version | Notes |
|---|---|---|
| vLLM | 0.18.1 (source in third_party/vllm/) |
Patched scheduler assert (see patches/) |
| Mooncake | 0.3.10 | RDMA-based KV transfer between instances |
| Python | 3.x managed by uv |
.venv/ at project root |
| Model | Qwen3-Coder-30B-A3B-Instruct |
MoE 128 experts top-8, 3B active params |
| Model path | ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct |
Same on dash0 and dash1 |
2.3 Workload Trace
| Property | Value |
|---|---|
| Source | GLM-5.1 Agentic Coder, production cluster, 2h window |
| Raw trace | ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl on dash0 |
| Total requests | 2,114,220 |
| Avg input tokens | 33,600 (p50=20k, p90=88k) |
| Avg output tokens | 445 (p50=80) |
| I/O ratio | 75.6× aggregate |
| Prefill token share | 98% |
| KV reuse (intra-session) | 91% of reusable blocks |
| Theoretical max APC | 71% (infinite cache, single instance) |
Sampled trace for benchmarks: traces/sampled_1000req_seed42.jsonl (1000 requests, seed=42, preserving session structure). For 200-request ablations: replayer --request-limit 200.
2.4 Two Configurations Compared
Baseline: PD-Combined (8× TP=1 DP=8)
8 independent vLLM instances, 1 GPU each, no Mooncake.
All instances do both prefill and decode.
Global scheduler (cache_aware_proxy.py --combined) handles:
- Session-sticky routing (multi-turn → same instance)
- Load-aware override (if pinned instance > 2× avg load, redirect)
- Cache-hit scoring (prefer instance with matching prefix blocks)
Launch:
# On dash0:
for i in $(seq 0 7); do
MASTER_PORT=$((29500+i)) CUDA_VISIBLE_DEVICES=$i \
vllm serve ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
--port $((8000+i)) --tp 1 \
--enable-prefix-caching --enforce-eager \
--gpu-memory-utilization 0.9 --max-model-len 200000 \
> /tmp/ab_base_$i.log 2>&1 &
done
python scripts/cache_aware_proxy.py \
--combined http://127.0.0.1:800{0..7} --port 9090
Elastic P2P Offload (8× TP=1 kv_both + selective offload)
8 independent vLLM instances, 1 GPU each, all kv_role=kv_both (Mooncake).
Same global scheduler, plus elastic offload logic:
- Proxy classifies each request: WARM (<5k new), MEDIUM (5-20k), HEAVY (>20k)
- WARM/MEDIUM: co-located on session-sticky instance (no KV transfer)
- HEAVY: prefill on a different instance (P), KV via Mooncake RDMA,
decode on session-sticky instance (D)
- Cap: max 4 concurrent offloads (MAX_OFFLOAD_INFLIGHT)
- P instance selection: round-robin with overload skip
Launch:
# On dash1 (or use scripts/launch_elastic_p2p.sh):
for i in $(seq 0 7); do
VLLM_MOONCAKE_BOOTSTRAP_PORT=$((8998+i)) \
MASTER_PORT=$((29500+i)) CUDA_VISIBLE_DEVICES=$i \
vllm serve ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
--port $((8000+i)) --tp 1 \
--enable-prefix-caching --enforce-eager \
--gpu-memory-utilization 0.9 --max-model-len 200000 \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
> /tmp/ab_elastic_$i.log 2>&1 &
sleep 2 # stagger to avoid NCCL port collision
done
# Wait for bootstrap servers
for bp in $(seq 8998 9005); do
until curl -s localhost:$bp/query > /dev/null 2>&1; do sleep 2; done
done
python scripts/cache_aware_proxy.py \
--combined http://127.0.0.1:800{0..7} \
--bootstrap-ports 8998,8999,9000,9001,9002,9003,9004,9005 \
--offload --heavy-threshold 20000 --port 9090
2.5 Benchmark Parameters
| Parameter | Value |
|---|---|
| Requests | 200 (from sampled 1000-req trace, --request-limit 200) |
| Time scale | 20× (compress 2h trace into ~6min) |
| Max inflight sessions | 8 |
| Request timeout | 600s |
| vLLM flags | --enforce-eager --enable-prefix-caching --max-model-len 200000 |
| GPU memory util | 0.9 |
| Fresh restart | Both configs started from cold (no warm cache) |
2.6 Reproducing the Benchmark
# Activate environment
cd ~/agentic-kv && source .venv/bin/activate
# Ensure sampled trace exists
python scripts/sample_trace.py \
--input ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \
--output traces/sampled_1000req_seed42.jsonl \
--target-requests 1000 --seed 42
# Start GPU monitoring (in a separate terminal)
bash scripts/gpu_monitor.sh > outputs/<tag>/gpu_util.csv &
# Run replayer against proxy
python -m replayer \
--trace traces/sampled_1000req_seed42.jsonl \
--output outputs/<tag>/metrics.jsonl \
--endpoint http://localhost:9090 \
--time-scale 20 --max-inflight-sessions 8 \
--request-limit 200 -v
# Collect proxy breakdown (elastic only)
curl -s http://localhost:9090/breakdown > outputs/<tag>/breakdown.json
# Collect APC from vLLM logs
for i in $(seq 0 7); do
grep "Prefix cache hit rate\|External prefix cache hit rate" /tmp/<prefix>_$i.log | tail -2
done
3. Results
Errata (2026-05-22): The initial cross-machine A/B (dash0 baseline vs dash1 elastic) reported -44% E2E improvement. Post-hoc analysis revealed the dash0 baseline instances were not freshly restarted — residual KV cache from prior experiments caused 2× TTFT inflation. All results below use verified fresh-restart experiments on the same machine.
3.1 Fair Comparison (all fresh-restart, same machine dash0, 200 req)
| Config | OK/N | TTFT p50 | TTFT p90 | TPOT p90 | E2E p50 |
|---|---|---|---|---|---|
| Baseline (no Mooncake) | 198/200 | 1.075s | 9.384s | 0.076s | 5.075s |
| LMetric routing | 198/200 | 1.099s | 9.392s | 0.073s | 5.205s |
| Elastic P2P (kv_both) | 195/200 | 1.018s | 11.312s | 0.085s | 6.977s |
3.2 Per-Class Breakdown
Baseline (fresh):
| Class | Count | % | TTFT p50 | TTFT p90 | TPOT p90 |
|---|---|---|---|---|---|
| WARM (<5k) | 46 | 23% | 0.137s | 0.262s | 0.061s |
| MEDIUM (5-20k) | 50 | 25% | 0.921s | 1.846s | 0.079s |
| HEAVY (20-50k) | 64 | 32% | 2.660s | 6.278s | 0.076s |
| HEAVY (>50k) | 38 | 19% | 9.587s | 30.415s | 0.102s |
Elastic P2P (fresh):
| Class | Count | % | TTFT p50 | TTFT p90 | TPOT p90 |
|---|---|---|---|---|---|
| WARM (<5k) | 46 | 23% | 0.142s | 0.279s | 0.072s |
| MEDIUM (5-20k) | 50 | 25% | 0.766s | 1.814s | 0.197s |
| HEAVY (>20k) | 99 | 51% | 6.390s | 22.668s | 0.085s |
3.3 Success Rate
| Config | OK | Total | Rate | Failure mode |
|---|---|---|---|---|
| Baseline | 198 | 200 | 99.0% | RemoteProtocolError (replayer-side) |
| Elastic P2P | 195 | 200 | 97.5% | 2× RemoteProtocolError + 3× ReadTimeout on >60k |
Elastic's 3 extra errors are D-side KV pull failures: prefill succeeded on P, KV pushed to Mooncake, but D never produced first token (decode scheduler couldn't allocate KV cache space). Prefill timeout fallback (120s → co-located) was never triggered.
3.4 Routing Policy: Linear vs LMetric (OSDI'26)
LMetric (score = P_tokens × BS) vs linear (score = ongoing_tokens - α·cache_hit). Both fresh-restart, same trace.
| Policy | TTFT p50 | TTFT p90 | TPOT p90 | E2E p50 | Delta E2E |
|---|---|---|---|---|---|
| Linear | 1.086s | 9.432s | 0.077s | 5.423s | — |
| LMetric | 1.099s | 9.392s | 0.073s | 5.205s | -4.0% |
LMetric provides modest improvement through better load balancing. Routing policy headroom is limited for this workload.
3.5 Errata: Why Prior Cross-Machine A/B Was Invalid
The initial comparison (commit 1e86285) reported:
Baseline (dash0): TTFT50=2.383 E2E50=10.232 ← WRONG (warm instances)
Elastic (dash1): TTFT50=1.315 E2E50=5.708
Delta: -45% -44% ← INVALID
Evidence that prior baseline was not fresh:
inst_7APC = 68.3% — impossible from 25 cold-start requests (max ~25%)- WARM TTFT p90 = 3.327s (fresh = 0.262s, 12.7× gap) — indicates KV cache memory pressure from prior experiments
- HEAVY TPOT p90 = 0.154s (fresh = 0.076s, 2.0× gap) — heavy prefill-decode interference from full KV cache
The elastic numbers on dash1 were genuinely fresh. The "improvement" was actually comparing fresh elastic against degraded baseline.
4. System-Level Analysis
4.1 Elastic P2P Does Not Improve Single-Machine Performance
Under fair comparison (same machine, both fresh):
| Metric | Baseline | Elastic | Delta |
|---|---|---|---|
| TTFT p50 | 1.075s | 1.018s | -5.3% |
| TTFT p90 | 9.384s | 11.312s | +20.5% |
| TPOT p90 | 0.076s | 0.085s | +11.6% |
| E2E p50 | 5.075s | 6.977s | +37.5% |
Elastic is worse on all metrics except TTFT p50. Root causes:
1. Mooncake kv_both memory overhead
Each instance with kv_role=kv_both maintains RDMA buffers + Mooncake bootstrap server, reducing GPU memory available for KV cache. This affects ALL requests (including WARM/MEDIUM that don't use P2P transfer), causing more cache eviction and higher TPOT.
Evidence: MEDIUM TPOT p90 = 0.197s (elastic) vs 0.079s (baseline) — 2.5× worse despite MEDIUM requests not using P2P at all.
2. D-side KV pull failures
3 HEAVY requests completed prefill on P instance successfully but D-side never produced first token. The KV cache on D was too full to allocate space for the transferred blocks. These became 600s timeouts.
3. P2P overhead without proportional benefit
The P2P path adds: prefill queue on P (p50=6.3s) + KV transfer + decode start on D (p50=0.8s). For requests where the D instance isn't under heavy prefill load (which is the case on fresh instances), co-located execution is faster.
4.2 When Elastic P2P Could Help
Elastic P2P is designed for the scenario where D-instance decode is disrupted by co-located heavy prefill. On fresh instances with 200 requests, this contention is moderate. The benefit may emerge under:
- Higher sustained load (1000+ concurrent requests)
- Longer experiment duration (KV cache fills up, eviction pressure increases)
- Multi-machine deployment (P on a different node, no memory competition)
5. Data & Log Locations
5.1 Experiment Outputs (on respective machines)
| Directory | Machine | Config | Notes |
|---|---|---|---|
outputs/ab_baseline/ |
dash0 | Combined 8× TP=1 | |
outputs/ab_elastic/ |
dash0 | Elastic P2P cap=4 | |
outputs/baseline_stability_fresh/ |
dash0 | Combined 8× fresh | Canonical baseline (§3.1) |
outputs/elastic_stability_*/ |
dash0 | Elastic P2P kv_both fresh | Canonical elastic (§3.1) |
outputs/ab_linear/ |
dash0 | Linear policy, 200 req | §3.4 routing policy comparison |
outputs/ab_lmetric/ |
dash0 | LMetric policy, 200 req | §3.4 routing policy comparison |
outputs/gpu_ab_combined/ |
local | Combined 8× TP=1 | Earlier run, has gpu_util.csv |
outputs/gpu_ab_pdsep/ |
local | PD-Sep 4P+4D | Earlier run, has gpu_util.csv |
outputs/exp2_combined_tp1_dp8/ |
local | Combined 8× TP=1 | 1000 req, cache-aware |
outputs/exp3_pd_sep_tp1_mooncake/ |
local | PD-Sep 4P+4D Mooncake | 1000 req |
5.2 vLLM Instance Logs
| Path pattern | Machine | Config |
|---|---|---|
/tmp/ab_base_$i.log |
dash0 | Baseline instances 0-7 |
/tmp/ab_elastic_$i.log |
dash1 | Elastic instances 0-7 |
/tmp/lmetric_ab_inst_$i.log |
dash0 | Linear policy instances 0-7 (§3.6) |
/tmp/lmetric_inst_$i.log |
dash0 | LMetric policy instances 0-7 (§3.6) |
Logs contain Prefix cache hit rate and External prefix cache hit rate lines for APC extraction.
5.3 Trace Data
| Path | Machine | Description |
|---|---|---|
~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl |
dash0 | Full 2h production trace (2.1M requests) |
traces/sampled_1000req_seed42.jsonl |
all | Sampled 1000 requests (gitignored, regenerate with sample_trace.py) |
5.4 Analysis Documents
| File | Content |
|---|---|
analysis/pd_separation_analysis.md |
Main report: PD-Sep vs Combined + Elastic P2P (§5) |
analysis/elastic_offload_design.md |
Elastic P2P design rationale |
analysis/kv_lifecycle_design.md |
KV cache eviction policy analysis |
analysis/adaptive_prefill_offload_design.md |
Initial adaptive offload design (superseded by elastic) |
6. Repository Structure
agentic-kv/
├── analysis/ # Research reports and design docs
│ ├── pd_separation_analysis.md # Main comprehensive report
│ ├── elastic_offload_design.md # Elastic P2P design
│ ├── kv_lifecycle_design.md # Cache eviction analysis
│ └── ...
├── replayer/ # Trace replay framework
│ ├── __main__.py # CLI entry: python -m replayer
│ ├── replay.py # Async replayer (session-aware, SSE streaming)
│ ├── trace.py # TraceRequest dataclass, session/hash_id handling
│ └── metrics.py # RequestMetrics, crash-safe JSONL sink
├── scripts/
│ ├── cache_aware_proxy.py # Global scheduler (combined + PD-sep + elastic offload)
│ ├── sample_trace.py # Cluster-to-machine trace sampler
│ ├── launch_vllm.sh # Launch combined TP=8
│ ├── launch_pd_mooncake.sh # Launch PD-Sep with Mooncake
│ ├── launch_elastic_p2p.sh # Launch elastic P2P (8× kv_both + offload proxy)
│ ├── run_experiments.sh # Full experiment matrix (combined/PD-sep)
│ ├── run_benchmark.sh # Single benchmark run
│ ├── gpu_monitor.sh # GPU utilization sampler (5s CSV)
│ ├── compute_roofline.py # Prefill/decode roofline analysis
│ ├── analyze_*.py # Various analysis scripts
│ └── compare_*.py # Experiment comparison scripts
├── patches/
│ ├── 0001-fix-kv-transfer-abort-race.patch
│ └── README.md
├── third_party/vllm/ # vLLM 0.18.1 source (with patch applied)
├── outputs/ # Experiment results (gitignored)
├── traces/ # Sampled traces (gitignored)
├── TODO.md # Original research goals
└── REPORT.md # This milestone report
7. Key Scripts Reference
| Script | What it does | Key flags |
|---|---|---|
scripts/cache_aware_proxy.py |
Global scheduler + elastic offload proxy | --combined, --offload, --policy {linear,lmetric}, --heavy-threshold, --bootstrap-ports |
scripts/run_lmetric_ab.sh |
A/B: linear vs lmetric routing policy | Runs both experiments with fresh restart |
scripts/run_elastic_stability_test.sh |
Elastic vs baseline with full isolation | Fresh start/stop per experiment |
scripts/bench.sh |
Standard single-experiment harness | --tag, --mode {baseline,elastic} |
scripts/sample_trace.py |
Sample complete sessions from cluster trace | --target-requests, --seed |
python -m replayer |
Replay trace against vLLM endpoint | --time-scale, --max-inflight-sessions, --request-limit |
scripts/gpu_monitor.sh |
Sample nvidia-smi to CSV | Pipe to outputs/<tag>/gpu_util.csv |
scripts/launch_elastic_p2p.sh |
Launch all 8 kv_both instances + offload proxy | HEAVY_THRESHOLD, MAX_OFFLOAD env vars |
8. Conclusions & Next Steps
Established findings:
- Full PD separation is net negative for single-machine agentic workloads (KV cache memory wall)
- Cache-aware session-sticky routing is the dominant optimization (+24pp APC, -60% TTFT vs round-robin)
- Elastic P2P offload does NOT improve single-machine performance — Mooncake kv_both memory overhead (+11% TPOT, +37% E2E) outweighs prefill isolation benefit under moderate load (200 req)
- LMetric (OSDI'26) provides modest E2E -4% over linear routing; routing policy headroom is limited
- Experimental methodology matters: warm vs fresh instances cause 2× TTFT difference; all comparisons must use verified fresh restart
Lessons learned:
- Prior cross-machine A/B (commit
1e86285) was invalid — warm baseline inflated by 2× due to residual KV cache state kv_role=kv_bothhas non-trivial always-on overhead even when P2P transfer is not used- Experiment isolation (kill all → verify GPU free → fresh start) is critical for reproducibility
Open problems:
- Elastic P2P may help under sustained high load (KV cache pressure makes co-located interference worse) — needs 1000-req experiment
- Mooncake kv_both memory overhead quantification and potential lazy initialization
- Multi-machine elastic (P on different node, no memory competition)
- Router state accuracy: proxy shadow state vs vLLM-internal exact state (TODO: vLLM → Redis → router)
scripts/bench.shstandardized harness to prevent future warm-instance mistakes
Updated 2026-05-22. Prior elastic A/B results (commit 1e86285) invalidated — see §3.5 errata.