Calls out that §3.1 (old random sampler, time-scale compression, 1 req/GPU cap) and the early elastic v3 warm-vs-fresh runs are no longer current, and that the "--max-inflight-sessions 64+" next-step text refers to a flag that was removed and must be restored per FIXES.md §B2 before those numbers can be reproduced. Points readers at §3.6/§3.7 as authoritative.
31 KiB
Milestone Report: Elastic P2P vs PD-Combined Baseline
Date: 2026-05-22 Author: Gahow Wang Status: Phase 1 complete — baseline + elastic validated, system-level analysis done
1. Research Question
For agentic LLM workloads (long input, short output, high KV cache reuse), is prefill-decode disaggregation beneficial? If full PD separation hurts (proven in §3), can selective disaggregation of only heavy requests improve serving latency while preserving KV cache locality?
1.1 Errata / Superseded sections
This report has been revised several times as the methodology matured. The sections below are kept for historical context but their numerical conclusions have been superseded — do not cite them in isolation.
- §3.1 (initial PD-sep vs PD-combined): ran with the old random sampler +
--time-scalecompression +--max-inflight-sessions 8. Cross-session KV reuse dropped from 52% → 16%, and per-GPU concurrency was capped at 1 req/GPU. Superseded by §3.6.- Earlier "elastic v3" warm-vs-fresh runs: baselines were not restarted between trials, leaving residual KV cache that inflated baseline TTFT ~2×. Superseded by the cold-start results in §3.6/§3.7.
- Any reference to running
--max-inflight-sessions 64+: that flag was removed when replay moved to trace-driven dispatch. The next-step experiment requires restoring the flag first (seeFIXES.md§B2 route A) before any production-concurrency numbers can be produced.The authoritative results are in §3.6 and §3.7.
2. Experimental Setup
2.1 Hardware
| Resource | Spec |
|---|---|
| Machine | dash0 / dash1 (identical config) |
| GPU | 8× NVIDIA H20 96GB HBM, NVLink |
| Network | 4× ConnectX-7 200Gbps RDMA |
| Storage | cpfs shared storage across machines |
2.2 Software
| Component | Version | Notes |
|---|---|---|
| vLLM | 0.18.1 (source in third_party/vllm/) |
Patched scheduler assert (see patches/) |
| Mooncake | 0.3.10 | RDMA-based KV transfer between instances |
| Python | 3.x managed by uv |
.venv/ at project root |
| Model | Qwen3-Coder-30B-A3B-Instruct |
MoE 128 experts top-8, 3B active params |
| Model path | ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct |
Same on dash0 and dash1 |
2.3 Workload Trace
| Property | Value |
|---|---|
| Source | GLM-5.1 Agentic Coder, production cluster, 2h window |
| Raw trace | ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl on dash0 |
| Total requests | 2,114,220 |
| Avg input tokens | 33,600 (p50=20k, p90=88k) |
| Avg output tokens | 445 (p50=80) |
| I/O ratio | 75.6× aggregate |
| Prefill token share | 98% |
| KV reuse breakdown | 62% intra-session, 38% cross-session (token-level) |
| Theoretical max APC | 67% (infinite cache, single instance, prefix-only) |
Sampled trace for benchmarks: traces/w600_r0.0015_st30.jsonl (1214 requests, 688 sessions, 70% multi-turn). Generated with window+thin sampling:
python scripts/sample_trace.py \
--input ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \
--output traces/w600_r0.0015_st30.jsonl \
--sample-ratio 0.0015 --max-single-turn-ratio 0.3 \
--window-seconds 600 --seed 42
| Trace property | Value |
|---|---|
| Sessions | 688 (70% multi-turn, avg 4.9 turns) |
| Requests | 1214 (use --request-limit 850 for daily, full for validation) |
| Avg input tokens | 48,776 |
| Trace span | 2912s (48.5 min); dense segment 0-990s (850 req) |
| Peak QPS | 1.6 req/s (in dense segment) |
| Hash block sharing | 48.3% (vs 52% full trace) |
| Theoretical APC | 80% (full), 76% (first 850 req) |
Sampling methodology (2026-05-23): Prior traces used random session sampling +
--time-scalecompression +--max-inflight-sessionssemaphore, which (a) destroyed cross-session hash block sharing (52% → 16%), (b) artificially limited concurrency to 1 req/GPU, and (c) masked prefill-decode interference. The new approach uses contiguous time-window sampling with session thinning (--max-single-turn-ratio 0.3) to preserve KV reuse patterns, and trace-driven replay with no artificial concurrency limits.
2.4 Two Configurations Compared
Baseline: PD-Combined (8× TP=1 DP=8)
8 independent vLLM instances, 1 GPU each, no Mooncake.
All instances do both prefill and decode.
Global scheduler (cache_aware_proxy.py --combined) handles:
- Session-sticky routing (multi-turn → same instance)
- Load-aware override (if pinned instance > 2× avg load, redirect)
- Cache-hit scoring (prefer instance with matching prefix blocks)
Launch:
# On dash0:
for i in $(seq 0 7); do
MASTER_PORT=$((29500+i)) CUDA_VISIBLE_DEVICES=$i \
vllm serve ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
--port $((8000+i)) --tp 1 \
--enable-prefix-caching --enforce-eager \
--gpu-memory-utilization 0.9 --max-model-len 200000 \
> /tmp/ab_base_$i.log 2>&1 &
done
python scripts/cache_aware_proxy.py \
--combined http://127.0.0.1:800{0..7} --port 9090
Elastic P2P Offload (8× TP=1 kv_both + selective offload)
8 independent vLLM instances, 1 GPU each, all kv_role=kv_both (Mooncake).
Same global scheduler, plus elastic offload logic:
- Proxy classifies each request: WARM (<5k new), MEDIUM (5-20k), HEAVY (>20k)
- WARM/MEDIUM: co-located on session-sticky instance (no KV transfer)
- HEAVY: prefill on a different instance (P), KV via Mooncake RDMA,
decode on session-sticky instance (D)
- Cap: max 4 concurrent offloads (MAX_OFFLOAD_INFLIGHT)
- P instance selection: round-robin with overload skip
Launch:
# On dash1 (or use scripts/launch_elastic_p2p.sh):
for i in $(seq 0 7); do
VLLM_MOONCAKE_BOOTSTRAP_PORT=$((8998+i)) \
MASTER_PORT=$((29500+i)) CUDA_VISIBLE_DEVICES=$i \
vllm serve ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
--port $((8000+i)) --tp 1 \
--enable-prefix-caching --enforce-eager \
--gpu-memory-utilization 0.9 --max-model-len 200000 \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
> /tmp/ab_elastic_$i.log 2>&1 &
sleep 2 # stagger to avoid NCCL port collision
done
# Wait for bootstrap servers
for bp in $(seq 8998 9005); do
until curl -s localhost:$bp/query > /dev/null 2>&1; do sleep 2; done
done
python scripts/cache_aware_proxy.py \
--combined http://127.0.0.1:800{0..7} \
--bootstrap-ports 8998,8999,9000,9001,9002,9003,9004,9005 \
--offload --heavy-threshold 20000 --port 9090
2.5 Benchmark Parameters
| Parameter | Value |
|---|---|
| Trace | traces/w600_r0.0015_st30.jsonl (window+thin, 70% multi-turn) |
| Daily iteration | --request-limit 850 (~13 min, APC≈76%) |
| Full validation | All 1214 requests (~48 min, APC≈80%) |
| Replay mode | Trace-driven (no session limit, no time compression) |
| Request timeout | 600s |
| vLLM flags | --enforce-eager --enable-prefix-caching --max-model-len 200000 |
| GPU memory util | 0.9 |
| Fresh restart | Both configs started from cold (no warm cache) |
2.6 Reproducing the Benchmark
# Activate environment
cd ~/agentic-kv && source .venv/bin/activate
# Ensure sampled trace exists
python scripts/sample_trace.py \
--input ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \
--output traces/sampled_1000req_seed42.jsonl \
--target-requests 1000 --seed 42
# Run benchmark (daily iteration)
bash scripts/bench.sh --tag my_experiment --mode baseline --policy linear \
--trace traces/w600_r0.0015_st30.jsonl --requests 850
# Run benchmark (full validation)
bash scripts/bench.sh --tag my_experiment_full --mode baseline --policy linear \
--trace traces/w600_r0.0015_st30.jsonl
3. Results
Errata (2026-05-22): The initial cross-machine A/B (dash0 baseline vs dash1 elastic) reported -44% E2E improvement. Post-hoc analysis revealed the dash0 baseline instances were not freshly restarted — residual KV cache from prior experiments caused 2× TTFT inflation.
Errata (2026-05-23): §3.1 results used artificial concurrency limits (
--max-inflight-sessions 8, 1 req/GPU) and random session sampling that destroyed cross-session KV sharing (52% → 16%). See §3.6 for production-realistic results with corrected methodology.
3.1 Legacy Comparison (artificial 1 req/GPU, 200 req)
| Config | OK/N | TTFT p50 | TTFT p90 | TPOT p90 | E2E p50 |
|---|---|---|---|---|---|
| Baseline (no Mooncake) | 198/200 | 1.075s | 9.384s | 0.076s | 5.075s |
| LMetric routing | 198/200 | 1.099s | 9.392s | 0.073s | 5.205s |
| Elastic P2P (kv_both) | 195/200 | 1.018s | 11.312s | 0.085s | 6.977s |
3.2 Per-Class Breakdown
Baseline (fresh):
| Class | Count | % | TTFT p50 | TTFT p90 | TPOT p90 |
|---|---|---|---|---|---|
| WARM (<5k) | 46 | 23% | 0.137s | 0.262s | 0.061s |
| MEDIUM (5-20k) | 50 | 25% | 0.921s | 1.846s | 0.079s |
| HEAVY (20-50k) | 64 | 32% | 2.660s | 6.278s | 0.076s |
| HEAVY (>50k) | 38 | 19% | 9.587s | 30.415s | 0.102s |
Elastic P2P (fresh):
| Class | Count | % | TTFT p50 | TTFT p90 | TPOT p90 |
|---|---|---|---|---|---|
| WARM (<5k) | 46 | 23% | 0.142s | 0.279s | 0.072s |
| MEDIUM (5-20k) | 50 | 25% | 0.766s | 1.814s | 0.197s |
| HEAVY (>20k) | 99 | 51% | 6.390s | 22.668s | 0.085s |
3.3 Success Rate
| Config | OK | Total | Rate | Failure mode |
|---|---|---|---|---|
| Baseline | 198 | 200 | 99.0% | RemoteProtocolError (replayer-side) |
| Elastic P2P | 195 | 200 | 97.5% | 2× RemoteProtocolError + 3× ReadTimeout on >60k |
Elastic's 3 extra errors are D-side KV pull failures: prefill succeeded on P, KV pushed to Mooncake, but D never produced first token (decode scheduler couldn't allocate KV cache space). Prefill timeout fallback (120s → co-located) was never triggered.
3.4 Routing Policy: Linear vs LMetric (OSDI'26)
LMetric (score = P_tokens × BS, pure per-request, no session affinity) vs Linear (score = ongoing_tokens - α·cache_hit, session-sticky). Both fresh-restart, same trace.
Errata (2026-05-23): Prior LMetric implementation incorrectly shared session-sticky logic with Linear. Fixed to pure per-request routing per OSDI'26 spec:
score = (pending_prefill + new_tokens) × num_requests, no affinity, no overload override. Results below use corrected implementation.
| Policy | TTFT p50 | TTFT p90 | TPOT p90 | E2E p50 | Delta E2E |
|---|---|---|---|---|---|
| Linear (session-sticky) | 1.073s | 9.347s | 0.073s | 5.119s | — |
| LMetric (no affinity) | 1.081s | 9.408s | 0.072s | 5.102s | -0.3% |
Key finding: LMetric without explicit session affinity matches Linear with session affinity on all metrics (< 2% difference). The cache-hit term in LMetric's scoring (new_tokens = input - cache_hit) creates implicit soft affinity — instances that already cached a session's prefix get lower P_tokens, naturally attracting subsequent turns. Explicit session-sticky routing is not required; cache-aware load balancing captures it automatically.
APC distribution (LMetric, no affinity): inst_0=60.6%, inst_1=58.3%, inst_2=43.2%, inst_3=28.9%, inst_4=16.6%, inst_5=24.0%, inst_6=13.9%, inst_7=0.0%. Non-uniform but comparable aggregate to Linear's explicit affinity.
3.5 Errata: Why Prior Cross-Machine A/B Was Invalid
The initial comparison (commit 1e86285) reported:
Baseline (dash0): TTFT50=2.383 E2E50=10.232 ← WRONG (warm instances)
Elastic (dash1): TTFT50=1.315 E2E50=5.708
Delta: -45% -44% ← INVALID
Evidence that prior baseline was not fresh:
inst_7APC = 68.3% — impossible from 25 cold-start requests (max ~25%)- WARM TTFT p90 = 3.327s (fresh = 0.262s, 12.7× gap) — indicates KV cache memory pressure from prior experiments
- HEAVY TPOT p90 = 0.154s (fresh = 0.076s, 2.0× gap) — heavy prefill-decode interference from full KV cache
The elastic numbers on dash1 were genuinely fresh. The "improvement" was actually comparing fresh elastic against degraded baseline.
3.6 Production-Realistic Baseline (trace-driven, corrected methodology)
Corrected sampling (window+thin, 70% multi-turn, block sharing 48%) and trace-driven replay (no session limit, no time compression). See §2.3 for trace details.
Linear policy, 912 requests (dense segment), peak QPS ≈ 1.6:
| Metric | Legacy (§3.1, 1 req/GPU) | New (trace-driven) | Delta |
|---|---|---|---|
| TTFT mean | 1.07s | 4.54s | +4.2× |
| TTFT p50 | 1.08s | 0.94s | -13% |
| TTFT p90 | 9.38s | 14.12s | +51% |
| TPOT p50 | 0.038s | 0.070s | +84% |
| TPOT p90 | 0.073s | 0.175s | +139% |
| APC (mean) | ~44% | 67.5% | +23pp |
| Errors | 2/200 (1.0%) | 0/912 (0%) | better |
| E2E p50 | 5.08s | 6.98s | +37% |
Key differences from legacy methodology:
-
APC 67.5% vs 44%: Window+thin sampling preserves cross-session block sharing (48% vs 16% in legacy random sampling), yielding production-realistic cache hit rates. Per-instance APC ranges 46–84%.
-
TPOT +139% at p90: With trace-driven replay, multiple concurrent requests per GPU create real prefill-decode interference. The legacy 1 req/GPU setup showed TPOT p90=0.073s (no interference), but production-realistic load shows TPOT p90=0.175s. This validates that prefill-decode interference is a real problem at production concurrency.
-
TTFT p50 improved (-13%) but mean degraded (+4.2×): Higher APC means cached requests get very fast TTFT (p50=0.94s). But concurrent heavy prefills cause queuing for non-cached requests, inflating the mean and p90.
-
Per-instance APC imbalance (46–84%): Routing quality directly determines per-instance APC. The 38pp gap between worst and best instance suggests routing optimization is still the highest-leverage improvement.
Output: outputs/baseline_r0015_st30/ on dash0.
3.7 Elastic PS vs Baseline (production-realistic trace)
850 requests, w600_r0.0015_st30.jsonl, peak QPS≈1.6. Baseline on dash0, elastic on dash1.
| Metric | Baseline | Elastic PS | Delta |
|---|---|---|---|
| TTFT mean | 4.35s | 4.01s | -7.8% |
| TTFT p50 | 0.94s | 0.93s | -1% |
| TPOT p50 | 0.070 | 0.071 | +2% |
| TPOT p90 | 0.162 | 0.157 | -3.1% |
| E2E p50 | 6.38s | 6.44s | +0.9% |
| APC mean | 60.7% | 59.9% | -0.8pp |
| Errors | 0/850 | 4/832 | 4 ReadTimeout |
Elastic PS is near-neutral. Root cause analysis:
Problem 1: Offload gate too restrictive — only 17/118 HEAVY requests (14%) were offloaded. 75% of HEAVY requests had cache_ratio=0% (cold Turn 1), failing the cache_ratio >= 0.3 gate. The gate was designed to avoid offloading cold requests (full prefill on P is slower than co-located), but this means 86% of HEAVY prefills still interfere with decode.
Problem 2: Offloaded requests are slower (+50.6%) — HEAVY_OFFLOAD TTFT=19.94s vs HEAVY_COLO=13.25s. Breakdown:
- Prefill on P: 14.72s (P also queued, no faster than co-located)
- KV transfer + decode start on D: 5.71s (pure overhead)
Interference is real but unaddressed: 89% of WARM/MEDIUM requests ran concurrently with 1+ HEAVY prefills (up to 60 concurrent). Elastic PS only offloaded 17/118 HEAVY requests — insufficient to reduce interference.
Conclusion: The offload gate (cache_ratio >= 0.3) is correct in principle (cold offload IS slower), but leaves the core problem unsolved. Reducing prefill-decode interference requires either:
- Offloading ALL heavy prefills (accepting higher TTFT for offloaded requests in exchange for lower TPOT for all)
- Chunked prefill scheduling that yields to decode (vLLM-side optimization)
- Dedicated prefill GPUs (full PD separation) if KV memory wall can be solved
Output: outputs/eval_baseline_linear/ on dash0, outputs/eval_elastic_linear/ on dash1.
4. System-Level Analysis
4.1 Elastic P2P Does Not Improve Single-Machine Performance
Under fair comparison (same machine, both fresh):
| Metric | Baseline | Elastic | Delta |
|---|---|---|---|
| TTFT p50 | 1.075s | 1.018s | -5.3% |
| TTFT p90 | 9.384s | 11.312s | +20.5% |
| TPOT p90 | 0.076s | 0.085s | +11.6% |
| E2E p50 | 5.075s | 6.977s | +37.5% |
Elastic is worse on all metrics except TTFT p50. Root causes:
1. Mooncake kv_both memory overhead
Each instance with kv_role=kv_both maintains RDMA buffers + Mooncake bootstrap server, reducing GPU memory available for KV cache. This affects ALL requests (including WARM/MEDIUM that don't use P2P transfer), causing more cache eviction and higher TPOT.
Evidence: MEDIUM TPOT p90 = 0.197s (elastic) vs 0.079s (baseline) — 2.5× worse despite MEDIUM requests not using P2P at all.
2. D-side KV pull failures
3 HEAVY requests completed prefill on P instance successfully but D-side never produced first token. The KV cache on D was too full to allocate space for the transferred blocks. These became 600s timeouts.
3. P2P overhead without proportional benefit
The P2P path adds: prefill queue on P (p50=6.3s) + KV transfer + decode start on D (p50=0.8s). For requests where the D instance isn't under heavy prefill load (which is the case on fresh instances), co-located execution is faster.
4.2 When Elastic P2P Could Help
Elastic P2P is designed for the scenario where D-instance decode is disrupted by co-located heavy prefill. On fresh instances with 200 requests, this contention is moderate. The benefit may emerge under:
- Higher sustained load (1000+ concurrent requests)
- Longer experiment duration (KV cache fills up, eviction pressure increases)
- Multi-machine deployment (P on a different node, no memory competition)
5. Data & Log Locations
5.1 Experiment Outputs (on respective machines)
| Directory | Machine | Config | Notes |
|---|---|---|---|
outputs/ab_baseline/ |
dash0 | Combined 8× TP=1 | |
outputs/ab_elastic/ |
dash0 | Elastic P2P cap=4 | |
outputs/baseline_stability_fresh/ |
dash0 | Combined 8× fresh | Canonical baseline (§3.1) |
outputs/elastic_stability_*/ |
dash0 | Elastic P2P kv_both fresh | Canonical elastic (§3.1) |
outputs/ab_linear/ |
dash0 | Linear policy, 200 req | §3.4 routing policy comparison |
outputs/ab_lmetric/ |
dash0 | LMetric policy, 200 req | §3.4 routing policy comparison |
outputs/gpu_ab_combined/ |
local | Combined 8× TP=1 | Earlier run, has gpu_util.csv |
outputs/gpu_ab_pdsep/ |
local | PD-Sep 4P+4D | Earlier run, has gpu_util.csv |
outputs/exp2_combined_tp1_dp8/ |
local | Combined 8× TP=1 | 1000 req, cache-aware |
outputs/exp3_pd_sep_tp1_mooncake/ |
local | PD-Sep 4P+4D Mooncake | 1000 req |
5.2 vLLM Instance Logs
| Path pattern | Machine | Config |
|---|---|---|
/tmp/ab_base_$i.log |
dash0 | Baseline instances 0-7 |
/tmp/ab_elastic_$i.log |
dash1 | Elastic instances 0-7 |
/tmp/lmetric_ab_inst_$i.log |
dash0 | Linear policy instances 0-7 (§3.6) |
/tmp/lmetric_inst_$i.log |
dash0 | LMetric policy instances 0-7 (§3.6) |
Logs contain Prefix cache hit rate and External prefix cache hit rate lines for APC extraction.
5.3 Trace Data
| Path | Machine | Description |
|---|---|---|
~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl |
dash0 | Full 2h production trace (2.1M requests) |
traces/sampled_1000req_seed42.jsonl |
all | Sampled 1000 requests (gitignored, regenerate with sample_trace.py) |
5.4 Analysis Documents
| File | Content |
|---|---|
analysis/pd_separation_analysis.md |
Main report: PD-Sep vs Combined + Elastic P2P (§5) |
analysis/elastic_offload_design.md |
Elastic P2P design rationale |
analysis/kv_lifecycle_design.md |
KV cache eviction policy analysis |
analysis/adaptive_prefill_offload_design.md |
Initial adaptive offload design (superseded by elastic) |
6. Repository Structure
agentic-kv/
├── analysis/ # Research reports and design docs
│ ├── pd_separation_analysis.md # Main comprehensive report
│ ├── elastic_offload_design.md # Elastic P2P design
│ ├── kv_lifecycle_design.md # Cache eviction analysis
│ └── ...
├── replayer/ # Trace replay framework
│ ├── __main__.py # CLI entry: python -m replayer
│ ├── replay.py # Async replayer (session-aware, SSE streaming)
│ ├── trace.py # TraceRequest dataclass, session/hash_id handling
│ └── metrics.py # RequestMetrics, crash-safe JSONL sink
├── scripts/
│ ├── cache_aware_proxy.py # Global scheduler (combined + PD-sep + elastic offload)
│ ├── sample_trace.py # Cluster-to-machine trace sampler
│ ├── launch_vllm.sh # Launch combined TP=8
│ ├── launch_pd_mooncake.sh # Launch PD-Sep with Mooncake
│ ├── launch_elastic_p2p.sh # Launch elastic P2P (8× kv_both + offload proxy)
│ ├── run_experiments.sh # Full experiment matrix (combined/PD-sep)
│ ├── run_benchmark.sh # Single benchmark run
│ ├── gpu_monitor.sh # GPU utilization sampler (5s CSV)
│ ├── compute_roofline.py # Prefill/decode roofline analysis
│ ├── analyze_*.py # Various analysis scripts
│ └── compare_*.py # Experiment comparison scripts
├── patches/
│ ├── 0001-fix-kv-transfer-abort-race.patch
│ └── README.md
├── third_party/vllm/ # vLLM 0.18.1 source (with patch applied)
├── outputs/ # Experiment results (gitignored)
├── traces/ # Sampled traces (gitignored)
├── TODO.md # Original research goals
└── REPORT.md # This milestone report
7. Key Scripts Reference
| Script | What it does | Key flags |
|---|---|---|
scripts/cache_aware_proxy.py |
Global scheduler + elastic offload proxy | --combined, --offload, --policy {linear,lmetric}, --heavy-threshold, --bootstrap-ports |
scripts/run_lmetric_ab.sh |
A/B: linear vs lmetric routing policy | Runs both experiments with fresh restart |
scripts/run_elastic_stability_test.sh |
Elastic vs baseline with full isolation | Fresh start/stop per experiment |
scripts/bench.sh |
Standard single-experiment harness | --tag, --mode {baseline,elastic} |
scripts/sample_trace.py |
Sample complete sessions from cluster trace | --target-requests, --seed |
python -m replayer |
Replay trace against vLLM endpoint | --time-scale, --max-inflight-sessions, --request-limit |
scripts/gpu_monitor.sh |
Sample nvidia-smi to CSV | Pipe to outputs/<tag>/gpu_util.csv |
scripts/launch_elastic_p2p.sh |
Launch all 8 kv_both instances + offload proxy | HEAVY_THRESHOLD, MAX_OFFLOAD env vars |
8. GPU Load Imbalance & Elastic Prefill Service Analysis
8.1 Load Imbalance Characterization
Session-sticky routing creates token load imbalance across instances. The severity depends on scale:
| Scale | Imbalance | Top 5 sessions | Cause |
|---|---|---|---|
| 200 req (143 sessions) | 8.6× tokens | 49% of all tokens | Small sample, few dominant sessions |
| 1000 req (668 sessions) | 1.24× tokens | 29% of all tokens | More sessions → natural averaging |
At 1000 requests, the heaviest instance has 4.5M tokens vs lightest 3.6M (1.24×). Despite this, TPOT is uniform across all instances (0.070–0.077), confirming that prefill-decode interference is minimal at ≤1 session/GPU. The imbalance manifests in TTFT only: heaviest 2 instances TTFT p50 = 1.42s vs lightest 2 at 0.83s (1.7× gap).
8.2 Session Accumulation Pattern
Agentic workloads produce long-lived sessions with growing context:
| Session | Turns | Total Tokens | Context Growth |
|---|---|---|---|
| 1569319 | 36 | 2.32M | 27k → 98k (+2.0k/turn) |
| 1206593 | 36 | 2.31M | 15k → 106k (+2.6k/turn) |
| 178176 | 25 | 1.93M | 36k → 95k (+2.5k/turn) |
Top 5 sessions = 29% of all tokens. With session-sticky, these lock their instances, creating persistent load hotspots.
8.3 Benchmark Concurrency vs Production Reality
Critical caveat: All prior experiments used
--max-inflight-sessions 8(1 session/GPU). This is 10–15× below production concurrency and masks the interference that elastic PS is designed to solve.
| Our Benchmark | Production Estimate | |
|---|---|---|
| Concurrent requests/GPU | 1–2 | 8–15 |
| KV cache usage/GPU | 26–28% (single req) | 80–100% |
| Prefill-decode interference | Minimal | Significant |
KV cache capacity: 281,888 tokens/GPU (25.8 GiB). A single 75k-token request consumes 27% of KV cache. At production concurrency (~15 req/GPU), KV cache is near-full, triggering eviction, cache misses, and prefill queuing — none of which appear in our 1-req/GPU benchmark.
Measured interference scaling:
| Concurrency | TPOT p90 | vs 8-session |
|---|---|---|
| 8 sessions (1/GPU) | 0.075s | baseline |
| 16 sessions (2/GPU) | 0.103s | +38% |
| Production (~15/GPU) | not tested | expected >>+45% |
8.4 Elastic PS: Two Capabilities Re-Evaluated
Capability 1: Reduce prefill-decode interference (lower TPOT)
At 1 req/GPU (our benchmark): no interference, no benefit. But this is an artifact of unrealistically low concurrency. At ≥2 req/GPU, chunked prefill interrupts decode steps, causing TPOT +38–45%. At production concurrency (~15/GPU), multiple HEAVY prefills sharing a GPU with decode requests would cause severe interference. Elastic PS's ability to isolate heavy prefills on separate GPUs directly addresses this.
Capability 2: Session migration for load balancing
Elastic PS enables mid-session migration: prefill on original instance (cache hit), KV transfer to a different instance for decode + future turns. This provides two benefits:
-
Break session lock-in: Agentic sessions grow +2k tokens/turn over 30+ turns. With session-sticky, a 36-turn session (2.3M tokens total) locks its GPU, creating a hotspot. Elastic PS lets the session migrate to a less-loaded GPU while preserving cache on the original (PS does fast cached prefill, new GPU decodes).
-
Rebalance without cache loss: Unlike breaking affinity (which destroys cache), elastic PS migration preserves the prefix cache on the original instance — the PS re-uses it for fast prefill, then transfers only the new KV to the destination.
Simulation of migration strategies (1000 req, at current low concurrency):
| Strategy | Imbalance | Migrations | KV Transfer Overhead |
|---|---|---|---|
| No migration | 1.24× | 0 | 0s |
| Every 10 turns | 1.21× | 10 | 15s |
| Every 5 turns | 1.20× | 20 | 30s |
At 1 req/GPU, migration benefit is marginal (imbalance is only 1.24×). At production concurrency where imbalance combines with KV cache pressure and interference, the benefit would be substantially larger.
Capability 3: Soft affinity from cache-hit scoring
The corrected LMetric experiment (§3.4) reveals that explicit session affinity is unnecessary. Cache-hit scoring (new_tokens = input − cached) creates implicit soft affinity — instances with cached prefixes score lower, naturally attracting subsequent turns. This matches hard session-sticky on all metrics (< 2% difference) while providing more flexible load balancing.
8.5 Elastic PS Verdict
| Aspect | At 1 req/GPU (tested) | At production load (expected) |
|---|---|---|
| TPOT reduction | 0% (no interference) | Significant (interference scales with concurrency) |
| Session migration | Marginal (1.24× → 1.20×) | Larger benefit (KV pressure + interference amplify imbalance) |
| Cache preservation | N/A | Key advantage vs breaking affinity |
At our benchmark concurrency (1 req/GPU), elastic PS is not justified — Mooncake overhead exceeds the non-existent interference benefit. But our benchmark is 10–15× below production load. The real question is whether elastic PS helps at production-realistic concurrency (64–128 concurrent sessions, 8–15 req/GPU), where:
- Prefill-decode interference is measurable (already +38% TPOT at just 2/GPU)
- KV cache pressure causes eviction and queue delays
- Session accumulation creates compounding hotspots
- Heavy prefills (50–100k tokens) block decode for seconds
Next step: run --max-inflight-sessions 64 benchmark to test elastic PS at production-realistic concurrency.
9. Conclusions & Next Steps
Established findings:
- Full PD separation is net negative for single-machine agentic workloads (KV cache memory wall)
- Cache-aware routing is the dominant optimization (+24pp APC, -60% TTFT vs round-robin)
- Explicit session affinity is unnecessary — cache-hit scoring creates implicit soft affinity that matches hard session-sticky (< 2% difference)
- At low concurrency (1 req/GPU), elastic P2P offload adds overhead without benefit
- Our benchmark concurrency is 10–15× below production:
--max-inflight-sessions 8yields 1 req/GPU, masking prefill-decode interference that appears at ≥2 req/GPU (+38% TPOT) and would dominate at production load (~15 req/GPU) - Experimental methodology matters: warm vs fresh instances cause 2× TTFT difference
Lessons learned:
- Prior cross-machine A/B (commit
1e86285) was invalid — warm baseline inflated by 2× - Prior LMetric implementation was invalid — incorrectly shared session-sticky logic with Linear
kv_role=kv_bothhas non-trivial always-on overhead even when P2P transfer is not used- Experiment isolation (kill all → verify GPU free → fresh start) is critical for reproducibility
- Benchmark concurrency must match production — sub-production concurrency hides interference effects that drive system design decisions
Open problems (priority ordered):
- Production-concurrency benchmark (
--max-inflight-sessions 64+): Validate whether prefill-decode interference at 8–15 req/GPU makes elastic PS net-positive - Multi-machine elastic: P on a different node eliminates GPU memory competition — the main cost that makes single-machine elastic net negative
- Layerwise KV transfer: Mooncake's block-level transfer after full prefill is the bottleneck. Layerwise pipelining could reduce transfer latency by overlapping with computation
- Router state accuracy: proxy shadow state vs vLLM-internal exact state (TODO: vLLM → Redis → router)
Updated 2026-05-23. LMetric corrected (§3.4 errata). GPU imbalance analysis added (§8). Benchmark concurrency gap identified — production-load experiments needed.