Bug 1+5: D instance had no accounting during prefill phase (7-11s window).
Router saw D as idle, routing extra traffic that caused KV allocation failures.
Fix: reserve D's ongoing_tokens+num_requests at offload decision time.
Bug 7: No cap on concurrent offloads despite REPORT claiming MAX_OFFLOAD=4.
Fix: add MAX_OFFLOAD_INFLIGHT=4 check before offloading.
Bug 6: Session affinity migrated to D but proxy cache estimator wasn't
updated for D. Future turns scored D as cache-cold.
Fix: call d_inst.record_prefix(token_ids) after successful decode.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Random session sampling destroys cross-session hash block sharing
(52% -> 16%) because sessions sharing system prompts get scattered.
New approach: take a contiguous time window from the trace (preserving
temporal locality of shared-prefix sessions), then thin within the
window to hit target QPS. This preserves both intra-session reuse
(62% of reusable tokens) and cross-session sharing (38%).
Results (block sharing rate):
Old random r=0.002: 16.0% -> Window+thin: 29.7%
Old random r=0.016: 19.5% -> Window+thin: 42.7%
Full trace baseline: 52%
Also corrected the "91% intra-session" claim: actual split is
62% intra / 38% cross (token-level), making cross-session sharing
preservation critical for valid APC benchmarks.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The replayer was artificially limiting concurrency with --max-inflight-sessions
(semaphore) and --time-scale (time compression), producing unrealistically low
1 req/GPU load that masked prefill-decode interference.
Replayer changes:
- Remove session_sem and time_scale entirely
- Each request dispatched at its trace timestamp exactly
- Sessions still sequential (turn N+1 waits for turn N completion)
- If turn completes late, next turn fires immediately
Sampler changes:
- Add --sample-ratio for GPU-proportional session sampling
- Keep --target-requests for backwards compat
- No time compression (preserve original arrival pattern)
bench.sh: remove --time-scale and --max-inflight-sessions args
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LMetric was incorrectly sharing session-sticky logic with Linear policy.
Fixed to pure per-request routing: score = P_tokens × BS where
P = pending_prefill + (input - cache_hit), BS = num_requests.
Experiment result (200 req, fresh restart): Linear vs corrected LMetric
show <2% difference on all metrics — LMetric's cache-hit estimation
provides implicit soft affinity that preserves locality without explicit
session stickiness.
Also fix bench.sh missing cd (replayer module not found from non-project
cwd) and rewrite run_lmetric_ab.sh as thin wrapper around bench.sh to
eliminate duplicated launch/cleanup logic that broke under set -euo.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
H7: Sweeping OVERLOAD_FACTOR (2.0/1.5/1.3/1.0) has no effect on GPU
imbalance (~3.5-4x across all settings). Root cause: imbalance is from
workload skew at session placement (turn 1), not from routing at turn 2+.
H4 GPU profiling confirms: GPU balance improvement IS real (4.0x→2.0x),
and it directly improves HEAVY_COLO TTFT by 10.5%. But RDMA-offloaded
requests have bimodal transfer times (0.6s or 18-31s) that negate the
routing benefit.
Updated elastic_hypotheses.md with H7 results and next directions:
higher load experiments where contention amplifies routing differences.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same-condition comparison (both fresh restart, same trace, same params):
Baseline (combined): TTFT=2.383/27.622 TPOT90=0.117 E2E=10.232
Elastic P2P (cap=4): TTFT=1.315/13.179 TPOT90=0.075 E2E=5.708
Delta: -45% / -52% -36% -44%
Key finding: TPOT p90 dropped 36% — confirming heavy prefill DOES
disrupt decode in combined mode, and elastic offload effectively
isolates it. Previous comparisons missed this because baselines
were run under different conditions (stale instances, different time_scale).
GPU util: elastic uses less GPU (15.8% vs 28.7%) but achieves better
latency — higher efficiency through better cache distribution.
APC: elastic has more balanced per-instance APC (36-38% prefix + 30-35%
external) vs baseline's skewed distribution (3.8% - 68.3%).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixed offload decision: removed p>=d gate (was blocking all offloads),
added MAX_OFFLOAD_INFLIGHT=4 cap and p_saturated threshold.
Result (200 req, fresh restart):
Baseline: 99% success, TTFT=1.080/9.410, TPOT90=0.076, E2E=5.306
Elastic: 96% success, TTFT=0.946/15.843, TPOT90=0.077, E2E=5.717
Architectural tradeoff confirmed:
- Median (p50) improves: D instances not disrupted by heavy prefill
- Tail (p90) worsens: offloaded HEAVY requests pay KV transfer cost
- TPOT unchanged: decode isolation is not the bottleneck
To improve p90: need layerwise pipelined KV transfer (overlap with prefill
compute) or smarter offload gating that avoids offloading the very largest
requests (which have the longest prefill time and generate the most KV).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Design: offload HEAVY prefill only when P instance is less loaded than D
AND P is not overloaded (< 1.5x avg). Preserves session-sticky on D
for future KV reuse. External KV correctly registered in prefix cache.
Result (67/200 processed, 75% success):
TTFT p50: 0.551s (-49% vs baseline 1.080s)
TTFT p90: 4.135s (vs baseline 9.410s, -56%)
TPOT p90: 0.074s (same as baseline)
E2E p50: 2.938s (-45% vs baseline 5.306s)
25% error rate from ReadTimeout on very large HEAVY requests queuing on P.
Needs stricter elastic gate or higher timeout. But successful requests
show significant improvement over both baseline and previous P2P.
Also: added external_prefix_cache metrics tracking to replayer summary.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Investigation confirms vLLM Mooncake connector DOES correctly register
externally-received KV blocks in the prefix cache. No bug exists.
Evidence from vLLM logs (per-instance):
inst_1: prefix_cache=14.7%, external_cache=72.1% <- high external hit
inst_4: prefix_cache=52.4%, external_cache=59.0%
The 0.5% aggregate APC from /metrics was a measurement artifact:
inst_0 received 718M query tokens (cold-start prefills) with 0% hit,
diluting the aggregate. D-instances have 20-72% external cache hit.
The /metrics endpoint's prefix_cache_hits_total counter does not include
external hits. The vLLM log's "External prefix cache hit rate" is the
correct metric for Mooncake-transferred KV reuse.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixed race condition in P instance selection (all going to inst_0).
P2P design: HEAVY requests prefill on least-loaded OTHER instance,
KV transfer via Mooncake, decode on session-sticky instance.
Result (200 req, fresh restart, vs baseline):
TTFT p50: 1.080 -> 0.939 (-13%) <- median improves (decode not disrupted)
TTFT p90: 9.410 -> 14.987 (+59%) <- tail worsens (KV transfer on large req)
TPOT p90: 0.076 -> 0.075 (-1%) <- unchanged (not the bottleneck)
E2E p50: 5.306 -> 5.565 (+5%) <- slightly worse overall
The P2P offload helps the common case (WARM/MEDIUM get lower TTFT because
their instance isn't blocked by a heavy prefill) but hurts HEAVY requests
(extra KV transfer latency). This is a median-vs-tail tradeoff.
For SLOs targeting p50: P2P offload helps.
For SLOs targeting p90/p99: baseline combined is better.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Evidence-backed analysis with per-request matched comparison:
1. KV CACHE MEMORY WALL (Evidence 3)
Combined: 12% KV cache per instance (comfortable)
PD-Sep 6P+2D: 48-97% on decode instances (saturation -> 100s waits)
2. KV TRANSFER OVERHEAD (Evidence 4, matched requests)
Mean 1.79s extra TTFT per request, 3.3x slower overall
Small requests (<5k) hit 8.0x ratio (transfer dominates prefill)
Large requests (>50k) hit 1.3x ratio (prefill dominates)
3. SESSION AFFINITY BROKEN (Evidence 5)
Combined: turn N+1 hits same GPU -> 80% multi-turn APC
PD-Sep: turn N+1 prefill on P has NO prior KV (sent to D) -> 0% APC on P
Must re-prefill + re-transfer on every turn
4. GPU UNDERUTILIZATION (Evidence 2)
PD-Sep: 12-17% GPU util (decode is memory-bound, wastes GPU compute)
Combined: 28-54% GPU util (flexible P+D on same GPU)
Root cause: agentic workloads break PD-Sep's assumptions (short input,
no prefix sharing, compute-heavy prefill) with long context, 91%
intra-session KV reuse, and lightweight MoE compute.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TP=2 DP=4 with hybrid routing achieves TTFT p50=0.611s (-43% vs TP=1),
the best TTFT across all tested configurations. But TPOT p90=0.109s
(+51% vs TP=1) due to cross-GPU all-reduce in decode.
Full comparison across 7 configurations shows two Pareto-optimal points:
TP=1 DP=8 hybrid: best TPOT (0.072s), good TTFT (1.064s)
TP=2 DP=4 hybrid: best TTFT (0.611s), acceptable TPOT (0.109s)
The choice depends on SLO:
TTFT-sensitive (interactive) -> TP=2 DP=4
TPOT-sensitive (streaming) -> TP=1 DP=8
All PD-Sep configurations are strictly dominated by one of these two.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implemented --offload mode: HEAVY requests (>20k new tokens) get P on
least-loaded instance, KV via Mooncake RDMA, D on session-sticky instance.
WARM/MEDIUM stay co-located (no KV transfer). All 8 instances run kv_both.
Result (200 req, same instances, fresh restart):
Baseline (no offload): TTFT=1.073 TPOT90=0.074 E2E=5.086
Offload HEAVY: TTFT=1.462 TPOT90=0.077 E2E=6.847
Delta: +36% +4% +35%
Conclusion: even selective KV transfer (only 44% of requests) adds more
overhead than the isolation benefit provides. On single-machine 8 GPU,
PD-combined with hybrid routing is strictly optimal. No form of KV
transfer — full PD-sep, selective offload, or otherwise — improves
over co-located serving for this workload.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Balanced session-sticky routing improves APC from 44.7% to 48.7% (+4pp,
close to simulated 49.2%) but TTFT worsens by 30% and E2E by 23%.
Root cause: session-sticky creates load hotspots — some instances get
multiple heavy concurrent sessions, causing queue delays, despite higher
per-instance APC.
Key finding: APC optimization and latency optimization are in tension.
- Cache affinity (sticky) -> higher APC, worse load balance -> worse latency
- Load-based routing (old) -> lower APC, better load balance -> better latency
The optimal design must balance both dimensions, not optimize one at
the expense of the other.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Routing fix: new sessions placed by cumulative token load (greedy bin
packing) with cache-hit tiebreak. Session affinity for turn 2+.
Replayer now sends X-Session-Id header for proper session tracking.
Agentic workload core patterns (GLM-5.1 trace):
- 91% of reusable KV is intra-session (not cross-session)
- Session-sticky routing is THE critical optimization
- 36% warm requests (1.3k new tokens), 64% cold (17k+)
- After cache: effective prefill/decode ratio drops from 61.5x to 28.7x
- Cross-session sharing (system prompt) is only 4.8% of tokens
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
With balanced session-sticky routing:
LRU APC = 49.2% (only 1.8pp below infinite 51.0%)
LFU APC = 43.5% (worse than LRU!)
SessionProtLRU = 49.0% (no improvement)
The previous 10.1pp gap was from routing imbalance (all traffic to inst_0),
not from cache eviction policy. Balanced routing recovers 5.9pp of the gap.
Multi-turn sessions get 80.1% APC with simple LRU + session-sticky routing
because inter-turn gap is only 2 requests (LRU naturally keeps it warm).
Conclusion: fix routing balance, not cache policy.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause of 10.1pp APC gap: multi-turn sessions' KV evicted between
turns by cold-start prefills (66% of loss). Inter-turn gap is only 2
requests p50, but LRU cache (550 blocks) can't protect 93 blocks/session
across 14-21 concurrent sessions.
Three approaches designed:
A. Session-sticky routing with KV reservation (proxy-only, no vLLM change)
B. Two-tier KV cache: GPU + DRAM offload via Mooncake
C. Prefill-aware eviction (LFU/ARC instead of LRU, vLLM patch)
Next: simulate LRU vs LFU vs "infinite-for-MT" to quantify upper bounds,
then implement Approach A (lowest effort, immediate benchmark).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added --heavy-threshold to cache_aware_proxy.py. HEAVY requests (new
tokens >= threshold) route to instance with least decode load; WARM/MEDIUM
route by cache-hit + token-level LB as before.
Result: no significant difference vs baseline on single-machine combined mode.
TTFT: +1.2%, TPOT: -1.5%, E2E: -0.3% (all within noise)
Per-class TTFT breakdown shows the optimization target:
WARM (75 req): p50=0.198s (cache hit, nearly free)
MEDIUM (72 req): p50=1.356s
HEAVY (54 req): p50=7.124s (36x slower than WARM)
Conclusion: single-machine combined mode already distributes load well
enough that adaptive routing adds no benefit. True isolation of HEAVY
prefills requires cross-machine offload (v2 with Mooncake or multi-node).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Breakdown profiling at proxy level captures:
t_proxy_recv → t_prefill_sent → t_prefill_done → t_decode_sent → t_first_token
Key finding: 87.7% of TTFT is spent in kv+decode phase, NOT prefill.
Root cause: decode instance KV cache memory saturation (97.1% usage).
With 6P+2D config, 2 decode GPUs have only ~56GB total KV cache.
Large agentic requests (avg 33.6k tokens) fill this quickly.
Small requests (49 tokens, prefill=0.044s) wait 114s for KV cache
to be freed by large requests completing decode.
vLLM log confirms: Running=0, Waiting=6, KV cache=97.1%
GPU is idle but requests queue for KV cache memory, not compute.
This is the fundamental bottleneck of single-machine PD separation
for long-context agentic workloads: concentrating decode onto fewer
GPUs creates a KV cache memory wall.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added --fire-and-forget flag to cache_aware_proxy.py for async prefill dispatch.
Results on 6P+2D config:
Await: TTFT=1.48s TPOT=0.066s E2E=5.95s 94% success
FnF: TTFT=5.32s TPOT=0.037s E2E=11.9s 85% success
Fire-and-forget improves TPOT by 44% (pipeline overlap) but degrades
TTFT by 260% (decode internally waits for KV, less efficiently than
proxy-level await) and increases errors from KV race conditions.
Full 4-way ablation summary in analyze_ablations.py.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6P+2D gives more GPUs to prefill, fewer to decode:
- Decode util: 7.8% (4D) -> 19.0% (2D), less waste
- TTFT: 1.99s (4P) -> 1.48s (6P), -26% from less prefill queuing
- But Combined (30.5% util, TTFT 1.01s) still best overall
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Systematic study of prefill-decode disaggregation for agentic LLM workloads
using production GLM-5.1 coder trace (2.1M requests, 71B input tokens).
Key findings:
- Cache-aware routing improves TPOT p90 by 15% and APC from 20.8% to 44.7%
without PD separation, matching PD-Sep's decode isolation benefit
- PD separation adds +72% TTFT overhead (KV transfer) with no TPOT gain
when using the same cache-aware scheduler
- Prefill remains compute-bound even at 95% KV cache reuse (AI >1000x
vs decode AI <2), but absolute FLOPs drop 71% from cache hits
- For agentic MoE workloads, cache-aware routing > PD separation
Infrastructure:
- Trace sampler preserving session structure + hash_ids for prefix sharing
- Async trace replayer with streaming TTFT/TPOT/E2E measurement
- Unified cache-aware + token-level load-balanced global scheduler proxy
supporting both PD-colocated and PD-disaggregated (Mooncake/RDMA) modes
- vLLM 0.18.1 scheduler patch for KV transfer abort race condition
- Roofline analysis tool for prefill/decode compute characterization
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>