Files

Gahow Wang baf7ffb08c 16-session contention: TPOT +45% from prefill-decode interference

Key finding: at 16 concurrent sessions (2 per GPU), TPOT p90 degrades
from 0.073 to 0.106 (+45%), with MEDIUM TPOT at 0.197 (+149%).
This is the first time we've reproduced real prefill-decode interference
in controlled experiments.

Elastic RDMA at 16 sessions doesn't help: only 13/500 offloaded (cache-gate
correct for cold turn-1), kv_both adds ~16% TPOT overhead at high concurrency.

Load scaling: 1000req_ts20, 200req_ts10, 200req_ts5, 500req_ts10 all show
~30% GPU util at 8 sessions. The bottleneck is max_inflight_sessions, not
arrival rate.

Updated elastic_hypotheses.md with H8, H9, and comprehensive final analysis.
The real bottleneck is vLLM's chunked prefill scheduling, not routing or
PD disaggregation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-23 05:51:47 +08:00

14 KiB

Raw Blame History

Elastic Prefill Service: Hypotheses and Validation Log

Date: 2026-05-23 Context: Investigating whether elastic PD disaggregation can improve agentic LLM serving vs pure co-located baseline.

Baseline Reference (8C plain, fresh restart, 200 req)

OK=198/200  TTFT50=1.075  TTFT90=9.384  TPOT90=0.0761  E2E50=5.075
WARM:   TTFT50=0.137  TPOT90=0.061
MEDIUM: TTFT50=0.921  TPOT90=0.079
HEAVY:  TTFT50=4.945  TPOT90=0.076

H1: Mooncake kv_both has significant runtime overhead

Claim: Enabling kv_both mode degrades TPOT even without KV transfer (RDMA threads, ZMQ sockets compete for CPU).

Prior evidence: Earlier elastic P2P experiment showed MEDIUM TPOT 0.079→0.197 (+150%). Attributed to kv_both overhead.

Experiment: Phase 0A (7C kv_both, no offload) vs Phase 0B (7C plain)

Result: TPOT90 = 0.0738 (kv_both) vs 0.0729 (plain) → +1.3%, within noise

Verdict: REJECTED. kv_both has zero runtime overhead. The earlier 150% TPOT degradation was from offload-induced interference, not kv_both itself.

H2: Dedicated Prefill Service (PS) without KV pull improves HEAVY TTFT

Claim: A dedicated PS instance (no sessions) does HEAVY prefill without disrupting C's decode. PS does full cold prefill (no cache), D (session-sticky C) pulls KV and decodes.

Experiment: PS V1 — 1PS + 7C kv_both, always offload HEAVY to PS

Result:

ps_always: OK=195/200, HEAVY TTFT p50=~7.8s (baseline 5.0s, +56%), cascading timeouts
ps_cost: 0 offloads (cost model correctly identifies PS is more expensive)
ps_flexd: OK=172/186 (92.5%), HEAVY TTFT p50=7.8s, 12 ReadTimeout

Root cause: PS has no KV cache for the session → full cold prefill is SLOWER than C's cached prefill. Cost model: full_input/8333 > (input-cached)/8333 + interference is always true.

Verdict: REJECTED. PS without KV pull cannot beat cached co-located prefill. The cold prefill overhead + KV transfer time exceeds the interference savings.

H3: C_s cached prefill + flexible D decode (V2) improves E2E

Claim: C_s (session-sticky, has cache) does fast prefill (max_tokens=1), D (least-loaded C) pulls KV via Mooncake and does decode. Benefits: (1) C_s prefill is fast due to cache, (2) D is least-loaded so decode starts quickly, (3) session migrates to D for better load balance.

Experiment: V2 — 8C kv_both, HEAVY offloaded (C_s prefill → flexible D decode)

Result:

OK=179/185 (96.8%)  TTFT50=0.762 (-29%)  E2E50=4.628 (-9%)  TPOT90=0.0746 (=)
HEAVY: TTFT50=4.794 (≈baseline)  TTFT90=20.4 (+117%)
Routes: 63 HEAVY_OFFLOAD, 51 MEDIUM, 69 WARM
Cache hit on offloaded: mean=3%, median=0% (92% are turn-1 cold)
Prefill: p50=5.0s  D KV pull: p50=1.1s p90=6.7s

Partial validation: E2E p50 improved 9%, TTFT p50 improved 29%. But HEAVY p90 degraded 2x and 6 errors (vs 2 baseline).

Key finding: 92% of HEAVY requests are turn-1 (zero cache on C_s). C_s does COLD prefill anyway → offload adds pure RDMA overhead (~1.1s) with no cache benefit.

Verdict: PARTIALLY VALIDATED. The architecture works for MEDIUM and WARM (better load balance). But blindly offloading all HEAVY hurts because most are cold.

H4: Only offload HEAVY with high cache hit (cold HEAVY should stay co-located)

Claim: Turn-1 HEAVY requests have zero cache → co-located is faster (no RDMA overhead). Only turn-2+ HEAVY with significant cache hit (>50%) should be offloaded, because:

C_s's prefill is fast (only new tokens computed)
D gets the KV via RDMA (~1.1s, small vs the savings from not waiting for C_s's decode queue)
C_s's decode is not disrupted

Counterintuition: This challenges the conventional PD-sep assumption that "all heavy prefill should be disaggregated." For agentic workloads with high cache reuse (70%+), most of the "heavy" prefix is already cached — the actual compute is MEDIUM-level.

Experiment: TODO — V2 with cache_hit > 50% * input_length gate

Expected:

Turn-1 cold HEAVY stays co-located (no RDMA overhead, same TTFT as baseline)
Turn-2+ cached HEAVY gets offloaded (C_s fast prefill + D least-loaded decode)
Overall: HEAVY TTFT ≈ baseline, HEAVY TPOT improved (D less loaded), fewer errors

H5: RDMA KV transfer overhead (1.1s p50) is too high — should be pipelined

Claim: The 1.1s p50 KV transfer time for HEAVY requests (~40k tokens) seems excessive. At 200Gbps RDMA (25 GB/s), 40k tokens × 96KB/token = 3.75GB → should take ~0.15s. The 7x gap suggests block-by-block transfer without pipelining.

Questions to investigate:

Does Mooncake do layerwise KV transfer? (transfer layer N while computing layer N+1)
Is the 1.1s from RDMA setup overhead, block scatter, or actual bandwidth?
Does vLLM's chunked prefill interact with the transfer (blocks only available after each chunk)?

From Mooncake code: MooncakeConnector does not do layerwise saving (comment in code). All blocks are saved/loaded after the FULL prefill completes. This means:

Prefill must complete entirely before ANY KV transfer starts
D cannot start decode until ALL blocks arrive
No overlap between prefill compute and KV transfer

Potential optimization: Layerwise transfer would allow D to start pulling layer 0's KV while C_s is still computing layer 47's KV. This could reduce the effective transfer latency to near zero (hidden behind compute).

Experiment: TODO — Profile actual RDMA transfer time vs setup overhead. Check if start_load_kv() and wait_for_layer_load() APIs support layerwise loading (they exist in the interface but Mooncake doesn't implement them).

H6: Session migration breaks KV cache locality for future turns

Claim: When a HEAVY request is offloaded from C_s to D, session affinity moves to D. But D starts with zero cache for this session — it only has the KV from the current turn (transferred via RDMA). Future turns go to D, which now has the current turn cached. But the RDMA-transferred KV might not be properly registered in D's prefix cache.

Questions:

Does vLLM's prefix cache recognize RDMA-transferred blocks as cacheable?
If yes, future turns on D should have similar APC to staying on C_s.
If no, every turn after migration is a cold start on D.

From vLLM metrics: external_prefix_cache_hits_total counts cross-instance cache hits. If this is > 0 on D after migration, the transferred blocks ARE cacheable.

Experiment: TODO — Track per-instance APC before and after session migration. Check if D's APC for migrated sessions matches expectations.

Summary of Current Understanding

                    Turn 1 (cold)           Turn 2+ (cached)
                    ─────────────           ────────────────
Co-located:         ✅ Best (no overhead)   ⚠️ HEAVY disrupts decode
Offload (V2):       ❌ Adds RDMA overhead   ✅ C_s fast prefill + D load balance

The optimal strategy is hybrid: co-locate cold turn-1, offload cached turn-2+.

This is the key insight for the paper: the offload decision should be cache-aware, not size-based. A 80k-token request with 90% cache hit is effectively a 8k-token prefill — MEDIUM, not HEAVY. The "heaviness" that matters for PD disaggregation is new_tokens_to_compute, not total_input_length.

H7: OVERLOAD_FACTOR tuning improves GPU balance

Claim: Lowering OVERLOAD_FACTOR (from 2.0 to 1.5/1.3/1.0) breaks session affinity earlier, improving GPU utilization balance.

Experiment: 4 baseline runs (no Mooncake) with OF=2.0, 1.5, 1.3, 1.0. 200 req each, fresh restart.

Result:

OF=2.0: imbalance=3.71x  TTFT50=1.077  E2E50=5.093
OF=1.5: imbalance=3.45x  TTFT50=1.068  E2E50=5.480
OF=1.3: imbalance=3.96x  TTFT50=1.073  E2E50=5.144
OF=1.0: imbalance=3.47x  TTFT50=1.085  E2E50=5.496

All within noise. APC unchanged (~30%).

Verdict: REJECTED. The imbalance is driven by workload skew (some sessions are inherently heavier), not by sticky routing. The OVERLOAD_FACTOR threshold rarely fires because per-instance load fluctuates too quickly. The hot GPU just rotates to different instances across runs.

Key learning: The root cause of GPU imbalance is at session placement time (turn 1), not at affinity-breaking time (turn 2+). Turn-1 placement uses ongoing_tokens scoring, which is a snapshot that doesn't account for cumulative or future load.

H4 Validated: Cache-gate improves GPU balance but RDMA kills TTFT

Experiment: H4 cache-gate (8C kv_both, offload only when cache_ratio >= 0.3) with GPU profiling.

Result:

                   Baseline        H4 cache-gate
GPU Imbalance:     3.97x           2.04x          ← 2x better balance
GPU Std:           14.9%           6.7%           ← less variance
GPU Max:           63.3%           35.3%          ← no extreme hotspot
HEAVY_COLO TTFT:   7.02s           6.28s          ← -10.5% from better balance!
HEAVY_OFFLOAD TTFT: N/A            11.45s         ← RDMA penalty
OK/N:              198/200         198/200        ← same reliability

Key finding: The 10.5% HEAVY_COLO improvement proves GPU balance → better latency. But the 7 RDMA-offloaded requests (TTFT=11.45s) pull down the aggregate. RDMA transfer is bimodal: 3/7 fast (0.6-1.2s), 3/7 slow (18-31s).

Current Understanding (updated)

PD-Sep: net negative (memory wall) ← proven
LMetric: ≈ baseline for agentic (session affinity limits routing freedom) ← proven
Elastic P2P (RDMA): net negative on single machine (Mooncake lacks layerwise transfer → RDMA is pure overhead) ← proven
OVERLOAD_FACTOR tuning: no effect (imbalance from workload skew, not routing) ← proven
GPU balance improvement → HEAVY TTFT -10.5%: validated (H4 HEAVY_COLO data)
The bottleneck is at time_scale=20 with 200 req: system is only 30% loaded. Higher load may reveal more optimization opportunities.

H8: Higher concurrency reveals prefill-decode interference

Claim: At 8 sessions / 8 GPUs, the system is underloaded (30% GPU util). Increasing to 16 sessions should reveal prefill-decode interference.

Experiments:

8 sessions, ts=20, 1000 req: TPOT90=0.073, GPU=30%, imbal=1.5x
16 sessions, ts=10, 500 req: TPOT90=0.106, GPU=~25%, imbal=~3.5x
32 sessions, ts=10, 500 req: (not run yet)

Result:

                8 sessions      16 sessions     Delta
TPOT p90:       0.0729          0.1058          +45%!
WARM TPOT90:    0.0640          0.1301          +103%!
MEDIUM TPOT90:  0.0750          0.1970          +149%!
HEAVY TTFT50:   (varies)        3.399           —
E2E p50:        4.516           5.830           +29%

Verdict: VALIDATED. 16 sessions creates real prefill-decode interference. MEDIUM TPOT degrades 2.5x because HEAVY prefills (via chunked prefill) block decode steps on the same GPU. This is the scenario where PD disaggregation should theoretically help.

H9: Elastic RDMA offload at 16 sessions reduces interference

Claim: At 16 sessions where interference is severe, elastic V2 (C_s prefill + flexible D decode via RDMA) should reduce TPOT by isolating heavy prefill from decode.

Experiment: 16 sessions, 500 req, elastic (kv_both + H4 cache-gate)

Result:

                Baseline 16s    Elastic 16s     Delta
TPOT p90:       0.1058          0.1231          +16% (WORSE)
MEDIUM TPOT90:  0.1970          0.2056          +4% (same)
TTFT p50:       0.828           0.937           +13% (WORSE)
E2E p50:        5.830           6.528           +12% (WORSE)
OK/N:           498/500         498/500         same
Offloaded:      —               13/500 (2.6%)   too few to matter

Verdict: REJECTED. Elastic at 16 sessions is WORSE, not better. Root causes:

Cache-gate correctly blocks 89% of HEAVY (cold turn-1, cache_ratio=0) → only 13 offloads
kv_both runtime overhead at high concurrency adds ~16% TPOT vs plain baseline
The 13 offloaded requests have TTFT p50=17.5s (RDMA overhead), much worse than colocated 3.5s

Key learning: The RDMA transfer approach cannot solve prefill-decode interference because:

Most HEAVY are cold (no cache to benefit from offload)
Mooncake lacks layerwise transfer (RDMA is pure sequential overhead after prefill)
kv_both has non-zero overhead at high concurrency (contradicts Phase 0 at low concurrency)

Current Understanding (final)

What DOESN'T work for agentic workloads:

PD-Sep: net negative — KV cache memory wall on decode instances
LMetric (OSDI'26): ≈ linear routing — session affinity limits routing freedom
Elastic P2P RDMA offload: net negative — Mooncake transfer overhead, no layerwise pipeline
OVERLOAD_FACTOR tuning: no effect — imbalance from workload skew, not routing
Dedicated Prefill Service (PS): cannot win cost comparison without KV pull, PS is always slower than cached C
Cache-gate offload (H4): correct but only 10-12% of HEAVY have cache → limited activation

What DOES work:

Cache-aware session-sticky routing: +24pp APC, -60% TTFT vs round-robin (the dominant optimization)
GPU balance from offload routing: HEAVY_COLO -10.5% TTFT when imbalance reduced (H4 data)

The real bottleneck:

At production-level concurrency (>1 session/GPU), the dominant bottleneck is chunked prefill interference: large HEAVY prefill chunks block decode steps on the same GPU, causing TPOT to degrade 45-149%.

Neither routing nor RDMA-based PD disaggregation solves this. The root cause is vLLM's scheduler design:

Chunked prefill chunk size (max_num_batched_tokens, default 8192) is fixed
Large prefill chunks monopolize the GPU for tens of ms, stalling decode
Reducing chunk size would improve decode responsiveness but increase prefill overhead

Next direction: Adaptive chunked prefill scheduling

Instead of fixed chunk size, dynamically adjust based on decode pressure:

When decode queue is deep: smaller chunks → more decode slots → better TPOT
When decode queue is empty: larger chunks → faster prefill → better TTFT
This is a vLLM scheduler modification, not a routing change

14 KiB Raw Blame History Unescape Escape

Elastic Prefill Service: Hypotheses and Validation Log

Baseline Reference (8C plain, fresh restart, 200 req)

H1: Mooncake kv_both has significant runtime overhead

H2: Dedicated Prefill Service (PS) without KV pull improves HEAVY TTFT

H3: C_s cached prefill + flexible D decode (V2) improves E2E

H4: Only offload HEAVY with high cache hit (cold HEAVY should stay co-located)

H5: RDMA KV transfer overhead (1.1s p50) is too high — should be pipelined

H6: Session migration breaks KV cache locality for future turns

Summary of Current Understanding

H7: OVERLOAD_FACTOR tuning improves GPU balance

H4 Validated: Cache-gate improves GPU balance but RDMA kills TTFT

Current Understanding (updated)

H8: Higher concurrency reveals prefill-decode interference

H9: Elastic RDMA offload at 16 sessions reduces interference

Current Understanding (final)

What DOESN'T work for agentic workloads:

What DOES work:

The real bottleneck:

Next direction: Adaptive chunked prefill scheduling

14 KiB

Raw Blame History