Per analysis/unified_routing_fix_review.md #2, several docs still presented the retired single-argmin + PUSH-migration design as the final algorithm. Mark them superseded and document the current hybrid direction (commit255c8e6). - REPORT.md §1.1 / §3.9: add errata callout and section header noting the "Final Design" framing was retired aftercc6e562/ 4c583f2; point readers to docs/migration-policy-design.md. - docs/migration-policy-design.md: rewrite. Opens with the current hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate + tie-breaker), then a "What Was Retired" commit table, then the old Approach A numbers preserved as "Historical Baseline-Mode Comparison". - analysis/research_findings.md §2.2 / §5: correct the LMetric framing. LMetric isn't "neutralized by affinity constraints" (pure --policy lmetric has no affinity at all); it converges to similar placements because P_tokens includes new_uncached_tokens, giving it implicit soft affinity. - analysis/elastic_hypotheses.md: same LMetric correction in the "DOESN'T work" summary, plus a footer cross-referencing the current routing direction. - analysis/unified_routing_fix_review.md: track this file (was untracked); it is the review handoff cited from the updated docs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
295 lines
15 KiB
Markdown
295 lines
15 KiB
Markdown
# Elastic Prefill Service: Hypotheses and Validation Log
|
||
|
||
**Date**: 2026-05-23
|
||
**Context**: Investigating whether elastic PD disaggregation can improve agentic LLM serving vs pure co-located baseline.
|
||
|
||
## Baseline Reference (8C plain, fresh restart, 200 req)
|
||
```
|
||
OK=198/200 TTFT50=1.075 TTFT90=9.384 TPOT90=0.0761 E2E50=5.075
|
||
WARM: TTFT50=0.137 TPOT90=0.061
|
||
MEDIUM: TTFT50=0.921 TPOT90=0.079
|
||
HEAVY: TTFT50=4.945 TPOT90=0.076
|
||
```
|
||
|
||
---
|
||
|
||
## H1: Mooncake kv_both has significant runtime overhead
|
||
|
||
**Claim**: Enabling kv_both mode degrades TPOT even without KV transfer (RDMA threads, ZMQ sockets compete for CPU).
|
||
|
||
**Prior evidence**: Earlier elastic P2P experiment showed MEDIUM TPOT 0.079→0.197 (+150%). Attributed to kv_both overhead.
|
||
|
||
**Experiment**: Phase 0A (7C kv_both, no offload) vs Phase 0B (7C plain)
|
||
|
||
**Result**: TPOT90 = 0.0738 (kv_both) vs 0.0729 (plain) → **+1.3%, within noise**
|
||
|
||
**Verdict**: **REJECTED**. kv_both has zero runtime overhead. The earlier 150% TPOT degradation was from offload-induced interference, not kv_both itself.
|
||
|
||
---
|
||
|
||
## H2: Dedicated Prefill Service (PS) without KV pull improves HEAVY TTFT
|
||
|
||
**Claim**: A dedicated PS instance (no sessions) does HEAVY prefill without disrupting C's decode. PS does full cold prefill (no cache), D (session-sticky C) pulls KV and decodes.
|
||
|
||
**Experiment**: PS V1 — 1PS + 7C kv_both, always offload HEAVY to PS
|
||
|
||
**Result**:
|
||
- `ps_always`: OK=195/200, HEAVY TTFT p50=~7.8s (baseline 5.0s, **+56%**), cascading timeouts
|
||
- `ps_cost`: 0 offloads (cost model correctly identifies PS is more expensive)
|
||
- `ps_flexd`: OK=172/186 (92.5%), HEAVY TTFT p50=7.8s, 12 ReadTimeout
|
||
|
||
**Root cause**: PS has no KV cache for the session → full cold prefill is SLOWER than C's cached prefill. Cost model: `full_input/8333 > (input-cached)/8333 + interference` is always true.
|
||
|
||
**Verdict**: **REJECTED**. PS without KV pull cannot beat cached co-located prefill. The cold prefill overhead + KV transfer time exceeds the interference savings.
|
||
|
||
---
|
||
|
||
## H3: C_s cached prefill + flexible D decode (V2) improves E2E
|
||
|
||
**Claim**: C_s (session-sticky, has cache) does fast prefill (max_tokens=1), D (least-loaded C) pulls KV via Mooncake and does decode. Benefits: (1) C_s prefill is fast due to cache, (2) D is least-loaded so decode starts quickly, (3) session migrates to D for better load balance.
|
||
|
||
**Experiment**: V2 — 8C kv_both, HEAVY offloaded (C_s prefill → flexible D decode)
|
||
|
||
**Result**:
|
||
```
|
||
OK=179/185 (96.8%) TTFT50=0.762 (-29%) E2E50=4.628 (-9%) TPOT90=0.0746 (=)
|
||
HEAVY: TTFT50=4.794 (≈baseline) TTFT90=20.4 (+117%)
|
||
Routes: 63 HEAVY_OFFLOAD, 51 MEDIUM, 69 WARM
|
||
Cache hit on offloaded: mean=3%, median=0% (92% are turn-1 cold)
|
||
Prefill: p50=5.0s D KV pull: p50=1.1s p90=6.7s
|
||
```
|
||
|
||
**Partial validation**: E2E p50 improved 9%, TTFT p50 improved 29%. But HEAVY p90 degraded 2x and 6 errors (vs 2 baseline).
|
||
|
||
**Key finding**: 92% of HEAVY requests are turn-1 (zero cache on C_s). C_s does COLD prefill anyway → offload adds pure RDMA overhead (~1.1s) with no cache benefit.
|
||
|
||
**Verdict**: **PARTIALLY VALIDATED**. The architecture works for MEDIUM and WARM (better load balance). But blindly offloading all HEAVY hurts because most are cold.
|
||
|
||
---
|
||
|
||
## H4: Only offload HEAVY with high cache hit (cold HEAVY should stay co-located)
|
||
|
||
**Claim**: Turn-1 HEAVY requests have zero cache → co-located is faster (no RDMA overhead). Only turn-2+ HEAVY with significant cache hit (>50%) should be offloaded, because:
|
||
- C_s's prefill is fast (only new tokens computed)
|
||
- D gets the KV via RDMA (~1.1s, small vs the savings from not waiting for C_s's decode queue)
|
||
- C_s's decode is not disrupted
|
||
|
||
**Counterintuition**: This challenges the conventional PD-sep assumption that "all heavy prefill should be disaggregated." For agentic workloads with high cache reuse (70%+), most of the "heavy" prefix is already cached — the actual compute is MEDIUM-level.
|
||
|
||
**Experiment**: TODO — V2 with `cache_hit > 50% * input_length` gate
|
||
|
||
**Expected**:
|
||
- Turn-1 cold HEAVY stays co-located (no RDMA overhead, same TTFT as baseline)
|
||
- Turn-2+ cached HEAVY gets offloaded (C_s fast prefill + D least-loaded decode)
|
||
- Overall: HEAVY TTFT ≈ baseline, HEAVY TPOT improved (D less loaded), fewer errors
|
||
|
||
---
|
||
|
||
## H5: RDMA KV transfer overhead (1.1s p50) is too high — should be pipelined
|
||
|
||
**Claim**: The 1.1s p50 KV transfer time for HEAVY requests (~40k tokens) seems excessive. At 200Gbps RDMA (25 GB/s), 40k tokens × 96KB/token = 3.75GB → should take ~0.15s. The 7x gap suggests block-by-block transfer without pipelining.
|
||
|
||
**Questions to investigate**:
|
||
1. Does Mooncake do layerwise KV transfer? (transfer layer N while computing layer N+1)
|
||
2. Is the 1.1s from RDMA setup overhead, block scatter, or actual bandwidth?
|
||
3. Does vLLM's chunked prefill interact with the transfer (blocks only available after each chunk)?
|
||
|
||
**From Mooncake code**: `MooncakeConnector does not do layerwise saving` (comment in code). All blocks are saved/loaded after the FULL prefill completes. This means:
|
||
- Prefill must complete entirely before ANY KV transfer starts
|
||
- D cannot start decode until ALL blocks arrive
|
||
- No overlap between prefill compute and KV transfer
|
||
|
||
**Potential optimization**: Layerwise transfer would allow D to start pulling layer 0's KV while C_s is still computing layer 47's KV. This could reduce the effective transfer latency to near zero (hidden behind compute).
|
||
|
||
**Experiment**: TODO — Profile actual RDMA transfer time vs setup overhead. Check if `start_load_kv()` and `wait_for_layer_load()` APIs support layerwise loading (they exist in the interface but Mooncake doesn't implement them).
|
||
|
||
---
|
||
|
||
## H6: Session migration breaks KV cache locality for future turns
|
||
|
||
**Claim**: When a HEAVY request is offloaded from C_s to D, session affinity moves to D. But D starts with zero cache for this session — it only has the KV from the current turn (transferred via RDMA). Future turns go to D, which now has the current turn cached. But the RDMA-transferred KV might not be properly registered in D's prefix cache.
|
||
|
||
**Questions**:
|
||
- Does vLLM's prefix cache recognize RDMA-transferred blocks as cacheable?
|
||
- If yes, future turns on D should have similar APC to staying on C_s.
|
||
- If no, every turn after migration is a cold start on D.
|
||
|
||
**From vLLM metrics**: `external_prefix_cache_hits_total` counts cross-instance cache hits. If this is > 0 on D after migration, the transferred blocks ARE cacheable.
|
||
|
||
**Experiment**: TODO — Track per-instance APC before and after session migration. Check if D's APC for migrated sessions matches expectations.
|
||
|
||
---
|
||
|
||
## Summary of Current Understanding
|
||
|
||
```
|
||
Turn 1 (cold) Turn 2+ (cached)
|
||
───────────── ────────────────
|
||
Co-located: ✅ Best (no overhead) ⚠️ HEAVY disrupts decode
|
||
Offload (V2): ❌ Adds RDMA overhead ✅ C_s fast prefill + D load balance
|
||
```
|
||
|
||
The optimal strategy is **hybrid**: co-locate cold turn-1, offload cached turn-2+.
|
||
|
||
This is the key insight for the paper: **the offload decision should be cache-aware, not size-based**. A 80k-token request with 90% cache hit is effectively a 8k-token prefill — MEDIUM, not HEAVY. The "heaviness" that matters for PD disaggregation is `new_tokens_to_compute`, not `total_input_length`.
|
||
|
||
---
|
||
|
||
## H7: OVERLOAD_FACTOR tuning improves GPU balance
|
||
|
||
**Claim**: Lowering OVERLOAD_FACTOR (from 2.0 to 1.5/1.3/1.0) breaks session affinity earlier, improving GPU utilization balance.
|
||
|
||
**Experiment**: 4 baseline runs (no Mooncake) with OF=2.0, 1.5, 1.3, 1.0. 200 req each, fresh restart.
|
||
|
||
**Result**:
|
||
```
|
||
OF=2.0: imbalance=3.71x TTFT50=1.077 E2E50=5.093
|
||
OF=1.5: imbalance=3.45x TTFT50=1.068 E2E50=5.480
|
||
OF=1.3: imbalance=3.96x TTFT50=1.073 E2E50=5.144
|
||
OF=1.0: imbalance=3.47x TTFT50=1.085 E2E50=5.496
|
||
```
|
||
|
||
All within noise. APC unchanged (~30%).
|
||
|
||
**Verdict**: **REJECTED**. The imbalance is driven by workload skew (some sessions are inherently heavier), not by sticky routing. The OVERLOAD_FACTOR threshold rarely fires because per-instance load fluctuates too quickly. The hot GPU just rotates to different instances across runs.
|
||
|
||
**Key learning**: The root cause of GPU imbalance is at **session placement time (turn 1)**, not at affinity-breaking time (turn 2+). Turn-1 placement uses `ongoing_tokens` scoring, which is a snapshot that doesn't account for cumulative or future load.
|
||
|
||
---
|
||
|
||
## H4 Validated: Cache-gate improves GPU balance but RDMA kills TTFT
|
||
|
||
**Experiment**: H4 cache-gate (8C kv_both, offload only when cache_ratio >= 0.3) with GPU profiling.
|
||
|
||
**Result**:
|
||
```
|
||
Baseline H4 cache-gate
|
||
GPU Imbalance: 3.97x 2.04x ← 2x better balance
|
||
GPU Std: 14.9% 6.7% ← less variance
|
||
GPU Max: 63.3% 35.3% ← no extreme hotspot
|
||
HEAVY_COLO TTFT: 7.02s 6.28s ← -10.5% from better balance!
|
||
HEAVY_OFFLOAD TTFT: N/A 11.45s ← RDMA penalty
|
||
OK/N: 198/200 198/200 ← same reliability
|
||
```
|
||
|
||
**Key finding**: The 10.5% HEAVY_COLO improvement proves GPU balance → better latency. But the 7 RDMA-offloaded requests (TTFT=11.45s) pull down the aggregate. RDMA transfer is bimodal: 3/7 fast (0.6-1.2s), 3/7 slow (18-31s).
|
||
|
||
---
|
||
|
||
## Current Understanding (updated)
|
||
|
||
1. **PD-Sep**: net negative (memory wall) ← proven
|
||
2. **LMetric**: ≈ baseline for agentic (session affinity limits routing freedom) ← proven
|
||
3. **Elastic P2P (RDMA)**: net negative on single machine (Mooncake lacks layerwise transfer → RDMA is pure overhead) ← proven
|
||
4. **OVERLOAD_FACTOR tuning**: no effect (imbalance from workload skew, not routing) ← proven
|
||
5. **GPU balance improvement → HEAVY TTFT -10.5%**: validated (H4 HEAVY_COLO data)
|
||
6. **The bottleneck is at time_scale=20 with 200 req**: system is only 30% loaded. Higher load may reveal more optimization opportunities.
|
||
|
||
---
|
||
|
||
## H8: Higher concurrency reveals prefill-decode interference
|
||
|
||
**Claim**: At 8 sessions / 8 GPUs, the system is underloaded (30% GPU util). Increasing to 16 sessions should reveal prefill-decode interference.
|
||
|
||
**Experiments**:
|
||
- 8 sessions, ts=20, 1000 req: TPOT90=0.073, GPU=30%, imbal=1.5x
|
||
- 16 sessions, ts=10, 500 req: TPOT90=0.106, GPU=~25%, imbal=~3.5x
|
||
- 32 sessions, ts=10, 500 req: (not run yet)
|
||
|
||
**Result**:
|
||
```
|
||
8 sessions 16 sessions Delta
|
||
TPOT p90: 0.0729 0.1058 +45%!
|
||
WARM TPOT90: 0.0640 0.1301 +103%!
|
||
MEDIUM TPOT90: 0.0750 0.1970 +149%!
|
||
HEAVY TTFT50: (varies) 3.399 —
|
||
E2E p50: 4.516 5.830 +29%
|
||
```
|
||
|
||
**Verdict**: **VALIDATED**. 16 sessions creates real prefill-decode interference. MEDIUM TPOT degrades 2.5x because HEAVY prefills (via chunked prefill) block decode steps on the same GPU. This is the scenario where PD disaggregation should theoretically help.
|
||
|
||
---
|
||
|
||
## H9: Elastic RDMA offload at 16 sessions reduces interference
|
||
|
||
**Claim**: At 16 sessions where interference is severe, elastic V2 (C_s prefill + flexible D decode via RDMA) should reduce TPOT by isolating heavy prefill from decode.
|
||
|
||
**Experiment**: 16 sessions, 500 req, elastic (kv_both + H4 cache-gate)
|
||
|
||
**Result**:
|
||
```
|
||
Baseline 16s Elastic 16s Delta
|
||
TPOT p90: 0.1058 0.1231 +16% (WORSE)
|
||
MEDIUM TPOT90: 0.1970 0.2056 +4% (same)
|
||
TTFT p50: 0.828 0.937 +13% (WORSE)
|
||
E2E p50: 5.830 6.528 +12% (WORSE)
|
||
OK/N: 498/500 498/500 same
|
||
Offloaded: — 13/500 (2.6%) too few to matter
|
||
```
|
||
|
||
**Verdict**: **REJECTED**. Elastic at 16 sessions is WORSE, not better. Root causes:
|
||
1. Cache-gate correctly blocks 89% of HEAVY (cold turn-1, cache_ratio=0) → only 13 offloads
|
||
2. kv_both runtime overhead at high concurrency adds ~16% TPOT vs plain baseline
|
||
3. The 13 offloaded requests have TTFT p50=17.5s (RDMA overhead), much worse than colocated 3.5s
|
||
|
||
**Key learning**: The RDMA transfer approach cannot solve prefill-decode interference because:
|
||
- Most HEAVY are cold (no cache to benefit from offload)
|
||
- Mooncake lacks layerwise transfer (RDMA is pure sequential overhead after prefill)
|
||
- kv_both has non-zero overhead at high concurrency (contradicts Phase 0 at low concurrency)
|
||
|
||
---
|
||
|
||
## Current Understanding (final)
|
||
|
||
### What DOESN'T work for agentic workloads:
|
||
|
||
1. **PD-Sep**: net negative — KV cache memory wall on decode instances
|
||
2. **LMetric (OSDI'26)**: ≈ linear routing — `P_tokens` already includes
|
||
`new_uncached_tokens`, so cache-hit scoring gives LMetric an implicit
|
||
soft affinity that converges to similar placements as explicit sticky
|
||
affinity (see `analysis/research_findings.md` §2.2 for the corrected
|
||
framing)
|
||
3. **Elastic P2P RDMA offload**: net negative — Mooncake transfer overhead, no layerwise pipeline
|
||
4. **OVERLOAD_FACTOR tuning**: no effect — imbalance from workload skew, not routing
|
||
5. **Dedicated Prefill Service (PS)**: cannot win cost comparison without KV pull, PS is always slower than cached C
|
||
6. **Cache-gate offload (H4)**: correct but only 10-12% of HEAVY have cache → limited activation
|
||
|
||
### What DOES work:
|
||
|
||
1. **Cache-aware session-sticky routing**: +24pp APC, -60% TTFT vs round-robin (the dominant optimization)
|
||
2. **GPU balance from offload routing**: HEAVY_COLO -10.5% TTFT when imbalance reduced (H4 data)
|
||
|
||
### The real bottleneck:
|
||
|
||
At production-level concurrency (>1 session/GPU), the dominant bottleneck is **chunked prefill interference**: large HEAVY prefill chunks block decode steps on the same GPU, causing TPOT to degrade 45-149%.
|
||
|
||
Neither routing nor RDMA-based PD disaggregation solves this. The root cause is vLLM's scheduler design:
|
||
- Chunked prefill chunk size (`max_num_batched_tokens`, default 8192) is fixed
|
||
- Large prefill chunks monopolize the GPU for tens of ms, stalling decode
|
||
- Reducing chunk size would improve decode responsiveness but increase prefill overhead
|
||
|
||
### Next direction: Adaptive chunked prefill scheduling
|
||
|
||
Instead of fixed chunk size, dynamically adjust based on decode pressure:
|
||
- When decode queue is deep: smaller chunks → more decode slots → better TPOT
|
||
- When decode queue is empty: larger chunks → faster prefill → better TTFT
|
||
- This is a vLLM scheduler modification, not a routing change
|
||
|
||
---
|
||
|
||
## Current routing direction (cross-reference)
|
||
|
||
The hypotheses above produced the following positive results that informed
|
||
the current `--policy unified` implementation:
|
||
|
||
- H1 / H7 / H9 (negative): PD-sep offload, OVERLOAD_FACTOR tuning, and
|
||
elastic RDMA at high concurrency all regressed or stayed within noise.
|
||
- H3 / H4 / H6 (partial): cache-gated offload exists but only ~10-12% of
|
||
HEAVY requests have cache, and the offloaded subset pays RDMA penalty.
|
||
|
||
The active algorithm (commit `255c8e6`) is **hybrid LMetric + high-cache
|
||
affinity** in baseline mode (no Mooncake). The retired migration variants
|
||
are catalogued in `docs/migration-policy-design.md` (Approach A and the
|
||
revert chain `cc6e562` / `4c583f2`). H7's rejection (OVERLOAD_FACTOR within
|
||
noise) is why the active default stays at `overload_factor=2.0`.
|