Files
agentic-kvc/analysis/elastic_hypotheses.md
Gahow Wang 6a27f75337 Docs: reconcile routing docs with current hybrid direction
Per analysis/unified_routing_fix_review.md #2, several docs still
presented the retired single-argmin + PUSH-migration design as the
final algorithm. Mark them superseded and document the current hybrid
direction (commit 255c8e6).

- REPORT.md §1.1 / §3.9: add errata callout and section header noting
  the "Final Design" framing was retired after cc6e562 / 4c583f2;
  point readers to docs/migration-policy-design.md.

- docs/migration-policy-design.md: rewrite. Opens with the current
  hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate +
  tie-breaker), then a "What Was Retired" commit table, then the old
  Approach A numbers preserved as "Historical Baseline-Mode Comparison".

- analysis/research_findings.md §2.2 / §5: correct the LMetric framing.
  LMetric isn't "neutralized by affinity constraints" (pure --policy
  lmetric has no affinity at all); it converges to similar placements
  because P_tokens includes new_uncached_tokens, giving it implicit
  soft affinity.

- analysis/elastic_hypotheses.md: same LMetric correction in the
  "DOESN'T work" summary, plus a footer cross-referencing the current
  routing direction.

- analysis/unified_routing_fix_review.md: track this file (was
  untracked); it is the review handoff cited from the updated docs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 10:47:14 +08:00

295 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Elastic Prefill Service: Hypotheses and Validation Log
**Date**: 2026-05-23
**Context**: Investigating whether elastic PD disaggregation can improve agentic LLM serving vs pure co-located baseline.
## Baseline Reference (8C plain, fresh restart, 200 req)
```
OK=198/200 TTFT50=1.075 TTFT90=9.384 TPOT90=0.0761 E2E50=5.075
WARM: TTFT50=0.137 TPOT90=0.061
MEDIUM: TTFT50=0.921 TPOT90=0.079
HEAVY: TTFT50=4.945 TPOT90=0.076
```
---
## H1: Mooncake kv_both has significant runtime overhead
**Claim**: Enabling kv_both mode degrades TPOT even without KV transfer (RDMA threads, ZMQ sockets compete for CPU).
**Prior evidence**: Earlier elastic P2P experiment showed MEDIUM TPOT 0.079→0.197 (+150%). Attributed to kv_both overhead.
**Experiment**: Phase 0A (7C kv_both, no offload) vs Phase 0B (7C plain)
**Result**: TPOT90 = 0.0738 (kv_both) vs 0.0729 (plain) → **+1.3%, within noise**
**Verdict**: **REJECTED**. kv_both has zero runtime overhead. The earlier 150% TPOT degradation was from offload-induced interference, not kv_both itself.
---
## H2: Dedicated Prefill Service (PS) without KV pull improves HEAVY TTFT
**Claim**: A dedicated PS instance (no sessions) does HEAVY prefill without disrupting C's decode. PS does full cold prefill (no cache), D (session-sticky C) pulls KV and decodes.
**Experiment**: PS V1 — 1PS + 7C kv_both, always offload HEAVY to PS
**Result**:
- `ps_always`: OK=195/200, HEAVY TTFT p50=~7.8s (baseline 5.0s, **+56%**), cascading timeouts
- `ps_cost`: 0 offloads (cost model correctly identifies PS is more expensive)
- `ps_flexd`: OK=172/186 (92.5%), HEAVY TTFT p50=7.8s, 12 ReadTimeout
**Root cause**: PS has no KV cache for the session → full cold prefill is SLOWER than C's cached prefill. Cost model: `full_input/8333 > (input-cached)/8333 + interference` is always true.
**Verdict**: **REJECTED**. PS without KV pull cannot beat cached co-located prefill. The cold prefill overhead + KV transfer time exceeds the interference savings.
---
## H3: C_s cached prefill + flexible D decode (V2) improves E2E
**Claim**: C_s (session-sticky, has cache) does fast prefill (max_tokens=1), D (least-loaded C) pulls KV via Mooncake and does decode. Benefits: (1) C_s prefill is fast due to cache, (2) D is least-loaded so decode starts quickly, (3) session migrates to D for better load balance.
**Experiment**: V2 — 8C kv_both, HEAVY offloaded (C_s prefill → flexible D decode)
**Result**:
```
OK=179/185 (96.8%) TTFT50=0.762 (-29%) E2E50=4.628 (-9%) TPOT90=0.0746 (=)
HEAVY: TTFT50=4.794 (≈baseline) TTFT90=20.4 (+117%)
Routes: 63 HEAVY_OFFLOAD, 51 MEDIUM, 69 WARM
Cache hit on offloaded: mean=3%, median=0% (92% are turn-1 cold)
Prefill: p50=5.0s D KV pull: p50=1.1s p90=6.7s
```
**Partial validation**: E2E p50 improved 9%, TTFT p50 improved 29%. But HEAVY p90 degraded 2x and 6 errors (vs 2 baseline).
**Key finding**: 92% of HEAVY requests are turn-1 (zero cache on C_s). C_s does COLD prefill anyway → offload adds pure RDMA overhead (~1.1s) with no cache benefit.
**Verdict**: **PARTIALLY VALIDATED**. The architecture works for MEDIUM and WARM (better load balance). But blindly offloading all HEAVY hurts because most are cold.
---
## H4: Only offload HEAVY with high cache hit (cold HEAVY should stay co-located)
**Claim**: Turn-1 HEAVY requests have zero cache → co-located is faster (no RDMA overhead). Only turn-2+ HEAVY with significant cache hit (>50%) should be offloaded, because:
- C_s's prefill is fast (only new tokens computed)
- D gets the KV via RDMA (~1.1s, small vs the savings from not waiting for C_s's decode queue)
- C_s's decode is not disrupted
**Counterintuition**: This challenges the conventional PD-sep assumption that "all heavy prefill should be disaggregated." For agentic workloads with high cache reuse (70%+), most of the "heavy" prefix is already cached — the actual compute is MEDIUM-level.
**Experiment**: TODO — V2 with `cache_hit > 50% * input_length` gate
**Expected**:
- Turn-1 cold HEAVY stays co-located (no RDMA overhead, same TTFT as baseline)
- Turn-2+ cached HEAVY gets offloaded (C_s fast prefill + D least-loaded decode)
- Overall: HEAVY TTFT ≈ baseline, HEAVY TPOT improved (D less loaded), fewer errors
---
## H5: RDMA KV transfer overhead (1.1s p50) is too high — should be pipelined
**Claim**: The 1.1s p50 KV transfer time for HEAVY requests (~40k tokens) seems excessive. At 200Gbps RDMA (25 GB/s), 40k tokens × 96KB/token = 3.75GB → should take ~0.15s. The 7x gap suggests block-by-block transfer without pipelining.
**Questions to investigate**:
1. Does Mooncake do layerwise KV transfer? (transfer layer N while computing layer N+1)
2. Is the 1.1s from RDMA setup overhead, block scatter, or actual bandwidth?
3. Does vLLM's chunked prefill interact with the transfer (blocks only available after each chunk)?
**From Mooncake code**: `MooncakeConnector does not do layerwise saving` (comment in code). All blocks are saved/loaded after the FULL prefill completes. This means:
- Prefill must complete entirely before ANY KV transfer starts
- D cannot start decode until ALL blocks arrive
- No overlap between prefill compute and KV transfer
**Potential optimization**: Layerwise transfer would allow D to start pulling layer 0's KV while C_s is still computing layer 47's KV. This could reduce the effective transfer latency to near zero (hidden behind compute).
**Experiment**: TODO — Profile actual RDMA transfer time vs setup overhead. Check if `start_load_kv()` and `wait_for_layer_load()` APIs support layerwise loading (they exist in the interface but Mooncake doesn't implement them).
---
## H6: Session migration breaks KV cache locality for future turns
**Claim**: When a HEAVY request is offloaded from C_s to D, session affinity moves to D. But D starts with zero cache for this session — it only has the KV from the current turn (transferred via RDMA). Future turns go to D, which now has the current turn cached. But the RDMA-transferred KV might not be properly registered in D's prefix cache.
**Questions**:
- Does vLLM's prefix cache recognize RDMA-transferred blocks as cacheable?
- If yes, future turns on D should have similar APC to staying on C_s.
- If no, every turn after migration is a cold start on D.
**From vLLM metrics**: `external_prefix_cache_hits_total` counts cross-instance cache hits. If this is > 0 on D after migration, the transferred blocks ARE cacheable.
**Experiment**: TODO — Track per-instance APC before and after session migration. Check if D's APC for migrated sessions matches expectations.
---
## Summary of Current Understanding
```
Turn 1 (cold) Turn 2+ (cached)
───────────── ────────────────
Co-located: ✅ Best (no overhead) ⚠️ HEAVY disrupts decode
Offload (V2): ❌ Adds RDMA overhead ✅ C_s fast prefill + D load balance
```
The optimal strategy is **hybrid**: co-locate cold turn-1, offload cached turn-2+.
This is the key insight for the paper: **the offload decision should be cache-aware, not size-based**. A 80k-token request with 90% cache hit is effectively a 8k-token prefill — MEDIUM, not HEAVY. The "heaviness" that matters for PD disaggregation is `new_tokens_to_compute`, not `total_input_length`.
---
## H7: OVERLOAD_FACTOR tuning improves GPU balance
**Claim**: Lowering OVERLOAD_FACTOR (from 2.0 to 1.5/1.3/1.0) breaks session affinity earlier, improving GPU utilization balance.
**Experiment**: 4 baseline runs (no Mooncake) with OF=2.0, 1.5, 1.3, 1.0. 200 req each, fresh restart.
**Result**:
```
OF=2.0: imbalance=3.71x TTFT50=1.077 E2E50=5.093
OF=1.5: imbalance=3.45x TTFT50=1.068 E2E50=5.480
OF=1.3: imbalance=3.96x TTFT50=1.073 E2E50=5.144
OF=1.0: imbalance=3.47x TTFT50=1.085 E2E50=5.496
```
All within noise. APC unchanged (~30%).
**Verdict**: **REJECTED**. The imbalance is driven by workload skew (some sessions are inherently heavier), not by sticky routing. The OVERLOAD_FACTOR threshold rarely fires because per-instance load fluctuates too quickly. The hot GPU just rotates to different instances across runs.
**Key learning**: The root cause of GPU imbalance is at **session placement time (turn 1)**, not at affinity-breaking time (turn 2+). Turn-1 placement uses `ongoing_tokens` scoring, which is a snapshot that doesn't account for cumulative or future load.
---
## H4 Validated: Cache-gate improves GPU balance but RDMA kills TTFT
**Experiment**: H4 cache-gate (8C kv_both, offload only when cache_ratio >= 0.3) with GPU profiling.
**Result**:
```
Baseline H4 cache-gate
GPU Imbalance: 3.97x 2.04x ← 2x better balance
GPU Std: 14.9% 6.7% ← less variance
GPU Max: 63.3% 35.3% ← no extreme hotspot
HEAVY_COLO TTFT: 7.02s 6.28s ← -10.5% from better balance!
HEAVY_OFFLOAD TTFT: N/A 11.45s ← RDMA penalty
OK/N: 198/200 198/200 ← same reliability
```
**Key finding**: The 10.5% HEAVY_COLO improvement proves GPU balance → better latency. But the 7 RDMA-offloaded requests (TTFT=11.45s) pull down the aggregate. RDMA transfer is bimodal: 3/7 fast (0.6-1.2s), 3/7 slow (18-31s).
---
## Current Understanding (updated)
1. **PD-Sep**: net negative (memory wall) ← proven
2. **LMetric**: ≈ baseline for agentic (session affinity limits routing freedom) ← proven
3. **Elastic P2P (RDMA)**: net negative on single machine (Mooncake lacks layerwise transfer → RDMA is pure overhead) ← proven
4. **OVERLOAD_FACTOR tuning**: no effect (imbalance from workload skew, not routing) ← proven
5. **GPU balance improvement → HEAVY TTFT -10.5%**: validated (H4 HEAVY_COLO data)
6. **The bottleneck is at time_scale=20 with 200 req**: system is only 30% loaded. Higher load may reveal more optimization opportunities.
---
## H8: Higher concurrency reveals prefill-decode interference
**Claim**: At 8 sessions / 8 GPUs, the system is underloaded (30% GPU util). Increasing to 16 sessions should reveal prefill-decode interference.
**Experiments**:
- 8 sessions, ts=20, 1000 req: TPOT90=0.073, GPU=30%, imbal=1.5x
- 16 sessions, ts=10, 500 req: TPOT90=0.106, GPU=~25%, imbal=~3.5x
- 32 sessions, ts=10, 500 req: (not run yet)
**Result**:
```
8 sessions 16 sessions Delta
TPOT p90: 0.0729 0.1058 +45%!
WARM TPOT90: 0.0640 0.1301 +103%!
MEDIUM TPOT90: 0.0750 0.1970 +149%!
HEAVY TTFT50: (varies) 3.399 —
E2E p50: 4.516 5.830 +29%
```
**Verdict**: **VALIDATED**. 16 sessions creates real prefill-decode interference. MEDIUM TPOT degrades 2.5x because HEAVY prefills (via chunked prefill) block decode steps on the same GPU. This is the scenario where PD disaggregation should theoretically help.
---
## H9: Elastic RDMA offload at 16 sessions reduces interference
**Claim**: At 16 sessions where interference is severe, elastic V2 (C_s prefill + flexible D decode via RDMA) should reduce TPOT by isolating heavy prefill from decode.
**Experiment**: 16 sessions, 500 req, elastic (kv_both + H4 cache-gate)
**Result**:
```
Baseline 16s Elastic 16s Delta
TPOT p90: 0.1058 0.1231 +16% (WORSE)
MEDIUM TPOT90: 0.1970 0.2056 +4% (same)
TTFT p50: 0.828 0.937 +13% (WORSE)
E2E p50: 5.830 6.528 +12% (WORSE)
OK/N: 498/500 498/500 same
Offloaded: — 13/500 (2.6%) too few to matter
```
**Verdict**: **REJECTED**. Elastic at 16 sessions is WORSE, not better. Root causes:
1. Cache-gate correctly blocks 89% of HEAVY (cold turn-1, cache_ratio=0) → only 13 offloads
2. kv_both runtime overhead at high concurrency adds ~16% TPOT vs plain baseline
3. The 13 offloaded requests have TTFT p50=17.5s (RDMA overhead), much worse than colocated 3.5s
**Key learning**: The RDMA transfer approach cannot solve prefill-decode interference because:
- Most HEAVY are cold (no cache to benefit from offload)
- Mooncake lacks layerwise transfer (RDMA is pure sequential overhead after prefill)
- kv_both has non-zero overhead at high concurrency (contradicts Phase 0 at low concurrency)
---
## Current Understanding (final)
### What DOESN'T work for agentic workloads:
1. **PD-Sep**: net negative — KV cache memory wall on decode instances
2. **LMetric (OSDI'26)**: ≈ linear routing — `P_tokens` already includes
`new_uncached_tokens`, so cache-hit scoring gives LMetric an implicit
soft affinity that converges to similar placements as explicit sticky
affinity (see `analysis/research_findings.md` §2.2 for the corrected
framing)
3. **Elastic P2P RDMA offload**: net negative — Mooncake transfer overhead, no layerwise pipeline
4. **OVERLOAD_FACTOR tuning**: no effect — imbalance from workload skew, not routing
5. **Dedicated Prefill Service (PS)**: cannot win cost comparison without KV pull, PS is always slower than cached C
6. **Cache-gate offload (H4)**: correct but only 10-12% of HEAVY have cache → limited activation
### What DOES work:
1. **Cache-aware session-sticky routing**: +24pp APC, -60% TTFT vs round-robin (the dominant optimization)
2. **GPU balance from offload routing**: HEAVY_COLO -10.5% TTFT when imbalance reduced (H4 data)
### The real bottleneck:
At production-level concurrency (>1 session/GPU), the dominant bottleneck is **chunked prefill interference**: large HEAVY prefill chunks block decode steps on the same GPU, causing TPOT to degrade 45-149%.
Neither routing nor RDMA-based PD disaggregation solves this. The root cause is vLLM's scheduler design:
- Chunked prefill chunk size (`max_num_batched_tokens`, default 8192) is fixed
- Large prefill chunks monopolize the GPU for tens of ms, stalling decode
- Reducing chunk size would improve decode responsiveness but increase prefill overhead
### Next direction: Adaptive chunked prefill scheduling
Instead of fixed chunk size, dynamically adjust based on decode pressure:
- When decode queue is deep: smaller chunks → more decode slots → better TPOT
- When decode queue is empty: larger chunks → faster prefill → better TTFT
- This is a vLLM scheduler modification, not a routing change
---
## Current routing direction (cross-reference)
The hypotheses above produced the following positive results that informed
the current `--policy unified` implementation:
- H1 / H7 / H9 (negative): PD-sep offload, OVERLOAD_FACTOR tuning, and
elastic RDMA at high concurrency all regressed or stayed within noise.
- H3 / H4 / H6 (partial): cache-gated offload exists but only ~10-12% of
HEAVY requests have cache, and the offloaded subset pays RDMA penalty.
The active algorithm (commit `255c8e6`) is **hybrid LMetric + high-cache
affinity** in baseline mode (no Mooncake). The retired migration variants
are catalogued in `docs/migration-policy-design.md` (Approach A and the
revert chain `cc6e562` / `4c583f2`). H7's rejection (OVERLOAD_FACTOR within
noise) is why the active default stays at `overload_factor=2.0`.