Add elastic hypotheses tracking doc with H1-H6 analysis

Tracks all hypotheses tested during elastic PD disaggregation research: - H1 (kv_both overhead): REJECTED — zero overhead at idle - H2 (PS cold prefill): REJECTED — PS slower than cached C - H3 (C_s+flexD): PARTIALLY VALIDATED — E2E -9% but HEAVY p90 +117% - H4 (cache-aware offload): TODO — only offload high-cache-hit HEAVY - H5 (RDMA overhead): TODO — Mooncake lacks layerwise transfer - H6 (session migration): TODO — verify D's APC after migration Key insight: offload decision should be cache-aware (new_tokens), not size-based (total_input). 80k request with 90% cache = 8k prefill. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 01:17:12 +08:00
parent fc92410ec9
commit 098d86385a
1 changed files with 134 additions and 0 deletions
--- a/analysis/elastic_hypotheses.md
+++ b/analysis/elastic_hypotheses.md
@@ -0,0 +1,134 @@
+# Elastic Prefill Service: Hypotheses and Validation Log
+
+**Date**: 2026-05-23
+**Context**: Investigating whether elastic PD disaggregation can improve agentic LLM serving vs pure co-located baseline.
+
+## Baseline Reference (8C plain, fresh restart, 200 req)
+```
+OK=198/200  TTFT50=1.075  TTFT90=9.384  TPOT90=0.0761  E2E50=5.075
+WARM:   TTFT50=0.137  TPOT90=0.061
+MEDIUM: TTFT50=0.921  TPOT90=0.079
+HEAVY:  TTFT50=4.945  TPOT90=0.076
+```
+
+---
+
+## H1: Mooncake kv_both has significant runtime overhead
+
+**Claim**: Enabling kv_both mode degrades TPOT even without KV transfer (RDMA threads, ZMQ sockets compete for CPU).
+
+**Prior evidence**: Earlier elastic P2P experiment showed MEDIUM TPOT 0.079→0.197 (+150%). Attributed to kv_both overhead.
+
+**Experiment**: Phase 0A (7C kv_both, no offload) vs Phase 0B (7C plain)
+
+**Result**: TPOT90 = 0.0738 (kv_both) vs 0.0729 (plain) → **+1.3%, within noise**
+
+**Verdict**: **REJECTED**. kv_both has zero runtime overhead. The earlier 150% TPOT degradation was from offload-induced interference, not kv_both itself.
+
+---
+
+## H2: Dedicated Prefill Service (PS) without KV pull improves HEAVY TTFT
+
+**Claim**: A dedicated PS instance (no sessions) does HEAVY prefill without disrupting C's decode. PS does full cold prefill (no cache), D (session-sticky C) pulls KV and decodes.
+
+**Experiment**: PS V1 — 1PS + 7C kv_both, always offload HEAVY to PS
+
+**Result**: 
+- `ps_always`: OK=195/200, HEAVY TTFT p50=~7.8s (baseline 5.0s, **+56%**), cascading timeouts
+- `ps_cost`: 0 offloads (cost model correctly identifies PS is more expensive)
+- `ps_flexd`: OK=172/186 (92.5%), HEAVY TTFT p50=7.8s, 12 ReadTimeout
+
+**Root cause**: PS has no KV cache for the session → full cold prefill is SLOWER than C's cached prefill. Cost model: `full_input/8333 > (input-cached)/8333 + interference` is always true.
+
+**Verdict**: **REJECTED**. PS without KV pull cannot beat cached co-located prefill. The cold prefill overhead + KV transfer time exceeds the interference savings.
+
+---
+
+## H3: C_s cached prefill + flexible D decode (V2) improves E2E
+
+**Claim**: C_s (session-sticky, has cache) does fast prefill (max_tokens=1), D (least-loaded C) pulls KV via Mooncake and does decode. Benefits: (1) C_s prefill is fast due to cache, (2) D is least-loaded so decode starts quickly, (3) session migrates to D for better load balance.
+
+**Experiment**: V2 — 8C kv_both, HEAVY offloaded (C_s prefill → flexible D decode)
+
+**Result**:
+```
+OK=179/185 (96.8%)  TTFT50=0.762 (-29%)  E2E50=4.628 (-9%)  TPOT90=0.0746 (=)
+HEAVY: TTFT50=4.794 (≈baseline)  TTFT90=20.4 (+117%)
+Routes: 63 HEAVY_OFFLOAD, 51 MEDIUM, 69 WARM
+Cache hit on offloaded: mean=3%, median=0% (92% are turn-1 cold)
+Prefill: p50=5.0s  D KV pull: p50=1.1s p90=6.7s
+```
+
+**Partial validation**: E2E p50 improved 9%, TTFT p50 improved 29%. But HEAVY p90 degraded 2x and 6 errors (vs 2 baseline).
+
+**Key finding**: 92% of HEAVY requests are turn-1 (zero cache on C_s). C_s does COLD prefill anyway → offload adds pure RDMA overhead (~1.1s) with no cache benefit.
+
+**Verdict**: **PARTIALLY VALIDATED**. The architecture works for MEDIUM and WARM (better load balance). But blindly offloading all HEAVY hurts because most are cold.
+
+---
+
+## H4: Only offload HEAVY with high cache hit (cold HEAVY should stay co-located)
+
+**Claim**: Turn-1 HEAVY requests have zero cache → co-located is faster (no RDMA overhead). Only turn-2+ HEAVY with significant cache hit (>50%) should be offloaded, because:
+- C_s's prefill is fast (only new tokens computed)
+- D gets the KV via RDMA (~1.1s, small vs the savings from not waiting for C_s's decode queue)
+- C_s's decode is not disrupted
+
+**Counterintuition**: This challenges the conventional PD-sep assumption that "all heavy prefill should be disaggregated." For agentic workloads with high cache reuse (70%+), most of the "heavy" prefix is already cached — the actual compute is MEDIUM-level.
+
+**Experiment**: TODO — V2 with `cache_hit > 50% * input_length` gate
+
+**Expected**: 
+- Turn-1 cold HEAVY stays co-located (no RDMA overhead, same TTFT as baseline)
+- Turn-2+ cached HEAVY gets offloaded (C_s fast prefill + D least-loaded decode)
+- Overall: HEAVY TTFT ≈ baseline, HEAVY TPOT improved (D less loaded), fewer errors
+
+---
+
+## H5: RDMA KV transfer overhead (1.1s p50) is too high — should be pipelined
+
+**Claim**: The 1.1s p50 KV transfer time for HEAVY requests (~40k tokens) seems excessive. At 200Gbps RDMA (25 GB/s), 40k tokens × 96KB/token = 3.75GB → should take ~0.15s. The 7x gap suggests block-by-block transfer without pipelining.
+
+**Questions to investigate**:
+1. Does Mooncake do layerwise KV transfer? (transfer layer N while computing layer N+1)
+2. Is the 1.1s from RDMA setup overhead, block scatter, or actual bandwidth?
+3. Does vLLM's chunked prefill interact with the transfer (blocks only available after each chunk)?
+
+**From Mooncake code**: `MooncakeConnector does not do layerwise saving` (comment in code). All blocks are saved/loaded after the FULL prefill completes. This means:
+- Prefill must complete entirely before ANY KV transfer starts
+- D cannot start decode until ALL blocks arrive
+- No overlap between prefill compute and KV transfer
+
+**Potential optimization**: Layerwise transfer would allow D to start pulling layer 0's KV while C_s is still computing layer 47's KV. This could reduce the effective transfer latency to near zero (hidden behind compute).
+
+**Experiment**: TODO — Profile actual RDMA transfer time vs setup overhead. Check if `start_load_kv()` and `wait_for_layer_load()` APIs support layerwise loading (they exist in the interface but Mooncake doesn't implement them).
+
+---
+
+## H6: Session migration breaks KV cache locality for future turns
+
+**Claim**: When a HEAVY request is offloaded from C_s to D, session affinity moves to D. But D starts with zero cache for this session — it only has the KV from the current turn (transferred via RDMA). Future turns go to D, which now has the current turn cached. But the RDMA-transferred KV might not be properly registered in D's prefix cache.
+
+**Questions**:
+- Does vLLM's prefix cache recognize RDMA-transferred blocks as cacheable?
+- If yes, future turns on D should have similar APC to staying on C_s.
+- If no, every turn after migration is a cold start on D.
+
+**From vLLM metrics**: `external_prefix_cache_hits_total` counts cross-instance cache hits. If this is > 0 on D after migration, the transferred blocks ARE cacheable.
+
+**Experiment**: TODO — Track per-instance APC before and after session migration. Check if D's APC for migrated sessions matches expectations.
+
+---
+
+## Summary of Current Understanding
+
+```
+                    Turn 1 (cold)           Turn 2+ (cached)
+                    ─────────────           ────────────────
+Co-located:         ✅ Best (no overhead)   ⚠️ HEAVY disrupts decode
+Offload (V2):       ❌ Adds RDMA overhead   ✅ C_s fast prefill + D load balance
+```
+
+The optimal strategy is **hybrid**: co-locate cold turn-1, offload cached turn-2+.
+
+This is the key insight for the paper: **the offload decision should be cache-aware, not size-based**. A 80k-token request with 90% cache hit is effectively a 8k-token prefill — MEDIUM, not HEAVY. The "heaviness" that matters for PD disaggregation is `new_tokens_to_compute`, not `total_input_length`.