Commit Graph

12 Commits

Author SHA1 Message Date
e45f00eb68 Cache policy simulation: routing quality dominates, not eviction policy
With balanced session-sticky routing:
  LRU APC = 49.2% (only 1.8pp below infinite 51.0%)
  LFU APC = 43.5% (worse than LRU!)
  SessionProtLRU = 49.0% (no improvement)

The previous 10.1pp gap was from routing imbalance (all traffic to inst_0),
not from cache eviction policy. Balanced routing recovers 5.9pp of the gap.

Multi-turn sessions get 80.1% APC with simple LRU + session-sticky routing
because inter-turn gap is only 2 requests (LRU naturally keeps it warm).

Conclusion: fix routing balance, not cache policy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 01:28:53 +08:00
10636b1ab1 KV cache lifecycle design + eviction loss analysis
Root cause of 10.1pp APC gap: multi-turn sessions' KV evicted between
turns by cold-start prefills (66% of loss). Inter-turn gap is only 2
requests p50, but LRU cache (550 blocks) can't protect 93 blocks/session
across 14-21 concurrent sessions.

Three approaches designed:
  A. Session-sticky routing with KV reservation (proxy-only, no vLLM change)
  B. Two-tier KV cache: GPU + DRAM offload via Mooncake
  C. Prefill-aware eviction (LFU/ARC instead of LRU, vLLM patch)

Next: simulate LRU vs LFU vs "infinite-for-MT" to quantify upper bounds,
then implement Approach A (lowest effort, immediate benchmark).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 01:27:22 +08:00
d11d9f5cb9 Adaptive prefill offload v1: implementation + experiment
Added --heavy-threshold to cache_aware_proxy.py. HEAVY requests (new
tokens >= threshold) route to instance with least decode load; WARM/MEDIUM
route by cache-hit + token-level LB as before.

Result: no significant difference vs baseline on single-machine combined mode.
  TTFT: +1.2%, TPOT: -1.5%, E2E: -0.3% (all within noise)

Per-class TTFT breakdown shows the optimization target:
  WARM (75 req):   p50=0.198s  (cache hit, nearly free)
  MEDIUM (72 req): p50=1.356s
  HEAVY (54 req):  p50=7.124s  (36x slower than WARM)

Conclusion: single-machine combined mode already distributes load well
enough that adaptive routing adds no benefit. True isolation of HEAVY
prefills requires cross-machine offload (v2 with Mooncake or multi-node).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 01:00:10 +08:00
d6e47d3742 Design doc: Adaptive Prefill Offload
All 8 GPUs stay PD-combined. Global scheduler classifies requests as
WARM/MEDIUM/HEAVY based on estimated new tokens after prefix cache.
Only HEAVY requests (20%, cold start >20k new tokens) get offloaded;
80% of requests are co-located with zero KV transfer.

This avoids the KV cache memory wall (no decode concentration) while
isolating heavy prefills from decode when needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 00:44:22 +08:00
445e491123 Add vLLM v0.18.1 source tree with KV transfer abort fix
third_party/vllm/ now tracked in git for direct patch management.
Based on vLLM v0.18.1 release with one patch applied:

  vllm/v1/core/sched/scheduler.py:
    Replace fatal assert with graceful skip when KV transfer callback
    arrives for an already-aborted request during PD disaggregated serving.

Future vLLM modifications should be made directly in third_party/vllm/
and committed normally. The patches/ directory is kept as documentation
of what changed from upstream.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 00:30:38 +08:00
b6591950bc Add vLLM patches directory for version-controlled patch management
patches/0001-fix-kv-transfer-abort-race.patch:
  Fix scheduler assert crash when KV transfer callback arrives
  after request abort in PD-disaggregated serving.

patches/README.md:
  How to apply patches to source tree or installed package.
  Per-patch description with problem/fix/impact.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 00:26:14 +08:00
efa70f05b5 Consolidate analysis into single report with appendix
Merged roofline_analysis.md into pd_separation_analysis.md.
Restructured as a self-contained research report:

1. TL;DR with key finding (KV cache memory wall)
2. Workload characterization (trace stats + cache reuse)
3. Experiment setup (hardware, software, configs, scripts)
4. Results (main comparison, GPU util, breakdown, ablations)
5. Analysis (DistServe assumptions, roofline, root cause)
6. Conclusions
7. Appendix: all experiment artifacts, data paths, reproducing steps

One document to read, with pointers to data for deeper analysis.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 00:23:23 +08:00
ce616f46d1 Add per-request breakdown profiling, identify KV cache memory bottleneck
Breakdown profiling at proxy level captures:
  t_proxy_recv → t_prefill_sent → t_prefill_done → t_decode_sent → t_first_token

Key finding: 87.7% of TTFT is spent in kv+decode phase, NOT prefill.
Root cause: decode instance KV cache memory saturation (97.1% usage).

With 6P+2D config, 2 decode GPUs have only ~56GB total KV cache.
Large agentic requests (avg 33.6k tokens) fill this quickly.
Small requests (49 tokens, prefill=0.044s) wait 114s for KV cache
to be freed by large requests completing decode.

vLLM log confirms: Running=0, Waiting=6, KV cache=97.1%
GPU is idle but requests queue for KV cache memory, not compute.

This is the fundamental bottleneck of single-machine PD separation
for long-context agentic workloads: concentrating decode onto fewer
GPUs creates a KV cache memory wall.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 00:13:50 +08:00
c7afdc5074 Ablation 2: fire-and-forget vs await-prefill scheduling
Added --fire-and-forget flag to cache_aware_proxy.py for async prefill dispatch.

Results on 6P+2D config:
  Await:  TTFT=1.48s  TPOT=0.066s  E2E=5.95s  94% success
  FnF:    TTFT=5.32s  TPOT=0.037s  E2E=11.9s  85% success

Fire-and-forget improves TPOT by 44% (pipeline overlap) but degrades
TTFT by 260% (decode internally waits for KV, less efficiently than
proxy-level await) and increases errors from KV race conditions.

Full 4-way ablation summary in analyze_ablations.py.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 23:02:42 +08:00
9dee25907b Add P/D ratio ablation: 6P+2D vs 4P+4D vs Combined
6P+2D gives more GPUs to prefill, fewer to decode:
- Decode util: 7.8% (4D) -> 19.0% (2D), less waste
- TTFT: 1.99s (4P) -> 1.48s (6P), -26% from less prefill queuing
- But Combined (30.5% util, TTFT 1.01s) still best overall

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 22:42:20 +08:00
67149130be Add GPU utilization A/B test and fix cache-aware proxy bugs
- GPU monitor: 5s interval nvidia-smi sampling during benchmarks
- A/B test script: clean restart + monitor + benchmark for Combined vs PD-Sep
- Fixed proxy: await bootstrap init (race condition), normalized LB scoring
- Fixed port conflicts: proxy 9090 to avoid bootstrap 9000 clash

Key finding: PD-Sep GPU utilization is 40% of Combined (12.4% vs 30.5%)
- Decode GPUs: mean=7.8%, max=47% (memory-bound, compute wasted)
- Prefill GPUs: active only 17% of samples (bursty, idle between requests)
- Combined: 8 GPUs flexibly used, mean=30.5%, active=64%

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 22:13:38 +08:00
05592e6adc Agentic workload PD separation analysis with trace-driven benchmarks
Systematic study of prefill-decode disaggregation for agentic LLM workloads
using production GLM-5.1 coder trace (2.1M requests, 71B input tokens).

Key findings:
- Cache-aware routing improves TPOT p90 by 15% and APC from 20.8% to 44.7%
  without PD separation, matching PD-Sep's decode isolation benefit
- PD separation adds +72% TTFT overhead (KV transfer) with no TPOT gain
  when using the same cache-aware scheduler
- Prefill remains compute-bound even at 95% KV cache reuse (AI >1000x
  vs decode AI <2), but absolute FLOPs drop 71% from cache hits
- For agentic MoE workloads, cache-aware routing > PD separation

Infrastructure:
- Trace sampler preserving session structure + hash_ids for prefix sharing
- Async trace replayer with streaming TTFT/TPOT/E2E measurement
- Unified cache-aware + token-level load-balanced global scheduler proxy
  supporting both PD-colocated and PD-disaggregated (Mooncake/RDMA) modes
- vLLM 0.18.1 scheduler patch for KV transfer abort race condition
- Roofline analysis tool for prefill/decode compute characterization

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 21:21:57 +08:00