Commit Graph

8 Commits

Author SHA1 Message Date
2b0ac70ee7 Phase 1 milestone: system-level analysis + reproducible report
- REPORT.md: self-contained milestone report covering baseline vs elastic
  setup, exact launch commands, benchmark params, results, log locations,
  and repo structure — sufficient for anyone to reproduce
- analysis/pd_separation_analysis.md §5: elastic P2P system-level breakdown
  (KV cache hit ratio, per-class TTFT, GPU util paradox explanation)
- scripts/cache_aware_proxy.py: round-robin P-instance selection replacing
  argmin(ongoing_tokens) to fix GPU load imbalance (3.0x → expected ~2x)
- scripts/launch_elastic_p2p.sh: one-command launch for elastic P2P config

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 16:17:41 +08:00
1d2eeb4925 Elastic P2P offload: TTFT p50 -49% vs baseline (0.551 vs 1.080)
Design: offload HEAVY prefill only when P instance is less loaded than D
AND P is not overloaded (< 1.5x avg). Preserves session-sticky on D
for future KV reuse. External KV correctly registered in prefix cache.

Result (67/200 processed, 75% success):
  TTFT p50: 0.551s (-49% vs baseline 1.080s)
  TTFT p90: 4.135s (vs baseline 9.410s, -56%)
  TPOT p90: 0.074s (same as baseline)
  E2E  p50: 2.938s (-45% vs baseline 5.306s)

25% error rate from ReadTimeout on very large HEAVY requests queuing on P.
Needs stricter elastic gate or higher timeout. But successful requests
show significant improvement over both baseline and previous P2P.

Also: added external_prefix_cache metrics tracking to replayer summary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 13:50:25 +08:00
a65ec42467 Update report: adaptive v2 confirms no KV transfer helps single-machine
All PD/offload schemes tested are worse than PD-combined + hybrid routing:
  Combined hybrid:    TTFT=0.737  TPOT90=0.072  APC=49.4%  (BEST)
  PD-Sep 4P+4D:       TTFT=1.994  TPOT90=0.075  APC=40.2%
  Adaptive v2 offload: TTFT=1.462  TPOT90=0.077  APC=~45%

Definitive: single-machine agentic serving = PD-combined + smart routing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 10:15:08 +08:00
795edc6c66 Overnight work report: routing optimization achieves +4.7pp APC
Summary of overnight autonomous session:
- Analyzed agentic workload patterns (91% KV reuse is intra-session)
- Simulated cache policies (LRU near-optimal, routing is the bottleneck)
- Implemented hybrid routing (session-sticky + load-aware override)
- Result: APC 44.7% -> 49.4% with zero latency regression

Key insight: routing quality > cache policy > PD separation for
single-machine agentic workloads.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 02:54:48 +08:00
10636b1ab1 KV cache lifecycle design + eviction loss analysis
Root cause of 10.1pp APC gap: multi-turn sessions' KV evicted between
turns by cold-start prefills (66% of loss). Inter-turn gap is only 2
requests p50, but LRU cache (550 blocks) can't protect 93 blocks/session
across 14-21 concurrent sessions.

Three approaches designed:
  A. Session-sticky routing with KV reservation (proxy-only, no vLLM change)
  B. Two-tier KV cache: GPU + DRAM offload via Mooncake
  C. Prefill-aware eviction (LFU/ARC instead of LRU, vLLM patch)

Next: simulate LRU vs LFU vs "infinite-for-MT" to quantify upper bounds,
then implement Approach A (lowest effort, immediate benchmark).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 01:27:22 +08:00
d6e47d3742 Design doc: Adaptive Prefill Offload
All 8 GPUs stay PD-combined. Global scheduler classifies requests as
WARM/MEDIUM/HEAVY based on estimated new tokens after prefix cache.
Only HEAVY requests (20%, cold start >20k new tokens) get offloaded;
80% of requests are co-located with zero KV transfer.

This avoids the KV cache memory wall (no decode concentration) while
isolating heavy prefills from decode when needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 00:44:22 +08:00
efa70f05b5 Consolidate analysis into single report with appendix
Merged roofline_analysis.md into pd_separation_analysis.md.
Restructured as a self-contained research report:

1. TL;DR with key finding (KV cache memory wall)
2. Workload characterization (trace stats + cache reuse)
3. Experiment setup (hardware, software, configs, scripts)
4. Results (main comparison, GPU util, breakdown, ablations)
5. Analysis (DistServe assumptions, roofline, root cause)
6. Conclusions
7. Appendix: all experiment artifacts, data paths, reproducing steps

One document to read, with pointers to data for deeper analysis.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 00:23:23 +08:00
05592e6adc Agentic workload PD separation analysis with trace-driven benchmarks
Systematic study of prefill-decode disaggregation for agentic LLM workloads
using production GLM-5.1 coder trace (2.1M requests, 71B input tokens).

Key findings:
- Cache-aware routing improves TPOT p90 by 15% and APC from 20.8% to 44.7%
  without PD separation, matching PD-Sep's decode isolation benefit
- PD separation adds +72% TTFT overhead (KV transfer) with no TPOT gain
  when using the same cache-aware scheduler
- Prefill remains compute-bound even at 95% KV cache reuse (AI >1000x
  vs decode AI <2), but absolute FLOPs drop 71% from cache hits
- For agentic MoE workloads, cache-aware routing > PD separation

Infrastructure:
- Trace sampler preserving session structure + hash_ids for prefix sharing
- Async trace replayer with streaming TTFT/TPOT/E2E measurement
- Unified cache-aware + token-level load-balanced global scheduler proxy
  supporting both PD-colocated and PD-disaggregated (Mooncake/RDMA) modes
- vLLM 0.18.1 scheduler patch for KV transfer abort race condition
- Roofline analysis tool for prefill/decode compute characterization

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 21:21:57 +08:00