agentic-kvc

253 Commits 1 Branch 0 Tags

Author	SHA1	Message	Date
Gahow Wang	9dee25907b	Add P/D ratio ablation: 6P+2D vs 4P+4D vs Combined 6P+2D gives more GPUs to prefill, fewer to decode: - Decode util: 7.8% (4D) -> 19.0% (2D), less waste - TTFT: 1.99s (4P) -> 1.48s (6P), -26% from less prefill queuing - But Combined (30.5% util, TTFT 1.01s) still best overall Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 22:42:20 +08:00
Gahow Wang	67149130be	Add GPU utilization A/B test and fix cache-aware proxy bugs - GPU monitor: 5s interval nvidia-smi sampling during benchmarks - A/B test script: clean restart + monitor + benchmark for Combined vs PD-Sep - Fixed proxy: await bootstrap init (race condition), normalized LB scoring - Fixed port conflicts: proxy 9090 to avoid bootstrap 9000 clash Key finding: PD-Sep GPU utilization is 40% of Combined (12.4% vs 30.5%) - Decode GPUs: mean=7.8%, max=47% (memory-bound, compute wasted) - Prefill GPUs: active only 17% of samples (bursty, idle between requests) - Combined: 8 GPUs flexibly used, mean=30.5%, active=64% Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 22:13:38 +08:00
Gahow Wang	05592e6adc	Agentic workload PD separation analysis with trace-driven benchmarks Systematic study of prefill-decode disaggregation for agentic LLM workloads using production GLM-5.1 coder trace (2.1M requests, 71B input tokens). Key findings: - Cache-aware routing improves TPOT p90 by 15% and APC from 20.8% to 44.7% without PD separation, matching PD-Sep's decode isolation benefit - PD separation adds +72% TTFT overhead (KV transfer) with no TPOT gain when using the same cache-aware scheduler - Prefill remains compute-bound even at 95% KV cache reuse (AI >1000x vs decode AI <2), but absolute FLOPs drop 71% from cache hits - For agentic MoE workloads, cache-aware routing > PD separation Infrastructure: - Trace sampler preserving session structure + hash_ids for prefix sharing - Async trace replayer with streaming TTFT/TPOT/E2E measurement - Unified cache-aware + token-level load-balanced global scheduler proxy supporting both PD-colocated and PD-disaggregated (Mooncake/RDMA) modes - vLLM 0.18.1 scheduler patch for KV transfer abort race condition - Roofline analysis tool for prefill/decode compute characterization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 21:21:57 +08:00

Author

SHA1

Message

Date

Gahow Wang

9dee25907b

Add P/D ratio ablation: 6P+2D vs 4P+4D vs Combined

6P+2D gives more GPUs to prefill, fewer to decode:
- Decode util: 7.8% (4D) -> 19.0% (2D), less waste
- TTFT: 1.99s (4P) -> 1.48s (6P), -26% from less prefill queuing
- But Combined (30.5% util, TTFT 1.01s) still best overall

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-21 22:42:20 +08:00

Gahow Wang

67149130be

Add GPU utilization A/B test and fix cache-aware proxy bugs

- GPU monitor: 5s interval nvidia-smi sampling during benchmarks
- A/B test script: clean restart + monitor + benchmark for Combined vs PD-Sep
- Fixed proxy: await bootstrap init (race condition), normalized LB scoring
- Fixed port conflicts: proxy 9090 to avoid bootstrap 9000 clash

Key finding: PD-Sep GPU utilization is 40% of Combined (12.4% vs 30.5%)
- Decode GPUs: mean=7.8%, max=47% (memory-bound, compute wasted)
- Prefill GPUs: active only 17% of samples (bursty, idle between requests)
- Combined: 8 GPUs flexibly used, mean=30.5%, active=64%

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-21 22:13:38 +08:00

Gahow Wang

05592e6adc

Agentic workload PD separation analysis with trace-driven benchmarks

Systematic study of prefill-decode disaggregation for agentic LLM workloads
using production GLM-5.1 coder trace (2.1M requests, 71B input tokens).

Key findings:
- Cache-aware routing improves TPOT p90 by 15% and APC from 20.8% to 44.7%
  without PD separation, matching PD-Sep's decode isolation benefit
- PD separation adds +72% TTFT overhead (KV transfer) with no TPOT gain
  when using the same cache-aware scheduler
- Prefill remains compute-bound even at 95% KV cache reuse (AI >1000x
  vs decode AI <2), but absolute FLOPs drop 71% from cache hits
- For agentic MoE workloads, cache-aware routing > PD separation

Infrastructure:
- Trace sampler preserving session structure + hash_ids for prefix sharing
- Async trace replayer with streaming TTFT/TPOT/E2E measurement
- Unified cache-aware + token-level load-balanced global scheduler proxy
  supporting both PD-colocated and PD-disaggregated (Mooncake/RDMA) modes
- vLLM 0.18.1 scheduler patch for KV transfer abort race condition
- Roofline analysis tool for prefill/decode compute characterization

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-21 21:21:57 +08:00

253 Commits