agentic-kvc

Go to file

Gahow Wang d6e47d3742 Design doc: Adaptive Prefill Offload

All 8 GPUs stay PD-combined. Global scheduler classifies requests as
WARM/MEDIUM/HEAVY based on estimated new tokens after prefix cache.
Only HEAVY requests (20%, cold start >20k new tokens) get offloaded;
80% of requests are co-located with zero KV transfer.

This avoids the KV cache memory wall (no decode concentration) while
isolating heavy prefills from decode when needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 00:44:22 +08:00

analysis

Design doc: Adaptive Prefill Offload

2026-05-22 00:44:22 +08:00

patches

Add vLLM patches directory for version-controlled patch management

2026-05-22 00:26:14 +08:00

replayer

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

scripts

Add per-request breakdown profiling, identify KV cache memory bottleneck