agentic-kvc

Go to file

Gahow Wang d11d9f5cb9 Adaptive prefill offload v1: implementation + experiment

Added --heavy-threshold to cache_aware_proxy.py. HEAVY requests (new
tokens >= threshold) route to instance with least decode load; WARM/MEDIUM
route by cache-hit + token-level LB as before.

Result: no significant difference vs baseline on single-machine combined mode.
  TTFT: +1.2%, TPOT: -1.5%, E2E: -0.3% (all within noise)

Per-class TTFT breakdown shows the optimization target:
  WARM (75 req):   p50=0.198s  (cache hit, nearly free)
  MEDIUM (72 req): p50=1.356s
  HEAVY (54 req):  p50=7.124s  (36x slower than WARM)

Conclusion: single-machine combined mode already distributes load well
enough that adaptive routing adds no benefit. True isolation of HEAVY
prefills requires cross-machine offload (v2 with Mooncake or multi-node).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 01:00:10 +08:00

analysis

Design doc: Adaptive Prefill Offload

2026-05-22 00:44:22 +08:00

patches

Add vLLM patches directory for version-controlled patch management

2026-05-22 00:26:14 +08:00

replayer

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

scripts

Adaptive prefill offload v1: implementation + experiment