agentic-kvc

Go to file

Gahow Wang 2fee355626 Adaptive v2 (selective Mooncake offload): worse than baseline

Implemented --offload mode: HEAVY requests (>20k new tokens) get P on
least-loaded instance, KV via Mooncake RDMA, D on session-sticky instance.
WARM/MEDIUM stay co-located (no KV transfer). All 8 instances run kv_both.

Result (200 req, same instances, fresh restart):
  Baseline (no offload):   TTFT=1.073  TPOT90=0.074  E2E=5.086
  Offload HEAVY:            TTFT=1.462  TPOT90=0.077  E2E=6.847
  Delta:                    +36%        +4%            +35%

Conclusion: even selective KV transfer (only 44% of requests) adds more
overhead than the isolation benefit provides. On single-machine 8 GPU,
PD-combined with hybrid routing is strictly optimal. No form of KV
transfer — full PD-sep, selective offload, or otherwise — improves
over co-located serving for this workload.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 10:14:10 +08:00

analysis

Overnight work report: routing optimization achieves +4.7pp APC

2026-05-22 02:54:48 +08:00

patches

Add vLLM patches directory for version-controlled patch management

2026-05-22 00:26:14 +08:00

replayer

Balanced session-sticky routing + agentic workload pattern analysis

2026-05-22 01:50:27 +08:00

scripts

Adaptive v2 (selective Mooncake offload): worse than baseline