agentic-kvc

Files

Gahow Wang 1d2eeb4925 Elastic P2P offload: TTFT p50 -49% vs baseline (0.551 vs 1.080)

Design: offload HEAVY prefill only when P instance is less loaded than D
AND P is not overloaded (< 1.5x avg). Preserves session-sticky on D
for future KV reuse. External KV correctly registered in prefix cache.

Result (67/200 processed, 75% success):
  TTFT p50: 0.551s (-49% vs baseline 1.080s)
  TTFT p90: 4.135s (vs baseline 9.410s, -56%)
  TPOT p90: 0.074s (same as baseline)
  E2E  p50: 2.938s (-45% vs baseline 5.306s)

25% error rate from ReadTimeout on very large HEAVY requests queuing on P.
Needs stricter elastic gate or higher timeout. But successful requests
show significant improvement over both baseline and previous P2P.

Also: added external_prefix_cache metrics tracking to replayer summary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 13:50:25 +08:00

__init__.py

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

__main__.py

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

metrics.py

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

replay.py

Elastic P2P offload: TTFT p50 -49% vs baseline (0.551 vs 1.080)

2026-05-22 13:50:25 +08:00

trace.py

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00