agentic-kvc

Files

Gahow Wang 9cebdb6b9b Fix multi-turn replay fidelity: track realized output tokens across all components

The replayer and proxy were building multi-turn prompts from trace tokens,
but the model generates different output tokens. Subsequent turns had wrong
prefix tokens, causing cache misses and invalid experimental measurements.

- replay.py: min_tokens=max_tokens for deterministic length, return_token_ids
  to capture actual output, _apply_realized_prefix for next-turn correction
- proxy: extract output token_ids from SSE, record prompt+output as realized
  prefix in shadow cache, extract _handle_local_request to deduplicate
- bench.sh/launch_elastic_p2p.sh: default elastic mode to unified policy
- mooncake_connector: only send prompt blocks (not stale output blocks),
  track failed_recving_block_ids for error recovery

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-24 14:47:51 +08:00

__init__.py

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

__main__.py

replayer: wire --max-inflight-sessions cap into replay loop (B2)

2026-05-23 21:04:09 +08:00

metrics.py

metrics: replace round-based percentile with linear interpolation (B5)

2026-05-23 21:00:24 +08:00

replay.py

Fix multi-turn replay fidelity: track realized output tokens across all components

2026-05-24 14:47:51 +08:00

trace.py

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00