Fix multi-turn replay fidelity: track realized output tokens across all components

The replayer and proxy were building multi-turn prompts from trace tokens, but the model generates different output tokens. Subsequent turns had wrong prefix tokens, causing cache misses and invalid experimental measurements. - replay.py: min_tokens=max_tokens for deterministic length, return_token_ids to capture actual output, _apply_realized_prefix for next-turn correction - proxy: extract output token_ids from SSE, record prompt+output as realized prefix in shadow cache, extract _handle_local_request to deduplicate - bench.sh/launch_elastic_p2p.sh: default elastic mode to unified policy - mooncake_connector: only send prompt blocks (not stale output blocks), track failed_recving_block_ids for error recovery Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 14:47:51 +08:00
parent cc4a9c91e7
commit 9cebdb6b9b
5 changed files with 312 additions and 77 deletions
--- a/scripts/launch_elastic_p2p.sh
+++ b/scripts/launch_elastic_p2p.sh
@@ -90,6 +90,7 @@ $PYTHON "$PROJECT_DIR/scripts/cache_aware_proxy.py" \
    --combined $combined_args \
    --bootstrap-ports "$bootstrap_ports" \
    --offload \
+    --policy unified \
    --heavy-threshold $HEAVY_THRESHOLD \
    --port $PROXY_PORT &
 sleep 5