gahow/agentic-kvc - agentic-kvc - Local Gitea

gahow/agentic-kvc

Go to file

Gahow Wang 9cebdb6b9b Fix multi-turn replay fidelity: track realized output tokens across all components

The replayer and proxy were building multi-turn prompts from trace tokens,
but the model generates different output tokens. Subsequent turns had wrong
prefix tokens, causing cache misses and invalid experimental measurements.

- replay.py: min_tokens=max_tokens for deterministic length, return_token_ids
  to capture actual output, _apply_realized_prefix for next-turn correction
- proxy: extract output token_ids from SSE, record prompt+output as realized
  prefix in shadow cache, extract _handle_local_request to deduplicate
- bench.sh/launch_elastic_p2p.sh: default elastic mode to unified policy
- mooncake_connector: only send prompt blocks (not stale output blocks),
  track failed_recving_block_ids for error recovery

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-24 14:47:51 +08:00

Add comprehensive research findings document

2026-05-23 07:16:31 +08:00

Add elastic PS evaluation plan for production-realistic trace

2026-05-23 15:56:05 +08:00

Add vLLM patches directory for version-controlled patch management

2026-05-22 00:26:14 +08:00

Fix multi-turn replay fidelity: track realized output tokens across all components

2026-05-24 14:47:51 +08:00

Fix multi-turn replay fidelity: track realized output tokens across all components

2026-05-24 14:47:51 +08:00

proxy: Settings dataclass + cache-ratio gate + P-pick offload penalty (B4, M2, M3, D5)

2026-05-23 21:11:17 +08:00

third_party/vllm

Fix multi-turn replay fidelity: track realized output tokens across all components

2026-05-24 14:47:51 +08:00

.gitignore

Phase 1 milestone: system-level analysis + reproducible report

2026-05-22 16:17:41 +08:00

FIXES.md

Add FIXES.md with prioritized repo cleanup checklist

2026-05-23 20:35:56 +08:00

pyproject.toml

tests: add minimal coverage for percentile + proxy routing (S1)

2026-05-23 21:07:14 +08:00

REPORT.md

Fix A+C: real cache sync + cached-prefill-on-C architecture

2026-05-24 11:22:38 +08:00

TODO.md

LMetric routing policy (OSDI'26) + A/B results vs linear baseline

2026-05-22 16:57:32 +08:00

uv.lock

Fix NONE_HASH import: use module ref instead of from-import (value binding bug)

2026-05-24 01:32:19 +08:00