agentic-kvc

Go to file

Gahow Wang 020be9f444 proxy: real LRU for cached_blocks + shadow-state reconcile loop (M1, M5)

M1: cached_blocks was a plain set with a "trim half via list slicing"
eviction. CPython does not guarantee set iteration order, so the trim
discarded an arbitrary half of the entries — completely unlike vLLM's
LRU and a known contributor to the router's cache_hit estimate
diverging from real APC. Replace with an OrderedDict-backed LRU:
move_to_end on hits, popitem(last=False) on overflow. Capacity exposed
as CACHE_CAPACITY_BLOCKS module constant (200000 by default).

M5: streamed responses decrement load counters in their generator's
finally block. If a client disconnects before consuming the body the
generator is never entered and the decrement is lost, causing
ongoing_tokens / num_requests / pending_prefill_tokens to drift
negative under load. Add a 60s background reconcile_loop that clamps
those counters at zero as a safety net. Started in lifespan, cancelled
on shutdown. Does not replace proper vLLM exact-state syncing.

2026-05-23 21:00:35 +08:00

analysis

Add comprehensive research findings document

2026-05-23 07:16:31 +08:00

experiments

Add elastic PS evaluation plan for production-realistic trace

2026-05-23 15:56:05 +08:00

patches

Add vLLM patches directory for version-controlled patch management

2026-05-22 00:26:14 +08:00

replayer

metrics: replace round-based percentile with linear interpolation (B5)