agentic-kvc

gahow/agentic-kvc

Fork 0

Commit Graph

Author	SHA1	Message	Date
Gahow Wang	5b26c345f4	P2: all routing policies read real state via eff_ accessors + ablation harness InstanceState.eff_{num_requests,pending_prefill,ongoing_decode,ongoing_tokens} = max(shadow, real) when feed fresh (fixes 30s-stale under-count, keeps in-flight RaceFix), plus real-only r_max_prefill_remaining / r_kv_used_frac. Wired into load_only, lmetric, sticky, unified(_kv_both), unified_v3, and snapshot logging. Feed off => identical to before. run_v3_trace.sh gains ES=1 toggle (always deploys enhanced proxy); run_ablation_es.sh runs each config ES0-vs-ES1 to test whether real state changes policy performance/ranking. All unit-tested without GPU. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 20:21:12 +08:00
Gahow Wang	be948d32b8	P2: real engine-state feed replaces stale shadow counters for migration targeting vLLM scheduler publishes real state (running/waiting, KV free, and the max-in-progress-prefill signal /metrics lacks) to a tmpfs/redis store ~20Hz; router reads it and avoids GIL-stall (mid-large-prefill) + KV-capacity-wall targets, using real load over 30s-stale shadow counters. Components: engine_state.py (canonical+reader), instrument_engine_state.py (scheduler patch, file/redis writer), migration_target.py (scorer), proxy wiring (--engine-state-uri, off=unchanged). All unit-tested without GPU; not yet run live. See P2_ENGINE_STATE.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 20:01:26 +08:00

Author

SHA1

Message

Date

Gahow Wang

5b26c345f4

P2: all routing policies read real state via eff_ accessors + ablation harness

InstanceState.eff_{num_requests,pending_prefill,ongoing_decode,ongoing_tokens}
= max(shadow, real) when feed fresh (fixes 30s-stale under-count, keeps
in-flight RaceFix), plus real-only r_max_prefill_remaining / r_kv_used_frac.
Wired into load_only, lmetric, sticky, unified(_kv_both), unified_v3, and
snapshot logging. Feed off => identical to before. run_v3_trace.sh gains ES=1
toggle (always deploys enhanced proxy); run_ablation_es.sh runs each config
ES0-vs-ES1 to test whether real state changes policy performance/ranking.
All unit-tested without GPU.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-28 20:21:12 +08:00

Gahow Wang

be948d32b8

P2: real engine-state feed replaces stale shadow counters for migration targeting

vLLM scheduler publishes real state (running/waiting, KV free, and the
max-in-progress-prefill signal /metrics lacks) to a tmpfs/redis store ~20Hz;
router reads it and avoids GIL-stall (mid-large-prefill) + KV-capacity-wall
targets, using real load over 30s-stale shadow counters. Components:
engine_state.py (canonical+reader), instrument_engine_state.py (scheduler
patch, file/redis writer), migration_target.py (scorer), proxy wiring
(--engine-state-uri, off=unchanged). All unit-tested without GPU; not yet
run live. See P2_ENGINE_STATE.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-28 20:01:26 +08:00

2 Commits