Files
agentic-pd-hybrid/docs
kzlin 4978c0d0cd profile(kvc): rewrite v5+profile report after critic audit + P0/P1 instrument
Hostile audit of the original report flagged three load-bearing errors:

1. held_tokens semantic was inverted. session_held_tokens() at
   session_aware_cache.py:278-282 sums (kv_allocated_len - cache_protected_len)
   per slot, i.e. slot-private (NOT in radix tree). So "other = cap - held -
   avail" actually CONTAINS the radix-tree protected prefix cache (likely the
   single biggest component for shared agentic prefixes), not just running
   batch + in-flight as the original report claimed.

2. Admission-race causal hypothesis for the 415 EXP2+profile errors is
   contradicted by the data: 414/415 errors have kv_transfer_blocks > 0 — they
   passed admission and died downstream ("generate stream ended before
   producing any token", raised by the client when a 200 response had an empty
   stream).

3. Polling deconfound was too quickly dismissed. Mode counts shift ~1:1
   (session-cap-fb -356 / kvcache-centric +406), and /server_info is not a
   passive read — it dispatches into the scheduler main loop and iterates
   every session slot.

Plus: per-D error% confounded by sticky session affinity (only 18 unique
sessions cause 415 errors, decode-3 had 0 errors only because no high-error
session landed there); decile 10 "recovery" was an equal-time binning
artifact (24.5% under equal-count); v5 vs v5+profile time gap was 21h not
6h; p50/p90 latency comparison is N=1.

Rewritten report (docs/V5_PROFILE_INVESTIGATION_ZH.md) marks each correction
with ⚠️ and demotes admission-race to one of four hypotheses (H1-H4).

Action items split into P0 (verify, must do first) and P1 (instrument):

P0 — scripts/sweep_tp1_v5_baseline_rerun_exp2.sh runs 3x v5 baseline EXP2
(no polling, identical config to the original v5 run) to test whether the
9-error baseline result is reproducible. If 3 runs give ~9 errors and
profile gives 415, polling is the leading suspect. Currently running
in background.

P1 — scheduler.py:_compute_pool_breakdown_for_diagnostics adds a read-only
"pool_breakdown" dict to /server_info covering: radix_evictable_tokens,
radix_protected_tokens, slot_private_held_tokens, session_slot_count,
running_batch_{reqs,kv_tokens}, transfer_queue_{reqs,tokens},
prealloc_queue_{reqs,tokens}, retracted_queue_{reqs,tokens}. With these,
"unaccounted = cap - sum(known)" exposes true leakage. replay.py captures
all fields into the per-tick row; analyzer prints the decomposition and
gracefully handles old timeseries (prints "P1 instrument absent").

Mock-tested end-to-end. SGLang patch is read-only and does not affect
admission/scheduling. Old v5+profile data still analyzes correctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 22:29:21 +08:00
..