v5 dropped errors but pushed session-cap fallback to 46-51%. Before adding
v6 mitigations we need to attribute that capacity loss to one of:
(a) active sessions — real footprint
(b) idle-evictable sessions — LRU not aggressive enough
(c) prefill backup blocks / in-flight / fragmentation — release timing
Without this it's all guessing. Plumb a 1Hz poller into replay that hits
each P/D worker's /server_info, captures session_cache + memory_usage, and
writes a per-worker time-series JSONL to <run_dir>/d-pool-timeseries.jsonl.
Off by default (--pool-poll-interval-s 0); v5+profile sweep enables it at
1.0s. Per-tick HTTP cost is ~8 parallel /server_info calls — negligible
relative to the 50min run.
Analyzer (scripts/analysis/analyze_pool_timeseries.py) decomposes each D's
capacity into active_held / idle_evictable / other (= cap-held-avail, the
backup-blocks bucket) / free, and reports session residency churn across
workers as a starvation/thrashing signal.
Mock-tested poller end-to-end (cancellation clean, file flushed, sessions
captured); analyzer validated against synthetic timeseries.
Next: run scripts/sweep_tp1_v5_optD_profile.sh on hardware (~90min), then
analyze results to pick a v6 direction.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>