Critical:
- cache_aware_proxy: _handle_pd_sep leaked p_inst.num_requests (never
decremented) and never managed d_inst.num_requests; fix media_type
from application/json to text/event-stream for SSE stream
High:
- b3_sweep/b3_isolated_policy/b3_analyze: replace hardcoded
/home/admin/cpfs/wjh/ ROOT with script-relative $(dirname "$0")/..
- b3_analyze: replace hardcoded 8-port WORKER_MAP with dynamic
generation from BASE_PORT and N_INSTANCES
Medium:
- analyze_breakdown: warn on stderr when records are skipped (was silent)
- deploy_vllm_patches: fail-fast on SSH/SCP errors instead of
continuing with empty VENV_SITE
- pyproject.toml: declare fastapi and uvicorn as runtime dependencies
- launch_elastic_p2p: kill EngineCore and proxy in trap handler to
prevent GPU memory leaks on exit
Adds analysis/pd_sep_paper_section/ as the home for the "PD separation is
net negative under agentic workloads" paper section: plot scripts for C1
(workload chars), C6 (roofline), C7 (routing-vs-PD-sep lever), the C6/C7
PDFs already rendered, and a README mapping candidate claims to required
figures plus open re-run items.
Removes --enforce-eager from bench.sh and all active launch scripts so
cuda graphs are captured -- the prior methodology suppressed one of
PD-sep's structural advantages (D-node fixed-shape decode). Legacy
scripts under scripts/legacy/ are intentionally untouched as historical
records.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The replayer and proxy were building multi-turn prompts from trace tokens,
but the model generates different output tokens. Subsequent turns had wrong
prefix tokens, causing cache misses and invalid experimental measurements.
- replay.py: min_tokens=max_tokens for deterministic length, return_token_ids
to capture actual output, _apply_realized_prefix for next-turn correction
- proxy: extract output token_ids from SSE, record prompt+output as realized
prefix in shadow cache, extract _handle_local_request to deduplicate
- bench.sh/launch_elastic_p2p.sh: default elastic mode to unified policy
- mooncake_connector: only send prompt blocks (not stale output blocks),
track failed_recving_block_ids for error recovery
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>