diff --git a/REPORT.md b/REPORT.md index 8c96df4..25745da 100644 --- a/REPORT.md +++ b/REPORT.md @@ -10,6 +10,26 @@ For agentic LLM workloads (long input, short output, high KV cache reuse), is prefill-decode disaggregation beneficial? If full PD separation hurts (proven in §3), can **selective** disaggregation of only heavy requests improve serving latency while preserving KV cache locality? +## 1.1 Errata / Superseded sections + +> This report has been revised several times as the methodology matured. +> The sections below are kept for historical context but their numerical +> conclusions have been **superseded** — do not cite them in isolation. +> +> - **§3.1 (initial PD-sep vs PD-combined)**: ran with the old random +> sampler + `--time-scale` compression + `--max-inflight-sessions 8`. +> Cross-session KV reuse dropped from 52% → 16%, and per-GPU concurrency +> was capped at 1 req/GPU. Superseded by §3.6. +> - **Earlier "elastic v3" warm-vs-fresh runs**: baselines were not +> restarted between trials, leaving residual KV cache that inflated +> baseline TTFT ~2×. Superseded by the cold-start results in §3.6/§3.7. +> - **Any reference to running `--max-inflight-sessions 64+`**: that flag +> was removed when replay moved to trace-driven dispatch. The next-step +> experiment requires restoring the flag first (see `FIXES.md` §B2 +> route A) before any production-concurrency numbers can be produced. +> +> The authoritative results are in **§3.6 and §3.7**. + ## 2. Experimental Setup ### 2.1 Hardware