From 0958823cdbafa92ea7578cecf15bb2a4cb714870 Mon Sep 17 00:00:00 2001 From: Gahow Wang Date: Sat, 23 May 2026 20:58:38 +0800 Subject: [PATCH] =?UTF-8?q?REPORT:=20add=20=C2=A71.1=20errata=20flagging?= =?UTF-8?q?=20superseded=20sections=20(S3)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Calls out that §3.1 (old random sampler, time-scale compression, 1 req/GPU cap) and the early elastic v3 warm-vs-fresh runs are no longer current, and that the "--max-inflight-sessions 64+" next-step text refers to a flag that was removed and must be restored per FIXES.md §B2 before those numbers can be reproduced. Points readers at §3.6/§3.7 as authoritative. --- REPORT.md | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/REPORT.md b/REPORT.md index 8c96df4..25745da 100644 --- a/REPORT.md +++ b/REPORT.md @@ -10,6 +10,26 @@ For agentic LLM workloads (long input, short output, high KV cache reuse), is prefill-decode disaggregation beneficial? If full PD separation hurts (proven in §3), can **selective** disaggregation of only heavy requests improve serving latency while preserving KV cache locality? +## 1.1 Errata / Superseded sections + +> This report has been revised several times as the methodology matured. +> The sections below are kept for historical context but their numerical +> conclusions have been **superseded** — do not cite them in isolation. +> +> - **§3.1 (initial PD-sep vs PD-combined)**: ran with the old random +> sampler + `--time-scale` compression + `--max-inflight-sessions 8`. +> Cross-session KV reuse dropped from 52% → 16%, and per-GPU concurrency +> was capped at 1 req/GPU. Superseded by §3.6. +> - **Earlier "elastic v3" warm-vs-fresh runs**: baselines were not +> restarted between trials, leaving residual KV cache that inflated +> baseline TTFT ~2×. Superseded by the cold-start results in §3.6/§3.7. +> - **Any reference to running `--max-inflight-sessions 64+`**: that flag +> was removed when replay moved to trace-driven dispatch. The next-step +> experiment requires restoring the flag first (see `FIXES.md` §B2 +> route A) before any production-concurrency numbers can be produced. +> +> The authoritative results are in **§3.6 and §3.7**. + ## 2. Experimental Setup ### 2.1 Hardware