REPORT: add §1.1 errata flagging superseded sections (S3)
Calls out that §3.1 (old random sampler, time-scale compression, 1 req/GPU cap) and the early elastic v3 warm-vs-fresh runs are no longer current, and that the "--max-inflight-sessions 64+" next-step text refers to a flag that was removed and must be restored per FIXES.md §B2 before those numbers can be reproduced. Points readers at §3.6/§3.7 as authoritative.
This commit is contained in:
20
REPORT.md
20
REPORT.md
@@ -10,6 +10,26 @@
|
|||||||
|
|
||||||
For agentic LLM workloads (long input, short output, high KV cache reuse), is prefill-decode disaggregation beneficial? If full PD separation hurts (proven in §3), can **selective** disaggregation of only heavy requests improve serving latency while preserving KV cache locality?
|
For agentic LLM workloads (long input, short output, high KV cache reuse), is prefill-decode disaggregation beneficial? If full PD separation hurts (proven in §3), can **selective** disaggregation of only heavy requests improve serving latency while preserving KV cache locality?
|
||||||
|
|
||||||
|
## 1.1 Errata / Superseded sections
|
||||||
|
|
||||||
|
> This report has been revised several times as the methodology matured.
|
||||||
|
> The sections below are kept for historical context but their numerical
|
||||||
|
> conclusions have been **superseded** — do not cite them in isolation.
|
||||||
|
>
|
||||||
|
> - **§3.1 (initial PD-sep vs PD-combined)**: ran with the old random
|
||||||
|
> sampler + `--time-scale` compression + `--max-inflight-sessions 8`.
|
||||||
|
> Cross-session KV reuse dropped from 52% → 16%, and per-GPU concurrency
|
||||||
|
> was capped at 1 req/GPU. Superseded by §3.6.
|
||||||
|
> - **Earlier "elastic v3" warm-vs-fresh runs**: baselines were not
|
||||||
|
> restarted between trials, leaving residual KV cache that inflated
|
||||||
|
> baseline TTFT ~2×. Superseded by the cold-start results in §3.6/§3.7.
|
||||||
|
> - **Any reference to running `--max-inflight-sessions 64+`**: that flag
|
||||||
|
> was removed when replay moved to trace-driven dispatch. The next-step
|
||||||
|
> experiment requires restoring the flag first (see `FIXES.md` §B2
|
||||||
|
> route A) before any production-concurrency numbers can be produced.
|
||||||
|
>
|
||||||
|
> The authoritative results are in **§3.6 and §3.7**.
|
||||||
|
|
||||||
## 2. Experimental Setup
|
## 2. Experimental Setup
|
||||||
|
|
||||||
### 2.1 Hardware
|
### 2.1 Hardware
|
||||||
|
|||||||
Reference in New Issue
Block a user