agentic-kvc

Author SHA1 Message Date

Author	SHA1	Message	Date
Gahow Wang	da39ab6804	Correct PD-disagg cost/benefit framing across repo The §3.2 cost-vs-benefit math in commits `029821c` (MB1 plot + pd_cost_vs_benefit.png) and `abde010` (RESULTS_SUMMARY.md) was wrong. What was wrong: I framed PD-disagg's max phase-isolation benefit as "≤ decode duration of the new request (~50–200 ms)" — implicitly treating the benefit as per-request and bounded by that request's own decode. The correct accounting is per-prefill-event across all stalled streams: benefit_per_prefill = D × T_prefill × (1 − TPOT_baseline/TPOT_during) ≈ D × T_prefill which follows from the chunked-prefill math (each of L/N chunks slows D ongoing decode steps from ~10 ms to t ms, summing to D × T_prefill). Plug MB1 + MB2 numbers in: prefill size \| T_prefill \| T_transfer \| D=8 benefit \| cost/benefit 2k tok \| 0.14 s \| 8 ms \| 1.1 s \| 0.7 % 33k tok \| 4.5 s \| 320 ms \| 36 s \| 0.9 % 125k tok \| 57 s \| 1.9 s \| 456 s \| 0.4 % On the phase-isolation axis alone, PD-disagg WINS by 100×–250× — the opposite of what the deleted figure showed. The actual dominant reason static PD-disagg fails in agentic is the D-side KV pool capacity wall (figs/f4b_pdsep_kv_wall.png) — p99 single-request KV is 11.5 GiB, per-D-instance pool is 38 GiB, so 4P+4D halves system decode capacity. Colleague's 4P+4D experiment showed TTFT p50 62× worse and success rate 99.5% → 52%, driven by pool overflow + queueing, not by transfer latency. Changes (all touched files explicitly listed; no `git add -u`): - figs/pd_cost_vs_benefit.png : DELETED (figure built on wrong math) - microbench/fresh_setup/plot_mb1.py : drop the pd_cost_vs_benefit function; keep mb1_interference.png and update its title to note per-prefill aggregate stall = D × T_prefill (not capped by decode) - figs/mb1_interference.png : regenerated, no misleading band annotation - analysis/mb1/README.md : Summary block rewritten ("what MB1 measures"; no more "max benefit = decode duration" claim); §3.2 implications section replaced with the corrected per-prefill-event table; explicit ⚠ Correction note documents what was wrong - analysis/mb2/README.md : Summary block + §3.2 implications section rewritten the same way; ⚠ Correction note links to RESULTS_SUMMARY §4 - RESULTS_SUMMARY.md §4 + §6 : §4 reordered to lead with the D-side capacity argument (the real failure mode), MB1/MB2 demoted from "kill-shot for PD-disagg" to "supporting context inputs to a cost-benefit table that actually favors PD-disagg on this axis"; §6 paper-claims list reordered to remove the wrong "PD-disagg loses on cost-vs-benefit" claim and replace with the corrected ones PAPER_OUTLINE.md and MEETING.md were checked and never picked up this specific wrong claim — they already (correctly) frame §3.2 around the D-side KV memory wall. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 22:04:49 +08:00
Gahow Wang	029821c1b6	MB1: prefill-decode interference under chunked-prefill default; §3.2 headline Single-GPU bench on dash1 GPU 0 (vanilla vLLM 0.18.1, chunked-prefill on, no kv_connector). 3 decode batch sizes × 5 prefill sizes × 3 reps. Method recap (driver: microbench/interference/driver.py, repurposed): - Pin D streaming decode requests at constant max_tokens - Inject one prefill-only request (max_tokens=1) of varying input length - Bin decode-stream token timestamps into "during prefill" vs baseline - Headline metric: effective per-stream TPOT during the prefill burst, = prefill_ttft / (num_tokens_during_prefill / D). This is the average rate at which each decode stream produces tokens during the burst. p50 of inter-token intervals is deceptive (chunked-prefill makes most intervals look normal); the burst-average gives the true cost. Results (D=8 row, the most agentic-realistic case): P (tokens) \| prefill_ttft \| per-stream TPOT during \| penalty 2048 \| 143 ms \| 32 ms \| 4× 8192 \| 583 ms \| 114 ms \| 15× 32768 \| 4520 ms \| 388 ms \| 52× 65536 \| 15615 ms \| 757 ms \| 99× 131072 \| 56991 ms \| 1419 ms \| 183× Baseline TPOT at D=8: ~7.7 ms. So during a 131k-token prefill burst each ongoing decode is running ~183× slower (i.e. essentially halted) for ~57 seconds. §3.2 implication: PD-disagg's promised phase-isolation benefit per agentic request is bounded by the decode duration, which is 50–200 ms for tool-call output. MB2 says the KV-transfer cost of PD-disagg is 300 ms – 10 s for agentic-size requests. Cost > benefit for every KV size above ~80 MiB (well below trace mean 192 MiB). The new figs/pd_cost_vs_benefit.png overlays MB1 benefit ceiling (50–200 ms band, capped by decode) onto MB2 transfer cost curve and marks the agentic-distribution waypoints (trace mean, p90, p95, p99) on the x-axis. Across the entire agentic distribution, the cost curve sits above the benefit band. Adds: - microbench/fresh_setup/mb1_launch.sh: single-GPU vLLM launcher (no kv_connector, default chunked_prefill=on, max_num_batched_tokens=8192) - microbench/fresh_setup/mb1_driver.py: copy of the existing microbench/interference/driver.py for cpfs deployment - microbench/fresh_setup/analyze_mb1.py: aggregator emitting per-(D, P) effective-TPOT-during + max PD-disagg-benefit table - microbench/fresh_setup/plot_mb1.py: mb1 standalone + pd_cost_vs_benefit headline figure - analysis/mb1/summary.csv: 45 raw rows from the sweep - analysis/mb1/breakdown.json: per-(D, P) aggregate - analysis/mb1/README.md: persistent doc - figs/mb1_interference.png: effective TPOT during prefill, one line per D - figs/pd_cost_vs_benefit.png: §3.2 headline (cost > benefit everywhere) Caveats noted in README: - chunk_tokens=8192 only; Sarathi-Serve's smaller chunks would interleave decode more aggressively. Chunk-size sensitivity is flagged as next run. - D ≤ 8; higher D may saturate or shrink the penalty further. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 21:25:09 +08:00

Gahow Wang

da39ab6804

Correct PD-disagg cost/benefit framing across repo

The §3.2 cost-vs-benefit math in commits 029821c (MB1 plot +
pd_cost_vs_benefit.png) and abde010 (RESULTS_SUMMARY.md) was wrong.

What was wrong:
  I framed PD-disagg's max phase-isolation benefit as "≤ decode duration
  of the new request (~50–200 ms)" — implicitly treating the benefit as
  per-request and bounded by that request's own decode. The correct
  accounting is per-prefill-event across all stalled streams:

      benefit_per_prefill = D × T_prefill × (1 − TPOT_baseline/TPOT_during)
                          ≈ D × T_prefill

  which follows from the chunked-prefill math (each of L/N chunks slows
  D ongoing decode steps from ~10 ms to t ms, summing to D × T_prefill).

Plug MB1 + MB2 numbers in:

  prefill size | T_prefill | T_transfer | D=8 benefit | cost/benefit
   2k tok      | 0.14 s    |     8 ms   |   1.1 s     |    0.7 %
  33k tok      | 4.5  s    |  320 ms    |  36   s     |    0.9 %
 125k tok      | 57   s    |  1.9 s     | 456   s     |    0.4 %

On the phase-isolation axis alone, PD-disagg WINS by 100×–250× — the
opposite of what the deleted figure showed.

The actual dominant reason static PD-disagg fails in agentic is the
D-side KV pool capacity wall (figs/f4b_pdsep_kv_wall.png) — p99
single-request KV is 11.5 GiB, per-D-instance pool is 38 GiB, so 4P+4D
halves system decode capacity. Colleague's 4P+4D experiment showed
TTFT p50 62× worse and success rate 99.5% → 52%, driven by pool
overflow + queueing, not by transfer latency.

Changes (all touched files explicitly listed; no `git add -u`):
- figs/pd_cost_vs_benefit.png : DELETED (figure built on wrong math)
- microbench/fresh_setup/plot_mb1.py : drop the pd_cost_vs_benefit
  function; keep mb1_interference.png and update its title to note
  per-prefill aggregate stall = D × T_prefill (not capped by decode)
- figs/mb1_interference.png : regenerated, no misleading band annotation
- analysis/mb1/README.md : Summary block rewritten ("what MB1 measures";
  no more "max benefit = decode duration" claim); §3.2 implications
  section replaced with the corrected per-prefill-event table; explicit
  ⚠ Correction note documents what was wrong
- analysis/mb2/README.md : Summary block + §3.2 implications section
  rewritten the same way; ⚠ Correction note links to RESULTS_SUMMARY §4
- RESULTS_SUMMARY.md §4 + §6 : §4 reordered to lead with the D-side
  capacity argument (the real failure mode), MB1/MB2 demoted from
  "kill-shot for PD-disagg" to "supporting context inputs to a
  cost-benefit table that actually favors PD-disagg on this axis";
  §6 paper-claims list reordered to remove the wrong "PD-disagg loses
  on cost-vs-benefit" claim and replace with the corrected ones

PAPER_OUTLINE.md and MEETING.md were checked and never picked up this
specific wrong claim — they already (correctly) frame §3.2 around the
D-side KV memory wall.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-27 22:04:49 +08:00

Gahow Wang

029821c1b6

MB1: prefill-decode interference under chunked-prefill default; §3.2 headline

Single-GPU bench on dash1 GPU 0 (vanilla vLLM 0.18.1, chunked-prefill on,
no kv_connector). 3 decode batch sizes × 5 prefill sizes × 3 reps.

Method recap (driver: microbench/interference/driver.py, repurposed):
- Pin D streaming decode requests at constant max_tokens
- Inject one prefill-only request (max_tokens=1) of varying input length
- Bin decode-stream token timestamps into "during prefill" vs baseline
- Headline metric: effective per-stream TPOT during the prefill burst,
  = prefill_ttft / (num_tokens_during_prefill / D). This is the average
  rate at which each decode stream produces tokens during the burst.
  p50 of inter-token intervals is deceptive (chunked-prefill makes most
  intervals look normal); the burst-average gives the true cost.

Results (D=8 row, the most agentic-realistic case):
  P (tokens) | prefill_ttft | per-stream TPOT during | penalty
       2048  |    143 ms    |      32 ms             |    4×
       8192  |    583 ms    |     114 ms             |   15×
      32768  |  4520 ms     |     388 ms             |   52×
      65536  | 15615 ms     |     757 ms             |   99×
     131072  | 56991 ms     |    1419 ms             |  183×

Baseline TPOT at D=8: ~7.7 ms. So during a 131k-token prefill burst
each ongoing decode is running ~183× slower (i.e. essentially halted)
for ~57 seconds.

§3.2 implication: PD-disagg's promised phase-isolation benefit per
agentic request is bounded by the decode duration, which is 50–200 ms
for tool-call output. MB2 says the KV-transfer cost of PD-disagg
is 300 ms – 10 s for agentic-size requests. Cost > benefit for every
KV size above ~80 MiB (well below trace mean 192 MiB).

The new figs/pd_cost_vs_benefit.png overlays MB1 benefit ceiling
(50–200 ms band, capped by decode) onto MB2 transfer cost curve and
marks the agentic-distribution waypoints (trace mean, p90, p95, p99)
on the x-axis. Across the entire agentic distribution, the cost curve
sits above the benefit band.

Adds:
- microbench/fresh_setup/mb1_launch.sh: single-GPU vLLM launcher (no
  kv_connector, default chunked_prefill=on, max_num_batched_tokens=8192)
- microbench/fresh_setup/mb1_driver.py: copy of the existing
  microbench/interference/driver.py for cpfs deployment
- microbench/fresh_setup/analyze_mb1.py: aggregator emitting
  per-(D, P) effective-TPOT-during + max PD-disagg-benefit table
- microbench/fresh_setup/plot_mb1.py: mb1 standalone +
  pd_cost_vs_benefit headline figure
- analysis/mb1/summary.csv: 45 raw rows from the sweep
- analysis/mb1/breakdown.json: per-(D, P) aggregate
- analysis/mb1/README.md: persistent doc
- figs/mb1_interference.png: effective TPOT during prefill, one line per D
- figs/pd_cost_vs_benefit.png: §3.2 headline (cost > benefit everywhere)

Caveats noted in README:
- chunk_tokens=8192 only; Sarathi-Serve's smaller chunks would
  interleave decode more aggressively. Chunk-size sensitivity is
  flagged as next run.
- D ≤ 8; higher D may saturate or shrink the penalty further.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-27 21:25:09 +08:00

2 Commits