Go to file

Gahow Wang fafc44da79 MB5 PD reuse-centric ablation: tooling, data, Fig 1-3

Three-axis controlled ablation of PD-colo vs PD-disagg on synthetic regular
traces (closed-loop, controlled reuse via REPLAY_NO_REALIZED_PREFIX) on the
clean stack (e13391e gated off).

  Axis 1 (Fig 1) -- reuse 6%->94% at N=8, in8192/out256
  Axis 2 (Fig 2) -- shape in2048/out2048 -> in32768/out64 at N=8, reuse~70%
  Axis 3 (Fig 3) -- concurrency N=8/16/32/64 at reuse~71%, in8192/out256

Findings:
  * APC parity colo=PD at every reuse (5.5/22/44/66/77/82%) -- contamination
    fix validated.
  * PD edge erodes 1.57x->1.10x with reuse; prefill GPUs strand 26%->9%.
  * Shape: PD-best peaks mid-sweep (1.34x at in8192/out512); wrong PD ratio
    catastrophic at prefill extreme (in32768/out64 pd2 = 378/400, p99 432s).
  * Concurrency: PD wins N<=32 (1.23-1.29x), TIPS at N=64 -- pd2/pd4
    crater (APC 71%->1.4%, TPS -30%) while colo scales cleanly.

Infrastructure:
  * replayer: --max-inflight-sessions, --inter-turn-think, --no-realized-prefix
    (env-defaulted via REPLAY_MAX_INFLIGHT, REPLAY_INTER_TURN_THINK_S,
    REPLAY_NO_REALIZED_PREFIX).
  * mb5_run.sh: writes bench_config.json + gpu_util.csv + run_window.json +
    instance_apc.txt + metrics.jsonl for bench_report/fig_agg ingest.
  * fig_agg.py: per-arm GPU role split + producer-side APC; --json mode.
  * gpu_util_report.py: companion per-GPU util report from gpu_util.csv.
  * partial_summary.py: stats from in-flight replay_metrics.jsonl
    (works before metrics.summary.json exists).

Data: analysis/mb5_pd_ablation/fig{1,2,3}.json (24 + 20 + 16 rows).
Figures: figs/mb5_pd_ablation/fig{1_reuse,2_shape,3_concurrency}_axis.png.

2026-05-31 20:14:46 +08:00

analysis

MB5 PD reuse-centric ablation: tooling, data, Fig 1-3

2026-05-31 20:14:46 +08:00

docs

Docs: reconcile routing docs with current hybrid direction

2026-05-25 10:47:14 +08:00

experiments

Add elastic PS evaluation plan for production-realistic trace

2026-05-23 15:56:05 +08:00

figs

MB5 PD reuse-centric ablation: tooling, data, Fig 1-3

2026-05-31 20:14:46 +08:00

microbench

MB5 PD reuse-centric ablation: tooling, data, Fig 1-3

2026-05-31 20:14:46 +08:00

patches

Add vLLM patches directory for version-controlled patch management

2026-05-22 00:26:14 +08:00

replayer

MB5 PD reuse-centric ablation: tooling, data, Fig 1-3

2026-05-31 20:14:46 +08:00

scripts

bench harness: env-tunable vLLM health timeout + both-modes 5-policy driver

2026-05-30 20:59:02 +08:00

tests

unified_v2.1: relax gates + add unified_kv_both isolation control

2026-05-26 10:40:57 +08:00

third_party/vllm

Gate evict_sent_blocks behind VLLM_EVICT_SENT_BLOCKS

2026-05-29 18:18:59 +08:00

traces

trace: time_to_parent_chat annotation + thinktime trace variants

2026-05-30 20:58:49 +08:00

v2 exp(d): expand figure to 6 panels (TTFT/E2E mean+p90, TPS, per-worker GPU util)

2026-05-30 21:10:27 +08:00

.gitignore

trace: time_to_parent_chat annotation + thinktime trace variants

2026-05-30 20:58:49 +08:00

FIXES.md

Add FIXES.md with prioritized repo cleanup checklist

2026-05-23 20:35:56 +08:00

MEETING.md

§2.3 reframe: dispatch coupling is regime-dependent, not binary chatbot/agentic

2026-05-27 16:51:38 +08:00

PAPER_OUTLINE.md

docs: reframe PAPER_OUTLINE to GPU-hit-first + embed v2 figures

2026-05-30 13:34:19 +08:00

pyproject.toml

Fix review bugs: PD-sep counter leaks, hardcoded paths, missing deps

2026-05-26 15:54:55 +08:00

README.md

Replayer think-time dispatch mode + benchmarking guidance

2026-05-30 16:28:36 +08:00

REPORT.md

Docs: reconcile routing docs with current hybrid direction

2026-05-25 10:47:14 +08:00

RESULTS_SUMMARY.md

PD-disagg docs: annotated corrections for e13391e contamination

2026-05-31 20:14:14 +08:00

TODO.md

LMetric routing policy (OSDI'26) + A/B results vs linear baseline

2026-05-22 16:57:32 +08:00

uv.lock

Fix review P2s: lockfile, model path convention, trap robustness

2026-05-26 16:05:43 +08:00

README.md

agentic-kv

Serving agentic LLM workloads by keeping the KV working set in GPU HBM (GPU-hit-first). Research outline: PAPER_OUTLINE.md. Evidence + experiments: v2/.

⚠️ Benchmarking methodology — read this first

Replay agentic traces with --dispatch-mode thinktime, not the default tracets. It is the faithful, more realistic load — and the dispatch mode materially changes the performance you measure.

The replayer offers two ways to time each turn:

mode	turn-k dispatched at	what it models
`tracets` (default)	`max(prev_turn_finished, trace_ts)`	absolute production schedule
`thinktime` (use this)	`prev_turn_finished + time_to_parent_chat`	real closed-loop agent pacing

Why it matters. tracets collapses the inter-turn think-time to ~0 whenever the system falls behind (it fires the next turn immediately because the trace timestamp is already in the past). That manufactures artificial request bursts — spiking instantaneous concurrency → KV-pool pressure → preemption → inflated tail latency and wasted throughput. thinktime keeps each turn's real gap (tool-exec + agent think), so the offered load is what a real agent produces.

Measured (w600 first-300s window, 8×H20, round-robin, 100% completion):

metric (N=8)	`tracets` (Mode 1)	`thinktime` (Mode 2)	Δ
E2E p90	102.8 s	73.5 s	−28%
E2E p99	245 s	227 s	−7%
TTFT p90	56.1 s	39.7 s	−29%
system TPS	111.8	119.3	+7%
wall-clock	967 s	787 s	−19%
TPOT p90	0.174 s	0.188 s	~flat

So under realistic capacity, tracets makes the system look ~30% worse on tail latency than it actually is. Tell-tale: scaling 6→8 instances barely helped tracets (975→967 s — its bursts re-saturate regardless of capacity) but helped thinktime a lot (1125→787 s). Under heavy saturation (N=6) the two converge (E2E p90 ≈ 118–120 s), since there is no slack for bursts to harm. Decode (TPOT) is dispatch-independent everywhere.

Recommendation: benchmark with --dispatch-mode thinktime; use tracets only as an explicit bursty stress case. Full ablation: v2/exp_c_dispatch_ablation/.

How to use it

# 1. annotate a trace with the real per-turn gap (one-time; scans the raw trace)
python scripts/add_time_to_parent.py traces/w600_r0.0015_st30.jsonl traces/w600_ttp.jsonl

# 2. replay closed-loop with faithful think-time
python -m replayer --trace traces/w600_ttp.jsonl --endpoint <eps> \
    --model <model> --dispatch-mode thinktime

time_to_parent_chat = this_turn.request_ready_time_ms − parent_turn.request_end_time_ms, computed from the raw trace and stored per request; turn-1 has none (fires at its trace arrival). Traces without the field fall back to tracets.

Project map

PAPER_OUTLINE.md — GPU-hit-first paper outline (the thesis).
v2/ — evidence experiments:
- exp_a_tier_latency/ — KV-hit cost by tier (GPU < CPU-local < remote-RDMA < miss).
- exp_b_capacity_knee/ — realized APC / latency knee vs GPU capacity.
- exp_c_dispatch_ablation/ — the replay-mode study above.
replayer/ — trace replayer (--dispatch-mode, closed-loop think-time).
scripts/add_time_to_parent.py — trace annotation for thinktime.
microbench/, analysis/ — PD-disagg, routing, workload characterization.

README.md Unescape Escape

agentic-kv

⚠️ Benchmarking methodology — read this first

How to use it

Project map

README.md