Three-axis controlled ablation of PD-colo vs PD-disagg on synthetic regular
traces (closed-loop, controlled reuse via REPLAY_NO_REALIZED_PREFIX) on the
clean stack (e13391e gated off).
Axis 1 (Fig 1) -- reuse 6%->94% at N=8, in8192/out256
Axis 2 (Fig 2) -- shape in2048/out2048 -> in32768/out64 at N=8, reuse~70%
Axis 3 (Fig 3) -- concurrency N=8/16/32/64 at reuse~71%, in8192/out256
Findings:
* APC parity colo=PD at every reuse (5.5/22/44/66/77/82%) -- contamination
fix validated.
* PD edge erodes 1.57x->1.10x with reuse; prefill GPUs strand 26%->9%.
* Shape: PD-best peaks mid-sweep (1.34x at in8192/out512); wrong PD ratio
catastrophic at prefill extreme (in32768/out64 pd2 = 378/400, p99 432s).
* Concurrency: PD wins N<=32 (1.23-1.29x), TIPS at N=64 -- pd2/pd4
crater (APC 71%->1.4%, TPS -30%) while colo scales cleanly.
Infrastructure:
* replayer: --max-inflight-sessions, --inter-turn-think, --no-realized-prefix
(env-defaulted via REPLAY_MAX_INFLIGHT, REPLAY_INTER_TURN_THINK_S,
REPLAY_NO_REALIZED_PREFIX).
* mb5_run.sh: writes bench_config.json + gpu_util.csv + run_window.json +
instance_apc.txt + metrics.jsonl for bench_report/fig_agg ingest.
* fig_agg.py: per-arm GPU role split + producer-side APC; --json mode.
* gpu_util_report.py: companion per-GPU util report from gpu_util.csv.
* partial_summary.py: stats from in-flight replay_metrics.jsonl
(works before metrics.summary.json exists).
Data: analysis/mb5_pd_ablation/fig{1,2,3}.json (24 + 20 + 16 rows).
Figures: figs/mb5_pd_ablation/fig{1_reuse,2_shape,3_concurrency}_axis.png.
agentic-kv
Serving agentic LLM workloads by keeping the KV working set in GPU HBM
(GPU-hit-first). Research outline: PAPER_OUTLINE.md.
Evidence + experiments: v2/.
⚠️ Benchmarking methodology — read this first
Replay agentic traces with
--dispatch-mode thinktime, not the defaulttracets. It is the faithful, more realistic load — and the dispatch mode materially changes the performance you measure.
The replayer offers two ways to time each turn:
| mode | turn-k dispatched at | what it models |
|---|---|---|
tracets (default) |
max(prev_turn_finished, trace_ts) |
absolute production schedule |
thinktime (use this) |
prev_turn_finished + time_to_parent_chat |
real closed-loop agent pacing |
Why it matters. tracets collapses the inter-turn think-time to ~0 whenever
the system falls behind (it fires the next turn immediately because the trace
timestamp is already in the past). That manufactures artificial request
bursts — spiking instantaneous concurrency → KV-pool pressure → preemption →
inflated tail latency and wasted throughput. thinktime keeps each turn's real
gap (tool-exec + agent think), so the offered load is what a real agent produces.
Measured (w600 first-300s window, 8×H20, round-robin, 100% completion):
| metric (N=8) | tracets (Mode 1) |
thinktime (Mode 2) |
Δ |
|---|---|---|---|
| E2E p90 | 102.8 s | 73.5 s | −28% |
| E2E p99 | 245 s | 227 s | −7% |
| TTFT p90 | 56.1 s | 39.7 s | −29% |
| system TPS | 111.8 | 119.3 | +7% |
| wall-clock | 967 s | 787 s | −19% |
| TPOT p90 | 0.174 s | 0.188 s | ~flat |
So under realistic capacity, tracets makes the system look ~30% worse on
tail latency than it actually is. Tell-tale: scaling 6→8 instances barely helped
tracets (975→967 s — its bursts re-saturate regardless of capacity) but helped
thinktime a lot (1125→787 s). Under heavy saturation (N=6) the two converge
(E2E p90 ≈ 118–120 s), since there is no slack for bursts to harm. Decode (TPOT)
is dispatch-independent everywhere.
Recommendation: benchmark with --dispatch-mode thinktime; use tracets
only as an explicit bursty stress case. Full ablation:
v2/exp_c_dispatch_ablation/.
How to use it
# 1. annotate a trace with the real per-turn gap (one-time; scans the raw trace)
python scripts/add_time_to_parent.py traces/w600_r0.0015_st30.jsonl traces/w600_ttp.jsonl
# 2. replay closed-loop with faithful think-time
python -m replayer --trace traces/w600_ttp.jsonl --endpoint <eps> \
--model <model> --dispatch-mode thinktime
time_to_parent_chat = this_turn.request_ready_time_ms − parent_turn.request_end_time_ms,
computed from the raw trace and stored per request; turn-1 has none (fires at its
trace arrival). Traces without the field fall back to tracets.
Project map
PAPER_OUTLINE.md— GPU-hit-first paper outline (the thesis).v2/— evidence experiments:exp_a_tier_latency/— KV-hit cost by tier (GPU < CPU-local < remote-RDMA < miss).exp_b_capacity_knee/— realized APC / latency knee vs GPU capacity.exp_c_dispatch_ablation/— the replay-mode study above.
replayer/— trace replayer (--dispatch-mode, closed-loop think-time).scripts/add_time_to_parent.py— trace annotation forthinktime.microbench/,analysis/— PD-disagg, routing, workload characterization.