Files

Gahow Wang f784e49c07 Microbench: prefill-decode interference + PD transfer lifecycle

Two microbenchmarks quantifying the elastic offload decision:

1. Interference (corrected): cold prefill causes 14-214x TPOT p90
   degradation on same-worker decode (D∈{1,2,4,8} × P∈{2k,8k,16k,32k}).
   Earlier run had a prefix-cache bug (deterministic prompts hit cache
   after rep 0); fixed with uuid+time_ns unique prompts.

2. Transfer lifecycle: PD-sep TTFT breakdown via Mooncake proxy,
   measuring prefill→RDMA→decode startup overhead.

Key finding: offload wins at all P≥2048 operating points —
transfer cost is 25-50% of interference cost even with bulk Mooncake.

2026-05-26 00:57:06 +08:00

5.2 KiB

Raw Blame History

Microbenchmark TODO

Overview

Two microbenchmarks to establish the quantitative foundation for the elastic migration decision:

Interference Microbench — quantify TPOT degradation from prefill-decode co-execution
Transfer Lifecycle Microbench — profile the full PD-sep request lifecycle, especially RDMA transfer cost

Together they answer: "For a given request in a given runtime state, is offload cheaper than co-execution?"

Microbench 1: Prefill-Decode Interference

Design doc: microbench/interference_microbench_design.md

Produces: f(decode_batch_size, new_prefill_tokens, chunk_size) → TPOT_penalty_ms

Tasks

#	Task	Status	Notes
1.1	Implement microbench driver (async streaming client)	TODO	Python, httpx + asyncio SSE
1.2	Implement steady-state detector (32 tokens, variance check)	TODO
1.3	Implement prefill injection + timestamp collection	TODO
1.4	Validate on single config (D=4, P=8192, chunk=8192)	TODO	Sanity check: penalty > 0
1.5	Run reduced sweep (chunk=8192, 7×7=49 configs, 5 reps)	TODO	~2.5h on 1×H20
1.6	Fit interference cost model	TODO	Linear/polynomial regression
1.7	Generate heatmap + break-even plot	TODO
1.8	(Optional) Full sweep with 4 chunk sizes	TODO	~10h
1.9	(Optional) Ablation: `--enforce-eager` vs CUDA graphs	TODO

Dependencies

Single H20 GPU with Qwen3-Coder-30B-A3B loaded
No vLLM source modifications needed (pure client-side measurement)

Microbench 2: PD Transfer Lifecycle

Design doc: microbench/transfer_lifecycle_design.md

Produces: Per-phase latency breakdown + transfer bandwidth model

Tasks

#	Task	Status	Notes
2.1	Write vLLM instrumentation patch (mooncake_connector timestamps)	TODO	~100 lines, non-invasive logging
2.2	Write scheduler instrumentation patch (promote timestamps)	TODO	~20 lines
2.3	Write proxy instrumentation (routing + dispatch timestamps)	TODO	Already partially in cache_aware_proxy.py
2.4	Implement cache-seeding script (warm D's prefix cache to target C)	TODO	Send C-token request to D in combined mode
2.5	Implement lifecycle driver (orchestrates P/D, collects all timestamps)	TODO
2.6	Validate on single config (C=0, N=8192, O=1)	TODO	Check all phases sum to E2E
2.7	Also run same config on combined instance (overhead baseline)	TODO
2.8	Run full sweep (6×6×4=144 configs, 5 reps)	TODO	~2h + 30min cache seeding
2.9	Verify incremental transfer: bytes_transferred independent of C	TODO	Critical correctness check
2.10	Fit transfer bandwidth model: `t = α + β × bytes`	TODO
2.11	Generate stacked bar charts + overhead comparison plots	TODO
2.12	Compute break-even: when does transfer overhead exceed interference cost?	TODO	Combines results from both microbenchmarks

Dependencies

2× H20 GPUs (P + D) on same machine (shared clock)
vLLM source patch (tasks 2.1-2.3)
Mooncake configured for P/D mode

Combined Analysis (After Both Complete)

#	Task	Status	Notes
3.1	Build unified offload decision model	TODO	`interference_cost(D,P,chunk) vs transfer_cost(N)`
3.2	Identify "offload wins" region in (D, N, C) space	TODO	The key deliverable
3.3	Estimate improvement from layerwise pipeline	TODO	`transfer_cost_layerwise = transfer_cost / num_layers`
3.4	Quantify maximum possible gain over LMetric	TODO	Upper bound: all requests in "offload wins" region use offload

Execution Order

Week 1:
  [1.1-1.4] Implement + validate interference microbench
  [2.1-2.3] Write vLLM instrumentation patches

Week 2:
  [1.5-1.7] Run interference sweep + fit model
  [2.4-2.6] Implement + validate lifecycle microbench

Week 3:
  [2.7-2.11] Run lifecycle sweep + analysis
  [3.1-3.4] Combined analysis → offload decision model

Critical path: Task 2.1 (vLLM patch) gates all of Microbench 2. Quick win: Microbench 1 needs zero vLLM modifications — can start immediately.

File Structure

microbench/
├── interference_microbench_design.md    # Design doc (done)
├── transfer_lifecycle_design.md         # Design doc (done)
├── TODO.md                              # This file
├── interference/
│   ├── driver.py                        # Microbench 1 client
│   ├── analyze.py                       # Fit model + plots
│   └── results/                         # Output JSON + CSV
├── lifecycle/
│   ├── driver.py                        # Microbench 2 orchestrator
│   ├── seed_cache.py                    # D-side cache warming
│   ├── analyze.py                       # Breakdown plots
│   └── results/                         # Output JSON + CSV
├── patches/
│   ├── 0001-connector-profiling.patch   # Mooncake timestamp logging
│   └── 0002-scheduler-profiling.patch   # Scheduler timestamp logging
└── combined/
    ├── decision_model.py                # Unified offload decision function
    └── plots/                           # Final analysis figures

5.2 KiB Raw Blame History Unescape Escape

Microbenchmark TODO

Overview

Microbench 1: Prefill-Decode Interference

Tasks

Dependencies

Microbench 2: PD Transfer Lifecycle

Tasks

Dependencies

Combined Analysis (After Both Complete)

Execution Order

File Structure

5.2 KiB

Raw Blame History