Two microbenchmarks quantifying the elastic offload decision:
1. Interference (corrected): cold prefill causes 14-214x TPOT p90
degradation on same-worker decode (D∈{1,2,4,8} × P∈{2k,8k,16k,32k}).
Earlier run had a prefix-cache bug (deterministic prompts hit cache
after rep 0); fixed with uuid+time_ns unique prompts.
2. Transfer lifecycle: PD-sep TTFT breakdown via Mooncake proxy,
measuring prefill→RDMA→decode startup overhead.
Key finding: offload wins at all P≥2048 operating points —
transfer cost is 25-50% of interference cost even with bulk Mooncake.
5.2 KiB
5.2 KiB
Microbenchmark TODO
Overview
Two microbenchmarks to establish the quantitative foundation for the elastic migration decision:
- Interference Microbench — quantify TPOT degradation from prefill-decode co-execution
- Transfer Lifecycle Microbench — profile the full PD-sep request lifecycle, especially RDMA transfer cost
Together they answer: "For a given request in a given runtime state, is offload cheaper than co-execution?"
Microbench 1: Prefill-Decode Interference
Design doc: microbench/interference_microbench_design.md
Produces: f(decode_batch_size, new_prefill_tokens, chunk_size) → TPOT_penalty_ms
Tasks
| # | Task | Status | Notes |
|---|---|---|---|
| 1.1 | Implement microbench driver (async streaming client) | TODO | Python, httpx + asyncio SSE |
| 1.2 | Implement steady-state detector (32 tokens, variance check) | TODO | |
| 1.3 | Implement prefill injection + timestamp collection | TODO | |
| 1.4 | Validate on single config (D=4, P=8192, chunk=8192) | TODO | Sanity check: penalty > 0 |
| 1.5 | Run reduced sweep (chunk=8192, 7×7=49 configs, 5 reps) | TODO | ~2.5h on 1×H20 |
| 1.6 | Fit interference cost model | TODO | Linear/polynomial regression |
| 1.7 | Generate heatmap + break-even plot | TODO | |
| 1.8 | (Optional) Full sweep with 4 chunk sizes | TODO | ~10h |
| 1.9 | (Optional) Ablation: --enforce-eager vs CUDA graphs |
TODO |
Dependencies
- Single H20 GPU with Qwen3-Coder-30B-A3B loaded
- No vLLM source modifications needed (pure client-side measurement)
Microbench 2: PD Transfer Lifecycle
Design doc: microbench/transfer_lifecycle_design.md
Produces: Per-phase latency breakdown + transfer bandwidth model
Tasks
| # | Task | Status | Notes |
|---|---|---|---|
| 2.1 | Write vLLM instrumentation patch (mooncake_connector timestamps) | TODO | ~100 lines, non-invasive logging |
| 2.2 | Write scheduler instrumentation patch (promote timestamps) | TODO | ~20 lines |
| 2.3 | Write proxy instrumentation (routing + dispatch timestamps) | TODO | Already partially in cache_aware_proxy.py |
| 2.4 | Implement cache-seeding script (warm D's prefix cache to target C) | TODO | Send C-token request to D in combined mode |
| 2.5 | Implement lifecycle driver (orchestrates P/D, collects all timestamps) | TODO | |
| 2.6 | Validate on single config (C=0, N=8192, O=1) | TODO | Check all phases sum to E2E |
| 2.7 | Also run same config on combined instance (overhead baseline) | TODO | |
| 2.8 | Run full sweep (6×6×4=144 configs, 5 reps) | TODO | ~2h + 30min cache seeding |
| 2.9 | Verify incremental transfer: bytes_transferred independent of C | TODO | Critical correctness check |
| 2.10 | Fit transfer bandwidth model: t = α + β × bytes |
TODO | |
| 2.11 | Generate stacked bar charts + overhead comparison plots | TODO | |
| 2.12 | Compute break-even: when does transfer overhead exceed interference cost? | TODO | Combines results from both microbenchmarks |
Dependencies
- 2× H20 GPUs (P + D) on same machine (shared clock)
- vLLM source patch (tasks 2.1-2.3)
- Mooncake configured for P/D mode
Combined Analysis (After Both Complete)
| # | Task | Status | Notes |
|---|---|---|---|
| 3.1 | Build unified offload decision model | TODO | interference_cost(D,P,chunk) vs transfer_cost(N) |
| 3.2 | Identify "offload wins" region in (D, N, C) space | TODO | The key deliverable |
| 3.3 | Estimate improvement from layerwise pipeline | TODO | transfer_cost_layerwise = transfer_cost / num_layers |
| 3.4 | Quantify maximum possible gain over LMetric | TODO | Upper bound: all requests in "offload wins" region use offload |
Execution Order
Week 1:
[1.1-1.4] Implement + validate interference microbench
[2.1-2.3] Write vLLM instrumentation patches
Week 2:
[1.5-1.7] Run interference sweep + fit model
[2.4-2.6] Implement + validate lifecycle microbench
Week 3:
[2.7-2.11] Run lifecycle sweep + analysis
[3.1-3.4] Combined analysis → offload decision model
Critical path: Task 2.1 (vLLM patch) gates all of Microbench 2. Quick win: Microbench 1 needs zero vLLM modifications — can start immediately.
File Structure
microbench/
├── interference_microbench_design.md # Design doc (done)
├── transfer_lifecycle_design.md # Design doc (done)
├── TODO.md # This file
├── interference/
│ ├── driver.py # Microbench 1 client
│ ├── analyze.py # Fit model + plots
│ └── results/ # Output JSON + CSV
├── lifecycle/
│ ├── driver.py # Microbench 2 orchestrator
│ ├── seed_cache.py # D-side cache warming
│ ├── analyze.py # Breakdown plots
│ └── results/ # Output JSON + CSV
├── patches/
│ ├── 0001-connector-profiling.patch # Mooncake timestamp logging
│ └── 0002-scheduler-profiling.patch # Scheduler timestamp logging
└── combined/
├── decision_model.py # Unified offload decision function
└── plots/ # Final analysis figures