# Microbenchmark TODO ## Overview Two microbenchmarks to establish the quantitative foundation for the elastic migration decision: 1. **Interference Microbench** — quantify TPOT degradation from prefill-decode co-execution 2. **Transfer Lifecycle Microbench** — profile the full PD-sep request lifecycle, especially RDMA transfer cost Together they answer: **"For a given request in a given runtime state, is offload cheaper than co-execution?"** --- ## Microbench 1: Prefill-Decode Interference **Design doc**: `microbench/interference_microbench_design.md` **Produces**: `f(decode_batch_size, new_prefill_tokens, chunk_size) → TPOT_penalty_ms` ### Tasks | # | Task | Status | Notes | |---|------|--------|-------| | 1.1 | Implement microbench driver (async streaming client) | TODO | Python, httpx + asyncio SSE | | 1.2 | Implement steady-state detector (32 tokens, variance check) | TODO | | | 1.3 | Implement prefill injection + timestamp collection | TODO | | | 1.4 | Validate on single config (D=4, P=8192, chunk=8192) | TODO | Sanity check: penalty > 0 | | 1.5 | Run reduced sweep (chunk=8192, 7×7=49 configs, 5 reps) | TODO | ~2.5h on 1×H20 | | 1.6 | Fit interference cost model | TODO | Linear/polynomial regression | | 1.7 | Generate heatmap + break-even plot | TODO | | | 1.8 | (Optional) Full sweep with 4 chunk sizes | TODO | ~10h | | 1.9 | (Optional) Ablation: `--enforce-eager` vs CUDA graphs | TODO | | ### Dependencies - Single H20 GPU with Qwen3-Coder-30B-A3B loaded - No vLLM source modifications needed (pure client-side measurement) --- ## Microbench 2: PD Transfer Lifecycle **Design doc**: `microbench/transfer_lifecycle_design.md` **Produces**: Per-phase latency breakdown + transfer bandwidth model ### Tasks | # | Task | Status | Notes | |---|------|--------|-------| | 2.1 | Write vLLM instrumentation patch (mooncake_connector timestamps) | TODO | ~100 lines, non-invasive logging | | 2.2 | Write scheduler instrumentation patch (promote timestamps) | TODO | ~20 lines | | 2.3 | Write proxy instrumentation (routing + dispatch timestamps) | TODO | Already partially in cache_aware_proxy.py | | 2.4 | Implement cache-seeding script (warm D's prefix cache to target C) | TODO | Send C-token request to D in combined mode | | 2.5 | Implement lifecycle driver (orchestrates P/D, collects all timestamps) | TODO | | | 2.6 | Validate on single config (C=0, N=8192, O=1) | TODO | Check all phases sum to E2E | | 2.7 | Also run same config on combined instance (overhead baseline) | TODO | | | 2.8 | Run full sweep (6×6×4=144 configs, 5 reps) | TODO | ~2h + 30min cache seeding | | 2.9 | Verify incremental transfer: bytes_transferred independent of C | TODO | Critical correctness check | | 2.10 | Fit transfer bandwidth model: `t = α + β × bytes` | TODO | | | 2.11 | Generate stacked bar charts + overhead comparison plots | TODO | | | 2.12 | Compute break-even: when does transfer overhead exceed interference cost? | TODO | Combines results from both microbenchmarks | ### Dependencies - 2× H20 GPUs (P + D) on same machine (shared clock) - vLLM source patch (tasks 2.1-2.3) - Mooncake configured for P/D mode --- ## Combined Analysis (After Both Complete) | # | Task | Status | Notes | |---|------|--------|-------| | 3.1 | Build unified offload decision model | TODO | `interference_cost(D,P,chunk) vs transfer_cost(N)` | | 3.2 | Identify "offload wins" region in (D, N, C) space | TODO | The key deliverable | | 3.3 | Estimate improvement from layerwise pipeline | TODO | `transfer_cost_layerwise = transfer_cost / num_layers` | | 3.4 | Quantify maximum possible gain over LMetric | TODO | Upper bound: all requests in "offload wins" region use offload | --- ## Execution Order ``` Week 1: [1.1-1.4] Implement + validate interference microbench [2.1-2.3] Write vLLM instrumentation patches Week 2: [1.5-1.7] Run interference sweep + fit model [2.4-2.6] Implement + validate lifecycle microbench Week 3: [2.7-2.11] Run lifecycle sweep + analysis [3.1-3.4] Combined analysis → offload decision model ``` **Critical path**: Task 2.1 (vLLM patch) gates all of Microbench 2. **Quick win**: Microbench 1 needs zero vLLM modifications — can start immediately. --- ## File Structure ``` microbench/ ├── interference_microbench_design.md # Design doc (done) ├── transfer_lifecycle_design.md # Design doc (done) ├── TODO.md # This file ├── interference/ │ ├── driver.py # Microbench 1 client │ ├── analyze.py # Fit model + plots │ └── results/ # Output JSON + CSV ├── lifecycle/ │ ├── driver.py # Microbench 2 orchestrator │ ├── seed_cache.py # D-side cache warming │ ├── analyze.py # Breakdown plots │ └── results/ # Output JSON + CSV ├── patches/ │ ├── 0001-connector-profiling.patch # Mooncake timestamp logging │ └── 0002-scheduler-profiling.patch # Scheduler timestamp logging └── combined/ ├── decision_model.py # Unified offload decision function └── plots/ # Final analysis figures ```