Two microbenchmarks quantifying the elastic offload decision:
1. Interference (corrected): cold prefill causes 14-214x TPOT p90
degradation on same-worker decode (D∈{1,2,4,8} × P∈{2k,8k,16k,32k}).
Earlier run had a prefix-cache bug (deterministic prompts hit cache
after rep 0); fixed with uuid+time_ns unique prompts.
2. Transfer lifecycle: PD-sep TTFT breakdown via Mooncake proxy,
measuring prefill→RDMA→decode startup overhead.
Key finding: offload wins at all P≥2048 operating points —
transfer cost is 25-50% of interference cost even with bulk Mooncake.
125 lines
5.2 KiB
Markdown
125 lines
5.2 KiB
Markdown
# Microbenchmark TODO
|
||
|
||
## Overview
|
||
|
||
Two microbenchmarks to establish the quantitative foundation for the elastic migration decision:
|
||
|
||
1. **Interference Microbench** — quantify TPOT degradation from prefill-decode co-execution
|
||
2. **Transfer Lifecycle Microbench** — profile the full PD-sep request lifecycle, especially RDMA transfer cost
|
||
|
||
Together they answer: **"For a given request in a given runtime state, is offload cheaper than co-execution?"**
|
||
|
||
---
|
||
|
||
## Microbench 1: Prefill-Decode Interference
|
||
|
||
**Design doc**: `microbench/interference_microbench_design.md`
|
||
|
||
**Produces**: `f(decode_batch_size, new_prefill_tokens, chunk_size) → TPOT_penalty_ms`
|
||
|
||
### Tasks
|
||
|
||
| # | Task | Status | Notes |
|
||
|---|------|--------|-------|
|
||
| 1.1 | Implement microbench driver (async streaming client) | TODO | Python, httpx + asyncio SSE |
|
||
| 1.2 | Implement steady-state detector (32 tokens, variance check) | TODO | |
|
||
| 1.3 | Implement prefill injection + timestamp collection | TODO | |
|
||
| 1.4 | Validate on single config (D=4, P=8192, chunk=8192) | TODO | Sanity check: penalty > 0 |
|
||
| 1.5 | Run reduced sweep (chunk=8192, 7×7=49 configs, 5 reps) | TODO | ~2.5h on 1×H20 |
|
||
| 1.6 | Fit interference cost model | TODO | Linear/polynomial regression |
|
||
| 1.7 | Generate heatmap + break-even plot | TODO | |
|
||
| 1.8 | (Optional) Full sweep with 4 chunk sizes | TODO | ~10h |
|
||
| 1.9 | (Optional) Ablation: `--enforce-eager` vs CUDA graphs | TODO | |
|
||
|
||
### Dependencies
|
||
- Single H20 GPU with Qwen3-Coder-30B-A3B loaded
|
||
- No vLLM source modifications needed (pure client-side measurement)
|
||
|
||
---
|
||
|
||
## Microbench 2: PD Transfer Lifecycle
|
||
|
||
**Design doc**: `microbench/transfer_lifecycle_design.md`
|
||
|
||
**Produces**: Per-phase latency breakdown + transfer bandwidth model
|
||
|
||
### Tasks
|
||
|
||
| # | Task | Status | Notes |
|
||
|---|------|--------|-------|
|
||
| 2.1 | Write vLLM instrumentation patch (mooncake_connector timestamps) | TODO | ~100 lines, non-invasive logging |
|
||
| 2.2 | Write scheduler instrumentation patch (promote timestamps) | TODO | ~20 lines |
|
||
| 2.3 | Write proxy instrumentation (routing + dispatch timestamps) | TODO | Already partially in cache_aware_proxy.py |
|
||
| 2.4 | Implement cache-seeding script (warm D's prefix cache to target C) | TODO | Send C-token request to D in combined mode |
|
||
| 2.5 | Implement lifecycle driver (orchestrates P/D, collects all timestamps) | TODO | |
|
||
| 2.6 | Validate on single config (C=0, N=8192, O=1) | TODO | Check all phases sum to E2E |
|
||
| 2.7 | Also run same config on combined instance (overhead baseline) | TODO | |
|
||
| 2.8 | Run full sweep (6×6×4=144 configs, 5 reps) | TODO | ~2h + 30min cache seeding |
|
||
| 2.9 | Verify incremental transfer: bytes_transferred independent of C | TODO | Critical correctness check |
|
||
| 2.10 | Fit transfer bandwidth model: `t = α + β × bytes` | TODO | |
|
||
| 2.11 | Generate stacked bar charts + overhead comparison plots | TODO | |
|
||
| 2.12 | Compute break-even: when does transfer overhead exceed interference cost? | TODO | Combines results from both microbenchmarks |
|
||
|
||
### Dependencies
|
||
- 2× H20 GPUs (P + D) on same machine (shared clock)
|
||
- vLLM source patch (tasks 2.1-2.3)
|
||
- Mooncake configured for P/D mode
|
||
|
||
---
|
||
|
||
## Combined Analysis (After Both Complete)
|
||
|
||
| # | Task | Status | Notes |
|
||
|---|------|--------|-------|
|
||
| 3.1 | Build unified offload decision model | TODO | `interference_cost(D,P,chunk) vs transfer_cost(N)` |
|
||
| 3.2 | Identify "offload wins" region in (D, N, C) space | TODO | The key deliverable |
|
||
| 3.3 | Estimate improvement from layerwise pipeline | TODO | `transfer_cost_layerwise = transfer_cost / num_layers` |
|
||
| 3.4 | Quantify maximum possible gain over LMetric | TODO | Upper bound: all requests in "offload wins" region use offload |
|
||
|
||
---
|
||
|
||
## Execution Order
|
||
|
||
```
|
||
Week 1:
|
||
[1.1-1.4] Implement + validate interference microbench
|
||
[2.1-2.3] Write vLLM instrumentation patches
|
||
|
||
Week 2:
|
||
[1.5-1.7] Run interference sweep + fit model
|
||
[2.4-2.6] Implement + validate lifecycle microbench
|
||
|
||
Week 3:
|
||
[2.7-2.11] Run lifecycle sweep + analysis
|
||
[3.1-3.4] Combined analysis → offload decision model
|
||
```
|
||
|
||
**Critical path**: Task 2.1 (vLLM patch) gates all of Microbench 2.
|
||
**Quick win**: Microbench 1 needs zero vLLM modifications — can start immediately.
|
||
|
||
---
|
||
|
||
## File Structure
|
||
|
||
```
|
||
microbench/
|
||
├── interference_microbench_design.md # Design doc (done)
|
||
├── transfer_lifecycle_design.md # Design doc (done)
|
||
├── TODO.md # This file
|
||
├── interference/
|
||
│ ├── driver.py # Microbench 1 client
|
||
│ ├── analyze.py # Fit model + plots
|
||
│ └── results/ # Output JSON + CSV
|
||
├── lifecycle/
|
||
│ ├── driver.py # Microbench 2 orchestrator
|
||
│ ├── seed_cache.py # D-side cache warming
|
||
│ ├── analyze.py # Breakdown plots
|
||
│ └── results/ # Output JSON + CSV
|
||
├── patches/
|
||
│ ├── 0001-connector-profiling.patch # Mooncake timestamp logging
|
||
│ └── 0002-scheduler-profiling.patch # Scheduler timestamp logging
|
||
└── combined/
|
||
├── decision_model.py # Unified offload decision function
|
||
└── plots/ # Final analysis figures
|
||
```
|