Files
agentic-kvc/microbench/TODO.md
Gahow Wang f784e49c07 Microbench: prefill-decode interference + PD transfer lifecycle
Two microbenchmarks quantifying the elastic offload decision:

1. Interference (corrected): cold prefill causes 14-214x TPOT p90
   degradation on same-worker decode (D∈{1,2,4,8} × P∈{2k,8k,16k,32k}).
   Earlier run had a prefix-cache bug (deterministic prompts hit cache
   after rep 0); fixed with uuid+time_ns unique prompts.

2. Transfer lifecycle: PD-sep TTFT breakdown via Mooncake proxy,
   measuring prefill→RDMA→decode startup overhead.

Key finding: offload wins at all P≥2048 operating points —
transfer cost is 25-50% of interference cost even with bulk Mooncake.
2026-05-26 00:57:06 +08:00

125 lines
5.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Microbenchmark TODO
## Overview
Two microbenchmarks to establish the quantitative foundation for the elastic migration decision:
1. **Interference Microbench** — quantify TPOT degradation from prefill-decode co-execution
2. **Transfer Lifecycle Microbench** — profile the full PD-sep request lifecycle, especially RDMA transfer cost
Together they answer: **"For a given request in a given runtime state, is offload cheaper than co-execution?"**
---
## Microbench 1: Prefill-Decode Interference
**Design doc**: `microbench/interference_microbench_design.md`
**Produces**: `f(decode_batch_size, new_prefill_tokens, chunk_size) → TPOT_penalty_ms`
### Tasks
| # | Task | Status | Notes |
|---|------|--------|-------|
| 1.1 | Implement microbench driver (async streaming client) | TODO | Python, httpx + asyncio SSE |
| 1.2 | Implement steady-state detector (32 tokens, variance check) | TODO | |
| 1.3 | Implement prefill injection + timestamp collection | TODO | |
| 1.4 | Validate on single config (D=4, P=8192, chunk=8192) | TODO | Sanity check: penalty > 0 |
| 1.5 | Run reduced sweep (chunk=8192, 7×7=49 configs, 5 reps) | TODO | ~2.5h on 1×H20 |
| 1.6 | Fit interference cost model | TODO | Linear/polynomial regression |
| 1.7 | Generate heatmap + break-even plot | TODO | |
| 1.8 | (Optional) Full sweep with 4 chunk sizes | TODO | ~10h |
| 1.9 | (Optional) Ablation: `--enforce-eager` vs CUDA graphs | TODO | |
### Dependencies
- Single H20 GPU with Qwen3-Coder-30B-A3B loaded
- No vLLM source modifications needed (pure client-side measurement)
---
## Microbench 2: PD Transfer Lifecycle
**Design doc**: `microbench/transfer_lifecycle_design.md`
**Produces**: Per-phase latency breakdown + transfer bandwidth model
### Tasks
| # | Task | Status | Notes |
|---|------|--------|-------|
| 2.1 | Write vLLM instrumentation patch (mooncake_connector timestamps) | TODO | ~100 lines, non-invasive logging |
| 2.2 | Write scheduler instrumentation patch (promote timestamps) | TODO | ~20 lines |
| 2.3 | Write proxy instrumentation (routing + dispatch timestamps) | TODO | Already partially in cache_aware_proxy.py |
| 2.4 | Implement cache-seeding script (warm D's prefix cache to target C) | TODO | Send C-token request to D in combined mode |
| 2.5 | Implement lifecycle driver (orchestrates P/D, collects all timestamps) | TODO | |
| 2.6 | Validate on single config (C=0, N=8192, O=1) | TODO | Check all phases sum to E2E |
| 2.7 | Also run same config on combined instance (overhead baseline) | TODO | |
| 2.8 | Run full sweep (6×6×4=144 configs, 5 reps) | TODO | ~2h + 30min cache seeding |
| 2.9 | Verify incremental transfer: bytes_transferred independent of C | TODO | Critical correctness check |
| 2.10 | Fit transfer bandwidth model: `t = α + β × bytes` | TODO | |
| 2.11 | Generate stacked bar charts + overhead comparison plots | TODO | |
| 2.12 | Compute break-even: when does transfer overhead exceed interference cost? | TODO | Combines results from both microbenchmarks |
### Dependencies
- 2× H20 GPUs (P + D) on same machine (shared clock)
- vLLM source patch (tasks 2.1-2.3)
- Mooncake configured for P/D mode
---
## Combined Analysis (After Both Complete)
| # | Task | Status | Notes |
|---|------|--------|-------|
| 3.1 | Build unified offload decision model | TODO | `interference_cost(D,P,chunk) vs transfer_cost(N)` |
| 3.2 | Identify "offload wins" region in (D, N, C) space | TODO | The key deliverable |
| 3.3 | Estimate improvement from layerwise pipeline | TODO | `transfer_cost_layerwise = transfer_cost / num_layers` |
| 3.4 | Quantify maximum possible gain over LMetric | TODO | Upper bound: all requests in "offload wins" region use offload |
---
## Execution Order
```
Week 1:
[1.1-1.4] Implement + validate interference microbench
[2.1-2.3] Write vLLM instrumentation patches
Week 2:
[1.5-1.7] Run interference sweep + fit model
[2.4-2.6] Implement + validate lifecycle microbench
Week 3:
[2.7-2.11] Run lifecycle sweep + analysis
[3.1-3.4] Combined analysis → offload decision model
```
**Critical path**: Task 2.1 (vLLM patch) gates all of Microbench 2.
**Quick win**: Microbench 1 needs zero vLLM modifications — can start immediately.
---
## File Structure
```
microbench/
├── interference_microbench_design.md # Design doc (done)
├── transfer_lifecycle_design.md # Design doc (done)
├── TODO.md # This file
├── interference/
│ ├── driver.py # Microbench 1 client
│ ├── analyze.py # Fit model + plots
│ └── results/ # Output JSON + CSV
├── lifecycle/
│ ├── driver.py # Microbench 2 orchestrator
│ ├── seed_cache.py # D-side cache warming
│ ├── analyze.py # Breakdown plots
│ └── results/ # Output JSON + CSV
├── patches/
│ ├── 0001-connector-profiling.patch # Mooncake timestamp logging
│ └── 0002-scheduler-profiling.patch # Scheduler timestamp logging
└── combined/
├── decision_model.py # Unified offload decision function
└── plots/ # Final analysis figures
```