# Microbenchmark TODO

## Overview

Two microbenchmarks to establish the quantitative foundation for the elastic migration decision:

1. **Interference Microbench** — quantify TPOT degradation from prefill-decode co-execution
2. **Transfer Lifecycle Microbench** — profile the full PD-sep request lifecycle, especially RDMA transfer cost

Together they answer: **"For a given request in a given runtime state, is offload cheaper than co-execution?"**

---

## Microbench 1: Prefill-Decode Interference

**Design doc**: `microbench/interference_microbench_design.md`

**Produces**: `f(decode_batch_size, new_prefill_tokens, chunk_size) → TPOT_penalty_ms`

### Tasks

| # | Task | Status | Notes |
|---|------|--------|-------|
| 1.1 | Implement microbench driver (async streaming client) | TODO | Python, httpx + asyncio SSE |
| 1.2 | Implement steady-state detector (32 tokens, variance check) | TODO | |
| 1.3 | Implement prefill injection + timestamp collection | TODO | |
| 1.4 | Validate on single config (D=4, P=8192, chunk=8192) | TODO | Sanity check: penalty > 0 |
| 1.5 | Run reduced sweep (chunk=8192, 7×7=49 configs, 5 reps) | TODO | ~2.5h on 1×H20 |
| 1.6 | Fit interference cost model | TODO | Linear/polynomial regression |
| 1.7 | Generate heatmap + break-even plot | TODO | |
| 1.8 | (Optional) Full sweep with 4 chunk sizes | TODO | ~10h |
| 1.9 | (Optional) Ablation: `--enforce-eager` vs CUDA graphs | TODO | |

### Dependencies
- Single H20 GPU with Qwen3-Coder-30B-A3B loaded
- No vLLM source modifications needed (pure client-side measurement)

---

## Microbench 2: PD Transfer Lifecycle

**Design doc**: `microbench/transfer_lifecycle_design.md`

**Produces**: Per-phase latency breakdown + transfer bandwidth model

### Tasks

| # | Task | Status | Notes |
|---|------|--------|-------|
| 2.1 | Write vLLM instrumentation patch (mooncake_connector timestamps) | TODO | ~100 lines, non-invasive logging |
| 2.2 | Write scheduler instrumentation patch (promote timestamps) | TODO | ~20 lines |
| 2.3 | Write proxy instrumentation (routing + dispatch timestamps) | TODO | Already partially in cache_aware_proxy.py |
| 2.4 | Implement cache-seeding script (warm D's prefix cache to target C) | TODO | Send C-token request to D in combined mode |
| 2.5 | Implement lifecycle driver (orchestrates P/D, collects all timestamps) | TODO | |
| 2.6 | Validate on single config (C=0, N=8192, O=1) | TODO | Check all phases sum to E2E |
| 2.7 | Also run same config on combined instance (overhead baseline) | TODO | |
| 2.8 | Run full sweep (6×6×4=144 configs, 5 reps) | TODO | ~2h + 30min cache seeding |
| 2.9 | Verify incremental transfer: bytes_transferred independent of C | TODO | Critical correctness check |
| 2.10 | Fit transfer bandwidth model: `t = α + β × bytes` | TODO | |
| 2.11 | Generate stacked bar charts + overhead comparison plots | TODO | |
| 2.12 | Compute break-even: when does transfer overhead exceed interference cost? | TODO | Combines results from both microbenchmarks |

### Dependencies
- 2× H20 GPUs (P + D) on same machine (shared clock)
- vLLM source patch (tasks 2.1-2.3)
- Mooncake configured for P/D mode

---

## Combined Analysis (After Both Complete)

| # | Task | Status | Notes |
|---|------|--------|-------|
| 3.1 | Build unified offload decision model | TODO | `interference_cost(D,P,chunk) vs transfer_cost(N)` |
| 3.2 | Identify "offload wins" region in (D, N, C) space | TODO | The key deliverable |
| 3.3 | Estimate improvement from layerwise pipeline | TODO | `transfer_cost_layerwise = transfer_cost / num_layers` |
| 3.4 | Quantify maximum possible gain over LMetric | TODO | Upper bound: all requests in "offload wins" region use offload |

---

## Execution Order

```
Week 1:
  [1.1-1.4] Implement + validate interference microbench
  [2.1-2.3] Write vLLM instrumentation patches

Week 2:
  [1.5-1.7] Run interference sweep + fit model
  [2.4-2.6] Implement + validate lifecycle microbench

Week 3:
  [2.7-2.11] Run lifecycle sweep + analysis
  [3.1-3.4] Combined analysis → offload decision model
```

**Critical path**: Task 2.1 (vLLM patch) gates all of Microbench 2.
**Quick win**: Microbench 1 needs zero vLLM modifications — can start immediately.

---

## File Structure

```
microbench/
├── interference_microbench_design.md    # Design doc (done)
├── transfer_lifecycle_design.md         # Design doc (done)
├── TODO.md                              # This file
├── interference/
│   ├── driver.py                        # Microbench 1 client
│   ├── analyze.py                       # Fit model + plots
│   └── results/                         # Output JSON + CSV
├── lifecycle/
│   ├── driver.py                        # Microbench 2 orchestrator
│   ├── seed_cache.py                    # D-side cache warming
│   ├── analyze.py                       # Breakdown plots
│   └── results/                         # Output JSON + CSV
├── patches/
│   ├── 0001-connector-profiling.patch   # Mooncake timestamp logging
│   └── 0002-scheduler-profiling.patch   # Scheduler timestamp logging
└── combined/
    ├── decision_model.py                # Unified offload decision function
    └── plots/                           # Final analysis figures
```