Files
agentic-kvc/microbench/interference_microbench_design.md
Gahow Wang f784e49c07 Microbench: prefill-decode interference + PD transfer lifecycle
Two microbenchmarks quantifying the elastic offload decision:

1. Interference (corrected): cold prefill causes 14-214x TPOT p90
   degradation on same-worker decode (D∈{1,2,4,8} × P∈{2k,8k,16k,32k}).
   Earlier run had a prefix-cache bug (deterministic prompts hit cache
   after rep 0); fixed with uuid+time_ns unique prompts.

2. Transfer lifecycle: PD-sep TTFT breakdown via Mooncake proxy,
   measuring prefill→RDMA→decode startup overhead.

Key finding: offload wins at all P≥2048 operating points —
transfer cost is 25-50% of interference cost even with bulk Mooncake.
2026-05-26 00:57:06 +08:00

318 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Prefill-Decode Interference Microbenchmark
## Goal
Quantify the **per-chunk TPOT degradation** caused by prefill interference on ongoing decode batches, producing a lookup table:
```
f(decode_batch_size, new_prefill_tokens, chunk_size) → TPOT_penalty_ms
```
This table is the foundation for the runtime offload decision:
```
interference_cost = num_chunks × decode_batch_size × TPOT_penalty
if interference_cost > layerwise_transfer_cost:
offload()
```
---
## Hardware & Model
| Parameter | Value |
|-----------|-------|
| GPU | NVIDIA H20 96GB × 1 (single instance) |
| Model | Qwen3-Coder-30B-A3B-Instruct (MoE 128E top-8, 3B active) |
| TP | 1 |
| `max_model_len` | 200000 |
| `block_size` | 16 (vLLM default) |
| `enable_prefix_caching` | true |
| `enable_chunked_prefill` | true |
| `max_num_batched_tokens` | 8192 (H20 default for openai API server) |
| `gpu_memory_utilization` | 0.9 |
---
## Experiment Design
### Independent Variables
| Variable | Values | Rationale |
|----------|--------|-----------|
| `decode_batch_size` (D) | 0, 1, 2, 4, 6, 8, 12 | Covers low→saturated decode concurrency |
| `new_prefill_tokens` (P) | 512, 1024, 2048, 4096, 8192, 16384, 32768 | Range from small warm turn to full cold heavy |
| `chunk_size` | 2048, 4096, 8192 (default), 16384 | Sweep the dominant scheduling knob |
Full sweep: 7 × 7 × 4 = 196 configurations.
### Dependent Variables (Measured)
| Metric | Definition | How to measure |
|--------|-----------|----------------|
| `TPOT_baseline` | Inter-token latency with decode-only batch (no prefill) | Send D dummy decode requests, measure steady-state TPOT |
| `TPOT_interference` | Inter-token latency while prefill chunks execute | Measure TPOT of ongoing decode requests during the window when prefill chunks are being processed |
| `TPOT_penalty` | `TPOT_interference - TPOT_baseline` | Per-token penalty from prefill co-execution |
| `prefill_duration` | Wall time from prefill request submission to first token | Includes queuing + chunked execution |
| `num_chunks_actual` | Number of scheduler iterations the prefill occupied | From vLLM engine logs or step counter |
| `step_time_baseline` | Scheduler step duration with decode-only | From engine internals or proxy measurement |
| `step_time_mixed` | Scheduler step duration with prefill+decode | Same |
### Control Variables (Fixed per experiment)
| Variable | Value | Rationale |
|----------|-------|-----------|
| Decode output length | 256 tokens each | Long enough to span the entire prefill window |
| Decode context length | 4096 tokens each | Realistic session history, pre-warmed via prefix cache |
| Prefill output length | 1 token | Minimize post-prefill decode interference |
| KV cache state | Prefill is fully cold (no cache hit) | Worst case: maximum chunks |
| Temperature | 0 (greedy) | Deterministic, no sampling variance |
---
## Protocol
### Phase 1: Baseline TPOT Measurement (Decode-Only)
```
1. Launch vLLM instance (TP=1, single H20 GPU)
2. Pre-fill D decode "seed" requests:
- Each has 4096-token context (pre-warmed via identical prompt prefix)
- Set max_tokens=256, temperature=0
3. Once all D requests are in active decode, start timer
4. Collect per-token timestamps for each decode request over 256 tokens
5. Compute TPOT_baseline = median(inter-token-intervals) across all D requests
6. Record step_time_baseline from vLLM metrics endpoint (/metrics)
```
**Warm-up**: Discard first 16 tokens per request (CUDA graph warm-up, attention ramp-up).
### Phase 2: Interference Measurement (Prefill Injected)
```
1. Same setup as Phase 1: D decode requests in steady-state
2. At token ~32 of the decode stream, inject prefill request:
- Input: P random tokens (no prefix cache hit)
- max_tokens=1
3. Continue collecting per-token timestamps for all D decode requests
4. Measure:
a. TPOT of decode requests DURING prefill window
(from prefill injection to prefill's first token)
b. TPOT of decode requests AFTER prefill completes (recovery)
c. Total prefill_duration
d. num_chunks = ceil(P / chunk_size) [verify against actual]
```
### Phase 3: Repeat for All Configurations
```
For chunk_size in [2048, 4096, 8192, 16384]:
Configure vLLM with --max-num-batched-tokens=chunk_size
Restart instance (clean KV cache state)
For D in [0, 1, 2, 4, 6, 8, 12]:
For P in [512, 1024, 2048, 4096, 8192, 16384, 32768]:
Run Phase 1 (baseline) → record TPOT_baseline[D]
Run Phase 2 (interference) → record TPOT_interference[D, P]
Compute TPOT_penalty[D, P] = TPOT_interference - TPOT_baseline
Wait 5s for KV eviction and state cleanup
```
**Optimization**: Phase 1 only needs to run once per (chunk_size, D) pair since it doesn't depend on P.
---
## Implementation
### Client Architecture
```
┌──────────────────────────────────────────────────┐
│ Microbench Driver │
├──────────────────────────────────────────────────┤
│ 1. Spawn D "background decode" streams (async) │
│ 2. Wait for steady-state (all D in decode) │
│ 3. Inject prefill request │
│ 4. Collect streaming token timestamps │
│ 5. Compute metrics │
└──────────────────────────────────────────────────┘
│ OpenAI-compatible streaming API
┌──────────────────────────────────────────────────┐
│ vLLM Instance (single GPU) │
│ --enable-chunked-prefill │
│ --max-num-batched-tokens={chunk_size} │
│ --enable-prefix-caching │
└──────────────────────────────────────────────────┘
```
### Request Construction
**Decode seed requests** (to create ongoing decode batch):
```python
{
"model": "Qwen3-Coder-30B-A3B-Instruct",
"messages": [{"role": "user", "content": FIXED_4K_PROMPT}],
"max_tokens": 256,
"temperature": 0,
"stream": True
}
```
All D requests share the same 4K prompt prefix (ensures prefix cache hit → instant prefill for seeds, isolating decode-only behavior).
**Interference prefill request**:
```python
{
"model": "Qwen3-Coder-30B-A3B-Instruct",
"messages": [{"role": "user", "content": RANDOM_P_TOKEN_PROMPT}],
"max_tokens": 1,
"temperature": 0,
"stream": True
}
```
Use random content (UUID-based) to guarantee zero prefix cache hit → forces full P-token prefill.
### Timestamp Collection
Use SSE streaming with `time.perf_counter_ns()` on each `data: {"choices":[{"delta":...}]}` chunk:
```python
async def collect_stream(session, url, payload) -> list[int]:
"""Returns list of nanosecond timestamps, one per token."""
timestamps = []
async with session.post(url, json=payload) as resp:
async for line in resp.content:
if line.startswith(b"data: ") and b"[DONE]" not in line:
timestamps.append(time.perf_counter_ns())
return timestamps
```
### Steady-State Detection
Before injecting prefill, verify all D requests are in active decode:
1. Wait until each stream has emitted ≥ 32 tokens
2. Check that the last 8 inter-token intervals are within 2× of each other (no startup variance)
---
## Output Format
### Per-Run Record (`results/{chunk_size}/D{d}_P{p}.json`)
```json
{
"config": {
"decode_batch_size": 4,
"new_prefill_tokens": 8192,
"chunk_size": 8192,
"model": "Qwen3-Coder-30B-A3B-Instruct",
"gpu": "H20"
},
"baseline": {
"tpot_p50_ms": 42.3,
"tpot_p90_ms": 45.1,
"tpot_p99_ms": 48.7,
"step_time_ms": 43.0
},
"interference": {
"tpot_during_prefill_p50_ms": 89.2,
"tpot_during_prefill_p90_ms": 95.4,
"tpot_after_prefill_p50_ms": 43.1,
"num_chunks_actual": 1,
"prefill_duration_ms": 91.0,
"prefill_ttft_ms": 91.0
},
"derived": {
"tpot_penalty_p50_ms": 46.9,
"tpot_penalty_ratio": 1.11,
"total_interference_ms": 46.9,
"decode_tokens_delayed": 4
}
}
```
### Aggregated Table (`results/interference_table.csv`)
```csv
chunk_size,decode_batch_size,new_prefill_tokens,tpot_baseline_ms,tpot_interference_ms,tpot_penalty_ms,penalty_ratio,num_chunks,prefill_duration_ms
8192,4,8192,42.3,89.2,46.9,1.11,1,91.0
8192,4,16384,42.3,88.5,46.2,1.09,2,178.3
8192,8,8192,78.1,156.3,78.2,1.00,1,159.0
...
```
---
## Analysis Deliverables
### 1. Interference Heatmap
X-axis: `new_prefill_tokens`, Y-axis: `decode_batch_size`, Color: `tpot_penalty_ratio`
Expected pattern:
- Penalty increases with decode_batch_size (more requests disrupted)
- Penalty per-request is roughly constant for same chunk_size (step time is dominated by the larger of prefill-chunk or decode-batch)
- Penalty increases with num_chunks (more disrupted iterations)
### 2. Total Interference Cost Model
```
total_interference_cost(D, P, chunk_size) =
num_chunks(P, chunk_size) × D × tpot_penalty_per_chunk(D, chunk_size)
```
If the model fits well (R² > 0.9), it becomes the offload decision function.
### 3. Break-Even Analysis
For each (D, P, chunk_size), compute:
```
break_even_transfer_time = total_interference_cost(D, P, chunk_size)
```
If layerwise pipeline transfer cost < break_even_transfer_time, offload wins.
Plot: "offload wins" region in the (D, P) space for chunk_size=8192.
### 4. Sensitivity to chunk_size
How does `--max-num-batched-tokens` (effective chunk size) trade off:
- Smaller chunk more chunks longer total prefill more interrupted decode steps, but each step is shorter
- Larger chunk fewer chunks shorter total prefill fewer interrupted steps, but each step takes longer
---
## Risks & Mitigations
| Risk | Impact | Mitigation |
|------|--------|------------|
| CUDA graph optimization masks real penalty | Underestimate interference | Run with `--enforce-eager` as ablation |
| vLLM internal batching merges decode+prefill differently than expected | Wrong chunk count | Verify with `/metrics` endpoint (`vllm:num_prefill_tokens_iter`) |
| Network jitter in timestamp collection | Noisy TPOT | Run on localhost (127.0.0.1), use 5 repetitions per config |
| KV cache pressure at high D | OOM or eviction | Keep decode context at 4K, monitor `gpu_cache_usage_perc` |
| MoE routing variance | Non-deterministic step time | Use greedy decoding, report p50 over 5 runs |
---
## Execution Estimate
| Phase | Time |
|-------|------|
| Per configuration (1 baseline + 1 interference, 5 reps) | ~3 min |
| Full sweep (196 configs × 3 min) | ~10 hours |
| Reduced sweep (chunk_size=8192 only, 49 configs) | ~2.5 hours |
| Analysis & plotting | 1 hour |
**Recommended**: Start with reduced sweep (default chunk_size=8192), then expand to other chunk sizes if results are promising.
---
## Success Criteria
1. **Interference is measurable**: `tpot_penalty_ratio > 1.1` for D 4 and P 4096
2. **Model fits**: Linear or polynomial model of `total_interference_cost` achieves R² > 0.85
3. **Break-even exists**: There exists a realistic (D, P) region where `interference_cost > 50ms` (layerwise pipeline transfer budget)
4. **Reproducible**: Coefficient of variation < 15% across 5 repetitions per config