Two microbenchmarks quantifying the elastic offload decision:
1. Interference (corrected): cold prefill causes 14-214x TPOT p90
degradation on same-worker decode (D∈{1,2,4,8} × P∈{2k,8k,16k,32k}).
Earlier run had a prefix-cache bug (deterministic prompts hit cache
after rep 0); fixed with uuid+time_ns unique prompts.
2. Transfer lifecycle: PD-sep TTFT breakdown via Mooncake proxy,
measuring prefill→RDMA→decode startup overhead.
Key finding: offload wins at all P≥2048 operating points —
transfer cost is 25-50% of interference cost even with bulk Mooncake.
318 lines
12 KiB
Markdown
318 lines
12 KiB
Markdown
# Prefill-Decode Interference Microbenchmark
|
||
|
||
## Goal
|
||
|
||
Quantify the **per-chunk TPOT degradation** caused by prefill interference on ongoing decode batches, producing a lookup table:
|
||
|
||
```
|
||
f(decode_batch_size, new_prefill_tokens, chunk_size) → TPOT_penalty_ms
|
||
```
|
||
|
||
This table is the foundation for the runtime offload decision:
|
||
|
||
```
|
||
interference_cost = num_chunks × decode_batch_size × TPOT_penalty
|
||
if interference_cost > layerwise_transfer_cost:
|
||
offload()
|
||
```
|
||
|
||
---
|
||
|
||
## Hardware & Model
|
||
|
||
| Parameter | Value |
|
||
|-----------|-------|
|
||
| GPU | NVIDIA H20 96GB × 1 (single instance) |
|
||
| Model | Qwen3-Coder-30B-A3B-Instruct (MoE 128E top-8, 3B active) |
|
||
| TP | 1 |
|
||
| `max_model_len` | 200000 |
|
||
| `block_size` | 16 (vLLM default) |
|
||
| `enable_prefix_caching` | true |
|
||
| `enable_chunked_prefill` | true |
|
||
| `max_num_batched_tokens` | 8192 (H20 default for openai API server) |
|
||
| `gpu_memory_utilization` | 0.9 |
|
||
|
||
---
|
||
|
||
## Experiment Design
|
||
|
||
### Independent Variables
|
||
|
||
| Variable | Values | Rationale |
|
||
|----------|--------|-----------|
|
||
| `decode_batch_size` (D) | 0, 1, 2, 4, 6, 8, 12 | Covers low→saturated decode concurrency |
|
||
| `new_prefill_tokens` (P) | 512, 1024, 2048, 4096, 8192, 16384, 32768 | Range from small warm turn to full cold heavy |
|
||
| `chunk_size` | 2048, 4096, 8192 (default), 16384 | Sweep the dominant scheduling knob |
|
||
|
||
Full sweep: 7 × 7 × 4 = 196 configurations.
|
||
|
||
### Dependent Variables (Measured)
|
||
|
||
| Metric | Definition | How to measure |
|
||
|--------|-----------|----------------|
|
||
| `TPOT_baseline` | Inter-token latency with decode-only batch (no prefill) | Send D dummy decode requests, measure steady-state TPOT |
|
||
| `TPOT_interference` | Inter-token latency while prefill chunks execute | Measure TPOT of ongoing decode requests during the window when prefill chunks are being processed |
|
||
| `TPOT_penalty` | `TPOT_interference - TPOT_baseline` | Per-token penalty from prefill co-execution |
|
||
| `prefill_duration` | Wall time from prefill request submission to first token | Includes queuing + chunked execution |
|
||
| `num_chunks_actual` | Number of scheduler iterations the prefill occupied | From vLLM engine logs or step counter |
|
||
| `step_time_baseline` | Scheduler step duration with decode-only | From engine internals or proxy measurement |
|
||
| `step_time_mixed` | Scheduler step duration with prefill+decode | Same |
|
||
|
||
### Control Variables (Fixed per experiment)
|
||
|
||
| Variable | Value | Rationale |
|
||
|----------|-------|-----------|
|
||
| Decode output length | 256 tokens each | Long enough to span the entire prefill window |
|
||
| Decode context length | 4096 tokens each | Realistic session history, pre-warmed via prefix cache |
|
||
| Prefill output length | 1 token | Minimize post-prefill decode interference |
|
||
| KV cache state | Prefill is fully cold (no cache hit) | Worst case: maximum chunks |
|
||
| Temperature | 0 (greedy) | Deterministic, no sampling variance |
|
||
|
||
---
|
||
|
||
## Protocol
|
||
|
||
### Phase 1: Baseline TPOT Measurement (Decode-Only)
|
||
|
||
```
|
||
1. Launch vLLM instance (TP=1, single H20 GPU)
|
||
2. Pre-fill D decode "seed" requests:
|
||
- Each has 4096-token context (pre-warmed via identical prompt prefix)
|
||
- Set max_tokens=256, temperature=0
|
||
3. Once all D requests are in active decode, start timer
|
||
4. Collect per-token timestamps for each decode request over 256 tokens
|
||
5. Compute TPOT_baseline = median(inter-token-intervals) across all D requests
|
||
6. Record step_time_baseline from vLLM metrics endpoint (/metrics)
|
||
```
|
||
|
||
**Warm-up**: Discard first 16 tokens per request (CUDA graph warm-up, attention ramp-up).
|
||
|
||
### Phase 2: Interference Measurement (Prefill Injected)
|
||
|
||
```
|
||
1. Same setup as Phase 1: D decode requests in steady-state
|
||
2. At token ~32 of the decode stream, inject prefill request:
|
||
- Input: P random tokens (no prefix cache hit)
|
||
- max_tokens=1
|
||
3. Continue collecting per-token timestamps for all D decode requests
|
||
4. Measure:
|
||
a. TPOT of decode requests DURING prefill window
|
||
(from prefill injection to prefill's first token)
|
||
b. TPOT of decode requests AFTER prefill completes (recovery)
|
||
c. Total prefill_duration
|
||
d. num_chunks = ceil(P / chunk_size) [verify against actual]
|
||
```
|
||
|
||
### Phase 3: Repeat for All Configurations
|
||
|
||
```
|
||
For chunk_size in [2048, 4096, 8192, 16384]:
|
||
Configure vLLM with --max-num-batched-tokens=chunk_size
|
||
Restart instance (clean KV cache state)
|
||
|
||
For D in [0, 1, 2, 4, 6, 8, 12]:
|
||
For P in [512, 1024, 2048, 4096, 8192, 16384, 32768]:
|
||
Run Phase 1 (baseline) → record TPOT_baseline[D]
|
||
Run Phase 2 (interference) → record TPOT_interference[D, P]
|
||
Compute TPOT_penalty[D, P] = TPOT_interference - TPOT_baseline
|
||
Wait 5s for KV eviction and state cleanup
|
||
```
|
||
|
||
**Optimization**: Phase 1 only needs to run once per (chunk_size, D) pair since it doesn't depend on P.
|
||
|
||
---
|
||
|
||
## Implementation
|
||
|
||
### Client Architecture
|
||
|
||
```
|
||
┌──────────────────────────────────────────────────┐
|
||
│ Microbench Driver │
|
||
├──────────────────────────────────────────────────┤
|
||
│ 1. Spawn D "background decode" streams (async) │
|
||
│ 2. Wait for steady-state (all D in decode) │
|
||
│ 3. Inject prefill request │
|
||
│ 4. Collect streaming token timestamps │
|
||
│ 5. Compute metrics │
|
||
└──────────────────────────────────────────────────┘
|
||
│ OpenAI-compatible streaming API
|
||
▼
|
||
┌──────────────────────────────────────────────────┐
|
||
│ vLLM Instance (single GPU) │
|
||
│ --enable-chunked-prefill │
|
||
│ --max-num-batched-tokens={chunk_size} │
|
||
│ --enable-prefix-caching │
|
||
└──────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### Request Construction
|
||
|
||
**Decode seed requests** (to create ongoing decode batch):
|
||
```python
|
||
{
|
||
"model": "Qwen3-Coder-30B-A3B-Instruct",
|
||
"messages": [{"role": "user", "content": FIXED_4K_PROMPT}],
|
||
"max_tokens": 256,
|
||
"temperature": 0,
|
||
"stream": True
|
||
}
|
||
```
|
||
|
||
All D requests share the same 4K prompt prefix (ensures prefix cache hit → instant prefill for seeds, isolating decode-only behavior).
|
||
|
||
**Interference prefill request**:
|
||
```python
|
||
{
|
||
"model": "Qwen3-Coder-30B-A3B-Instruct",
|
||
"messages": [{"role": "user", "content": RANDOM_P_TOKEN_PROMPT}],
|
||
"max_tokens": 1,
|
||
"temperature": 0,
|
||
"stream": True
|
||
}
|
||
```
|
||
|
||
Use random content (UUID-based) to guarantee zero prefix cache hit → forces full P-token prefill.
|
||
|
||
### Timestamp Collection
|
||
|
||
Use SSE streaming with `time.perf_counter_ns()` on each `data: {"choices":[{"delta":...}]}` chunk:
|
||
|
||
```python
|
||
async def collect_stream(session, url, payload) -> list[int]:
|
||
"""Returns list of nanosecond timestamps, one per token."""
|
||
timestamps = []
|
||
async with session.post(url, json=payload) as resp:
|
||
async for line in resp.content:
|
||
if line.startswith(b"data: ") and b"[DONE]" not in line:
|
||
timestamps.append(time.perf_counter_ns())
|
||
return timestamps
|
||
```
|
||
|
||
### Steady-State Detection
|
||
|
||
Before injecting prefill, verify all D requests are in active decode:
|
||
1. Wait until each stream has emitted ≥ 32 tokens
|
||
2. Check that the last 8 inter-token intervals are within 2× of each other (no startup variance)
|
||
|
||
---
|
||
|
||
## Output Format
|
||
|
||
### Per-Run Record (`results/{chunk_size}/D{d}_P{p}.json`)
|
||
|
||
```json
|
||
{
|
||
"config": {
|
||
"decode_batch_size": 4,
|
||
"new_prefill_tokens": 8192,
|
||
"chunk_size": 8192,
|
||
"model": "Qwen3-Coder-30B-A3B-Instruct",
|
||
"gpu": "H20"
|
||
},
|
||
"baseline": {
|
||
"tpot_p50_ms": 42.3,
|
||
"tpot_p90_ms": 45.1,
|
||
"tpot_p99_ms": 48.7,
|
||
"step_time_ms": 43.0
|
||
},
|
||
"interference": {
|
||
"tpot_during_prefill_p50_ms": 89.2,
|
||
"tpot_during_prefill_p90_ms": 95.4,
|
||
"tpot_after_prefill_p50_ms": 43.1,
|
||
"num_chunks_actual": 1,
|
||
"prefill_duration_ms": 91.0,
|
||
"prefill_ttft_ms": 91.0
|
||
},
|
||
"derived": {
|
||
"tpot_penalty_p50_ms": 46.9,
|
||
"tpot_penalty_ratio": 1.11,
|
||
"total_interference_ms": 46.9,
|
||
"decode_tokens_delayed": 4
|
||
}
|
||
}
|
||
```
|
||
|
||
### Aggregated Table (`results/interference_table.csv`)
|
||
|
||
```csv
|
||
chunk_size,decode_batch_size,new_prefill_tokens,tpot_baseline_ms,tpot_interference_ms,tpot_penalty_ms,penalty_ratio,num_chunks,prefill_duration_ms
|
||
8192,4,8192,42.3,89.2,46.9,1.11,1,91.0
|
||
8192,4,16384,42.3,88.5,46.2,1.09,2,178.3
|
||
8192,8,8192,78.1,156.3,78.2,1.00,1,159.0
|
||
...
|
||
```
|
||
|
||
---
|
||
|
||
## Analysis Deliverables
|
||
|
||
### 1. Interference Heatmap
|
||
|
||
X-axis: `new_prefill_tokens`, Y-axis: `decode_batch_size`, Color: `tpot_penalty_ratio`
|
||
|
||
Expected pattern:
|
||
- Penalty increases with decode_batch_size (more requests disrupted)
|
||
- Penalty per-request is roughly constant for same chunk_size (step time is dominated by the larger of prefill-chunk or decode-batch)
|
||
- Penalty increases with num_chunks (more disrupted iterations)
|
||
|
||
### 2. Total Interference Cost Model
|
||
|
||
```
|
||
total_interference_cost(D, P, chunk_size) =
|
||
num_chunks(P, chunk_size) × D × tpot_penalty_per_chunk(D, chunk_size)
|
||
```
|
||
|
||
If the model fits well (R² > 0.9), it becomes the offload decision function.
|
||
|
||
### 3. Break-Even Analysis
|
||
|
||
For each (D, P, chunk_size), compute:
|
||
```
|
||
break_even_transfer_time = total_interference_cost(D, P, chunk_size)
|
||
```
|
||
|
||
If layerwise pipeline transfer cost < break_even_transfer_time, offload wins.
|
||
|
||
Plot: "offload wins" region in the (D, P) space for chunk_size=8192.
|
||
|
||
### 4. Sensitivity to chunk_size
|
||
|
||
How does `--max-num-batched-tokens` (effective chunk size) trade off:
|
||
- Smaller chunk → more chunks → longer total prefill → more interrupted decode steps, but each step is shorter
|
||
- Larger chunk → fewer chunks → shorter total prefill → fewer interrupted steps, but each step takes longer
|
||
|
||
---
|
||
|
||
## Risks & Mitigations
|
||
|
||
| Risk | Impact | Mitigation |
|
||
|------|--------|------------|
|
||
| CUDA graph optimization masks real penalty | Underestimate interference | Run with `--enforce-eager` as ablation |
|
||
| vLLM internal batching merges decode+prefill differently than expected | Wrong chunk count | Verify with `/metrics` endpoint (`vllm:num_prefill_tokens_iter`) |
|
||
| Network jitter in timestamp collection | Noisy TPOT | Run on localhost (127.0.0.1), use 5 repetitions per config |
|
||
| KV cache pressure at high D | OOM or eviction | Keep decode context at 4K, monitor `gpu_cache_usage_perc` |
|
||
| MoE routing variance | Non-deterministic step time | Use greedy decoding, report p50 over 5 runs |
|
||
|
||
---
|
||
|
||
## Execution Estimate
|
||
|
||
| Phase | Time |
|
||
|-------|------|
|
||
| Per configuration (1 baseline + 1 interference, 5 reps) | ~3 min |
|
||
| Full sweep (196 configs × 3 min) | ~10 hours |
|
||
| Reduced sweep (chunk_size=8192 only, 49 configs) | ~2.5 hours |
|
||
| Analysis & plotting | 1 hour |
|
||
|
||
**Recommended**: Start with reduced sweep (default chunk_size=8192), then expand to other chunk sizes if results are promising.
|
||
|
||
---
|
||
|
||
## Success Criteria
|
||
|
||
1. **Interference is measurable**: `tpot_penalty_ratio > 1.1` for D ≥ 4 and P ≥ 4096
|
||
2. **Model fits**: Linear or polynomial model of `total_interference_cost` achieves R² > 0.85
|
||
3. **Break-even exists**: There exists a realistic (D, P) region where `interference_cost > 50ms` (layerwise pipeline transfer budget)
|
||
4. **Reproducible**: Coefficient of variation < 15% across 5 repetitions per config
|