Two microbenchmarks quantifying the elastic offload decision:
1. Interference (corrected): cold prefill causes 14-214x TPOT p90
degradation on same-worker decode (D∈{1,2,4,8} × P∈{2k,8k,16k,32k}).
Earlier run had a prefix-cache bug (deterministic prompts hit cache
after rep 0); fixed with uuid+time_ns unique prompts.
2. Transfer lifecycle: PD-sep TTFT breakdown via Mooncake proxy,
measuring prefill→RDMA→decode startup overhead.
Key finding: offload wins at all P≥2048 operating points —
transfer cost is 25-50% of interference cost even with bulk Mooncake.
12 KiB
Prefill-Decode Interference Microbenchmark
Goal
Quantify the per-chunk TPOT degradation caused by prefill interference on ongoing decode batches, producing a lookup table:
f(decode_batch_size, new_prefill_tokens, chunk_size) → TPOT_penalty_ms
This table is the foundation for the runtime offload decision:
interference_cost = num_chunks × decode_batch_size × TPOT_penalty
if interference_cost > layerwise_transfer_cost:
offload()
Hardware & Model
| Parameter | Value |
|---|---|
| GPU | NVIDIA H20 96GB × 1 (single instance) |
| Model | Qwen3-Coder-30B-A3B-Instruct (MoE 128E top-8, 3B active) |
| TP | 1 |
max_model_len |
200000 |
block_size |
16 (vLLM default) |
enable_prefix_caching |
true |
enable_chunked_prefill |
true |
max_num_batched_tokens |
8192 (H20 default for openai API server) |
gpu_memory_utilization |
0.9 |
Experiment Design
Independent Variables
| Variable | Values | Rationale |
|---|---|---|
decode_batch_size (D) |
0, 1, 2, 4, 6, 8, 12 | Covers low→saturated decode concurrency |
new_prefill_tokens (P) |
512, 1024, 2048, 4096, 8192, 16384, 32768 | Range from small warm turn to full cold heavy |
chunk_size |
2048, 4096, 8192 (default), 16384 | Sweep the dominant scheduling knob |
Full sweep: 7 × 7 × 4 = 196 configurations.
Dependent Variables (Measured)
| Metric | Definition | How to measure |
|---|---|---|
TPOT_baseline |
Inter-token latency with decode-only batch (no prefill) | Send D dummy decode requests, measure steady-state TPOT |
TPOT_interference |
Inter-token latency while prefill chunks execute | Measure TPOT of ongoing decode requests during the window when prefill chunks are being processed |
TPOT_penalty |
TPOT_interference - TPOT_baseline |
Per-token penalty from prefill co-execution |
prefill_duration |
Wall time from prefill request submission to first token | Includes queuing + chunked execution |
num_chunks_actual |
Number of scheduler iterations the prefill occupied | From vLLM engine logs or step counter |
step_time_baseline |
Scheduler step duration with decode-only | From engine internals or proxy measurement |
step_time_mixed |
Scheduler step duration with prefill+decode | Same |
Control Variables (Fixed per experiment)
| Variable | Value | Rationale |
|---|---|---|
| Decode output length | 256 tokens each | Long enough to span the entire prefill window |
| Decode context length | 4096 tokens each | Realistic session history, pre-warmed via prefix cache |
| Prefill output length | 1 token | Minimize post-prefill decode interference |
| KV cache state | Prefill is fully cold (no cache hit) | Worst case: maximum chunks |
| Temperature | 0 (greedy) | Deterministic, no sampling variance |
Protocol
Phase 1: Baseline TPOT Measurement (Decode-Only)
1. Launch vLLM instance (TP=1, single H20 GPU)
2. Pre-fill D decode "seed" requests:
- Each has 4096-token context (pre-warmed via identical prompt prefix)
- Set max_tokens=256, temperature=0
3. Once all D requests are in active decode, start timer
4. Collect per-token timestamps for each decode request over 256 tokens
5. Compute TPOT_baseline = median(inter-token-intervals) across all D requests
6. Record step_time_baseline from vLLM metrics endpoint (/metrics)
Warm-up: Discard first 16 tokens per request (CUDA graph warm-up, attention ramp-up).
Phase 2: Interference Measurement (Prefill Injected)
1. Same setup as Phase 1: D decode requests in steady-state
2. At token ~32 of the decode stream, inject prefill request:
- Input: P random tokens (no prefix cache hit)
- max_tokens=1
3. Continue collecting per-token timestamps for all D decode requests
4. Measure:
a. TPOT of decode requests DURING prefill window
(from prefill injection to prefill's first token)
b. TPOT of decode requests AFTER prefill completes (recovery)
c. Total prefill_duration
d. num_chunks = ceil(P / chunk_size) [verify against actual]
Phase 3: Repeat for All Configurations
For chunk_size in [2048, 4096, 8192, 16384]:
Configure vLLM with --max-num-batched-tokens=chunk_size
Restart instance (clean KV cache state)
For D in [0, 1, 2, 4, 6, 8, 12]:
For P in [512, 1024, 2048, 4096, 8192, 16384, 32768]:
Run Phase 1 (baseline) → record TPOT_baseline[D]
Run Phase 2 (interference) → record TPOT_interference[D, P]
Compute TPOT_penalty[D, P] = TPOT_interference - TPOT_baseline
Wait 5s for KV eviction and state cleanup
Optimization: Phase 1 only needs to run once per (chunk_size, D) pair since it doesn't depend on P.
Implementation
Client Architecture
┌──────────────────────────────────────────────────┐
│ Microbench Driver │
├──────────────────────────────────────────────────┤
│ 1. Spawn D "background decode" streams (async) │
│ 2. Wait for steady-state (all D in decode) │
│ 3. Inject prefill request │
│ 4. Collect streaming token timestamps │
│ 5. Compute metrics │
└──────────────────────────────────────────────────┘
│ OpenAI-compatible streaming API
▼
┌──────────────────────────────────────────────────┐
│ vLLM Instance (single GPU) │
│ --enable-chunked-prefill │
│ --max-num-batched-tokens={chunk_size} │
│ --enable-prefix-caching │
└──────────────────────────────────────────────────┘
Request Construction
Decode seed requests (to create ongoing decode batch):
{
"model": "Qwen3-Coder-30B-A3B-Instruct",
"messages": [{"role": "user", "content": FIXED_4K_PROMPT}],
"max_tokens": 256,
"temperature": 0,
"stream": True
}
All D requests share the same 4K prompt prefix (ensures prefix cache hit → instant prefill for seeds, isolating decode-only behavior).
Interference prefill request:
{
"model": "Qwen3-Coder-30B-A3B-Instruct",
"messages": [{"role": "user", "content": RANDOM_P_TOKEN_PROMPT}],
"max_tokens": 1,
"temperature": 0,
"stream": True
}
Use random content (UUID-based) to guarantee zero prefix cache hit → forces full P-token prefill.
Timestamp Collection
Use SSE streaming with time.perf_counter_ns() on each data: {"choices":[{"delta":...}]} chunk:
async def collect_stream(session, url, payload) -> list[int]:
"""Returns list of nanosecond timestamps, one per token."""
timestamps = []
async with session.post(url, json=payload) as resp:
async for line in resp.content:
if line.startswith(b"data: ") and b"[DONE]" not in line:
timestamps.append(time.perf_counter_ns())
return timestamps
Steady-State Detection
Before injecting prefill, verify all D requests are in active decode:
- Wait until each stream has emitted ≥ 32 tokens
- Check that the last 8 inter-token intervals are within 2× of each other (no startup variance)
Output Format
Per-Run Record (results/{chunk_size}/D{d}_P{p}.json)
{
"config": {
"decode_batch_size": 4,
"new_prefill_tokens": 8192,
"chunk_size": 8192,
"model": "Qwen3-Coder-30B-A3B-Instruct",
"gpu": "H20"
},
"baseline": {
"tpot_p50_ms": 42.3,
"tpot_p90_ms": 45.1,
"tpot_p99_ms": 48.7,
"step_time_ms": 43.0
},
"interference": {
"tpot_during_prefill_p50_ms": 89.2,
"tpot_during_prefill_p90_ms": 95.4,
"tpot_after_prefill_p50_ms": 43.1,
"num_chunks_actual": 1,
"prefill_duration_ms": 91.0,
"prefill_ttft_ms": 91.0
},
"derived": {
"tpot_penalty_p50_ms": 46.9,
"tpot_penalty_ratio": 1.11,
"total_interference_ms": 46.9,
"decode_tokens_delayed": 4
}
}
Aggregated Table (results/interference_table.csv)
chunk_size,decode_batch_size,new_prefill_tokens,tpot_baseline_ms,tpot_interference_ms,tpot_penalty_ms,penalty_ratio,num_chunks,prefill_duration_ms
8192,4,8192,42.3,89.2,46.9,1.11,1,91.0
8192,4,16384,42.3,88.5,46.2,1.09,2,178.3
8192,8,8192,78.1,156.3,78.2,1.00,1,159.0
...
Analysis Deliverables
1. Interference Heatmap
X-axis: new_prefill_tokens, Y-axis: decode_batch_size, Color: tpot_penalty_ratio
Expected pattern:
- Penalty increases with decode_batch_size (more requests disrupted)
- Penalty per-request is roughly constant for same chunk_size (step time is dominated by the larger of prefill-chunk or decode-batch)
- Penalty increases with num_chunks (more disrupted iterations)
2. Total Interference Cost Model
total_interference_cost(D, P, chunk_size) =
num_chunks(P, chunk_size) × D × tpot_penalty_per_chunk(D, chunk_size)
If the model fits well (R² > 0.9), it becomes the offload decision function.
3. Break-Even Analysis
For each (D, P, chunk_size), compute:
break_even_transfer_time = total_interference_cost(D, P, chunk_size)
If layerwise pipeline transfer cost < break_even_transfer_time, offload wins.
Plot: "offload wins" region in the (D, P) space for chunk_size=8192.
4. Sensitivity to chunk_size
How does --max-num-batched-tokens (effective chunk size) trade off:
- Smaller chunk → more chunks → longer total prefill → more interrupted decode steps, but each step is shorter
- Larger chunk → fewer chunks → shorter total prefill → fewer interrupted steps, but each step takes longer
Risks & Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| CUDA graph optimization masks real penalty | Underestimate interference | Run with --enforce-eager as ablation |
| vLLM internal batching merges decode+prefill differently than expected | Wrong chunk count | Verify with /metrics endpoint (vllm:num_prefill_tokens_iter) |
| Network jitter in timestamp collection | Noisy TPOT | Run on localhost (127.0.0.1), use 5 repetitions per config |
| KV cache pressure at high D | OOM or eviction | Keep decode context at 4K, monitor gpu_cache_usage_perc |
| MoE routing variance | Non-deterministic step time | Use greedy decoding, report p50 over 5 runs |
Execution Estimate
| Phase | Time |
|---|---|
| Per configuration (1 baseline + 1 interference, 5 reps) | ~3 min |
| Full sweep (196 configs × 3 min) | ~10 hours |
| Reduced sweep (chunk_size=8192 only, 49 configs) | ~2.5 hours |
| Analysis & plotting | 1 hour |
Recommended: Start with reduced sweep (default chunk_size=8192), then expand to other chunk sizes if results are promising.
Success Criteria
- Interference is measurable:
tpot_penalty_ratio > 1.1for D ≥ 4 and P ≥ 4096 - Model fits: Linear or polynomial model of
total_interference_costachieves R² > 0.85 - Break-even exists: There exists a realistic (D, P) region where
interference_cost > 50ms(layerwise pipeline transfer budget) - Reproducible: Coefficient of variation < 15% across 5 repetitions per config