# Prefill-Decode Interference Microbenchmark ## Goal Quantify the **per-chunk TPOT degradation** caused by prefill interference on ongoing decode batches, producing a lookup table: ``` f(decode_batch_size, new_prefill_tokens, chunk_size) → TPOT_penalty_ms ``` This table is the foundation for the runtime offload decision: ``` interference_cost = num_chunks × decode_batch_size × TPOT_penalty if interference_cost > layerwise_transfer_cost: offload() ``` --- ## Hardware & Model | Parameter | Value | |-----------|-------| | GPU | NVIDIA H20 96GB × 1 (single instance) | | Model | Qwen3-Coder-30B-A3B-Instruct (MoE 128E top-8, 3B active) | | TP | 1 | | `max_model_len` | 200000 | | `block_size` | 16 (vLLM default) | | `enable_prefix_caching` | true | | `enable_chunked_prefill` | true | | `max_num_batched_tokens` | 8192 (H20 default for openai API server) | | `gpu_memory_utilization` | 0.9 | --- ## Experiment Design ### Independent Variables | Variable | Values | Rationale | |----------|--------|-----------| | `decode_batch_size` (D) | 0, 1, 2, 4, 6, 8, 12 | Covers low→saturated decode concurrency | | `new_prefill_tokens` (P) | 512, 1024, 2048, 4096, 8192, 16384, 32768 | Range from small warm turn to full cold heavy | | `chunk_size` | 2048, 4096, 8192 (default), 16384 | Sweep the dominant scheduling knob | Full sweep: 7 × 7 × 4 = 196 configurations. ### Dependent Variables (Measured) | Metric | Definition | How to measure | |--------|-----------|----------------| | `TPOT_baseline` | Inter-token latency with decode-only batch (no prefill) | Send D dummy decode requests, measure steady-state TPOT | | `TPOT_interference` | Inter-token latency while prefill chunks execute | Measure TPOT of ongoing decode requests during the window when prefill chunks are being processed | | `TPOT_penalty` | `TPOT_interference - TPOT_baseline` | Per-token penalty from prefill co-execution | | `prefill_duration` | Wall time from prefill request submission to first token | Includes queuing + chunked execution | | `num_chunks_actual` | Number of scheduler iterations the prefill occupied | From vLLM engine logs or step counter | | `step_time_baseline` | Scheduler step duration with decode-only | From engine internals or proxy measurement | | `step_time_mixed` | Scheduler step duration with prefill+decode | Same | ### Control Variables (Fixed per experiment) | Variable | Value | Rationale | |----------|-------|-----------| | Decode output length | 256 tokens each | Long enough to span the entire prefill window | | Decode context length | 4096 tokens each | Realistic session history, pre-warmed via prefix cache | | Prefill output length | 1 token | Minimize post-prefill decode interference | | KV cache state | Prefill is fully cold (no cache hit) | Worst case: maximum chunks | | Temperature | 0 (greedy) | Deterministic, no sampling variance | --- ## Protocol ### Phase 1: Baseline TPOT Measurement (Decode-Only) ``` 1. Launch vLLM instance (TP=1, single H20 GPU) 2. Pre-fill D decode "seed" requests: - Each has 4096-token context (pre-warmed via identical prompt prefix) - Set max_tokens=256, temperature=0 3. Once all D requests are in active decode, start timer 4. Collect per-token timestamps for each decode request over 256 tokens 5. Compute TPOT_baseline = median(inter-token-intervals) across all D requests 6. Record step_time_baseline from vLLM metrics endpoint (/metrics) ``` **Warm-up**: Discard first 16 tokens per request (CUDA graph warm-up, attention ramp-up). ### Phase 2: Interference Measurement (Prefill Injected) ``` 1. Same setup as Phase 1: D decode requests in steady-state 2. At token ~32 of the decode stream, inject prefill request: - Input: P random tokens (no prefix cache hit) - max_tokens=1 3. Continue collecting per-token timestamps for all D decode requests 4. Measure: a. TPOT of decode requests DURING prefill window (from prefill injection to prefill's first token) b. TPOT of decode requests AFTER prefill completes (recovery) c. Total prefill_duration d. num_chunks = ceil(P / chunk_size) [verify against actual] ``` ### Phase 3: Repeat for All Configurations ``` For chunk_size in [2048, 4096, 8192, 16384]: Configure vLLM with --max-num-batched-tokens=chunk_size Restart instance (clean KV cache state) For D in [0, 1, 2, 4, 6, 8, 12]: For P in [512, 1024, 2048, 4096, 8192, 16384, 32768]: Run Phase 1 (baseline) → record TPOT_baseline[D] Run Phase 2 (interference) → record TPOT_interference[D, P] Compute TPOT_penalty[D, P] = TPOT_interference - TPOT_baseline Wait 5s for KV eviction and state cleanup ``` **Optimization**: Phase 1 only needs to run once per (chunk_size, D) pair since it doesn't depend on P. --- ## Implementation ### Client Architecture ``` ┌──────────────────────────────────────────────────┐ │ Microbench Driver │ ├──────────────────────────────────────────────────┤ │ 1. Spawn D "background decode" streams (async) │ │ 2. Wait for steady-state (all D in decode) │ │ 3. Inject prefill request │ │ 4. Collect streaming token timestamps │ │ 5. Compute metrics │ └──────────────────────────────────────────────────┘ │ OpenAI-compatible streaming API ▼ ┌──────────────────────────────────────────────────┐ │ vLLM Instance (single GPU) │ │ --enable-chunked-prefill │ │ --max-num-batched-tokens={chunk_size} │ │ --enable-prefix-caching │ └──────────────────────────────────────────────────┘ ``` ### Request Construction **Decode seed requests** (to create ongoing decode batch): ```python { "model": "Qwen3-Coder-30B-A3B-Instruct", "messages": [{"role": "user", "content": FIXED_4K_PROMPT}], "max_tokens": 256, "temperature": 0, "stream": True } ``` All D requests share the same 4K prompt prefix (ensures prefix cache hit → instant prefill for seeds, isolating decode-only behavior). **Interference prefill request**: ```python { "model": "Qwen3-Coder-30B-A3B-Instruct", "messages": [{"role": "user", "content": RANDOM_P_TOKEN_PROMPT}], "max_tokens": 1, "temperature": 0, "stream": True } ``` Use random content (UUID-based) to guarantee zero prefix cache hit → forces full P-token prefill. ### Timestamp Collection Use SSE streaming with `time.perf_counter_ns()` on each `data: {"choices":[{"delta":...}]}` chunk: ```python async def collect_stream(session, url, payload) -> list[int]: """Returns list of nanosecond timestamps, one per token.""" timestamps = [] async with session.post(url, json=payload) as resp: async for line in resp.content: if line.startswith(b"data: ") and b"[DONE]" not in line: timestamps.append(time.perf_counter_ns()) return timestamps ``` ### Steady-State Detection Before injecting prefill, verify all D requests are in active decode: 1. Wait until each stream has emitted ≥ 32 tokens 2. Check that the last 8 inter-token intervals are within 2× of each other (no startup variance) --- ## Output Format ### Per-Run Record (`results/{chunk_size}/D{d}_P{p}.json`) ```json { "config": { "decode_batch_size": 4, "new_prefill_tokens": 8192, "chunk_size": 8192, "model": "Qwen3-Coder-30B-A3B-Instruct", "gpu": "H20" }, "baseline": { "tpot_p50_ms": 42.3, "tpot_p90_ms": 45.1, "tpot_p99_ms": 48.7, "step_time_ms": 43.0 }, "interference": { "tpot_during_prefill_p50_ms": 89.2, "tpot_during_prefill_p90_ms": 95.4, "tpot_after_prefill_p50_ms": 43.1, "num_chunks_actual": 1, "prefill_duration_ms": 91.0, "prefill_ttft_ms": 91.0 }, "derived": { "tpot_penalty_p50_ms": 46.9, "tpot_penalty_ratio": 1.11, "total_interference_ms": 46.9, "decode_tokens_delayed": 4 } } ``` ### Aggregated Table (`results/interference_table.csv`) ```csv chunk_size,decode_batch_size,new_prefill_tokens,tpot_baseline_ms,tpot_interference_ms,tpot_penalty_ms,penalty_ratio,num_chunks,prefill_duration_ms 8192,4,8192,42.3,89.2,46.9,1.11,1,91.0 8192,4,16384,42.3,88.5,46.2,1.09,2,178.3 8192,8,8192,78.1,156.3,78.2,1.00,1,159.0 ... ``` --- ## Analysis Deliverables ### 1. Interference Heatmap X-axis: `new_prefill_tokens`, Y-axis: `decode_batch_size`, Color: `tpot_penalty_ratio` Expected pattern: - Penalty increases with decode_batch_size (more requests disrupted) - Penalty per-request is roughly constant for same chunk_size (step time is dominated by the larger of prefill-chunk or decode-batch) - Penalty increases with num_chunks (more disrupted iterations) ### 2. Total Interference Cost Model ``` total_interference_cost(D, P, chunk_size) = num_chunks(P, chunk_size) × D × tpot_penalty_per_chunk(D, chunk_size) ``` If the model fits well (R² > 0.9), it becomes the offload decision function. ### 3. Break-Even Analysis For each (D, P, chunk_size), compute: ``` break_even_transfer_time = total_interference_cost(D, P, chunk_size) ``` If layerwise pipeline transfer cost < break_even_transfer_time, offload wins. Plot: "offload wins" region in the (D, P) space for chunk_size=8192. ### 4. Sensitivity to chunk_size How does `--max-num-batched-tokens` (effective chunk size) trade off: - Smaller chunk → more chunks → longer total prefill → more interrupted decode steps, but each step is shorter - Larger chunk → fewer chunks → shorter total prefill → fewer interrupted steps, but each step takes longer --- ## Risks & Mitigations | Risk | Impact | Mitigation | |------|--------|------------| | CUDA graph optimization masks real penalty | Underestimate interference | Run with `--enforce-eager` as ablation | | vLLM internal batching merges decode+prefill differently than expected | Wrong chunk count | Verify with `/metrics` endpoint (`vllm:num_prefill_tokens_iter`) | | Network jitter in timestamp collection | Noisy TPOT | Run on localhost (127.0.0.1), use 5 repetitions per config | | KV cache pressure at high D | OOM or eviction | Keep decode context at 4K, monitor `gpu_cache_usage_perc` | | MoE routing variance | Non-deterministic step time | Use greedy decoding, report p50 over 5 runs | --- ## Execution Estimate | Phase | Time | |-------|------| | Per configuration (1 baseline + 1 interference, 5 reps) | ~3 min | | Full sweep (196 configs × 3 min) | ~10 hours | | Reduced sweep (chunk_size=8192 only, 49 configs) | ~2.5 hours | | Analysis & plotting | 1 hour | **Recommended**: Start with reduced sweep (default chunk_size=8192), then expand to other chunk sizes if results are promising. --- ## Success Criteria 1. **Interference is measurable**: `tpot_penalty_ratio > 1.1` for D ≥ 4 and P ≥ 4096 2. **Model fits**: Linear or polynomial model of `total_interference_cost` achieves R² > 0.85 3. **Break-even exists**: There exists a realistic (D, P) region where `interference_cost > 50ms` (layerwise pipeline transfer budget) 4. **Reproducible**: Coefficient of variation < 15% across 5 repetitions per config