Files
agentic-kvc/microbench/interference_microbench_design.md
Gahow Wang f784e49c07 Microbench: prefill-decode interference + PD transfer lifecycle
Two microbenchmarks quantifying the elastic offload decision:

1. Interference (corrected): cold prefill causes 14-214x TPOT p90
   degradation on same-worker decode (D∈{1,2,4,8} × P∈{2k,8k,16k,32k}).
   Earlier run had a prefix-cache bug (deterministic prompts hit cache
   after rep 0); fixed with uuid+time_ns unique prompts.

2. Transfer lifecycle: PD-sep TTFT breakdown via Mooncake proxy,
   measuring prefill→RDMA→decode startup overhead.

Key finding: offload wins at all P≥2048 operating points —
transfer cost is 25-50% of interference cost even with bulk Mooncake.
2026-05-26 00:57:06 +08:00

12 KiB
Raw Blame History

Prefill-Decode Interference Microbenchmark

Goal

Quantify the per-chunk TPOT degradation caused by prefill interference on ongoing decode batches, producing a lookup table:

f(decode_batch_size, new_prefill_tokens, chunk_size) → TPOT_penalty_ms

This table is the foundation for the runtime offload decision:

interference_cost = num_chunks × decode_batch_size × TPOT_penalty
if interference_cost > layerwise_transfer_cost:
    offload()

Hardware & Model

Parameter Value
GPU NVIDIA H20 96GB × 1 (single instance)
Model Qwen3-Coder-30B-A3B-Instruct (MoE 128E top-8, 3B active)
TP 1
max_model_len 200000
block_size 16 (vLLM default)
enable_prefix_caching true
enable_chunked_prefill true
max_num_batched_tokens 8192 (H20 default for openai API server)
gpu_memory_utilization 0.9

Experiment Design

Independent Variables

Variable Values Rationale
decode_batch_size (D) 0, 1, 2, 4, 6, 8, 12 Covers low→saturated decode concurrency
new_prefill_tokens (P) 512, 1024, 2048, 4096, 8192, 16384, 32768 Range from small warm turn to full cold heavy
chunk_size 2048, 4096, 8192 (default), 16384 Sweep the dominant scheduling knob

Full sweep: 7 × 7 × 4 = 196 configurations.

Dependent Variables (Measured)

Metric Definition How to measure
TPOT_baseline Inter-token latency with decode-only batch (no prefill) Send D dummy decode requests, measure steady-state TPOT
TPOT_interference Inter-token latency while prefill chunks execute Measure TPOT of ongoing decode requests during the window when prefill chunks are being processed
TPOT_penalty TPOT_interference - TPOT_baseline Per-token penalty from prefill co-execution
prefill_duration Wall time from prefill request submission to first token Includes queuing + chunked execution
num_chunks_actual Number of scheduler iterations the prefill occupied From vLLM engine logs or step counter
step_time_baseline Scheduler step duration with decode-only From engine internals or proxy measurement
step_time_mixed Scheduler step duration with prefill+decode Same

Control Variables (Fixed per experiment)

Variable Value Rationale
Decode output length 256 tokens each Long enough to span the entire prefill window
Decode context length 4096 tokens each Realistic session history, pre-warmed via prefix cache
Prefill output length 1 token Minimize post-prefill decode interference
KV cache state Prefill is fully cold (no cache hit) Worst case: maximum chunks
Temperature 0 (greedy) Deterministic, no sampling variance

Protocol

Phase 1: Baseline TPOT Measurement (Decode-Only)

1. Launch vLLM instance (TP=1, single H20 GPU)
2. Pre-fill D decode "seed" requests:
   - Each has 4096-token context (pre-warmed via identical prompt prefix)
   - Set max_tokens=256, temperature=0
3. Once all D requests are in active decode, start timer
4. Collect per-token timestamps for each decode request over 256 tokens
5. Compute TPOT_baseline = median(inter-token-intervals) across all D requests
6. Record step_time_baseline from vLLM metrics endpoint (/metrics)

Warm-up: Discard first 16 tokens per request (CUDA graph warm-up, attention ramp-up).

Phase 2: Interference Measurement (Prefill Injected)

1. Same setup as Phase 1: D decode requests in steady-state
2. At token ~32 of the decode stream, inject prefill request:
   - Input: P random tokens (no prefix cache hit)
   - max_tokens=1
3. Continue collecting per-token timestamps for all D decode requests
4. Measure:
   a. TPOT of decode requests DURING prefill window
      (from prefill injection to prefill's first token)
   b. TPOT of decode requests AFTER prefill completes (recovery)
   c. Total prefill_duration
   d. num_chunks = ceil(P / chunk_size) [verify against actual]

Phase 3: Repeat for All Configurations

For chunk_size in [2048, 4096, 8192, 16384]:
    Configure vLLM with --max-num-batched-tokens=chunk_size
    Restart instance (clean KV cache state)
    
    For D in [0, 1, 2, 4, 6, 8, 12]:
        For P in [512, 1024, 2048, 4096, 8192, 16384, 32768]:
            Run Phase 1 (baseline) → record TPOT_baseline[D]
            Run Phase 2 (interference) → record TPOT_interference[D, P]
            Compute TPOT_penalty[D, P] = TPOT_interference - TPOT_baseline
            Wait 5s for KV eviction and state cleanup

Optimization: Phase 1 only needs to run once per (chunk_size, D) pair since it doesn't depend on P.


Implementation

Client Architecture

┌──────────────────────────────────────────────────┐
│                  Microbench Driver                │
├──────────────────────────────────────────────────┤
│  1. Spawn D "background decode" streams (async)  │
│  2. Wait for steady-state (all D in decode)      │
│  3. Inject prefill request                       │
│  4. Collect streaming token timestamps           │
│  5. Compute metrics                              │
└──────────────────────────────────────────────────┘
         │ OpenAI-compatible streaming API
         ▼
┌──────────────────────────────────────────────────┐
│           vLLM Instance (single GPU)             │
│  --enable-chunked-prefill                        │
│  --max-num-batched-tokens={chunk_size}           │
│  --enable-prefix-caching                         │
└──────────────────────────────────────────────────┘

Request Construction

Decode seed requests (to create ongoing decode batch):

{
    "model": "Qwen3-Coder-30B-A3B-Instruct",
    "messages": [{"role": "user", "content": FIXED_4K_PROMPT}],
    "max_tokens": 256,
    "temperature": 0,
    "stream": True
}

All D requests share the same 4K prompt prefix (ensures prefix cache hit → instant prefill for seeds, isolating decode-only behavior).

Interference prefill request:

{
    "model": "Qwen3-Coder-30B-A3B-Instruct",
    "messages": [{"role": "user", "content": RANDOM_P_TOKEN_PROMPT}],
    "max_tokens": 1,
    "temperature": 0,
    "stream": True
}

Use random content (UUID-based) to guarantee zero prefix cache hit → forces full P-token prefill.

Timestamp Collection

Use SSE streaming with time.perf_counter_ns() on each data: {"choices":[{"delta":...}]} chunk:

async def collect_stream(session, url, payload) -> list[int]:
    """Returns list of nanosecond timestamps, one per token."""
    timestamps = []
    async with session.post(url, json=payload) as resp:
        async for line in resp.content:
            if line.startswith(b"data: ") and b"[DONE]" not in line:
                timestamps.append(time.perf_counter_ns())
    return timestamps

Steady-State Detection

Before injecting prefill, verify all D requests are in active decode:

  1. Wait until each stream has emitted ≥ 32 tokens
  2. Check that the last 8 inter-token intervals are within 2× of each other (no startup variance)

Output Format

Per-Run Record (results/{chunk_size}/D{d}_P{p}.json)

{
    "config": {
        "decode_batch_size": 4,
        "new_prefill_tokens": 8192,
        "chunk_size": 8192,
        "model": "Qwen3-Coder-30B-A3B-Instruct",
        "gpu": "H20"
    },
    "baseline": {
        "tpot_p50_ms": 42.3,
        "tpot_p90_ms": 45.1,
        "tpot_p99_ms": 48.7,
        "step_time_ms": 43.0
    },
    "interference": {
        "tpot_during_prefill_p50_ms": 89.2,
        "tpot_during_prefill_p90_ms": 95.4,
        "tpot_after_prefill_p50_ms": 43.1,
        "num_chunks_actual": 1,
        "prefill_duration_ms": 91.0,
        "prefill_ttft_ms": 91.0
    },
    "derived": {
        "tpot_penalty_p50_ms": 46.9,
        "tpot_penalty_ratio": 1.11,
        "total_interference_ms": 46.9,
        "decode_tokens_delayed": 4
    }
}

Aggregated Table (results/interference_table.csv)

chunk_size,decode_batch_size,new_prefill_tokens,tpot_baseline_ms,tpot_interference_ms,tpot_penalty_ms,penalty_ratio,num_chunks,prefill_duration_ms
8192,4,8192,42.3,89.2,46.9,1.11,1,91.0
8192,4,16384,42.3,88.5,46.2,1.09,2,178.3
8192,8,8192,78.1,156.3,78.2,1.00,1,159.0
...

Analysis Deliverables

1. Interference Heatmap

X-axis: new_prefill_tokens, Y-axis: decode_batch_size, Color: tpot_penalty_ratio

Expected pattern:

  • Penalty increases with decode_batch_size (more requests disrupted)
  • Penalty per-request is roughly constant for same chunk_size (step time is dominated by the larger of prefill-chunk or decode-batch)
  • Penalty increases with num_chunks (more disrupted iterations)

2. Total Interference Cost Model

total_interference_cost(D, P, chunk_size) = 
    num_chunks(P, chunk_size) × D × tpot_penalty_per_chunk(D, chunk_size)

If the model fits well (R² > 0.9), it becomes the offload decision function.

3. Break-Even Analysis

For each (D, P, chunk_size), compute:

break_even_transfer_time = total_interference_cost(D, P, chunk_size)

If layerwise pipeline transfer cost < break_even_transfer_time, offload wins.

Plot: "offload wins" region in the (D, P) space for chunk_size=8192.

4. Sensitivity to chunk_size

How does --max-num-batched-tokens (effective chunk size) trade off:

  • Smaller chunk → more chunks → longer total prefill → more interrupted decode steps, but each step is shorter
  • Larger chunk → fewer chunks → shorter total prefill → fewer interrupted steps, but each step takes longer

Risks & Mitigations

Risk Impact Mitigation
CUDA graph optimization masks real penalty Underestimate interference Run with --enforce-eager as ablation
vLLM internal batching merges decode+prefill differently than expected Wrong chunk count Verify with /metrics endpoint (vllm:num_prefill_tokens_iter)
Network jitter in timestamp collection Noisy TPOT Run on localhost (127.0.0.1), use 5 repetitions per config
KV cache pressure at high D OOM or eviction Keep decode context at 4K, monitor gpu_cache_usage_perc
MoE routing variance Non-deterministic step time Use greedy decoding, report p50 over 5 runs

Execution Estimate

Phase Time
Per configuration (1 baseline + 1 interference, 5 reps) ~3 min
Full sweep (196 configs × 3 min) ~10 hours
Reduced sweep (chunk_size=8192 only, 49 configs) ~2.5 hours
Analysis & plotting 1 hour

Recommended: Start with reduced sweep (default chunk_size=8192), then expand to other chunk sizes if results are promising.


Success Criteria

  1. Interference is measurable: tpot_penalty_ratio > 1.1 for D ≥ 4 and P ≥ 4096
  2. Model fits: Linear or polynomial model of total_interference_cost achieves R² > 0.85
  3. Break-even exists: There exists a realistic (D, P) region where interference_cost > 50ms (layerwise pipeline transfer budget)
  4. Reproducible: Coefficient of variation < 15% across 5 repetitions per config