Files

Gahow Wang f784e49c07 Microbench: prefill-decode interference + PD transfer lifecycle

Two microbenchmarks quantifying the elastic offload decision:

1. Interference (corrected): cold prefill causes 14-214x TPOT p90
   degradation on same-worker decode (D∈{1,2,4,8} × P∈{2k,8k,16k,32k}).
   Earlier run had a prefix-cache bug (deterministic prompts hit cache
   after rep 0); fixed with uuid+time_ns unique prompts.

2. Transfer lifecycle: PD-sep TTFT breakdown via Mooncake proxy,
   measuring prefill→RDMA→decode startup overhead.

Key finding: offload wins at all P≥2048 operating points —
transfer cost is 25-50% of interference cost even with bulk Mooncake.

2026-05-26 00:57:06 +08:00

12 KiB

Raw Blame History

Prefill-Decode Interference Microbenchmark

Goal

Quantify the per-chunk TPOT degradation caused by prefill interference on ongoing decode batches, producing a lookup table:

f(decode_batch_size, new_prefill_tokens, chunk_size) → TPOT_penalty_ms

This table is the foundation for the runtime offload decision:

interference_cost = num_chunks × decode_batch_size × TPOT_penalty
if interference_cost > layerwise_transfer_cost:
    offload()

Hardware & Model

Parameter	Value
GPU	NVIDIA H20 96GB × 1 (single instance)
Model	Qwen3-Coder-30B-A3B-Instruct (MoE 128E top-8, 3B active)
TP	1
`max_model_len`	200000
`block_size`	16 (vLLM default)
`enable_prefix_caching`	true
`enable_chunked_prefill`	true
`max_num_batched_tokens`	8192 (H20 default for openai API server)
`gpu_memory_utilization`	0.9

Experiment Design

Independent Variables

Variable	Values	Rationale
`decode_batch_size` (D)	0, 1, 2, 4, 6, 8, 12	Covers low→saturated decode concurrency
`new_prefill_tokens` (P)	512, 1024, 2048, 4096, 8192, 16384, 32768	Range from small warm turn to full cold heavy
`chunk_size`	2048, 4096, 8192 (default), 16384	Sweep the dominant scheduling knob

Full sweep: 7 × 7 × 4 = 196 configurations.

Dependent Variables (Measured)

Metric	Definition	How to measure
`TPOT_baseline`	Inter-token latency with decode-only batch (no prefill)	Send D dummy decode requests, measure steady-state TPOT
`TPOT_interference`	Inter-token latency while prefill chunks execute	Measure TPOT of ongoing decode requests during the window when prefill chunks are being processed
`TPOT_penalty`	`TPOT_interference - TPOT_baseline`	Per-token penalty from prefill co-execution
`prefill_duration`	Wall time from prefill request submission to first token	Includes queuing + chunked execution
`num_chunks_actual`	Number of scheduler iterations the prefill occupied	From vLLM engine logs or step counter
`step_time_baseline`	Scheduler step duration with decode-only	From engine internals or proxy measurement
`step_time_mixed`	Scheduler step duration with prefill+decode	Same

Control Variables (Fixed per experiment)

Variable	Value	Rationale
Decode output length	256 tokens each	Long enough to span the entire prefill window
Decode context length	4096 tokens each	Realistic session history, pre-warmed via prefix cache
Prefill output length	1 token	Minimize post-prefill decode interference
KV cache state	Prefill is fully cold (no cache hit)	Worst case: maximum chunks
Temperature	0 (greedy)	Deterministic, no sampling variance

Protocol

Phase 1: Baseline TPOT Measurement (Decode-Only)

1. Launch vLLM instance (TP=1, single H20 GPU)
2. Pre-fill D decode "seed" requests:
   - Each has 4096-token context (pre-warmed via identical prompt prefix)
   - Set max_tokens=256, temperature=0
3. Once all D requests are in active decode, start timer
4. Collect per-token timestamps for each decode request over 256 tokens
5. Compute TPOT_baseline = median(inter-token-intervals) across all D requests
6. Record step_time_baseline from vLLM metrics endpoint (/metrics)

Warm-up: Discard first 16 tokens per request (CUDA graph warm-up, attention ramp-up).

Phase 2: Interference Measurement (Prefill Injected)

1. Same setup as Phase 1: D decode requests in steady-state
2. At token ~32 of the decode stream, inject prefill request:
   - Input: P random tokens (no prefix cache hit)
   - max_tokens=1
3. Continue collecting per-token timestamps for all D decode requests
4. Measure:
   a. TPOT of decode requests DURING prefill window
      (from prefill injection to prefill's first token)
   b. TPOT of decode requests AFTER prefill completes (recovery)
   c. Total prefill_duration
   d. num_chunks = ceil(P / chunk_size) [verify against actual]

Phase 3: Repeat for All Configurations

For chunk_size in [2048, 4096, 8192, 16384]:
    Configure vLLM with --max-num-batched-tokens=chunk_size
    Restart instance (clean KV cache state)
    
    For D in [0, 1, 2, 4, 6, 8, 12]:
        For P in [512, 1024, 2048, 4096, 8192, 16384, 32768]:
            Run Phase 1 (baseline) → record TPOT_baseline[D]
            Run Phase 2 (interference) → record TPOT_interference[D, P]
            Compute TPOT_penalty[D, P] = TPOT_interference - TPOT_baseline
            Wait 5s for KV eviction and state cleanup

Optimization: Phase 1 only needs to run once per (chunk_size, D) pair since it doesn't depend on P.

Implementation

Client Architecture

┌──────────────────────────────────────────────────┐
│                  Microbench Driver                │
├──────────────────────────────────────────────────┤
│  1. Spawn D "background decode" streams (async)  │
│  2. Wait for steady-state (all D in decode)      │
│  3. Inject prefill request                       │
│  4. Collect streaming token timestamps           │
│  5. Compute metrics                              │
└──────────────────────────────────────────────────┘
         │ OpenAI-compatible streaming API
         ▼
┌──────────────────────────────────────────────────┐
│           vLLM Instance (single GPU)             │
│  --enable-chunked-prefill                        │
│  --max-num-batched-tokens={chunk_size}           │
│  --enable-prefix-caching                         │
└──────────────────────────────────────────────────┘

Request Construction

Decode seed requests (to create ongoing decode batch):

{
    "model": "Qwen3-Coder-30B-A3B-Instruct",
    "messages": [{"role": "user", "content": FIXED_4K_PROMPT}],
    "max_tokens": 256,
    "temperature": 0,
    "stream": True
}

All D requests share the same 4K prompt prefix (ensures prefix cache hit → instant prefill for seeds, isolating decode-only behavior).

Interference prefill request:

{
    "model": "Qwen3-Coder-30B-A3B-Instruct",
    "messages": [{"role": "user", "content": RANDOM_P_TOKEN_PROMPT}],
    "max_tokens": 1,
    "temperature": 0,
    "stream": True
}

Use random content (UUID-based) to guarantee zero prefix cache hit → forces full P-token prefill.

Timestamp Collection

Use SSE streaming with time.perf_counter_ns() on each data: {"choices":[{"delta":...}]} chunk:

async def collect_stream(session, url, payload) -> list[int]:
    """Returns list of nanosecond timestamps, one per token."""
    timestamps = []
    async with session.post(url, json=payload) as resp:
        async for line in resp.content:
            if line.startswith(b"data: ") and b"[DONE]" not in line:
                timestamps.append(time.perf_counter_ns())
    return timestamps

Steady-State Detection

Before injecting prefill, verify all D requests are in active decode:

Wait until each stream has emitted ≥ 32 tokens
Check that the last 8 inter-token intervals are within 2× of each other (no startup variance)

Output Format

Per-Run Record (`results/{chunk_size}/D{d}_P{p}.json`)

{
    "config": {
        "decode_batch_size": 4,
        "new_prefill_tokens": 8192,
        "chunk_size": 8192,
        "model": "Qwen3-Coder-30B-A3B-Instruct",
        "gpu": "H20"
    },
    "baseline": {
        "tpot_p50_ms": 42.3,
        "tpot_p90_ms": 45.1,
        "tpot_p99_ms": 48.7,
        "step_time_ms": 43.0
    },
    "interference": {
        "tpot_during_prefill_p50_ms": 89.2,
        "tpot_during_prefill_p90_ms": 95.4,
        "tpot_after_prefill_p50_ms": 43.1,
        "num_chunks_actual": 1,
        "prefill_duration_ms": 91.0,
        "prefill_ttft_ms": 91.0
    },
    "derived": {
        "tpot_penalty_p50_ms": 46.9,
        "tpot_penalty_ratio": 1.11,
        "total_interference_ms": 46.9,
        "decode_tokens_delayed": 4
    }
}

Aggregated Table (`results/interference_table.csv`)

chunk_size,decode_batch_size,new_prefill_tokens,tpot_baseline_ms,tpot_interference_ms,tpot_penalty_ms,penalty_ratio,num_chunks,prefill_duration_ms
8192,4,8192,42.3,89.2,46.9,1.11,1,91.0
8192,4,16384,42.3,88.5,46.2,1.09,2,178.3
8192,8,8192,78.1,156.3,78.2,1.00,1,159.0
...

Analysis Deliverables

1. Interference Heatmap

X-axis: new_prefill_tokens, Y-axis: decode_batch_size, Color: tpot_penalty_ratio

Expected pattern:

Penalty increases with decode_batch_size (more requests disrupted)
Penalty per-request is roughly constant for same chunk_size (step time is dominated by the larger of prefill-chunk or decode-batch)
Penalty increases with num_chunks (more disrupted iterations)

2. Total Interference Cost Model

total_interference_cost(D, P, chunk_size) = 
    num_chunks(P, chunk_size) × D × tpot_penalty_per_chunk(D, chunk_size)

If the model fits well (R² > 0.9), it becomes the offload decision function.

3. Break-Even Analysis

For each (D, P, chunk_size), compute:

break_even_transfer_time = total_interference_cost(D, P, chunk_size)

If layerwise pipeline transfer cost < break_even_transfer_time, offload wins.

Plot: "offload wins" region in the (D, P) space for chunk_size=8192.

4. Sensitivity to chunk_size

How does --max-num-batched-tokens (effective chunk size) trade off:

Smaller chunk → more chunks → longer total prefill → more interrupted decode steps, but each step is shorter
Larger chunk → fewer chunks → shorter total prefill → fewer interrupted steps, but each step takes longer

Risks & Mitigations

Risk	Impact	Mitigation
CUDA graph optimization masks real penalty	Underestimate interference	Run with `--enforce-eager` as ablation
vLLM internal batching merges decode+prefill differently than expected	Wrong chunk count	Verify with `/metrics` endpoint (`vllm:num_prefill_tokens_iter`)
Network jitter in timestamp collection	Noisy TPOT	Run on localhost (127.0.0.1), use 5 repetitions per config
KV cache pressure at high D	OOM or eviction	Keep decode context at 4K, monitor `gpu_cache_usage_perc`
MoE routing variance	Non-deterministic step time	Use greedy decoding, report p50 over 5 runs

Execution Estimate

Phase	Time
Per configuration (1 baseline + 1 interference, 5 reps)	~3 min
Full sweep (196 configs × 3 min)	~10 hours
Reduced sweep (chunk_size=8192 only, 49 configs)	~2.5 hours
Analysis & plotting	1 hour

Recommended: Start with reduced sweep (default chunk_size=8192), then expand to other chunk sizes if results are promising.

Success Criteria

Interference is measurable: tpot_penalty_ratio > 1.1 for D ≥ 4 and P ≥ 4096
Model fits: Linear or polynomial model of total_interference_cost achieves R² > 0.85
Break-even exists: There exists a realistic (D, P) region where interference_cost > 50ms (layerwise pipeline transfer budget)
Reproducible: Coefficient of variation < 15% across 5 repetitions per config

12 KiB Raw Blame History Unescape Escape