agentic-kvc/microbench/interference_microbench_design.md

# Prefill-Decode Interference Microbenchmark

## Goal

Quantify the **per-chunk TPOT degradation** caused by prefill interference on ongoing decode batches, producing a lookup table:

```
f(decode_batch_size, new_prefill_tokens, chunk_size) → TPOT_penalty_ms
```

This table is the foundation for the runtime offload decision:

```
interference_cost = num_chunks × decode_batch_size × TPOT_penalty
if interference_cost > layerwise_transfer_cost:
    offload()
```

---

## Hardware & Model

| Parameter | Value |
|-----------|-------|
| GPU | NVIDIA H20 96GB × 1 (single instance) |
| Model | Qwen3-Coder-30B-A3B-Instruct (MoE 128E top-8, 3B active) |
| TP | 1 |
| `max_model_len` | 200000 |
| `block_size` | 16 (vLLM default) |
| `enable_prefix_caching` | true |
| `enable_chunked_prefill` | true |
| `max_num_batched_tokens` | 8192 (H20 default for openai API server) |
| `gpu_memory_utilization` | 0.9 |

---

## Experiment Design

### Independent Variables

| Variable | Values | Rationale |
|----------|--------|-----------|
| `decode_batch_size` (D) | 0, 1, 2, 4, 6, 8, 12 | Covers low→saturated decode concurrency |
| `new_prefill_tokens` (P) | 512, 1024, 2048, 4096, 8192, 16384, 32768 | Range from small warm turn to full cold heavy |
| `chunk_size` | 2048, 4096, 8192 (default), 16384 | Sweep the dominant scheduling knob |

Full sweep: 7 × 7 × 4 = 196 configurations.

### Dependent Variables (Measured)

| Metric | Definition | How to measure |
|--------|-----------|----------------|
| `TPOT_baseline` | Inter-token latency with decode-only batch (no prefill) | Send D dummy decode requests, measure steady-state TPOT |
| `TPOT_interference` | Inter-token latency while prefill chunks execute | Measure TPOT of ongoing decode requests during the window when prefill chunks are being processed |
| `TPOT_penalty` | `TPOT_interference - TPOT_baseline` | Per-token penalty from prefill co-execution |
| `prefill_duration` | Wall time from prefill request submission to first token | Includes queuing + chunked execution |
| `num_chunks_actual` | Number of scheduler iterations the prefill occupied | From vLLM engine logs or step counter |
| `step_time_baseline` | Scheduler step duration with decode-only | From engine internals or proxy measurement |
| `step_time_mixed` | Scheduler step duration with prefill+decode | Same |

### Control Variables (Fixed per experiment)

| Variable | Value | Rationale |
|----------|-------|-----------|
| Decode output length | 256 tokens each | Long enough to span the entire prefill window |
| Decode context length | 4096 tokens each | Realistic session history, pre-warmed via prefix cache |
| Prefill output length | 1 token | Minimize post-prefill decode interference |
| KV cache state | Prefill is fully cold (no cache hit) | Worst case: maximum chunks |
| Temperature | 0 (greedy) | Deterministic, no sampling variance |

---

## Protocol

### Phase 1: Baseline TPOT Measurement (Decode-Only)

```
1. Launch vLLM instance (TP=1, single H20 GPU)
2. Pre-fill D decode "seed" requests:
   - Each has 4096-token context (pre-warmed via identical prompt prefix)
   - Set max_tokens=256, temperature=0
3. Once all D requests are in active decode, start timer
4. Collect per-token timestamps for each decode request over 256 tokens
5. Compute TPOT_baseline = median(inter-token-intervals) across all D requests
6. Record step_time_baseline from vLLM metrics endpoint (/metrics)
```

**Warm-up**: Discard first 16 tokens per request (CUDA graph warm-up, attention ramp-up).

### Phase 2: Interference Measurement (Prefill Injected)

```
1. Same setup as Phase 1: D decode requests in steady-state
2. At token ~32 of the decode stream, inject prefill request:
   - Input: P random tokens (no prefix cache hit)
   - max_tokens=1
3. Continue collecting per-token timestamps for all D decode requests
4. Measure:
   a. TPOT of decode requests DURING prefill window
      (from prefill injection to prefill's first token)
   b. TPOT of decode requests AFTER prefill completes (recovery)
   c. Total prefill_duration
   d. num_chunks = ceil(P / chunk_size) [verify against actual]
```

### Phase 3: Repeat for All Configurations

```
For chunk_size in [2048, 4096, 8192, 16384]:
    Configure vLLM with --max-num-batched-tokens=chunk_size
    Restart instance (clean KV cache state)

    For D in [0, 1, 2, 4, 6, 8, 12]:
        For P in [512, 1024, 2048, 4096, 8192, 16384, 32768]:
            Run Phase 1 (baseline) → record TPOT_baseline[D]
            Run Phase 2 (interference) → record TPOT_interference[D, P]
            Compute TPOT_penalty[D, P] = TPOT_interference - TPOT_baseline
            Wait 5s for KV eviction and state cleanup
```

**Optimization**: Phase 1 only needs to run once per (chunk_size, D) pair since it doesn't depend on P.

---

## Implementation

### Client Architecture

```
┌──────────────────────────────────────────────────┐
│                  Microbench Driver                │
├──────────────────────────────────────────────────┤
│  1. Spawn D "background decode" streams (async)  │
│  2. Wait for steady-state (all D in decode)      │
│  3. Inject prefill request                       │
│  4. Collect streaming token timestamps           │
│  5. Compute metrics                              │
└──────────────────────────────────────────────────┘
         │ OpenAI-compatible streaming API
         ▼
┌──────────────────────────────────────────────────┐
│           vLLM Instance (single GPU)             │
│  --enable-chunked-prefill                        │
│  --max-num-batched-tokens={chunk_size}           │
│  --enable-prefix-caching                         │
└──────────────────────────────────────────────────┘
```

### Request Construction

**Decode seed requests** (to create ongoing decode batch):
```python
{
    "model": "Qwen3-Coder-30B-A3B-Instruct",
    "messages": [{"role": "user", "content": FIXED_4K_PROMPT}],
    "max_tokens": 256,
    "temperature": 0,
    "stream": True
}
```

All D requests share the same 4K prompt prefix (ensures prefix cache hit → instant prefill for seeds, isolating decode-only behavior).

**Interference prefill request**:
```python
{
    "model": "Qwen3-Coder-30B-A3B-Instruct",
    "messages": [{"role": "user", "content": RANDOM_P_TOKEN_PROMPT}],
    "max_tokens": 1,
    "temperature": 0,
    "stream": True
}
```

Use random content (UUID-based) to guarantee zero prefix cache hit → forces full P-token prefill.

### Timestamp Collection

Use SSE streaming with `time.perf_counter_ns()` on each `data: {"choices":[{"delta":...}]}` chunk:

```python
async def collect_stream(session, url, payload) -> list[int]:
    """Returns list of nanosecond timestamps, one per token."""
    timestamps = []
    async with session.post(url, json=payload) as resp:
        async for line in resp.content:
            if line.startswith(b"data: ") and b"[DONE]" not in line:
                timestamps.append(time.perf_counter_ns())
    return timestamps
```

### Steady-State Detection

Before injecting prefill, verify all D requests are in active decode:
1. Wait until each stream has emitted ≥ 32 tokens
2. Check that the last 8 inter-token intervals are within 2× of each other (no startup variance)

---

## Output Format

### Per-Run Record (`results/{chunk_size}/D{d}_P{p}.json`)

```json
{
    "config": {
        "decode_batch_size": 4,
        "new_prefill_tokens": 8192,
        "chunk_size": 8192,
        "model": "Qwen3-Coder-30B-A3B-Instruct",
        "gpu": "H20"
    },
    "baseline": {
        "tpot_p50_ms": 42.3,
        "tpot_p90_ms": 45.1,
        "tpot_p99_ms": 48.7,
        "step_time_ms": 43.0
    },
    "interference": {
        "tpot_during_prefill_p50_ms": 89.2,
        "tpot_during_prefill_p90_ms": 95.4,
        "tpot_after_prefill_p50_ms": 43.1,
        "num_chunks_actual": 1,
        "prefill_duration_ms": 91.0,
        "prefill_ttft_ms": 91.0
    },
    "derived": {
        "tpot_penalty_p50_ms": 46.9,
        "tpot_penalty_ratio": 1.11,
        "total_interference_ms": 46.9,
        "decode_tokens_delayed": 4
    }
}
```

### Aggregated Table (`results/interference_table.csv`)

```csv
chunk_size,decode_batch_size,new_prefill_tokens,tpot_baseline_ms,tpot_interference_ms,tpot_penalty_ms,penalty_ratio,num_chunks,prefill_duration_ms
8192,4,8192,42.3,89.2,46.9,1.11,1,91.0
8192,4,16384,42.3,88.5,46.2,1.09,2,178.3
8192,8,8192,78.1,156.3,78.2,1.00,1,159.0
...
```

---

## Analysis Deliverables

### 1. Interference Heatmap

X-axis: `new_prefill_tokens`, Y-axis: `decode_batch_size`, Color: `tpot_penalty_ratio`

Expected pattern:
- Penalty increases with decode_batch_size (more requests disrupted)
- Penalty per-request is roughly constant for same chunk_size (step time is dominated by the larger of prefill-chunk or decode-batch)
- Penalty increases with num_chunks (more disrupted iterations)

### 2. Total Interference Cost Model

```
total_interference_cost(D, P, chunk_size) =
    num_chunks(P, chunk_size) × D × tpot_penalty_per_chunk(D, chunk_size)
```

If the model fits well (R² > 0.9), it becomes the offload decision function.

### 3. Break-Even Analysis

For each (D, P, chunk_size), compute:
```
break_even_transfer_time = total_interference_cost(D, P, chunk_size)
```

If layerwise pipeline transfer cost < break_even_transfer_time, offload wins.

Plot: "offload wins" region in the (D, P) space for chunk_size=8192.

### 4. Sensitivity to chunk_size

How does `--max-num-batched-tokens` (effective chunk size) trade off:
- Smaller chunk → more chunks → longer total prefill → more interrupted decode steps, but each step is shorter
- Larger chunk → fewer chunks → shorter total prefill → fewer interrupted steps, but each step takes longer

---

## Risks & Mitigations

| Risk | Impact | Mitigation |
|------|--------|------------|
| CUDA graph optimization masks real penalty | Underestimate interference | Run with `--enforce-eager` as ablation |
| vLLM internal batching merges decode+prefill differently than expected | Wrong chunk count | Verify with `/metrics` endpoint (`vllm:num_prefill_tokens_iter`) |
| Network jitter in timestamp collection | Noisy TPOT | Run on localhost (127.0.0.1), use 5 repetitions per config |
| KV cache pressure at high D | OOM or eviction | Keep decode context at 4K, monitor `gpu_cache_usage_perc` |
| MoE routing variance | Non-deterministic step time | Use greedy decoding, report p50 over 5 runs |

---

## Execution Estimate

| Phase | Time |
|-------|------|
| Per configuration (1 baseline + 1 interference, 5 reps) | ~3 min |
| Full sweep (196 configs × 3 min) | ~10 hours |
| Reduced sweep (chunk_size=8192 only, 49 configs) | ~2.5 hours |
| Analysis & plotting | 1 hour |

**Recommended**: Start with reduced sweep (default chunk_size=8192), then expand to other chunk sizes if results are promising.

---

## Success Criteria

1. **Interference is measurable**: `tpot_penalty_ratio > 1.1` for D ≥ 4 and P ≥ 4096
2. **Model fits**: Linear or polynomial model of `total_interference_cost` achieves R² > 0.85
3. **Break-even exists**: There exists a realistic (D, P) region where `interference_cost > 50ms` (layerwise pipeline transfer budget)
4. **Reproducible**: Coefficient of variation < 15% across 5 repetitions per config