Files
agentic-kvc/microbench/TODO.md
Gahow Wang f784e49c07 Microbench: prefill-decode interference + PD transfer lifecycle
Two microbenchmarks quantifying the elastic offload decision:

1. Interference (corrected): cold prefill causes 14-214x TPOT p90
   degradation on same-worker decode (D∈{1,2,4,8} × P∈{2k,8k,16k,32k}).
   Earlier run had a prefix-cache bug (deterministic prompts hit cache
   after rep 0); fixed with uuid+time_ns unique prompts.

2. Transfer lifecycle: PD-sep TTFT breakdown via Mooncake proxy,
   measuring prefill→RDMA→decode startup overhead.

Key finding: offload wins at all P≥2048 operating points —
transfer cost is 25-50% of interference cost even with bulk Mooncake.
2026-05-26 00:57:06 +08:00

5.2 KiB
Raw Blame History

Microbenchmark TODO

Overview

Two microbenchmarks to establish the quantitative foundation for the elastic migration decision:

  1. Interference Microbench — quantify TPOT degradation from prefill-decode co-execution
  2. Transfer Lifecycle Microbench — profile the full PD-sep request lifecycle, especially RDMA transfer cost

Together they answer: "For a given request in a given runtime state, is offload cheaper than co-execution?"


Microbench 1: Prefill-Decode Interference

Design doc: microbench/interference_microbench_design.md

Produces: f(decode_batch_size, new_prefill_tokens, chunk_size) → TPOT_penalty_ms

Tasks

# Task Status Notes
1.1 Implement microbench driver (async streaming client) TODO Python, httpx + asyncio SSE
1.2 Implement steady-state detector (32 tokens, variance check) TODO
1.3 Implement prefill injection + timestamp collection TODO
1.4 Validate on single config (D=4, P=8192, chunk=8192) TODO Sanity check: penalty > 0
1.5 Run reduced sweep (chunk=8192, 7×7=49 configs, 5 reps) TODO ~2.5h on 1×H20
1.6 Fit interference cost model TODO Linear/polynomial regression
1.7 Generate heatmap + break-even plot TODO
1.8 (Optional) Full sweep with 4 chunk sizes TODO ~10h
1.9 (Optional) Ablation: --enforce-eager vs CUDA graphs TODO

Dependencies

  • Single H20 GPU with Qwen3-Coder-30B-A3B loaded
  • No vLLM source modifications needed (pure client-side measurement)

Microbench 2: PD Transfer Lifecycle

Design doc: microbench/transfer_lifecycle_design.md

Produces: Per-phase latency breakdown + transfer bandwidth model

Tasks

# Task Status Notes
2.1 Write vLLM instrumentation patch (mooncake_connector timestamps) TODO ~100 lines, non-invasive logging
2.2 Write scheduler instrumentation patch (promote timestamps) TODO ~20 lines
2.3 Write proxy instrumentation (routing + dispatch timestamps) TODO Already partially in cache_aware_proxy.py
2.4 Implement cache-seeding script (warm D's prefix cache to target C) TODO Send C-token request to D in combined mode
2.5 Implement lifecycle driver (orchestrates P/D, collects all timestamps) TODO
2.6 Validate on single config (C=0, N=8192, O=1) TODO Check all phases sum to E2E
2.7 Also run same config on combined instance (overhead baseline) TODO
2.8 Run full sweep (6×6×4=144 configs, 5 reps) TODO ~2h + 30min cache seeding
2.9 Verify incremental transfer: bytes_transferred independent of C TODO Critical correctness check
2.10 Fit transfer bandwidth model: t = α + β × bytes TODO
2.11 Generate stacked bar charts + overhead comparison plots TODO
2.12 Compute break-even: when does transfer overhead exceed interference cost? TODO Combines results from both microbenchmarks

Dependencies

  • 2× H20 GPUs (P + D) on same machine (shared clock)
  • vLLM source patch (tasks 2.1-2.3)
  • Mooncake configured for P/D mode

Combined Analysis (After Both Complete)

# Task Status Notes
3.1 Build unified offload decision model TODO interference_cost(D,P,chunk) vs transfer_cost(N)
3.2 Identify "offload wins" region in (D, N, C) space TODO The key deliverable
3.3 Estimate improvement from layerwise pipeline TODO transfer_cost_layerwise = transfer_cost / num_layers
3.4 Quantify maximum possible gain over LMetric TODO Upper bound: all requests in "offload wins" region use offload

Execution Order

Week 1:
  [1.1-1.4] Implement + validate interference microbench
  [2.1-2.3] Write vLLM instrumentation patches

Week 2:
  [1.5-1.7] Run interference sweep + fit model
  [2.4-2.6] Implement + validate lifecycle microbench

Week 3:
  [2.7-2.11] Run lifecycle sweep + analysis
  [3.1-3.4] Combined analysis → offload decision model

Critical path: Task 2.1 (vLLM patch) gates all of Microbench 2. Quick win: Microbench 1 needs zero vLLM modifications — can start immediately.


File Structure

microbench/
├── interference_microbench_design.md    # Design doc (done)
├── transfer_lifecycle_design.md         # Design doc (done)
├── TODO.md                              # This file
├── interference/
│   ├── driver.py                        # Microbench 1 client
│   ├── analyze.py                       # Fit model + plots
│   └── results/                         # Output JSON + CSV
├── lifecycle/
│   ├── driver.py                        # Microbench 2 orchestrator
│   ├── seed_cache.py                    # D-side cache warming
│   ├── analyze.py                       # Breakdown plots
│   └── results/                         # Output JSON + CSV
├── patches/
│   ├── 0001-connector-profiling.patch   # Mooncake timestamp logging
│   └── 0002-scheduler-profiling.patch   # Scheduler timestamp logging
└── combined/
    ├── decision_model.py                # Unified offload decision function
    └── plots/                           # Final analysis figures