Files
replaysim/docs/frontier_vllm_alignment_summary_20260625.md

7.3 KiB

Frontier vs vLLM H20 Alignment Summary

Date: 2026-06-25

This document summarizes the current ReplayServe comparison between Frontier simulation and real vLLM runs on H20 for Qwen3-30B-A3B. It covers TP=1/2/4, different timestamp scales, and 100/200/500-request windows from qwen_coder_blksz_16.jsonl.

The source data and plots are generated by:

~/.venv/plot/bin/python tools/build_frontier_vllm_alignment_report.py

Generated artifacts:

  • docs/assets/frontier_vllm_alignment/frontier_vllm_alignment.csv
  • docs/assets/frontier_vllm_alignment/frontier_vllm_alignment.json
  • docs/assets/frontier_vllm_alignment/throughput_ratio.png
  • docs/assets/frontier_vllm_alignment/latency_ratios.png
  • docs/assets/frontier_vllm_alignment/tp_scaling_total_tps.png
  • docs/assets/frontier_vllm_alignment/completion_prefix.png

Bottom Line

Functional replay is now usable for the clean 200-request runs:

  • TP1 scale 2/3 after the Frontier lifecycle fix: 200/200 completed.
  • TP2/TP4 scale 2/3: 200/200 completed, no preemption on either side, matched vLLM KV block counts, and exact trace-side prefix reuse ratio.

Performance is not fully calibrated:

  • TP1 scale 2/3 is the closest current operating point: Frontier throughput is about 0.74x vLLM and TPOT p50/p95 is close.
  • TP2/TP4 is functionally aligned but slower: Frontier throughput is only 0.55-0.63x vLLM, and TP4 TPOT is too pessimistic.
  • Frontier underestimates the TP2->TP4 speedup. vLLM improves total throughput by 1.15-1.20x; Frontier improves by only 1.07-1.10x.

Current use: acceptable for integration work and rough qualitative trends, not yet acceptable as a calibrated absolute performance predictor.

Run Matrix

All vLLM runs use vLLM 0.11.1, H20, Qwen3-30B-A3B, max_model_len=32768, max_num_seqs=64, max_num_batched_tokens=32768, gpu_memory_utilization=0.85, prefix caching, and chunked prefill.

run Frontier rows preempt F/V prefix hit F/V total tok/s F/V ratio TPOT p50 F/V E2E p95 F/V
TP1 N100 raw 96/100 0/8 0.249/0.251 2349/3832 0.61 0.0569/0.0661s 119.6/97.4s
TP1 N500 raw 439/500 0/63 0.119/0.387 4734/5283 0.90 0.0564/0.0498s 397.3/417.4s
TP1 N200 scale 0.667 176/200 0/26 0.170/0.270 3913/4865 0.80 0.0584/0.0515s 189.2/183.7s
TP1 N200 scale 2 200/200 33/43 0.231/0.270 3506/4743 0.74 0.0542/0.0497s 174.5/142.3s
TP1 N200 scale 3 200/200 20/16 0.218/0.270 3390/4608 0.74 0.0534/0.0462s 154.5/122.8s
TP2 N200 scale 2 200/200 0/0 0.270/0.270 4581/7547 0.61 0.0430/0.0300s 106.8/72.5s
TP2 N200 scale 3 200/200 0/0 0.270/0.270 4062/6426 0.63 0.0394/0.0191s 101.6/54.0s
TP4 N200 scale 2 200/200 0/0 0.270/0.270 5035/9073 0.55 0.0337/0.0163s 84.9/43.6s
TP4 N200 scale 3 200/200 0/0 0.270/0.270 4355/7403 0.59 0.0311/0.0094s 83.0/27.9s

Important prefix caveat: the vLLM prefix-hit column in this table is the trace-side synthetic estimate from the vLLM summaries. For TP1 runs with preemption and finite KV pressure, the observed vLLM scheduler computed: signal is the better comparator. Earlier analysis in docs/rs4_frontier_h20_tp1_alignment.md records those finite-cache comparisons. For TP2/TP4, no preemption occurs and the trace-side prefix ratio matches Frontier exactly.

Plots

Throughput ratio

Latency ratios

TP scaling

Completion and prefix reuse

Interpretation

TP1

The early TP1 100/500/scale-0.667 runs are still useful as historical stress points, but they were run before the decode-preemption lifecycle fix. Frontier therefore missed rows in those runs:

  • 96/100 for N100 raw
  • 439/500 for N500 raw
  • 176/200 for N200 scale 0.667

After the lifecycle fix, TP1 scale 2 and scale 3 both complete 200/200. Preemption is now in the same order as vLLM:

  • scale 2: Frontier 33 vs vLLM 43
  • scale 3: Frontier 20 vs vLLM 16

TP1 timing is the closest current calibrated region. Throughput is about 0.74x vLLM, TPOT p50/p95 is close, and E2E p95 is about 1.23-1.26x vLLM. This is not perfect, but it is usable for integration-level trend checks.

TP2 and TP4

The TP2/TP4 runs are functionally cleaner than TP1:

  • 200/200 completed for all four runs.
  • Frontier and vLLM both report no preemption.
  • Frontier uses explicit vLLM KV capacities:
    • TP2: 69,055 blocks
    • TP4: 177,077 blocks
  • Prefix hit ratio matches exactly: 0.2697549478.

We did profile TP2/TP4 true-mixed attention. The active RS12 profile includes:

  • attention_tp2_tp4_combined.csv: 36,163 rows, including 1,260 true-mixed prefill+decode rows for TP2/TP4.
  • linear_op_tp2_tp4_full32k.csv: covers up to 32,768 tokens.
  • moe_tp2_tp4_full32k.csv: covers up to 32,768 tokens.

Without the true-mixed rows, Frontier fails with missing attn_decode_in_mixed predictions. With them, all RS12 runs complete.

The remaining TP2/TP4 gap is therefore not a missing-profile blocker. It is a timing-model gap:

  • TP2 throughput is 0.61-0.63x vLLM.
  • TP4 throughput is 0.55-0.59x vLLM.
  • TP4 TPOT p50 is 2.06-3.30x vLLM.

Scaling

For the same first-200 request fixtures:

fixture metric Frontier TP4/TP2 vLLM TP4/TP2
scale 2 total tok/s 1.10 1.20
scale 2 decode tok/s 1.10 1.20
scale 2 TPOT p50 0.78 0.54
scale 3 total tok/s 1.07 1.15
scale 3 decode tok/s 1.07 1.15
scale 3 TPOT p50 0.79 0.49

Frontier sees some TP4 improvement, but much less than real vLLM. This is the clearest current evidence that the simulator is not yet modeling vLLM's TP-dependent decode execution path well enough.

Likely Gap Sources

The main unresolved issues are:

  • CPU/scheduler overhead is still skipped (skip_cpu_overhead_modeling=true).
  • Decode CUDA graph behavior is not modeled in the current Frontier runs (decode_cuda_graph_mode=none).
  • Random-forest predictors interpolate over profile grids, while real online mixed batches may concentrate on shapes not directly sampled.
  • Some TP4 predictor fit errors are nontrivial, for example attn_kv_cache_save MAPE around 11% in the TP4 profile log.
  • Frontier's scheduler and preemption behavior is close but not identical for TP1 under finite KV pressure.

ReplayServe TODO

  1. Rerun the 500-request TP1 stress after the decode-preemption lifecycle fix, so the 500-row result is no longer mixed with the old incomplete behavior.
  2. Record vLLM observed scheduler prefix/preemption metrics in machine-readable summaries, not only in docs, especially first-start and last-start computed: ratios.
  3. Add a shape-ledger analysis: compare Frontier's actual online batch shapes against the profile grid and identify hot shapes that are interpolated.
  4. Profile or import vLLM CPU overhead and test skip_cpu_overhead_modeling=false.
  5. Collect kernel-only / decode-CUDA-graph timing profiles before enabling a Frontier CUDA-graph decode mode.
  6. Calibrate TP2/TP4 timing only after the above, because current functional replay is aligned but the TP scaling is not.