178 lines
7.3 KiB
Markdown
178 lines
7.3 KiB
Markdown
# Frontier vs vLLM H20 Alignment Summary
|
|
|
|
Date: 2026-06-25
|
|
|
|
This document summarizes the current ReplayServe comparison between Frontier
|
|
simulation and real vLLM runs on H20 for Qwen3-30B-A3B. It covers TP=1/2/4,
|
|
different timestamp scales, and 100/200/500-request windows from
|
|
`qwen_coder_blksz_16.jsonl`.
|
|
|
|
The source data and plots are generated by:
|
|
|
|
```bash
|
|
~/.venv/plot/bin/python tools/build_frontier_vllm_alignment_report.py
|
|
```
|
|
|
|
Generated artifacts:
|
|
|
|
- `docs/assets/frontier_vllm_alignment/frontier_vllm_alignment.csv`
|
|
- `docs/assets/frontier_vllm_alignment/frontier_vllm_alignment.json`
|
|
- `docs/assets/frontier_vllm_alignment/throughput_ratio.png`
|
|
- `docs/assets/frontier_vllm_alignment/latency_ratios.png`
|
|
- `docs/assets/frontier_vllm_alignment/tp_scaling_total_tps.png`
|
|
- `docs/assets/frontier_vllm_alignment/completion_prefix.png`
|
|
|
|
## Bottom Line
|
|
|
|
Functional replay is now usable for the clean 200-request runs:
|
|
|
|
- TP1 scale 2/3 after the Frontier lifecycle fix: `200/200` completed.
|
|
- TP2/TP4 scale 2/3: `200/200` completed, no preemption on either side, matched
|
|
vLLM KV block counts, and exact trace-side prefix reuse ratio.
|
|
|
|
Performance is not fully calibrated:
|
|
|
|
- TP1 scale 2/3 is the closest current operating point: Frontier throughput is
|
|
about `0.74x` vLLM and TPOT p50/p95 is close.
|
|
- TP2/TP4 is functionally aligned but slower: Frontier throughput is only
|
|
`0.55-0.63x` vLLM, and TP4 TPOT is too pessimistic.
|
|
- Frontier underestimates the TP2->TP4 speedup. vLLM improves total throughput
|
|
by `1.15-1.20x`; Frontier improves by only `1.07-1.10x`.
|
|
|
|
Current use: acceptable for integration work and rough qualitative trends, not
|
|
yet acceptable as a calibrated absolute performance predictor.
|
|
|
|
## Run Matrix
|
|
|
|
All vLLM runs use vLLM 0.11.1, H20, Qwen3-30B-A3B,
|
|
`max_model_len=32768`, `max_num_seqs=64`,
|
|
`max_num_batched_tokens=32768`, `gpu_memory_utilization=0.85`, prefix caching,
|
|
and chunked prefill.
|
|
|
|
| run | Frontier rows | preempt F/V | prefix hit F/V | total tok/s F/V | ratio | TPOT p50 F/V | E2E p95 F/V |
|
|
|---|---:|---:|---:|---:|---:|---:|---:|
|
|
| TP1 N100 raw | 96/100 | 0/8 | 0.249/0.251 | 2349/3832 | 0.61 | 0.0569/0.0661s | 119.6/97.4s |
|
|
| TP1 N500 raw | 439/500 | 0/63 | 0.119/0.387 | 4734/5283 | 0.90 | 0.0564/0.0498s | 397.3/417.4s |
|
|
| TP1 N200 scale 0.667 | 176/200 | 0/26 | 0.170/0.270 | 3913/4865 | 0.80 | 0.0584/0.0515s | 189.2/183.7s |
|
|
| TP1 N200 scale 2 | 200/200 | 33/43 | 0.231/0.270 | 3506/4743 | 0.74 | 0.0542/0.0497s | 174.5/142.3s |
|
|
| TP1 N200 scale 3 | 200/200 | 20/16 | 0.218/0.270 | 3390/4608 | 0.74 | 0.0534/0.0462s | 154.5/122.8s |
|
|
| TP2 N200 scale 2 | 200/200 | 0/0 | 0.270/0.270 | 4581/7547 | 0.61 | 0.0430/0.0300s | 106.8/72.5s |
|
|
| TP2 N200 scale 3 | 200/200 | 0/0 | 0.270/0.270 | 4062/6426 | 0.63 | 0.0394/0.0191s | 101.6/54.0s |
|
|
| TP4 N200 scale 2 | 200/200 | 0/0 | 0.270/0.270 | 5035/9073 | 0.55 | 0.0337/0.0163s | 84.9/43.6s |
|
|
| TP4 N200 scale 3 | 200/200 | 0/0 | 0.270/0.270 | 4355/7403 | 0.59 | 0.0311/0.0094s | 83.0/27.9s |
|
|
|
|
Important prefix caveat: the vLLM prefix-hit column in this table is the
|
|
trace-side synthetic estimate from the vLLM summaries. For TP1 runs with
|
|
preemption and finite KV pressure, the observed vLLM scheduler `computed:`
|
|
signal is the better comparator. Earlier analysis in
|
|
`docs/rs4_frontier_h20_tp1_alignment.md` records those finite-cache comparisons.
|
|
For TP2/TP4, no preemption occurs and the trace-side prefix ratio matches
|
|
Frontier exactly.
|
|
|
|
## Plots
|
|
|
|

|
|
|
|

|
|
|
|

|
|
|
|

|
|
|
|
## Interpretation
|
|
|
|
### TP1
|
|
|
|
The early TP1 100/500/scale-0.667 runs are still useful as historical stress
|
|
points, but they were run before the decode-preemption lifecycle fix. Frontier
|
|
therefore missed rows in those runs:
|
|
|
|
- `96/100` for N100 raw
|
|
- `439/500` for N500 raw
|
|
- `176/200` for N200 scale 0.667
|
|
|
|
After the lifecycle fix, TP1 scale 2 and scale 3 both complete `200/200`.
|
|
Preemption is now in the same order as vLLM:
|
|
|
|
- scale 2: Frontier 33 vs vLLM 43
|
|
- scale 3: Frontier 20 vs vLLM 16
|
|
|
|
TP1 timing is the closest current calibrated region. Throughput is about
|
|
`0.74x` vLLM, TPOT p50/p95 is close, and E2E p95 is about `1.23-1.26x` vLLM.
|
|
This is not perfect, but it is usable for integration-level trend checks.
|
|
|
|
### TP2 and TP4
|
|
|
|
The TP2/TP4 runs are functionally cleaner than TP1:
|
|
|
|
- `200/200` completed for all four runs.
|
|
- Frontier and vLLM both report no preemption.
|
|
- Frontier uses explicit vLLM KV capacities:
|
|
- TP2: 69,055 blocks
|
|
- TP4: 177,077 blocks
|
|
- Prefix hit ratio matches exactly: `0.2697549478`.
|
|
|
|
We did profile TP2/TP4 true-mixed attention. The active RS12 profile includes:
|
|
|
|
- `attention_tp2_tp4_combined.csv`: 36,163 rows, including 1,260 true-mixed
|
|
prefill+decode rows for TP2/TP4.
|
|
- `linear_op_tp2_tp4_full32k.csv`: covers up to 32,768 tokens.
|
|
- `moe_tp2_tp4_full32k.csv`: covers up to 32,768 tokens.
|
|
|
|
Without the true-mixed rows, Frontier fails with missing
|
|
`attn_decode_in_mixed` predictions. With them, all RS12 runs complete.
|
|
|
|
The remaining TP2/TP4 gap is therefore not a missing-profile blocker. It is a
|
|
timing-model gap:
|
|
|
|
- TP2 throughput is `0.61-0.63x` vLLM.
|
|
- TP4 throughput is `0.55-0.59x` vLLM.
|
|
- TP4 TPOT p50 is `2.06-3.30x` vLLM.
|
|
|
|
## Scaling
|
|
|
|
For the same first-200 request fixtures:
|
|
|
|
| fixture | metric | Frontier TP4/TP2 | vLLM TP4/TP2 |
|
|
|---|---|---:|---:|
|
|
| scale 2 | total tok/s | 1.10 | 1.20 |
|
|
| scale 2 | decode tok/s | 1.10 | 1.20 |
|
|
| scale 2 | TPOT p50 | 0.78 | 0.54 |
|
|
| scale 3 | total tok/s | 1.07 | 1.15 |
|
|
| scale 3 | decode tok/s | 1.07 | 1.15 |
|
|
| scale 3 | TPOT p50 | 0.79 | 0.49 |
|
|
|
|
Frontier sees some TP4 improvement, but much less than real vLLM. This is the
|
|
clearest current evidence that the simulator is not yet modeling vLLM's
|
|
TP-dependent decode execution path well enough.
|
|
|
|
## Likely Gap Sources
|
|
|
|
The main unresolved issues are:
|
|
|
|
- CPU/scheduler overhead is still skipped (`skip_cpu_overhead_modeling=true`).
|
|
- Decode CUDA graph behavior is not modeled in the current Frontier runs
|
|
(`decode_cuda_graph_mode=none`).
|
|
- Random-forest predictors interpolate over profile grids, while real online
|
|
mixed batches may concentrate on shapes not directly sampled.
|
|
- Some TP4 predictor fit errors are nontrivial, for example
|
|
`attn_kv_cache_save` MAPE around 11% in the TP4 profile log.
|
|
- Frontier's scheduler and preemption behavior is close but not identical for
|
|
TP1 under finite KV pressure.
|
|
|
|
## ReplayServe TODO
|
|
|
|
1. Rerun the 500-request TP1 stress after the decode-preemption lifecycle fix,
|
|
so the 500-row result is no longer mixed with the old incomplete behavior.
|
|
2. Record vLLM observed scheduler prefix/preemption metrics in machine-readable
|
|
summaries, not only in docs, especially first-start and last-start
|
|
`computed:` ratios.
|
|
3. Add a shape-ledger analysis: compare Frontier's actual online batch shapes
|
|
against the profile grid and identify hot shapes that are interpolated.
|
|
4. Profile or import vLLM CPU overhead and test
|
|
`skip_cpu_overhead_modeling=false`.
|
|
5. Collect kernel-only / decode-CUDA-graph timing profiles before enabling a
|
|
Frontier CUDA-graph decode mode.
|
|
6. Calibrate TP2/TP4 timing only after the above, because current functional
|
|
replay is aligned but the TP scaling is not.
|