Add ReplayServe Frontier vLLM alignment report

This commit is contained in:
2026-06-25 17:10:30 +08:00
commit a99bd00782
63 changed files with 17033 additions and 0 deletions

View File

@@ -0,0 +1,740 @@
# RS4 Frontier H20 TP1 Alignment
This note compares Frontier H20 TP1 against the real vLLM TP1 run on dash2 for
`coder_100`.
## Setup
Real vLLM:
- Runtime: vLLM 0.11.1
- Host/GPU: dash2, NVIDIA H20
- Model: `/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B`
- TP: 1
- KV capacity: 244,496 tokens = 15,281 blocks at block size 16
- Run: `runs/vllm_gpu_smoke_20260624/tp1_coder100_uncapped`
Frontier:
- Frontier root: `/tmp/replayserve-frontier-rs1b`
- Frontier commit: `d9cfeb6d8791fbf2f295dd9744c56a666171776e`
- Model config name: `qwen3-a3b-30b-moe`
- Device: `h20`
- Network node SKU: `h20_dgx`
- TP: `attn_tensor_parallel_size=1`, `moe_tensor_parallel_size=1`,
`moe_expert_parallel_size=1`
- `max_tokens_in_batch=32768`, `batch_size_cap=64`, block size 16
- Prefix cache on, chunked prefill on
- `long_prefill_token_threshold=32768`
- Config: `configs/rs4_frontier_h20_tp1.json`
- Run: `runs/rs4_frontier_h20_tp1_20260624`
The high long-prefill threshold is deliberate. Frontier's earlier threshold 64
run under-counted prefix hits because long prompts were admitted in 64-token
chunks, unlike the current real vLLM run.
## KV Capacity
| run | KV blocks | KV tokens | note |
|---|---:|---:|---|
| Frontier `planner_kv` | 17,385 | 278,160 | Frontier H20 memory planner, no non-KV overhead |
| Frontier `vllm_kv_15281` | 15,281 | 244,496 | Explicitly matched to real vLLM TP1 |
| vLLM TP1 | 15,281 | 244,496 | From vLLM memory profiling |
So only `vllm_kv_15281` has the same KV block count as real vLLM TP1.
## Results
| run | completed | prefix hit tokens / ratio | preemptions | TTFT p50/p95 | TPOT p50/p95 | E2E p50/p95 | decode tok/s |
|---|---:|---:|---:|---:|---:|---:|---:|
| Frontier `planner_kv` | 96/100 | 110,608 / 0.240691 | 0 | 0.986/128.991s | 0.582/0.582s | 279.092/1706.675s | 19.4 |
| Frontier `vllm_kv_15281` | 92/100 | 103,168 / 0.242542 | 0 | 0.964/182.639s | 0.582/0.582s | 305.290/1765.347s | 19.4 |
| vLLM TP1 real | 100/100 | 119,152 / 0.251082 sidecar estimate | 8 | 4.503/29.060s | 0.066/0.621s | 41.841/97.366s | 567.4 |
The latency/throughput rows are not calibrated. Frontier still uses dummy
execution timing, so TPOT is a constant simulator artifact.
## Prefix Admission Check
For TP1, real vLLM has preemption. Therefore the sidecar theoretical prefix-hit
estimate is not the right observed comparator for every request. The observed
vLLM scheduler signal is the first `computed:` value in `stdout.log` for each
request start.
Using first-start `computed:` tokens:
| Frontier run | compared rows | Frontier computed sum | vLLM first-start computed sum | mismatch |
|---|---:|---:|---:|---:|
| `planner_kv` | 96 | 110,608 | 108,208 | one request differs |
| `vllm_kv_15281` | 92 | 103,168 | 103,168 | exact match |
So with the KV block count explicitly matched, Frontier's prefix-cache admission
matches real vLLM TP1 for every row where Frontier emits complete cache metrics.
## Current Alignment Judgment
Aligned:
- H20 device and Qwen3-30B-A3B structural model config can run in Frontier.
- TP1 scheduler knobs can be matched.
- KV block count can be matched explicitly at 15,281 blocks.
- First-admission prefix-cache hit tokens match real vLLM TP1 on completed rows
when KV blocks are explicit.
Not aligned:
- Frontier emits complete request/cache metrics for only 92/100 requests in the
explicit-KV run, while vLLM completes 100/100.
- Frontier reports 0 preemptions; real vLLM TP1 reports 8 preemptions across 5
repeated-start requests.
- Frontier timing is not comparable because it still uses dummy execution
prediction. The current latency/throughput gap is expected and not a
calibrated simulator error.
Next work:
- Treat RS6 as the current profiled baseline and investigate why it omits
complete latency/cache metrics for requests `70`, `77`, `88`, and `90`.
- Instrument Frontier's vLLM V1 scheduler around KV block allocation, free-block
count, and preemption victim selection. Real vLLM TP1 has 8 preemptions, while
Frontier still reports 0 with the same explicit 15,281-block capacity.
- Add a per-request Frontier/vLLM comparator that reports TTFT/TPOT/E2E ratios,
prefix hits, and completion/preemption status on the same request ids.
- Calibrate CPU/scheduler/CUDA-graph effects separately from op profile timing;
RS6 removed the 4096-token linear/MoE extrapolation as the primary explanation
for the remaining gap.
## Performance Gap
Use Frontier `vllm_kv_15281` as the current aligned-KV simulator point. This
matches the real vLLM TP1 KV block count, but it still uses Frontier dummy
execution timing.
| metric | Frontier H20 TP1 explicit KV | real vLLM H20 TP1 | gap |
|---|---:|---:|---:|
| completed requests | 92/100 | 100/100 | not aligned |
| TTFT p50 | 0.964s | 4.503s | Frontier 0.21x real |
| TTFT p95 | 182.639s | 29.060s | Frontier 6.28x real |
| TPOT p50 | 0.582s | 0.066s | Frontier 8.81x real |
| TPOT p95 | 0.582s | 0.621s | Frontier 0.94x real |
| E2E p50 | 305.290s | 41.841s | Frontier 7.30x real |
| E2E p95 | 1765.347s | 97.366s | Frontier 18.13x real |
| RPS | 0.0217 | 0.6880 | vLLM 31.74x Frontier |
| decode tok/s | 19.4 | 567.4 | vLLM 29.20x Frontier |
Interpretation:
- The prefix admission path is close after explicit KV matching, but performance
is not calibrated.
- Frontier uses dummy execution timing; its TPOT is nearly constant at 582 ms,
while real vLLM TP1 has p50 TPOT 66 ms and p95 TPOT 621 ms.
- Frontier does not reproduce real vLLM's TP1 preemption behavior: real vLLM had
8 preemptions, while Frontier reported 0.
- Frontier emits complete request/cache metrics for only 92 rows in this run,
so p95 and throughput are not yet on the same request set.
- The TTFT sign is mixed: Frontier p50 TTFT is too optimistic, but p95 TTFT is
far too pessimistic. This is consistent with uncalibrated execution timing plus
different queue/preemption dynamics.
## RS5 Profiled Frontier Timing
Frontier does support replacing dummy timing with real CSV profiles through the
random-forest execution-time predictor. The required non-dummy flags are wired
in `tools/run_frontier_sweep.py`, and the active profiled config is
`configs/rs5_frontier_h20_tp1_profile.json`.
Profile data collected on dash2 H20 TP1:
- Linear ops: `linear_op.csv`, CUDA event, max tokens 4096.
- Attention: `attention_combined.csv`, CUDA event, max sequence/chunk 18000,
with 15417 standard rows plus 612 true-mixed rows. Online replay needs the
true-mixed rows to train `attn_prefill_mixed` and `attn_decode_in_mixed`.
- MoE: `moe_vllm_fused.csv`, CUDA event, max tokens 4096, vLLM fused MoE
backend.
Frontier vLLM 0.11.1 profiling needed local compatibility patches in
`patches/frontier-vllm-0.11.1-profiling-compat.patch`:
- RoPE helper fallback when vLLM 0.11.1 `get_rope()` no longer accepts the
legacy `rotary_dim` keyword.
- `_get_config_dtype_str` fallback for vLLM fused MoE config dtype.
- `ReplicatedLinear(disable_tp=True)` fallback to torch `Linear` when vLLM TP
group is not initialized in standalone profiling.
- `fused_topk()` variable-return handling.
- `invoke_fused_moe_kernel()` 0.11.1 signature compatibility.
The first profiled MoE attempt used Frontier's `frontier_loop` backend and was
not faithful to vLLM serving. It predicted `moe_grouped_gemm` at about 16 ms for
24 tokens and 19 ms for 1024 tokens, causing TPOT around 0.93 s. The vLLM fused
MoE profile predicts about 0.32 ms for 24 tokens and 0.87 ms for 1024 tokens.
| run | completed | prefix hit ratio | TTFT p50/p95 | TPOT p50/p95 | E2E p50/p95 | total tok/s | decode tok/s |
|---|---:|---:|---:|---:|---:|---:|---:|
| Frontier dummy `vllm_kv_15281` | 92/100 | 0.2422 | 0.964/182.639s | 0.582/0.582s | 305.290/1765.347s | 131.3 | 19.4 |
| Frontier profiled `frontier_loop` MoE | 93/100 | 0.2492 | 3.320/310.235s | 0.930/1.767s | 492.097/2038.538s | 165.9 | 24.6 |
| Frontier profiled vLLM fused MoE | 97/100 | 0.2376 | 0.355/13.695s | 0.056/0.098s | 27.032/119.019s | 2056.7 | 304.5 |
| Frontier profiled vLLM fused MoE, linear/MoE 32K | 96/100 | 0.2484 | 0.909/12.763s | 0.057/0.146s | 30.939/119.636s | 2348.9 | 347.8 |
| vLLM TP1 real | 100/100 | 0.2511 | 4.503/29.060s | 0.066/0.621s | 41.841/97.366s | 3832.3 | 567.4 |
Current judgment:
- The profiled vLLM fused MoE run is the first useful timing baseline. TPOT p50
is close to real vLLM, but throughput is still about 54% of real vLLM and
TTFT/E2E tails do not align.
- After extending linear and MoE profiles to 32768 tokens and adding
`prefill_hot` MoE rows, the cache hit ratio is nearly aligned
(0.2484 vs vLLM 0.2511), throughput improves to about 61% of real vLLM, and
TTFT p50 moves from 0.08x to 0.20x of real vLLM. This confirms that the 4096
profile ceiling was a real source of error.
- Prefix/cache accounting remains close but not exact: the profiled run emits
complete cache metrics for 96/100 requests in the 32K run, with token hit
ratio 0.2488 vs vLLM's sidecar estimate 0.2511.
- Frontier still reports zero preemptions, while real vLLM TP1 had 8 preemption
events. This affects completion set, TTFT tail, and E2E tail.
- The remaining gaps are no longer explained by the linear/MoE 4096-token
extrapolation alone. The 32K run still has TTFT p50 at 0.20x, TTFT p95 at
0.44x, TPOT p95 at 0.23x, and throughput at 0.61x of real vLLM. This points
to missing CPU/scheduler/CUDA-graph modeling plus Frontier's scheduler and
completion/preemption fidelity.
- The 32K run still completes only 96/100 requests in latency/cache metrics
(`70`, `77`, `88`, `90` missing), while real vLLM completes 100/100. This is
a Frontier lifecycle/metrics or scheduler-fidelity issue to debug separately.
## 2026-06-24 Follow-Up
Handled in the ReplayServe harness:
- `tools/run_frontier_sweep.py` now passes an absolute metrics output path into
Frontier. Frontier runs with `cwd=/tmp/replayserve-frontier-rs1b`; relative
metrics paths can otherwise be written under the Frontier scratch instead of
ReplayServe's run directory.
- `tools/postprocess_frontier_smoke.py` now emits a `completion` block with
`completed_requests`, `total_requests`, and `missing_latency_request_ids`.
- `tools/aggregate_runs.py` now marks a run as `incomplete` when postprocess
reports missing latency rows. The latest RS6 summary is therefore incomplete,
not a clean pass.
Latest RS6 vs real vLLM TP1 after the 32K profile and harness fixes:
| metric | Frontier RS6 32K profile | real vLLM TP1 | Frontier / vLLM |
|---|---:|---:|---:|
| completed requests | 96/100 | 100/100 | 0.96 |
| prefix token hit ratio | 0.2488 | 0.2511 | 0.99 |
| preemption events | 0 | 8 | 0.00 |
| TTFT p50 | 0.909s | 4.503s | 0.20 |
| TTFT p95 | 12.763s | 29.060s | 0.44 |
| TPOT p50 | 0.0569s | 0.0661s | 0.86 |
| TPOT p95 | 0.146s | 0.621s | 0.23 |
| E2E p50 | 30.939s | 41.841s | 0.74 |
| E2E p95 | 119.636s | 97.366s | 1.23 |
| total tok/s | 2348.9 | 3832.3 | 0.61 |
| decode tok/s | 347.8 | 567.4 | 0.61 |
Preemption experiment:
- A local trial enabled waiting-admission preemption in Frontier Phase 2. It did
produce preemption events, but it was not a valid alignment improvement:
Frontier completed only 79/100 requests and amplified the early-decode
disappearance pattern. That config was removed from `configs/`.
- This means the remaining preemption gap is not just "turn on preemption in
Phase 2". Frontier's batch/runtime-epoch lifecycle needs a deeper fix before
its preemption behavior can be considered faithful to vLLM TP1.
Current interpretation:
- Prefix/cache replay is close: token-weighted prefix hit ratio is within about
1% relative of the vLLM synthetic replay estimate.
- Completion/preemption is not aligned. Requests `70`, `77`, `88`, and `90`
begin decode in RS6 but never reach completion metrics; vLLM completes all
100 requests and logs 8 preemption events.
- Timing is partially useful but not fully calibrated. Linear and MoE profiles
now cover the trace's long-prefill range up to 32768 tokens, so the old 4096
extrapolation is no longer the main explanation. The remaining TTFT/TPOT/E2E
gap likely comes from missing CPU/scheduler overhead, decode CUDA graph
modeling, and Frontier scheduler lifecycle differences.
## 2026-06-25 500-Request Stress
Generated `traces/fixtures/coder_500` from the first 500 rows of
`qwen_coder_blksz_16.jsonl`:
- `row_count=500`
- `max_total_tokens=21318`
- `overflow_count=0`
- `partial_final_block_rows=466`
Frontier RS8 used the same H20 TP1 Qwen3-30B-A3B full32K profile and explicit
KV block count as RS6:
- Config:
`configs/rs8_frontier_h20_tp1_profile_full32k_coder500.json`
- Run:
`runs/rs8_frontier_h20_tp1_profile_full32k_coder500_20260625`
- Runtime: 492 seconds
- Status: incomplete
| metric | Frontier RS6 100 reqs | Frontier RS8 500 reqs |
|---|---:|---:|
| completed requests | 96/100 | 439/500 |
| missing latency/cache rows | 4 | 61 |
| prefix token hit ratio | 0.2488 | 0.1192 |
| preemption events | 0 | 0 |
| TTFT p50/p95 | 0.909/12.763s | 136.776/340.237s |
| TPOT p50/p95 | 0.0569/0.146s | 0.0564/0.0894s |
| E2E p50/p95 | 30.939/119.636s | 177.800/397.291s |
| total tok/s | 2348.9 | 4733.7 |
| decode tok/s | 347.8 | 656.2 |
Missing request ids in RS8:
```text
70,77,88,90,103,106,134,135,142,143,153,154,176,178,183,184,186,188,210,211,216,222,245,246,263,272,274,278,291,298,299,300,320,325,334,335,347,348,363,367,373,374,393,399,403,409,412,413,414,433,434,437,439,450,453,460,469,475,476,479,497
```
The incomplete-row issue clearly scales: 4/100 missing in RS6 becomes 61/500
missing in RS8. This makes RS8 invalid for final performance claims, but useful
as a stress signal for Frontier lifecycle/metrics fidelity.
The lower prefix hit ratio is not by itself proof of adapter failure. The
unbounded trace-side trie estimate for `coder_500` is 0.3868 token hit ratio,
but the H20 TP1 configuration has finite KV capacity (`num_blocks=15281`, about
244K tokens). The 500-request window has 2.7M prompt tokens, so KV eviction can
substantially reduce real prefix hits. The dash1 vLLM run below is the current
finite-cache comparator for whether Frontier's behavior is faithful.
Real vLLM TP1 500 was first attempted on dash2 with the same settings as
`tp1_coder100_uncapped` (`max_num_seqs=64`, `max_num_batched_tokens=32768`,
`gpu_memory_utilization=0.85`, `CUDA_VISIBLE_DEVICES=0`), but did not start
because dash2 was already occupied by eight existing `agentic-kvc` vLLM serve
processes on ports 8000-8007. Each H20 had about 89GB allocated, and vLLM failed
with free memory below the required 0.85 utilization target. Those processes
were not killed; the temporary ReplayServe GPU lock was released.
A replacement vLLM TP1 500 run completed on dash1:
- Run:
`runs/vllm_gpu_smoke_20260625_dash1/tp1_coder500_uncapped`
- Runtime: vLLM 0.11.1
- Host/GPU: dash1, one NVIDIA H20 via `CUDA_VISIBLE_DEVICES=0`
- Model: `/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B`
- Command knobs: `TP=1`, `max_model_len=32768`, `max_num_seqs=64`,
`max_num_batched_tokens=32768`, `gpu_memory_utilization=0.85`,
prefix caching on, chunked prefill on
- vLLM profiled KV capacity: 244,496 tokens = 15,281 blocks at block size 16
- Replay wall time after engine startup: 595.116 seconds
- Process elapsed including model load/startup: 2026-06-25T03:08:18Z to
2026-06-25T03:19:41Z
| metric | Frontier RS8 500 reqs | vLLM TP1 500 reqs | vLLM / Frontier |
|---|---:|---:|---:|
| completed requests | 439/500 | 500/500 | not aligned |
| preemption events | 0 | 63 | not aligned |
| repeated/preempted request ids | 0 | 57 | not aligned |
| TTFT p50 | 136.776s | 185.658s | 1.36 |
| TTFT p95 | 340.237s | 375.895s | 1.10 |
| TPOT p50 | 0.0564s | 0.0498s | 0.88 |
| TPOT p95 | 0.0894s | 0.0919s | 1.03 |
| E2E p50 | 177.800s | 224.270s | 1.26 |
| E2E p95 | 397.291s | 417.356s | 1.05 |
| requests/s | 0.661 | 0.840 | 1.27 |
| total tok/s | 4733.7 | 5282.9 | 1.12 |
| decode tok/s | 656.2 | 732.3 | 1.12 |
Because Frontier emits latency/cache rows for only 439 requests, the latency
comparison above mixes Frontier's completed subset with vLLM's complete 500-row
run. Restricting vLLM to the same 439 request ids gives:
| metric | Frontier RS8 439 rows | vLLM same 439 ids | vLLM / Frontier |
|---|---:|---:|---:|
| TTFT p50 | 136.776s | 169.968s | 1.24 |
| TTFT p95 | 340.237s | 375.760s | 1.10 |
| TPOT p50 | 0.0564s | 0.0498s | 0.88 |
| TPOT p95 | 0.0894s | 0.1071s | 1.20 |
| E2E p50 | 177.800s | 218.606s | 1.23 |
| E2E p95 | 397.291s | 416.110s | 1.05 |
Prefix/cache comparison needs careful metric naming:
- The unbounded ReplayServe trie estimate for all 500 rows is 1,047,632 hit
tokens / 2,708,110 prompt tokens = 0.3868 token hit ratio.
- vLLM's finite-cache scheduler log is much lower under this pressure:
first-start `computed:` ratio is 0.0979, last-start ratio is 0.1643, and
max-per-request ratio is 0.1655.
- On the same 439 request ids where Frontier emits complete metrics, vLLM's
first-start `computed:` ratio is 0.1050, last-start ratio is 0.1665, and
max-per-request ratio is 0.1679.
- Frontier RS8 reports `replayserve_token_hit_ratio=0.1192` and
`frontier_block_hit_ratio=0.1191`, which is in the same order as vLLM's
finite-cache scheduler signal but far below the unbounded trace-side estimate.
Current 500-request judgment:
- Frontier's timing profile is now in the right broad range for this stressed
H20 TP1 run: TPOT p50/p95 and E2E p95 are close to vLLM, and aggregate token
throughput is within about 12%.
- The run is still not a faithful simulator result because completion and
preemption diverge: Frontier drops 61 latency/cache rows and reports zero
preemptions, while real vLLM completes all 500 requests and logs 63
preemption events across 57 request ids.
- The 500-request trace invalidates the earlier use of the unbounded sidecar
prefix estimate as the primary comparator. Finite KV capacity, eviction, and
preemption must be part of the prefix-cache replay metric.
ReplayServe TODO:
- Treat incomplete Frontier runs as invalid for final performance claims unless
the comparison explicitly reports the missing request set.
- Keep the focused Frontier debug guard in the local patch: sequential mode now
fails if `completed_requests < total_requests` at drain time and reports the
missing request state.
- Add a comparator that reports both unbounded trace-side prefix reuse and
finite-cache observed reuse from vLLM scheduler logs; do not compare
Frontier's finite-cache hit ratio directly to the unbounded trie estimate.
- Profile or import vLLM CPU overhead records for H20 TP1 before enabling
`skip_cpu_overhead_modeling=false`; without those records Frontier falls back
to zero CPU overhead.
- Collect kernel-only/decode-CUDA-graph timing profiles before using
`decode_cuda_graph_mode=full_decode_only`; the current RS6 profile is CUDA
event/eager timing.
## 2026-06-25 200-Request Timestamp Scale 2/3
Generated `traces/fixtures/coder_200_ts0667` from the first 200 rows of
`qwen_coder_blksz_16.jsonl`, with each timestamp multiplied by `2/3` in the
fixture files:
- `row_count=200`
- `timestamp_scale=0.6666666666666666`
- `last_timestamp=30.711333333333332`
- `max_total_tokens=18985`
- `partial_final_block_rows=182`
Important: in the current replay semantics, smaller timestamp scale makes
arrivals denser. It reduces the arrival window from about 46.1s to 30.7s for the
first 200 requests. This does not reduce queue pressure relative to the same
200 requests at scale 1.0; it only reduces the request count relative to the
500-request stress.
Frontier RS9:
- Config:
`configs/rs9_frontier_h20_tp1_profile_full32k_coder200_ts0667.json`
- Run:
`runs/rs9_frontier_h20_tp1_profile_full32k_coder200_ts0667`
- Runtime: 460 seconds
- Status: incomplete
vLLM dash1 TP1:
- Run:
`runs/vllm_gpu_smoke_20260625_dash1/tp1_coder200_ts0667_uncapped`
- Runtime: vLLM 0.11.1
- Host/GPU: dash1, one NVIDIA H20 via `CUDA_VISIBLE_DEVICES=0`
- vLLM profiled KV capacity: 244,496 tokens = 15,281 blocks at block size 16
- Replay wall time after engine startup: 242.813 seconds
| metric | Frontier RS9 200 ts=2/3 | vLLM TP1 200 ts=2/3 | vLLM / Frontier |
|---|---:|---:|---:|
| completed requests | 176/200 | 200/200 | not aligned |
| preemption events | 0 | 26 | not aligned |
| TTFT p50 | 20.580s | 34.563s | 1.68 |
| TTFT p95 | 96.718s | 120.804s | 1.25 |
| TPOT p50 | 0.0584s | 0.0515s | 0.88 |
| TPOT p95 | 0.2359s | 0.2535s | 1.07 |
| E2E p50 | 73.207s | 83.622s | 1.14 |
| E2E p95 | 189.240s | 183.727s | 0.97 |
| requests/s | 0.583 | 0.824 | 1.41 |
| total tok/s | 3913.4 | 4864.8 | 1.24 |
| decode tok/s | 593.3 | 737.5 | 1.24 |
Restricting vLLM to the same 176 request ids where Frontier emits complete
metrics gives:
| metric | Frontier RS9 176 rows | vLLM same 176 ids | vLLM / Frontier |
|---|---:|---:|---:|
| TTFT p50 | 20.580s | 27.896s | 1.36 |
| TTFT p95 | 96.718s | 120.804s | 1.25 |
| TPOT p50 | 0.0584s | 0.0520s | 0.89 |
| TPOT p95 | 0.2359s | 0.2539s | 1.08 |
| E2E p50 | 73.207s | 82.645s | 1.13 |
| E2E p95 | 189.240s | 183.727s | 0.97 |
Prefix/cache comparison:
- The unbounded ReplayServe trie estimate for all 200 rows is 270,336 hit
tokens / 1,002,154 prompt tokens = 0.2698 token hit ratio.
- vLLM finite-cache scheduler signal for all 200 rows: first-start `computed:`
ratio 0.1392, last-start ratio 0.2126, max-per-request ratio 0.2129.
- On the same 176 request ids where Frontier emits complete metrics, vLLM
first-start ratio is 0.1487, last-start ratio is 0.1926, and max-per-request
ratio is 0.1927.
- Frontier RS9 reports `replayserve_token_hit_ratio=0.1703` and
`frontier_block_hit_ratio=0.1700`, again between vLLM first-start and
last/max finite-cache scheduler signals.
Missing request ids in RS9:
```text
70,78,80,86,87,89,96,101,102,105,125,126,131,132,135,144,145,146,147,148,149,150,151,198
```
Current 200-request judgment:
- Reducing the request count from 500 to 200 substantially reduces TTFT and E2E
tails, but `scale=2/3` is still a dense-arrival stress test. vLLM TTFT p95 is
still 120.8s.
- Frontier timing is closer than the old 100-request dummy/profile baselines:
TPOT p50/p95 and E2E p50/p95 are broadly aligned.
- Completion/preemption remains the blocking fidelity issue: Frontier drops 24
rows and reports zero preemptions; vLLM completes all 200 and logs 26
preemptions across 22 repeated-start request ids.
- To actually reduce queue pressure for the same first 200 requests, use a
timestamp scale greater than 1. The follow-up scale 2 and 3 runs below do
this.
## 2026-06-25 200-Request Timestamp Scale 2 and 3
Generated two more first-200 fixtures from `qwen_coder_blksz_16.jsonl`:
| fixture | timestamp scale | last timestamp | max total tokens |
|---|---:|---:|---:|
| `coder_200_ts2` | 2.0 | 92.134s | 18,985 |
| `coder_200_ts3` | 3.0 | 138.201s | 18,985 |
These are the intended lower-arrival-pressure runs. The request payloads are the
same first 200 rows as `coder_200_ts0667`; only timestamps differ.
Frontier RS10:
- Config:
`configs/rs10_frontier_h20_tp1_profile_full32k_coder200_ts2_ts3.json`
- Run:
`runs/rs10_frontier_h20_tp1_profile_full32k_coder200_ts2_ts3`
- Status: incomplete for both fixtures
vLLM dash1 TP1:
- Runs:
`runs/vllm_gpu_smoke_20260625_dash1/tp1_coder_200_ts2_uncapped`
and `runs/vllm_gpu_smoke_20260625_dash1/tp1_coder_200_ts3_uncapped`
- Runtime: vLLM 0.11.1
- Host/GPU: dash1, one NVIDIA H20 via `CUDA_VISIBLE_DEVICES=0`
- vLLM profiled KV capacity: 244,496 tokens = 15,281 blocks at block size 16
Run-level comparison:
| metric | Frontier scale 2 | vLLM scale 2 | Frontier scale 3 | vLLM scale 3 |
|---|---:|---:|---:|---:|
| completed requests | 182/200 | 200/200 | 184/200 | 200/200 |
| preemption events | 0 | 43 | 0 | 16 |
| TTFT p50 | 8.118s | 9.217s | 0.779s | 1.166s |
| TTFT p95 | 67.850s | 69.211s | 35.918s | 32.258s |
| TPOT p50 | 0.0544s | 0.0497s | 0.0544s | 0.0462s |
| TPOT p95 | 0.0747s | 0.0686s | 0.0773s | 0.0714s |
| E2E p50 | 51.118s | 55.002s | 40.641s | 33.213s |
| E2E p95 | 162.607s | 142.338s | 158.434s | 122.789s |
| requests/s | 0.593 | 0.803 | 0.544 | 0.780 |
| total tok/s | 3846.1 | 4742.5 | 3490.6 | 4608.1 |
| decode tok/s | 583.1 | 719.0 | 529.2 | 698.6 |
Restricting vLLM to the same request ids where Frontier emits complete metrics:
| metric | Frontier scale 2 182 rows | vLLM same 182 ids | Frontier scale 3 184 rows | vLLM same 184 ids |
|---|---:|---:|---:|---:|
| TTFT p50 | 8.118s | 8.574s | 0.779s | 0.945s |
| TTFT p95 | 67.850s | 68.934s | 35.918s | 32.258s |
| TPOT p50 | 0.0544s | 0.0501s | 0.0544s | 0.0461s |
| TPOT p95 | 0.0747s | 0.0686s | 0.0773s | 0.0679s |
| E2E p50 | 51.118s | 53.263s | 40.641s | 33.213s |
| E2E p95 | 162.607s | 141.264s | 158.434s | 122.789s |
Prefix/cache comparison:
| metric | scale 2 | scale 3 |
|---|---:|---:|
| unbounded trace-side token hit ratio | 0.2698 | 0.2698 |
| vLLM first-start `computed:` ratio | 0.1433 | 0.1471 |
| vLLM last-start `computed:` ratio | 0.2382 | 0.1968 |
| vLLM max-per-request `computed:` ratio | 0.2383 | 0.1998 |
| Frontier `replayserve_token_hit_ratio` | 0.1448 | 0.1523 |
| Frontier `frontier_block_hit_ratio` | 0.1446 | 0.1521 |
Current scale 2 and 3 judgment:
- The user's intended `scale=2` and `scale=3` runs do reduce queueing. vLLM
TTFT p95 drops from 120.8s at `scale=2/3` to 69.2s at `scale=2` and 32.3s at
`scale=3`.
- `scale=3` is the first run where vLLM p50 TTFT is near 1s. The p95 is still
high because long prompts and KV pressure remain, but the severe all-request
queueing seen in the 500-request run is much reduced.
- Frontier timing is now close on TTFT and TPOT for the completed-row subset,
especially at `scale=2`. However, Frontier still misses completion/cache rows
and still reports zero preemptions.
- Completion/preemption is therefore still the main Frontier fidelity blocker:
`scale=2` misses 18 rows and vLLM logs 43 preemptions; `scale=3` misses 16 rows
and vLLM logs 16 preemptions.
## 2026-06-25 Frontier Lifecycle Fix For RS10
The missing-row root cause was Frontier lifecycle handling after decode-phase
preemption. Missing requests were preempted after prefill/decode had started,
then left in this inconsistent state:
```text
preempted=True
is_prefill_complete=True
num_processed_tokens=0
scheduled=False
completed=False
```
The next waiting admission computed `num_new_tokens=0` and removed the request
from the queue, so sequential simulation drained with fewer completed requests
but no remaining scheduler work.
The updated ReplayServe Frontier patch now:
- replays decode-phase preemption by treating already-produced tokens as the
next prefill segment and the remaining tokens as decode work;
- preserves unfinished zero-token waiting requests instead of silently dropping
them;
- reports metrics against user-facing trace prompt/output lengths after runtime
token splitting;
- fails fast if sequential mode drains before all generated requests complete.
Verification runs:
| run | old completion | fixed completion | Frontier preemptions | prefix token hit ratio | status |
|---|---:|---:|---:|---:|---|
| `coder_200_ts2` | 182/200 | 200/200 | 33 | 0.2313 | pass |
| `coder_200_ts3` | 184/200 | 200/200 | 20 | 0.2177 | pass |
Fixed-run paths:
- `runs/rs10_preemption_replay_fix_ts2/frontier_h20_tp1_profile_full32k/coder_200_ts2/vllm_kv_15281_profile_full32k`
- `runs/rs10_preemption_replay_fix_ts3/frontier_h20_tp1_profile_full32k/coder_200_ts3/vllm_kv_15281_profile_full32k`
Updated run-level comparison:
| metric | Frontier scale 2 fixed | vLLM scale 2 | Frontier scale 3 fixed | vLLM scale 3 |
|---|---:|---:|---:|---:|
| completed requests | 200/200 | 200/200 | 200/200 | 200/200 |
| preemption events | 33 | 43 | 20 | 16 |
| TTFT p50 | 9.595s | 9.217s | 1.001s | 1.166s |
| TTFT p95 | 77.503s | 69.211s | 45.947s | 32.258s |
| TPOT p50 | 0.0542s | 0.0497s | 0.0534s | 0.0462s |
| TPOT p95 | 0.0665s | 0.0686s | 0.0686s | 0.0714s |
| E2E p50 | 61.458s | 55.002s | 44.761s | 33.213s |
| E2E p95 | 174.484s | 142.338s | 154.548s | 122.789s |
| requests/s | 0.594 | 0.803 | 0.574 | 0.780 |
| total tok/s | 3506.3 | 4742.5 | 3390.0 | 4608.1 |
| decode tok/s | 531.6 | 719.0 | 513.9 | 698.6 |
Current judgment after the fix:
- The completion/preemption lifecycle blocker for RS10 is fixed: both scale 2
and scale 3 now emit 200 request rows and complete postprocess.
- Frontier preemption is now in the same order as vLLM, but not exact:
scale 2 is 33 vs 43 events, scale 3 is 20 vs 16 events.
- Prefix hit ratio changed materially because preempted requests now replay and
re-enter prefix-cache admission instead of disappearing. It is no longer valid
to compare the old incomplete RS10 prefix ratios against vLLM.
- Timing remains close in TPOT but Frontier is still slower in aggregate
throughput, about 0.74x of vLLM total/decode token throughput for both scale 2
and scale 3. TTFT/E2E tails are still worse after the completion set becomes
complete.
- Remaining gap is no longer "missing metrics rows"; it is scheduler/preemption
fidelity plus CPU/scheduler/CUDA-graph timing calibration.
## 2026-06-25 H20 TP2/TP4 Comparison
The TP2/TP4 comparison uses the same first-200 `coder_200_ts2` and
`coder_200_ts3` fixtures. The vLLM runs are on dash1 with
`/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B`, vLLM 0.11.1,
`max_model_len=32768`, `max_num_seqs=64`,
`max_num_batched_tokens=32768`, `gpu_memory_utilization=0.85`,
prefix caching on, and chunked prefill on.
vLLM measured KV capacity:
| TP | KV tokens | KV blocks |
|---:|---:|---:|
| 2 | 1,104,880 | 69,055 |
| 4 | 2,833,232 | 177,077 |
Frontier RS12 uses explicit matching KV blocks and fresh H20 TP2/TP4 profiles:
- Config:
`configs/rs12_frontier_h20_tp2_tp4_profile_full32k_coder200_ts2_ts3.json`
- Run:
`runs/rs12_frontier_h20_tp2_tp4_profile_full32k_coder200_ts2_ts3`
- Profile source:
`dash1:/home/admin/cpfs/wjh/replayserve_frontier_profiles/h20_tp2_tp4_qwen3_30ba3b_full32k_20260625_true_mixed`
- Linear/MoE profiles cover TP2/TP4 up to 32768 tokens.
- Attention profile covers TP2/TP4 standard attention plus 1260 true-mixed
prefill+decode rows. The true-mixed rows are required; standard attention
alone fails with missing `attn_decode_in_mixed` predictions.
All four Frontier runs completed 200/200 request rows. Neither Frontier nor the
vLLM TP2/TP4 logs reported preemption events. Prefix token hit ratio is exactly
the same in Frontier postprocess and vLLM's trace-side synthetic estimate:
0.2697549478.
Run-level comparison:
| TP | fixture | metric | Frontier | vLLM | Frontier / vLLM |
|---:|---|---|---:|---:|---:|
| 2 | `coder_200_ts2` | requests/s | 0.776 | 1.278 | 0.61 |
| 2 | `coder_200_ts2` | total tok/s | 4581 | 7547 | 0.61 |
| 2 | `coder_200_ts2` | decode tok/s | 695 | 1144 | 0.61 |
| 2 | `coder_200_ts2` | TTFT p50/p95 | 0.269/6.745s | 0.225/0.715s | 1.20/9.43 |
| 2 | `coder_200_ts2` | TPOT p50/p95 | 0.0430/0.0529s | 0.0300/0.0434s | 1.43/1.22 |
| 2 | `coder_200_ts2` | E2E p50/p95 | 26.05/106.76s | 16.45/72.53s | 1.58/1.47 |
| 4 | `coder_200_ts2` | requests/s | 0.853 | 1.536 | 0.55 |
| 4 | `coder_200_ts2` | total tok/s | 5035 | 9073 | 0.55 |
| 4 | `coder_200_ts2` | decode tok/s | 763 | 1376 | 0.55 |
| 4 | `coder_200_ts2` | TTFT p50/p95 | 0.098/0.386s | 0.170/1.420s | 0.57/0.27 |
| 4 | `coder_200_ts2` | TPOT p50/p95 | 0.0337/0.0384s | 0.0163/0.0283s | 2.06/1.36 |
| 4 | `coder_200_ts2` | E2E p50/p95 | 18.65/84.94s | 9.26/43.62s | 2.01/1.95 |
| 2 | `coder_200_ts3` | requests/s | 0.688 | 1.088 | 0.63 |
| 2 | `coder_200_ts3` | total tok/s | 4062 | 6426 | 0.63 |
| 2 | `coder_200_ts3` | decode tok/s | 616 | 974 | 0.63 |
| 2 | `coder_200_ts3` | TTFT p50/p95 | 0.134/0.574s | 0.154/0.627s | 0.87/0.92 |
| 2 | `coder_200_ts3` | TPOT p50/p95 | 0.0394/0.0467s | 0.0191/0.0280s | 2.07/1.67 |
| 2 | `coder_200_ts3` | E2E p50/p95 | 21.79/101.59s | 9.96/53.98s | 2.19/1.88 |
| 4 | `coder_200_ts3` | requests/s | 0.737 | 1.254 | 0.59 |
| 4 | `coder_200_ts3` | total tok/s | 4355 | 7403 | 0.59 |
| 4 | `coder_200_ts3` | decode tok/s | 660 | 1122 | 0.59 |
| 4 | `coder_200_ts3` | TTFT p50/p95 | 0.089/0.346s | 0.100/0.318s | 0.89/1.09 |
| 4 | `coder_200_ts3` | TPOT p50/p95 | 0.0311/0.0358s | 0.0094/0.0128s | 3.30/2.80 |
| 4 | `coder_200_ts3` | E2E p50/p95 | 16.90/83.01s | 5.55/27.87s | 3.05/2.98 |
TP scaling comparison:
| fixture | metric | Frontier TP4 / TP2 | vLLM TP4 / TP2 |
|---|---|---:|---:|
| `coder_200_ts2` | total tok/s speedup | 1.10 | 1.20 |
| `coder_200_ts2` | decode tok/s speedup | 1.10 | 1.20 |
| `coder_200_ts2` | TPOT p50 reduction | 0.78 | 0.54 |
| `coder_200_ts3` | total tok/s speedup | 1.07 | 1.15 |
| `coder_200_ts3` | decode tok/s speedup | 1.07 | 1.15 |
| `coder_200_ts3` | TPOT p50 reduction | 0.79 | 0.49 |
Current TP2/TP4 judgment:
- Functional replay is aligned for this setting: same request rows, same
trace-side prefix reuse ratio, matched vLLM KV block counts, and no
preemption on either side.
- Absolute performance is not aligned. Frontier reports only 55-63% of vLLM
total/decode throughput across TP2/TP4, and TPOT is especially pessimistic at
TP4.
- Relative TP scaling is also under-estimated. vLLM's TP4 improves TPOT p50 by
about 46-51% over TP2, while Frontier improves by only about 21-22%.
- The remaining gap is therefore not caused by missing rows, prefix-cache
mismatch, or KV capacity mismatch in these runs. It points to timing model
limitations: missing CPU/scheduler/CUDA-graph modeling, random-forest profile
interpolation error, and imperfect modeling of vLLM's TP-dependent decode
execution path.
- These RS12 results are acceptable for continuing ReplayServe integration and
rough qualitative trends. They are not yet acceptable as calibrated absolute
performance predictions.