Add ReplayServe Frontier vLLM alignment report
This commit is contained in:
740
docs/rs4_frontier_h20_tp1_alignment.md
Normal file
740
docs/rs4_frontier_h20_tp1_alignment.md
Normal file
@@ -0,0 +1,740 @@
|
||||
# RS4 Frontier H20 TP1 Alignment
|
||||
|
||||
This note compares Frontier H20 TP1 against the real vLLM TP1 run on dash2 for
|
||||
`coder_100`.
|
||||
|
||||
## Setup
|
||||
|
||||
Real vLLM:
|
||||
|
||||
- Runtime: vLLM 0.11.1
|
||||
- Host/GPU: dash2, NVIDIA H20
|
||||
- Model: `/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B`
|
||||
- TP: 1
|
||||
- KV capacity: 244,496 tokens = 15,281 blocks at block size 16
|
||||
- Run: `runs/vllm_gpu_smoke_20260624/tp1_coder100_uncapped`
|
||||
|
||||
Frontier:
|
||||
|
||||
- Frontier root: `/tmp/replayserve-frontier-rs1b`
|
||||
- Frontier commit: `d9cfeb6d8791fbf2f295dd9744c56a666171776e`
|
||||
- Model config name: `qwen3-a3b-30b-moe`
|
||||
- Device: `h20`
|
||||
- Network node SKU: `h20_dgx`
|
||||
- TP: `attn_tensor_parallel_size=1`, `moe_tensor_parallel_size=1`,
|
||||
`moe_expert_parallel_size=1`
|
||||
- `max_tokens_in_batch=32768`, `batch_size_cap=64`, block size 16
|
||||
- Prefix cache on, chunked prefill on
|
||||
- `long_prefill_token_threshold=32768`
|
||||
- Config: `configs/rs4_frontier_h20_tp1.json`
|
||||
- Run: `runs/rs4_frontier_h20_tp1_20260624`
|
||||
|
||||
The high long-prefill threshold is deliberate. Frontier's earlier threshold 64
|
||||
run under-counted prefix hits because long prompts were admitted in 64-token
|
||||
chunks, unlike the current real vLLM run.
|
||||
|
||||
## KV Capacity
|
||||
|
||||
| run | KV blocks | KV tokens | note |
|
||||
|---|---:|---:|---|
|
||||
| Frontier `planner_kv` | 17,385 | 278,160 | Frontier H20 memory planner, no non-KV overhead |
|
||||
| Frontier `vllm_kv_15281` | 15,281 | 244,496 | Explicitly matched to real vLLM TP1 |
|
||||
| vLLM TP1 | 15,281 | 244,496 | From vLLM memory profiling |
|
||||
|
||||
So only `vllm_kv_15281` has the same KV block count as real vLLM TP1.
|
||||
|
||||
## Results
|
||||
|
||||
| run | completed | prefix hit tokens / ratio | preemptions | TTFT p50/p95 | TPOT p50/p95 | E2E p50/p95 | decode tok/s |
|
||||
|---|---:|---:|---:|---:|---:|---:|---:|
|
||||
| Frontier `planner_kv` | 96/100 | 110,608 / 0.240691 | 0 | 0.986/128.991s | 0.582/0.582s | 279.092/1706.675s | 19.4 |
|
||||
| Frontier `vllm_kv_15281` | 92/100 | 103,168 / 0.242542 | 0 | 0.964/182.639s | 0.582/0.582s | 305.290/1765.347s | 19.4 |
|
||||
| vLLM TP1 real | 100/100 | 119,152 / 0.251082 sidecar estimate | 8 | 4.503/29.060s | 0.066/0.621s | 41.841/97.366s | 567.4 |
|
||||
|
||||
The latency/throughput rows are not calibrated. Frontier still uses dummy
|
||||
execution timing, so TPOT is a constant simulator artifact.
|
||||
|
||||
## Prefix Admission Check
|
||||
|
||||
For TP1, real vLLM has preemption. Therefore the sidecar theoretical prefix-hit
|
||||
estimate is not the right observed comparator for every request. The observed
|
||||
vLLM scheduler signal is the first `computed:` value in `stdout.log` for each
|
||||
request start.
|
||||
|
||||
Using first-start `computed:` tokens:
|
||||
|
||||
| Frontier run | compared rows | Frontier computed sum | vLLM first-start computed sum | mismatch |
|
||||
|---|---:|---:|---:|---:|
|
||||
| `planner_kv` | 96 | 110,608 | 108,208 | one request differs |
|
||||
| `vllm_kv_15281` | 92 | 103,168 | 103,168 | exact match |
|
||||
|
||||
So with the KV block count explicitly matched, Frontier's prefix-cache admission
|
||||
matches real vLLM TP1 for every row where Frontier emits complete cache metrics.
|
||||
|
||||
## Current Alignment Judgment
|
||||
|
||||
Aligned:
|
||||
|
||||
- H20 device and Qwen3-30B-A3B structural model config can run in Frontier.
|
||||
- TP1 scheduler knobs can be matched.
|
||||
- KV block count can be matched explicitly at 15,281 blocks.
|
||||
- First-admission prefix-cache hit tokens match real vLLM TP1 on completed rows
|
||||
when KV blocks are explicit.
|
||||
|
||||
Not aligned:
|
||||
|
||||
- Frontier emits complete request/cache metrics for only 92/100 requests in the
|
||||
explicit-KV run, while vLLM completes 100/100.
|
||||
- Frontier reports 0 preemptions; real vLLM TP1 reports 8 preemptions across 5
|
||||
repeated-start requests.
|
||||
- Frontier timing is not comparable because it still uses dummy execution
|
||||
prediction. The current latency/throughput gap is expected and not a
|
||||
calibrated simulator error.
|
||||
|
||||
Next work:
|
||||
|
||||
- Treat RS6 as the current profiled baseline and investigate why it omits
|
||||
complete latency/cache metrics for requests `70`, `77`, `88`, and `90`.
|
||||
- Instrument Frontier's vLLM V1 scheduler around KV block allocation, free-block
|
||||
count, and preemption victim selection. Real vLLM TP1 has 8 preemptions, while
|
||||
Frontier still reports 0 with the same explicit 15,281-block capacity.
|
||||
- Add a per-request Frontier/vLLM comparator that reports TTFT/TPOT/E2E ratios,
|
||||
prefix hits, and completion/preemption status on the same request ids.
|
||||
- Calibrate CPU/scheduler/CUDA-graph effects separately from op profile timing;
|
||||
RS6 removed the 4096-token linear/MoE extrapolation as the primary explanation
|
||||
for the remaining gap.
|
||||
|
||||
## Performance Gap
|
||||
|
||||
Use Frontier `vllm_kv_15281` as the current aligned-KV simulator point. This
|
||||
matches the real vLLM TP1 KV block count, but it still uses Frontier dummy
|
||||
execution timing.
|
||||
|
||||
| metric | Frontier H20 TP1 explicit KV | real vLLM H20 TP1 | gap |
|
||||
|---|---:|---:|---:|
|
||||
| completed requests | 92/100 | 100/100 | not aligned |
|
||||
| TTFT p50 | 0.964s | 4.503s | Frontier 0.21x real |
|
||||
| TTFT p95 | 182.639s | 29.060s | Frontier 6.28x real |
|
||||
| TPOT p50 | 0.582s | 0.066s | Frontier 8.81x real |
|
||||
| TPOT p95 | 0.582s | 0.621s | Frontier 0.94x real |
|
||||
| E2E p50 | 305.290s | 41.841s | Frontier 7.30x real |
|
||||
| E2E p95 | 1765.347s | 97.366s | Frontier 18.13x real |
|
||||
| RPS | 0.0217 | 0.6880 | vLLM 31.74x Frontier |
|
||||
| decode tok/s | 19.4 | 567.4 | vLLM 29.20x Frontier |
|
||||
|
||||
Interpretation:
|
||||
|
||||
- The prefix admission path is close after explicit KV matching, but performance
|
||||
is not calibrated.
|
||||
- Frontier uses dummy execution timing; its TPOT is nearly constant at 582 ms,
|
||||
while real vLLM TP1 has p50 TPOT 66 ms and p95 TPOT 621 ms.
|
||||
- Frontier does not reproduce real vLLM's TP1 preemption behavior: real vLLM had
|
||||
8 preemptions, while Frontier reported 0.
|
||||
- Frontier emits complete request/cache metrics for only 92 rows in this run,
|
||||
so p95 and throughput are not yet on the same request set.
|
||||
- The TTFT sign is mixed: Frontier p50 TTFT is too optimistic, but p95 TTFT is
|
||||
far too pessimistic. This is consistent with uncalibrated execution timing plus
|
||||
different queue/preemption dynamics.
|
||||
|
||||
## RS5 Profiled Frontier Timing
|
||||
|
||||
Frontier does support replacing dummy timing with real CSV profiles through the
|
||||
random-forest execution-time predictor. The required non-dummy flags are wired
|
||||
in `tools/run_frontier_sweep.py`, and the active profiled config is
|
||||
`configs/rs5_frontier_h20_tp1_profile.json`.
|
||||
|
||||
Profile data collected on dash2 H20 TP1:
|
||||
|
||||
- Linear ops: `linear_op.csv`, CUDA event, max tokens 4096.
|
||||
- Attention: `attention_combined.csv`, CUDA event, max sequence/chunk 18000,
|
||||
with 15417 standard rows plus 612 true-mixed rows. Online replay needs the
|
||||
true-mixed rows to train `attn_prefill_mixed` and `attn_decode_in_mixed`.
|
||||
- MoE: `moe_vllm_fused.csv`, CUDA event, max tokens 4096, vLLM fused MoE
|
||||
backend.
|
||||
|
||||
Frontier vLLM 0.11.1 profiling needed local compatibility patches in
|
||||
`patches/frontier-vllm-0.11.1-profiling-compat.patch`:
|
||||
|
||||
- RoPE helper fallback when vLLM 0.11.1 `get_rope()` no longer accepts the
|
||||
legacy `rotary_dim` keyword.
|
||||
- `_get_config_dtype_str` fallback for vLLM fused MoE config dtype.
|
||||
- `ReplicatedLinear(disable_tp=True)` fallback to torch `Linear` when vLLM TP
|
||||
group is not initialized in standalone profiling.
|
||||
- `fused_topk()` variable-return handling.
|
||||
- `invoke_fused_moe_kernel()` 0.11.1 signature compatibility.
|
||||
|
||||
The first profiled MoE attempt used Frontier's `frontier_loop` backend and was
|
||||
not faithful to vLLM serving. It predicted `moe_grouped_gemm` at about 16 ms for
|
||||
24 tokens and 19 ms for 1024 tokens, causing TPOT around 0.93 s. The vLLM fused
|
||||
MoE profile predicts about 0.32 ms for 24 tokens and 0.87 ms for 1024 tokens.
|
||||
|
||||
| run | completed | prefix hit ratio | TTFT p50/p95 | TPOT p50/p95 | E2E p50/p95 | total tok/s | decode tok/s |
|
||||
|---|---:|---:|---:|---:|---:|---:|---:|
|
||||
| Frontier dummy `vllm_kv_15281` | 92/100 | 0.2422 | 0.964/182.639s | 0.582/0.582s | 305.290/1765.347s | 131.3 | 19.4 |
|
||||
| Frontier profiled `frontier_loop` MoE | 93/100 | 0.2492 | 3.320/310.235s | 0.930/1.767s | 492.097/2038.538s | 165.9 | 24.6 |
|
||||
| Frontier profiled vLLM fused MoE | 97/100 | 0.2376 | 0.355/13.695s | 0.056/0.098s | 27.032/119.019s | 2056.7 | 304.5 |
|
||||
| Frontier profiled vLLM fused MoE, linear/MoE 32K | 96/100 | 0.2484 | 0.909/12.763s | 0.057/0.146s | 30.939/119.636s | 2348.9 | 347.8 |
|
||||
| vLLM TP1 real | 100/100 | 0.2511 | 4.503/29.060s | 0.066/0.621s | 41.841/97.366s | 3832.3 | 567.4 |
|
||||
|
||||
Current judgment:
|
||||
|
||||
- The profiled vLLM fused MoE run is the first useful timing baseline. TPOT p50
|
||||
is close to real vLLM, but throughput is still about 54% of real vLLM and
|
||||
TTFT/E2E tails do not align.
|
||||
- After extending linear and MoE profiles to 32768 tokens and adding
|
||||
`prefill_hot` MoE rows, the cache hit ratio is nearly aligned
|
||||
(0.2484 vs vLLM 0.2511), throughput improves to about 61% of real vLLM, and
|
||||
TTFT p50 moves from 0.08x to 0.20x of real vLLM. This confirms that the 4096
|
||||
profile ceiling was a real source of error.
|
||||
- Prefix/cache accounting remains close but not exact: the profiled run emits
|
||||
complete cache metrics for 96/100 requests in the 32K run, with token hit
|
||||
ratio 0.2488 vs vLLM's sidecar estimate 0.2511.
|
||||
- Frontier still reports zero preemptions, while real vLLM TP1 had 8 preemption
|
||||
events. This affects completion set, TTFT tail, and E2E tail.
|
||||
- The remaining gaps are no longer explained by the linear/MoE 4096-token
|
||||
extrapolation alone. The 32K run still has TTFT p50 at 0.20x, TTFT p95 at
|
||||
0.44x, TPOT p95 at 0.23x, and throughput at 0.61x of real vLLM. This points
|
||||
to missing CPU/scheduler/CUDA-graph modeling plus Frontier's scheduler and
|
||||
completion/preemption fidelity.
|
||||
- The 32K run still completes only 96/100 requests in latency/cache metrics
|
||||
(`70`, `77`, `88`, `90` missing), while real vLLM completes 100/100. This is
|
||||
a Frontier lifecycle/metrics or scheduler-fidelity issue to debug separately.
|
||||
|
||||
## 2026-06-24 Follow-Up
|
||||
|
||||
Handled in the ReplayServe harness:
|
||||
|
||||
- `tools/run_frontier_sweep.py` now passes an absolute metrics output path into
|
||||
Frontier. Frontier runs with `cwd=/tmp/replayserve-frontier-rs1b`; relative
|
||||
metrics paths can otherwise be written under the Frontier scratch instead of
|
||||
ReplayServe's run directory.
|
||||
- `tools/postprocess_frontier_smoke.py` now emits a `completion` block with
|
||||
`completed_requests`, `total_requests`, and `missing_latency_request_ids`.
|
||||
- `tools/aggregate_runs.py` now marks a run as `incomplete` when postprocess
|
||||
reports missing latency rows. The latest RS6 summary is therefore incomplete,
|
||||
not a clean pass.
|
||||
|
||||
Latest RS6 vs real vLLM TP1 after the 32K profile and harness fixes:
|
||||
|
||||
| metric | Frontier RS6 32K profile | real vLLM TP1 | Frontier / vLLM |
|
||||
|---|---:|---:|---:|
|
||||
| completed requests | 96/100 | 100/100 | 0.96 |
|
||||
| prefix token hit ratio | 0.2488 | 0.2511 | 0.99 |
|
||||
| preemption events | 0 | 8 | 0.00 |
|
||||
| TTFT p50 | 0.909s | 4.503s | 0.20 |
|
||||
| TTFT p95 | 12.763s | 29.060s | 0.44 |
|
||||
| TPOT p50 | 0.0569s | 0.0661s | 0.86 |
|
||||
| TPOT p95 | 0.146s | 0.621s | 0.23 |
|
||||
| E2E p50 | 30.939s | 41.841s | 0.74 |
|
||||
| E2E p95 | 119.636s | 97.366s | 1.23 |
|
||||
| total tok/s | 2348.9 | 3832.3 | 0.61 |
|
||||
| decode tok/s | 347.8 | 567.4 | 0.61 |
|
||||
|
||||
Preemption experiment:
|
||||
|
||||
- A local trial enabled waiting-admission preemption in Frontier Phase 2. It did
|
||||
produce preemption events, but it was not a valid alignment improvement:
|
||||
Frontier completed only 79/100 requests and amplified the early-decode
|
||||
disappearance pattern. That config was removed from `configs/`.
|
||||
- This means the remaining preemption gap is not just "turn on preemption in
|
||||
Phase 2". Frontier's batch/runtime-epoch lifecycle needs a deeper fix before
|
||||
its preemption behavior can be considered faithful to vLLM TP1.
|
||||
|
||||
Current interpretation:
|
||||
|
||||
- Prefix/cache replay is close: token-weighted prefix hit ratio is within about
|
||||
1% relative of the vLLM synthetic replay estimate.
|
||||
- Completion/preemption is not aligned. Requests `70`, `77`, `88`, and `90`
|
||||
begin decode in RS6 but never reach completion metrics; vLLM completes all
|
||||
100 requests and logs 8 preemption events.
|
||||
- Timing is partially useful but not fully calibrated. Linear and MoE profiles
|
||||
now cover the trace's long-prefill range up to 32768 tokens, so the old 4096
|
||||
extrapolation is no longer the main explanation. The remaining TTFT/TPOT/E2E
|
||||
gap likely comes from missing CPU/scheduler overhead, decode CUDA graph
|
||||
modeling, and Frontier scheduler lifecycle differences.
|
||||
|
||||
## 2026-06-25 500-Request Stress
|
||||
|
||||
Generated `traces/fixtures/coder_500` from the first 500 rows of
|
||||
`qwen_coder_blksz_16.jsonl`:
|
||||
|
||||
- `row_count=500`
|
||||
- `max_total_tokens=21318`
|
||||
- `overflow_count=0`
|
||||
- `partial_final_block_rows=466`
|
||||
|
||||
Frontier RS8 used the same H20 TP1 Qwen3-30B-A3B full32K profile and explicit
|
||||
KV block count as RS6:
|
||||
|
||||
- Config:
|
||||
`configs/rs8_frontier_h20_tp1_profile_full32k_coder500.json`
|
||||
- Run:
|
||||
`runs/rs8_frontier_h20_tp1_profile_full32k_coder500_20260625`
|
||||
- Runtime: 492 seconds
|
||||
- Status: incomplete
|
||||
|
||||
| metric | Frontier RS6 100 reqs | Frontier RS8 500 reqs |
|
||||
|---|---:|---:|
|
||||
| completed requests | 96/100 | 439/500 |
|
||||
| missing latency/cache rows | 4 | 61 |
|
||||
| prefix token hit ratio | 0.2488 | 0.1192 |
|
||||
| preemption events | 0 | 0 |
|
||||
| TTFT p50/p95 | 0.909/12.763s | 136.776/340.237s |
|
||||
| TPOT p50/p95 | 0.0569/0.146s | 0.0564/0.0894s |
|
||||
| E2E p50/p95 | 30.939/119.636s | 177.800/397.291s |
|
||||
| total tok/s | 2348.9 | 4733.7 |
|
||||
| decode tok/s | 347.8 | 656.2 |
|
||||
|
||||
Missing request ids in RS8:
|
||||
|
||||
```text
|
||||
70,77,88,90,103,106,134,135,142,143,153,154,176,178,183,184,186,188,210,211,216,222,245,246,263,272,274,278,291,298,299,300,320,325,334,335,347,348,363,367,373,374,393,399,403,409,412,413,414,433,434,437,439,450,453,460,469,475,476,479,497
|
||||
```
|
||||
|
||||
The incomplete-row issue clearly scales: 4/100 missing in RS6 becomes 61/500
|
||||
missing in RS8. This makes RS8 invalid for final performance claims, but useful
|
||||
as a stress signal for Frontier lifecycle/metrics fidelity.
|
||||
|
||||
The lower prefix hit ratio is not by itself proof of adapter failure. The
|
||||
unbounded trace-side trie estimate for `coder_500` is 0.3868 token hit ratio,
|
||||
but the H20 TP1 configuration has finite KV capacity (`num_blocks=15281`, about
|
||||
244K tokens). The 500-request window has 2.7M prompt tokens, so KV eviction can
|
||||
substantially reduce real prefix hits. The dash1 vLLM run below is the current
|
||||
finite-cache comparator for whether Frontier's behavior is faithful.
|
||||
|
||||
Real vLLM TP1 500 was first attempted on dash2 with the same settings as
|
||||
`tp1_coder100_uncapped` (`max_num_seqs=64`, `max_num_batched_tokens=32768`,
|
||||
`gpu_memory_utilization=0.85`, `CUDA_VISIBLE_DEVICES=0`), but did not start
|
||||
because dash2 was already occupied by eight existing `agentic-kvc` vLLM serve
|
||||
processes on ports 8000-8007. Each H20 had about 89GB allocated, and vLLM failed
|
||||
with free memory below the required 0.85 utilization target. Those processes
|
||||
were not killed; the temporary ReplayServe GPU lock was released.
|
||||
|
||||
A replacement vLLM TP1 500 run completed on dash1:
|
||||
|
||||
- Run:
|
||||
`runs/vllm_gpu_smoke_20260625_dash1/tp1_coder500_uncapped`
|
||||
- Runtime: vLLM 0.11.1
|
||||
- Host/GPU: dash1, one NVIDIA H20 via `CUDA_VISIBLE_DEVICES=0`
|
||||
- Model: `/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B`
|
||||
- Command knobs: `TP=1`, `max_model_len=32768`, `max_num_seqs=64`,
|
||||
`max_num_batched_tokens=32768`, `gpu_memory_utilization=0.85`,
|
||||
prefix caching on, chunked prefill on
|
||||
- vLLM profiled KV capacity: 244,496 tokens = 15,281 blocks at block size 16
|
||||
- Replay wall time after engine startup: 595.116 seconds
|
||||
- Process elapsed including model load/startup: 2026-06-25T03:08:18Z to
|
||||
2026-06-25T03:19:41Z
|
||||
|
||||
| metric | Frontier RS8 500 reqs | vLLM TP1 500 reqs | vLLM / Frontier |
|
||||
|---|---:|---:|---:|
|
||||
| completed requests | 439/500 | 500/500 | not aligned |
|
||||
| preemption events | 0 | 63 | not aligned |
|
||||
| repeated/preempted request ids | 0 | 57 | not aligned |
|
||||
| TTFT p50 | 136.776s | 185.658s | 1.36 |
|
||||
| TTFT p95 | 340.237s | 375.895s | 1.10 |
|
||||
| TPOT p50 | 0.0564s | 0.0498s | 0.88 |
|
||||
| TPOT p95 | 0.0894s | 0.0919s | 1.03 |
|
||||
| E2E p50 | 177.800s | 224.270s | 1.26 |
|
||||
| E2E p95 | 397.291s | 417.356s | 1.05 |
|
||||
| requests/s | 0.661 | 0.840 | 1.27 |
|
||||
| total tok/s | 4733.7 | 5282.9 | 1.12 |
|
||||
| decode tok/s | 656.2 | 732.3 | 1.12 |
|
||||
|
||||
Because Frontier emits latency/cache rows for only 439 requests, the latency
|
||||
comparison above mixes Frontier's completed subset with vLLM's complete 500-row
|
||||
run. Restricting vLLM to the same 439 request ids gives:
|
||||
|
||||
| metric | Frontier RS8 439 rows | vLLM same 439 ids | vLLM / Frontier |
|
||||
|---|---:|---:|---:|
|
||||
| TTFT p50 | 136.776s | 169.968s | 1.24 |
|
||||
| TTFT p95 | 340.237s | 375.760s | 1.10 |
|
||||
| TPOT p50 | 0.0564s | 0.0498s | 0.88 |
|
||||
| TPOT p95 | 0.0894s | 0.1071s | 1.20 |
|
||||
| E2E p50 | 177.800s | 218.606s | 1.23 |
|
||||
| E2E p95 | 397.291s | 416.110s | 1.05 |
|
||||
|
||||
Prefix/cache comparison needs careful metric naming:
|
||||
|
||||
- The unbounded ReplayServe trie estimate for all 500 rows is 1,047,632 hit
|
||||
tokens / 2,708,110 prompt tokens = 0.3868 token hit ratio.
|
||||
- vLLM's finite-cache scheduler log is much lower under this pressure:
|
||||
first-start `computed:` ratio is 0.0979, last-start ratio is 0.1643, and
|
||||
max-per-request ratio is 0.1655.
|
||||
- On the same 439 request ids where Frontier emits complete metrics, vLLM's
|
||||
first-start `computed:` ratio is 0.1050, last-start ratio is 0.1665, and
|
||||
max-per-request ratio is 0.1679.
|
||||
- Frontier RS8 reports `replayserve_token_hit_ratio=0.1192` and
|
||||
`frontier_block_hit_ratio=0.1191`, which is in the same order as vLLM's
|
||||
finite-cache scheduler signal but far below the unbounded trace-side estimate.
|
||||
|
||||
Current 500-request judgment:
|
||||
|
||||
- Frontier's timing profile is now in the right broad range for this stressed
|
||||
H20 TP1 run: TPOT p50/p95 and E2E p95 are close to vLLM, and aggregate token
|
||||
throughput is within about 12%.
|
||||
- The run is still not a faithful simulator result because completion and
|
||||
preemption diverge: Frontier drops 61 latency/cache rows and reports zero
|
||||
preemptions, while real vLLM completes all 500 requests and logs 63
|
||||
preemption events across 57 request ids.
|
||||
- The 500-request trace invalidates the earlier use of the unbounded sidecar
|
||||
prefix estimate as the primary comparator. Finite KV capacity, eviction, and
|
||||
preemption must be part of the prefix-cache replay metric.
|
||||
|
||||
ReplayServe TODO:
|
||||
|
||||
- Treat incomplete Frontier runs as invalid for final performance claims unless
|
||||
the comparison explicitly reports the missing request set.
|
||||
- Keep the focused Frontier debug guard in the local patch: sequential mode now
|
||||
fails if `completed_requests < total_requests` at drain time and reports the
|
||||
missing request state.
|
||||
- Add a comparator that reports both unbounded trace-side prefix reuse and
|
||||
finite-cache observed reuse from vLLM scheduler logs; do not compare
|
||||
Frontier's finite-cache hit ratio directly to the unbounded trie estimate.
|
||||
- Profile or import vLLM CPU overhead records for H20 TP1 before enabling
|
||||
`skip_cpu_overhead_modeling=false`; without those records Frontier falls back
|
||||
to zero CPU overhead.
|
||||
- Collect kernel-only/decode-CUDA-graph timing profiles before using
|
||||
`decode_cuda_graph_mode=full_decode_only`; the current RS6 profile is CUDA
|
||||
event/eager timing.
|
||||
|
||||
## 2026-06-25 200-Request Timestamp Scale 2/3
|
||||
|
||||
Generated `traces/fixtures/coder_200_ts0667` from the first 200 rows of
|
||||
`qwen_coder_blksz_16.jsonl`, with each timestamp multiplied by `2/3` in the
|
||||
fixture files:
|
||||
|
||||
- `row_count=200`
|
||||
- `timestamp_scale=0.6666666666666666`
|
||||
- `last_timestamp=30.711333333333332`
|
||||
- `max_total_tokens=18985`
|
||||
- `partial_final_block_rows=182`
|
||||
|
||||
Important: in the current replay semantics, smaller timestamp scale makes
|
||||
arrivals denser. It reduces the arrival window from about 46.1s to 30.7s for the
|
||||
first 200 requests. This does not reduce queue pressure relative to the same
|
||||
200 requests at scale 1.0; it only reduces the request count relative to the
|
||||
500-request stress.
|
||||
|
||||
Frontier RS9:
|
||||
|
||||
- Config:
|
||||
`configs/rs9_frontier_h20_tp1_profile_full32k_coder200_ts0667.json`
|
||||
- Run:
|
||||
`runs/rs9_frontier_h20_tp1_profile_full32k_coder200_ts0667`
|
||||
- Runtime: 460 seconds
|
||||
- Status: incomplete
|
||||
|
||||
vLLM dash1 TP1:
|
||||
|
||||
- Run:
|
||||
`runs/vllm_gpu_smoke_20260625_dash1/tp1_coder200_ts0667_uncapped`
|
||||
- Runtime: vLLM 0.11.1
|
||||
- Host/GPU: dash1, one NVIDIA H20 via `CUDA_VISIBLE_DEVICES=0`
|
||||
- vLLM profiled KV capacity: 244,496 tokens = 15,281 blocks at block size 16
|
||||
- Replay wall time after engine startup: 242.813 seconds
|
||||
|
||||
| metric | Frontier RS9 200 ts=2/3 | vLLM TP1 200 ts=2/3 | vLLM / Frontier |
|
||||
|---|---:|---:|---:|
|
||||
| completed requests | 176/200 | 200/200 | not aligned |
|
||||
| preemption events | 0 | 26 | not aligned |
|
||||
| TTFT p50 | 20.580s | 34.563s | 1.68 |
|
||||
| TTFT p95 | 96.718s | 120.804s | 1.25 |
|
||||
| TPOT p50 | 0.0584s | 0.0515s | 0.88 |
|
||||
| TPOT p95 | 0.2359s | 0.2535s | 1.07 |
|
||||
| E2E p50 | 73.207s | 83.622s | 1.14 |
|
||||
| E2E p95 | 189.240s | 183.727s | 0.97 |
|
||||
| requests/s | 0.583 | 0.824 | 1.41 |
|
||||
| total tok/s | 3913.4 | 4864.8 | 1.24 |
|
||||
| decode tok/s | 593.3 | 737.5 | 1.24 |
|
||||
|
||||
Restricting vLLM to the same 176 request ids where Frontier emits complete
|
||||
metrics gives:
|
||||
|
||||
| metric | Frontier RS9 176 rows | vLLM same 176 ids | vLLM / Frontier |
|
||||
|---|---:|---:|---:|
|
||||
| TTFT p50 | 20.580s | 27.896s | 1.36 |
|
||||
| TTFT p95 | 96.718s | 120.804s | 1.25 |
|
||||
| TPOT p50 | 0.0584s | 0.0520s | 0.89 |
|
||||
| TPOT p95 | 0.2359s | 0.2539s | 1.08 |
|
||||
| E2E p50 | 73.207s | 82.645s | 1.13 |
|
||||
| E2E p95 | 189.240s | 183.727s | 0.97 |
|
||||
|
||||
Prefix/cache comparison:
|
||||
|
||||
- The unbounded ReplayServe trie estimate for all 200 rows is 270,336 hit
|
||||
tokens / 1,002,154 prompt tokens = 0.2698 token hit ratio.
|
||||
- vLLM finite-cache scheduler signal for all 200 rows: first-start `computed:`
|
||||
ratio 0.1392, last-start ratio 0.2126, max-per-request ratio 0.2129.
|
||||
- On the same 176 request ids where Frontier emits complete metrics, vLLM
|
||||
first-start ratio is 0.1487, last-start ratio is 0.1926, and max-per-request
|
||||
ratio is 0.1927.
|
||||
- Frontier RS9 reports `replayserve_token_hit_ratio=0.1703` and
|
||||
`frontier_block_hit_ratio=0.1700`, again between vLLM first-start and
|
||||
last/max finite-cache scheduler signals.
|
||||
|
||||
Missing request ids in RS9:
|
||||
|
||||
```text
|
||||
70,78,80,86,87,89,96,101,102,105,125,126,131,132,135,144,145,146,147,148,149,150,151,198
|
||||
```
|
||||
|
||||
Current 200-request judgment:
|
||||
|
||||
- Reducing the request count from 500 to 200 substantially reduces TTFT and E2E
|
||||
tails, but `scale=2/3` is still a dense-arrival stress test. vLLM TTFT p95 is
|
||||
still 120.8s.
|
||||
- Frontier timing is closer than the old 100-request dummy/profile baselines:
|
||||
TPOT p50/p95 and E2E p50/p95 are broadly aligned.
|
||||
- Completion/preemption remains the blocking fidelity issue: Frontier drops 24
|
||||
rows and reports zero preemptions; vLLM completes all 200 and logs 26
|
||||
preemptions across 22 repeated-start request ids.
|
||||
- To actually reduce queue pressure for the same first 200 requests, use a
|
||||
timestamp scale greater than 1. The follow-up scale 2 and 3 runs below do
|
||||
this.
|
||||
|
||||
## 2026-06-25 200-Request Timestamp Scale 2 and 3
|
||||
|
||||
Generated two more first-200 fixtures from `qwen_coder_blksz_16.jsonl`:
|
||||
|
||||
| fixture | timestamp scale | last timestamp | max total tokens |
|
||||
|---|---:|---:|---:|
|
||||
| `coder_200_ts2` | 2.0 | 92.134s | 18,985 |
|
||||
| `coder_200_ts3` | 3.0 | 138.201s | 18,985 |
|
||||
|
||||
These are the intended lower-arrival-pressure runs. The request payloads are the
|
||||
same first 200 rows as `coder_200_ts0667`; only timestamps differ.
|
||||
|
||||
Frontier RS10:
|
||||
|
||||
- Config:
|
||||
`configs/rs10_frontier_h20_tp1_profile_full32k_coder200_ts2_ts3.json`
|
||||
- Run:
|
||||
`runs/rs10_frontier_h20_tp1_profile_full32k_coder200_ts2_ts3`
|
||||
- Status: incomplete for both fixtures
|
||||
|
||||
vLLM dash1 TP1:
|
||||
|
||||
- Runs:
|
||||
`runs/vllm_gpu_smoke_20260625_dash1/tp1_coder_200_ts2_uncapped`
|
||||
and `runs/vllm_gpu_smoke_20260625_dash1/tp1_coder_200_ts3_uncapped`
|
||||
- Runtime: vLLM 0.11.1
|
||||
- Host/GPU: dash1, one NVIDIA H20 via `CUDA_VISIBLE_DEVICES=0`
|
||||
- vLLM profiled KV capacity: 244,496 tokens = 15,281 blocks at block size 16
|
||||
|
||||
Run-level comparison:
|
||||
|
||||
| metric | Frontier scale 2 | vLLM scale 2 | Frontier scale 3 | vLLM scale 3 |
|
||||
|---|---:|---:|---:|---:|
|
||||
| completed requests | 182/200 | 200/200 | 184/200 | 200/200 |
|
||||
| preemption events | 0 | 43 | 0 | 16 |
|
||||
| TTFT p50 | 8.118s | 9.217s | 0.779s | 1.166s |
|
||||
| TTFT p95 | 67.850s | 69.211s | 35.918s | 32.258s |
|
||||
| TPOT p50 | 0.0544s | 0.0497s | 0.0544s | 0.0462s |
|
||||
| TPOT p95 | 0.0747s | 0.0686s | 0.0773s | 0.0714s |
|
||||
| E2E p50 | 51.118s | 55.002s | 40.641s | 33.213s |
|
||||
| E2E p95 | 162.607s | 142.338s | 158.434s | 122.789s |
|
||||
| requests/s | 0.593 | 0.803 | 0.544 | 0.780 |
|
||||
| total tok/s | 3846.1 | 4742.5 | 3490.6 | 4608.1 |
|
||||
| decode tok/s | 583.1 | 719.0 | 529.2 | 698.6 |
|
||||
|
||||
Restricting vLLM to the same request ids where Frontier emits complete metrics:
|
||||
|
||||
| metric | Frontier scale 2 182 rows | vLLM same 182 ids | Frontier scale 3 184 rows | vLLM same 184 ids |
|
||||
|---|---:|---:|---:|---:|
|
||||
| TTFT p50 | 8.118s | 8.574s | 0.779s | 0.945s |
|
||||
| TTFT p95 | 67.850s | 68.934s | 35.918s | 32.258s |
|
||||
| TPOT p50 | 0.0544s | 0.0501s | 0.0544s | 0.0461s |
|
||||
| TPOT p95 | 0.0747s | 0.0686s | 0.0773s | 0.0679s |
|
||||
| E2E p50 | 51.118s | 53.263s | 40.641s | 33.213s |
|
||||
| E2E p95 | 162.607s | 141.264s | 158.434s | 122.789s |
|
||||
|
||||
Prefix/cache comparison:
|
||||
|
||||
| metric | scale 2 | scale 3 |
|
||||
|---|---:|---:|
|
||||
| unbounded trace-side token hit ratio | 0.2698 | 0.2698 |
|
||||
| vLLM first-start `computed:` ratio | 0.1433 | 0.1471 |
|
||||
| vLLM last-start `computed:` ratio | 0.2382 | 0.1968 |
|
||||
| vLLM max-per-request `computed:` ratio | 0.2383 | 0.1998 |
|
||||
| Frontier `replayserve_token_hit_ratio` | 0.1448 | 0.1523 |
|
||||
| Frontier `frontier_block_hit_ratio` | 0.1446 | 0.1521 |
|
||||
|
||||
Current scale 2 and 3 judgment:
|
||||
|
||||
- The user's intended `scale=2` and `scale=3` runs do reduce queueing. vLLM
|
||||
TTFT p95 drops from 120.8s at `scale=2/3` to 69.2s at `scale=2` and 32.3s at
|
||||
`scale=3`.
|
||||
- `scale=3` is the first run where vLLM p50 TTFT is near 1s. The p95 is still
|
||||
high because long prompts and KV pressure remain, but the severe all-request
|
||||
queueing seen in the 500-request run is much reduced.
|
||||
- Frontier timing is now close on TTFT and TPOT for the completed-row subset,
|
||||
especially at `scale=2`. However, Frontier still misses completion/cache rows
|
||||
and still reports zero preemptions.
|
||||
- Completion/preemption is therefore still the main Frontier fidelity blocker:
|
||||
`scale=2` misses 18 rows and vLLM logs 43 preemptions; `scale=3` misses 16 rows
|
||||
and vLLM logs 16 preemptions.
|
||||
|
||||
## 2026-06-25 Frontier Lifecycle Fix For RS10
|
||||
|
||||
The missing-row root cause was Frontier lifecycle handling after decode-phase
|
||||
preemption. Missing requests were preempted after prefill/decode had started,
|
||||
then left in this inconsistent state:
|
||||
|
||||
```text
|
||||
preempted=True
|
||||
is_prefill_complete=True
|
||||
num_processed_tokens=0
|
||||
scheduled=False
|
||||
completed=False
|
||||
```
|
||||
|
||||
The next waiting admission computed `num_new_tokens=0` and removed the request
|
||||
from the queue, so sequential simulation drained with fewer completed requests
|
||||
but no remaining scheduler work.
|
||||
|
||||
The updated ReplayServe Frontier patch now:
|
||||
|
||||
- replays decode-phase preemption by treating already-produced tokens as the
|
||||
next prefill segment and the remaining tokens as decode work;
|
||||
- preserves unfinished zero-token waiting requests instead of silently dropping
|
||||
them;
|
||||
- reports metrics against user-facing trace prompt/output lengths after runtime
|
||||
token splitting;
|
||||
- fails fast if sequential mode drains before all generated requests complete.
|
||||
|
||||
Verification runs:
|
||||
|
||||
| run | old completion | fixed completion | Frontier preemptions | prefix token hit ratio | status |
|
||||
|---|---:|---:|---:|---:|---|
|
||||
| `coder_200_ts2` | 182/200 | 200/200 | 33 | 0.2313 | pass |
|
||||
| `coder_200_ts3` | 184/200 | 200/200 | 20 | 0.2177 | pass |
|
||||
|
||||
Fixed-run paths:
|
||||
|
||||
- `runs/rs10_preemption_replay_fix_ts2/frontier_h20_tp1_profile_full32k/coder_200_ts2/vllm_kv_15281_profile_full32k`
|
||||
- `runs/rs10_preemption_replay_fix_ts3/frontier_h20_tp1_profile_full32k/coder_200_ts3/vllm_kv_15281_profile_full32k`
|
||||
|
||||
Updated run-level comparison:
|
||||
|
||||
| metric | Frontier scale 2 fixed | vLLM scale 2 | Frontier scale 3 fixed | vLLM scale 3 |
|
||||
|---|---:|---:|---:|---:|
|
||||
| completed requests | 200/200 | 200/200 | 200/200 | 200/200 |
|
||||
| preemption events | 33 | 43 | 20 | 16 |
|
||||
| TTFT p50 | 9.595s | 9.217s | 1.001s | 1.166s |
|
||||
| TTFT p95 | 77.503s | 69.211s | 45.947s | 32.258s |
|
||||
| TPOT p50 | 0.0542s | 0.0497s | 0.0534s | 0.0462s |
|
||||
| TPOT p95 | 0.0665s | 0.0686s | 0.0686s | 0.0714s |
|
||||
| E2E p50 | 61.458s | 55.002s | 44.761s | 33.213s |
|
||||
| E2E p95 | 174.484s | 142.338s | 154.548s | 122.789s |
|
||||
| requests/s | 0.594 | 0.803 | 0.574 | 0.780 |
|
||||
| total tok/s | 3506.3 | 4742.5 | 3390.0 | 4608.1 |
|
||||
| decode tok/s | 531.6 | 719.0 | 513.9 | 698.6 |
|
||||
|
||||
Current judgment after the fix:
|
||||
|
||||
- The completion/preemption lifecycle blocker for RS10 is fixed: both scale 2
|
||||
and scale 3 now emit 200 request rows and complete postprocess.
|
||||
- Frontier preemption is now in the same order as vLLM, but not exact:
|
||||
scale 2 is 33 vs 43 events, scale 3 is 20 vs 16 events.
|
||||
- Prefix hit ratio changed materially because preempted requests now replay and
|
||||
re-enter prefix-cache admission instead of disappearing. It is no longer valid
|
||||
to compare the old incomplete RS10 prefix ratios against vLLM.
|
||||
- Timing remains close in TPOT but Frontier is still slower in aggregate
|
||||
throughput, about 0.74x of vLLM total/decode token throughput for both scale 2
|
||||
and scale 3. TTFT/E2E tails are still worse after the completion set becomes
|
||||
complete.
|
||||
- Remaining gap is no longer "missing metrics rows"; it is scheduler/preemption
|
||||
fidelity plus CPU/scheduler/CUDA-graph timing calibration.
|
||||
|
||||
## 2026-06-25 H20 TP2/TP4 Comparison
|
||||
|
||||
The TP2/TP4 comparison uses the same first-200 `coder_200_ts2` and
|
||||
`coder_200_ts3` fixtures. The vLLM runs are on dash1 with
|
||||
`/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B`, vLLM 0.11.1,
|
||||
`max_model_len=32768`, `max_num_seqs=64`,
|
||||
`max_num_batched_tokens=32768`, `gpu_memory_utilization=0.85`,
|
||||
prefix caching on, and chunked prefill on.
|
||||
|
||||
vLLM measured KV capacity:
|
||||
|
||||
| TP | KV tokens | KV blocks |
|
||||
|---:|---:|---:|
|
||||
| 2 | 1,104,880 | 69,055 |
|
||||
| 4 | 2,833,232 | 177,077 |
|
||||
|
||||
Frontier RS12 uses explicit matching KV blocks and fresh H20 TP2/TP4 profiles:
|
||||
|
||||
- Config:
|
||||
`configs/rs12_frontier_h20_tp2_tp4_profile_full32k_coder200_ts2_ts3.json`
|
||||
- Run:
|
||||
`runs/rs12_frontier_h20_tp2_tp4_profile_full32k_coder200_ts2_ts3`
|
||||
- Profile source:
|
||||
`dash1:/home/admin/cpfs/wjh/replayserve_frontier_profiles/h20_tp2_tp4_qwen3_30ba3b_full32k_20260625_true_mixed`
|
||||
- Linear/MoE profiles cover TP2/TP4 up to 32768 tokens.
|
||||
- Attention profile covers TP2/TP4 standard attention plus 1260 true-mixed
|
||||
prefill+decode rows. The true-mixed rows are required; standard attention
|
||||
alone fails with missing `attn_decode_in_mixed` predictions.
|
||||
|
||||
All four Frontier runs completed 200/200 request rows. Neither Frontier nor the
|
||||
vLLM TP2/TP4 logs reported preemption events. Prefix token hit ratio is exactly
|
||||
the same in Frontier postprocess and vLLM's trace-side synthetic estimate:
|
||||
0.2697549478.
|
||||
|
||||
Run-level comparison:
|
||||
|
||||
| TP | fixture | metric | Frontier | vLLM | Frontier / vLLM |
|
||||
|---:|---|---|---:|---:|---:|
|
||||
| 2 | `coder_200_ts2` | requests/s | 0.776 | 1.278 | 0.61 |
|
||||
| 2 | `coder_200_ts2` | total tok/s | 4581 | 7547 | 0.61 |
|
||||
| 2 | `coder_200_ts2` | decode tok/s | 695 | 1144 | 0.61 |
|
||||
| 2 | `coder_200_ts2` | TTFT p50/p95 | 0.269/6.745s | 0.225/0.715s | 1.20/9.43 |
|
||||
| 2 | `coder_200_ts2` | TPOT p50/p95 | 0.0430/0.0529s | 0.0300/0.0434s | 1.43/1.22 |
|
||||
| 2 | `coder_200_ts2` | E2E p50/p95 | 26.05/106.76s | 16.45/72.53s | 1.58/1.47 |
|
||||
| 4 | `coder_200_ts2` | requests/s | 0.853 | 1.536 | 0.55 |
|
||||
| 4 | `coder_200_ts2` | total tok/s | 5035 | 9073 | 0.55 |
|
||||
| 4 | `coder_200_ts2` | decode tok/s | 763 | 1376 | 0.55 |
|
||||
| 4 | `coder_200_ts2` | TTFT p50/p95 | 0.098/0.386s | 0.170/1.420s | 0.57/0.27 |
|
||||
| 4 | `coder_200_ts2` | TPOT p50/p95 | 0.0337/0.0384s | 0.0163/0.0283s | 2.06/1.36 |
|
||||
| 4 | `coder_200_ts2` | E2E p50/p95 | 18.65/84.94s | 9.26/43.62s | 2.01/1.95 |
|
||||
| 2 | `coder_200_ts3` | requests/s | 0.688 | 1.088 | 0.63 |
|
||||
| 2 | `coder_200_ts3` | total tok/s | 4062 | 6426 | 0.63 |
|
||||
| 2 | `coder_200_ts3` | decode tok/s | 616 | 974 | 0.63 |
|
||||
| 2 | `coder_200_ts3` | TTFT p50/p95 | 0.134/0.574s | 0.154/0.627s | 0.87/0.92 |
|
||||
| 2 | `coder_200_ts3` | TPOT p50/p95 | 0.0394/0.0467s | 0.0191/0.0280s | 2.07/1.67 |
|
||||
| 2 | `coder_200_ts3` | E2E p50/p95 | 21.79/101.59s | 9.96/53.98s | 2.19/1.88 |
|
||||
| 4 | `coder_200_ts3` | requests/s | 0.737 | 1.254 | 0.59 |
|
||||
| 4 | `coder_200_ts3` | total tok/s | 4355 | 7403 | 0.59 |
|
||||
| 4 | `coder_200_ts3` | decode tok/s | 660 | 1122 | 0.59 |
|
||||
| 4 | `coder_200_ts3` | TTFT p50/p95 | 0.089/0.346s | 0.100/0.318s | 0.89/1.09 |
|
||||
| 4 | `coder_200_ts3` | TPOT p50/p95 | 0.0311/0.0358s | 0.0094/0.0128s | 3.30/2.80 |
|
||||
| 4 | `coder_200_ts3` | E2E p50/p95 | 16.90/83.01s | 5.55/27.87s | 3.05/2.98 |
|
||||
|
||||
TP scaling comparison:
|
||||
|
||||
| fixture | metric | Frontier TP4 / TP2 | vLLM TP4 / TP2 |
|
||||
|---|---|---:|---:|
|
||||
| `coder_200_ts2` | total tok/s speedup | 1.10 | 1.20 |
|
||||
| `coder_200_ts2` | decode tok/s speedup | 1.10 | 1.20 |
|
||||
| `coder_200_ts2` | TPOT p50 reduction | 0.78 | 0.54 |
|
||||
| `coder_200_ts3` | total tok/s speedup | 1.07 | 1.15 |
|
||||
| `coder_200_ts3` | decode tok/s speedup | 1.07 | 1.15 |
|
||||
| `coder_200_ts3` | TPOT p50 reduction | 0.79 | 0.49 |
|
||||
|
||||
Current TP2/TP4 judgment:
|
||||
|
||||
- Functional replay is aligned for this setting: same request rows, same
|
||||
trace-side prefix reuse ratio, matched vLLM KV block counts, and no
|
||||
preemption on either side.
|
||||
- Absolute performance is not aligned. Frontier reports only 55-63% of vLLM
|
||||
total/decode throughput across TP2/TP4, and TPOT is especially pessimistic at
|
||||
TP4.
|
||||
- Relative TP scaling is also under-estimated. vLLM's TP4 improves TPOT p50 by
|
||||
about 46-51% over TP2, while Frontier improves by only about 21-22%.
|
||||
- The remaining gap is therefore not caused by missing rows, prefix-cache
|
||||
mismatch, or KV capacity mismatch in these runs. It points to timing model
|
||||
limitations: missing CPU/scheduler/CUDA-graph modeling, random-forest profile
|
||||
interpolation error, and imperfect modeling of vLLM's TP-dependent decode
|
||||
execution path.
|
||||
- These RS12 results are acceptable for continuing ReplayServe integration and
|
||||
rough qualitative trends. They are not yet acceptable as calibrated absolute
|
||||
performance predictions.
|
||||
Reference in New Issue
Block a user