741 lines
33 KiB
Markdown
741 lines
33 KiB
Markdown
# RS4 Frontier H20 TP1 Alignment
|
|
|
|
This note compares Frontier H20 TP1 against the real vLLM TP1 run on dash2 for
|
|
`coder_100`.
|
|
|
|
## Setup
|
|
|
|
Real vLLM:
|
|
|
|
- Runtime: vLLM 0.11.1
|
|
- Host/GPU: dash2, NVIDIA H20
|
|
- Model: `/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B`
|
|
- TP: 1
|
|
- KV capacity: 244,496 tokens = 15,281 blocks at block size 16
|
|
- Run: `runs/vllm_gpu_smoke_20260624/tp1_coder100_uncapped`
|
|
|
|
Frontier:
|
|
|
|
- Frontier root: `/tmp/replayserve-frontier-rs1b`
|
|
- Frontier commit: `d9cfeb6d8791fbf2f295dd9744c56a666171776e`
|
|
- Model config name: `qwen3-a3b-30b-moe`
|
|
- Device: `h20`
|
|
- Network node SKU: `h20_dgx`
|
|
- TP: `attn_tensor_parallel_size=1`, `moe_tensor_parallel_size=1`,
|
|
`moe_expert_parallel_size=1`
|
|
- `max_tokens_in_batch=32768`, `batch_size_cap=64`, block size 16
|
|
- Prefix cache on, chunked prefill on
|
|
- `long_prefill_token_threshold=32768`
|
|
- Config: `configs/rs4_frontier_h20_tp1.json`
|
|
- Run: `runs/rs4_frontier_h20_tp1_20260624`
|
|
|
|
The high long-prefill threshold is deliberate. Frontier's earlier threshold 64
|
|
run under-counted prefix hits because long prompts were admitted in 64-token
|
|
chunks, unlike the current real vLLM run.
|
|
|
|
## KV Capacity
|
|
|
|
| run | KV blocks | KV tokens | note |
|
|
|---|---:|---:|---|
|
|
| Frontier `planner_kv` | 17,385 | 278,160 | Frontier H20 memory planner, no non-KV overhead |
|
|
| Frontier `vllm_kv_15281` | 15,281 | 244,496 | Explicitly matched to real vLLM TP1 |
|
|
| vLLM TP1 | 15,281 | 244,496 | From vLLM memory profiling |
|
|
|
|
So only `vllm_kv_15281` has the same KV block count as real vLLM TP1.
|
|
|
|
## Results
|
|
|
|
| run | completed | prefix hit tokens / ratio | preemptions | TTFT p50/p95 | TPOT p50/p95 | E2E p50/p95 | decode tok/s |
|
|
|---|---:|---:|---:|---:|---:|---:|---:|
|
|
| Frontier `planner_kv` | 96/100 | 110,608 / 0.240691 | 0 | 0.986/128.991s | 0.582/0.582s | 279.092/1706.675s | 19.4 |
|
|
| Frontier `vllm_kv_15281` | 92/100 | 103,168 / 0.242542 | 0 | 0.964/182.639s | 0.582/0.582s | 305.290/1765.347s | 19.4 |
|
|
| vLLM TP1 real | 100/100 | 119,152 / 0.251082 sidecar estimate | 8 | 4.503/29.060s | 0.066/0.621s | 41.841/97.366s | 567.4 |
|
|
|
|
The latency/throughput rows are not calibrated. Frontier still uses dummy
|
|
execution timing, so TPOT is a constant simulator artifact.
|
|
|
|
## Prefix Admission Check
|
|
|
|
For TP1, real vLLM has preemption. Therefore the sidecar theoretical prefix-hit
|
|
estimate is not the right observed comparator for every request. The observed
|
|
vLLM scheduler signal is the first `computed:` value in `stdout.log` for each
|
|
request start.
|
|
|
|
Using first-start `computed:` tokens:
|
|
|
|
| Frontier run | compared rows | Frontier computed sum | vLLM first-start computed sum | mismatch |
|
|
|---|---:|---:|---:|---:|
|
|
| `planner_kv` | 96 | 110,608 | 108,208 | one request differs |
|
|
| `vllm_kv_15281` | 92 | 103,168 | 103,168 | exact match |
|
|
|
|
So with the KV block count explicitly matched, Frontier's prefix-cache admission
|
|
matches real vLLM TP1 for every row where Frontier emits complete cache metrics.
|
|
|
|
## Current Alignment Judgment
|
|
|
|
Aligned:
|
|
|
|
- H20 device and Qwen3-30B-A3B structural model config can run in Frontier.
|
|
- TP1 scheduler knobs can be matched.
|
|
- KV block count can be matched explicitly at 15,281 blocks.
|
|
- First-admission prefix-cache hit tokens match real vLLM TP1 on completed rows
|
|
when KV blocks are explicit.
|
|
|
|
Not aligned:
|
|
|
|
- Frontier emits complete request/cache metrics for only 92/100 requests in the
|
|
explicit-KV run, while vLLM completes 100/100.
|
|
- Frontier reports 0 preemptions; real vLLM TP1 reports 8 preemptions across 5
|
|
repeated-start requests.
|
|
- Frontier timing is not comparable because it still uses dummy execution
|
|
prediction. The current latency/throughput gap is expected and not a
|
|
calibrated simulator error.
|
|
|
|
Next work:
|
|
|
|
- Treat RS6 as the current profiled baseline and investigate why it omits
|
|
complete latency/cache metrics for requests `70`, `77`, `88`, and `90`.
|
|
- Instrument Frontier's vLLM V1 scheduler around KV block allocation, free-block
|
|
count, and preemption victim selection. Real vLLM TP1 has 8 preemptions, while
|
|
Frontier still reports 0 with the same explicit 15,281-block capacity.
|
|
- Add a per-request Frontier/vLLM comparator that reports TTFT/TPOT/E2E ratios,
|
|
prefix hits, and completion/preemption status on the same request ids.
|
|
- Calibrate CPU/scheduler/CUDA-graph effects separately from op profile timing;
|
|
RS6 removed the 4096-token linear/MoE extrapolation as the primary explanation
|
|
for the remaining gap.
|
|
|
|
## Performance Gap
|
|
|
|
Use Frontier `vllm_kv_15281` as the current aligned-KV simulator point. This
|
|
matches the real vLLM TP1 KV block count, but it still uses Frontier dummy
|
|
execution timing.
|
|
|
|
| metric | Frontier H20 TP1 explicit KV | real vLLM H20 TP1 | gap |
|
|
|---|---:|---:|---:|
|
|
| completed requests | 92/100 | 100/100 | not aligned |
|
|
| TTFT p50 | 0.964s | 4.503s | Frontier 0.21x real |
|
|
| TTFT p95 | 182.639s | 29.060s | Frontier 6.28x real |
|
|
| TPOT p50 | 0.582s | 0.066s | Frontier 8.81x real |
|
|
| TPOT p95 | 0.582s | 0.621s | Frontier 0.94x real |
|
|
| E2E p50 | 305.290s | 41.841s | Frontier 7.30x real |
|
|
| E2E p95 | 1765.347s | 97.366s | Frontier 18.13x real |
|
|
| RPS | 0.0217 | 0.6880 | vLLM 31.74x Frontier |
|
|
| decode tok/s | 19.4 | 567.4 | vLLM 29.20x Frontier |
|
|
|
|
Interpretation:
|
|
|
|
- The prefix admission path is close after explicit KV matching, but performance
|
|
is not calibrated.
|
|
- Frontier uses dummy execution timing; its TPOT is nearly constant at 582 ms,
|
|
while real vLLM TP1 has p50 TPOT 66 ms and p95 TPOT 621 ms.
|
|
- Frontier does not reproduce real vLLM's TP1 preemption behavior: real vLLM had
|
|
8 preemptions, while Frontier reported 0.
|
|
- Frontier emits complete request/cache metrics for only 92 rows in this run,
|
|
so p95 and throughput are not yet on the same request set.
|
|
- The TTFT sign is mixed: Frontier p50 TTFT is too optimistic, but p95 TTFT is
|
|
far too pessimistic. This is consistent with uncalibrated execution timing plus
|
|
different queue/preemption dynamics.
|
|
|
|
## RS5 Profiled Frontier Timing
|
|
|
|
Frontier does support replacing dummy timing with real CSV profiles through the
|
|
random-forest execution-time predictor. The required non-dummy flags are wired
|
|
in `tools/run_frontier_sweep.py`, and the active profiled config is
|
|
`configs/rs5_frontier_h20_tp1_profile.json`.
|
|
|
|
Profile data collected on dash2 H20 TP1:
|
|
|
|
- Linear ops: `linear_op.csv`, CUDA event, max tokens 4096.
|
|
- Attention: `attention_combined.csv`, CUDA event, max sequence/chunk 18000,
|
|
with 15417 standard rows plus 612 true-mixed rows. Online replay needs the
|
|
true-mixed rows to train `attn_prefill_mixed` and `attn_decode_in_mixed`.
|
|
- MoE: `moe_vllm_fused.csv`, CUDA event, max tokens 4096, vLLM fused MoE
|
|
backend.
|
|
|
|
Frontier vLLM 0.11.1 profiling needed local compatibility patches in
|
|
`patches/frontier-vllm-0.11.1-profiling-compat.patch`:
|
|
|
|
- RoPE helper fallback when vLLM 0.11.1 `get_rope()` no longer accepts the
|
|
legacy `rotary_dim` keyword.
|
|
- `_get_config_dtype_str` fallback for vLLM fused MoE config dtype.
|
|
- `ReplicatedLinear(disable_tp=True)` fallback to torch `Linear` when vLLM TP
|
|
group is not initialized in standalone profiling.
|
|
- `fused_topk()` variable-return handling.
|
|
- `invoke_fused_moe_kernel()` 0.11.1 signature compatibility.
|
|
|
|
The first profiled MoE attempt used Frontier's `frontier_loop` backend and was
|
|
not faithful to vLLM serving. It predicted `moe_grouped_gemm` at about 16 ms for
|
|
24 tokens and 19 ms for 1024 tokens, causing TPOT around 0.93 s. The vLLM fused
|
|
MoE profile predicts about 0.32 ms for 24 tokens and 0.87 ms for 1024 tokens.
|
|
|
|
| run | completed | prefix hit ratio | TTFT p50/p95 | TPOT p50/p95 | E2E p50/p95 | total tok/s | decode tok/s |
|
|
|---|---:|---:|---:|---:|---:|---:|---:|
|
|
| Frontier dummy `vllm_kv_15281` | 92/100 | 0.2422 | 0.964/182.639s | 0.582/0.582s | 305.290/1765.347s | 131.3 | 19.4 |
|
|
| Frontier profiled `frontier_loop` MoE | 93/100 | 0.2492 | 3.320/310.235s | 0.930/1.767s | 492.097/2038.538s | 165.9 | 24.6 |
|
|
| Frontier profiled vLLM fused MoE | 97/100 | 0.2376 | 0.355/13.695s | 0.056/0.098s | 27.032/119.019s | 2056.7 | 304.5 |
|
|
| Frontier profiled vLLM fused MoE, linear/MoE 32K | 96/100 | 0.2484 | 0.909/12.763s | 0.057/0.146s | 30.939/119.636s | 2348.9 | 347.8 |
|
|
| vLLM TP1 real | 100/100 | 0.2511 | 4.503/29.060s | 0.066/0.621s | 41.841/97.366s | 3832.3 | 567.4 |
|
|
|
|
Current judgment:
|
|
|
|
- The profiled vLLM fused MoE run is the first useful timing baseline. TPOT p50
|
|
is close to real vLLM, but throughput is still about 54% of real vLLM and
|
|
TTFT/E2E tails do not align.
|
|
- After extending linear and MoE profiles to 32768 tokens and adding
|
|
`prefill_hot` MoE rows, the cache hit ratio is nearly aligned
|
|
(0.2484 vs vLLM 0.2511), throughput improves to about 61% of real vLLM, and
|
|
TTFT p50 moves from 0.08x to 0.20x of real vLLM. This confirms that the 4096
|
|
profile ceiling was a real source of error.
|
|
- Prefix/cache accounting remains close but not exact: the profiled run emits
|
|
complete cache metrics for 96/100 requests in the 32K run, with token hit
|
|
ratio 0.2488 vs vLLM's sidecar estimate 0.2511.
|
|
- Frontier still reports zero preemptions, while real vLLM TP1 had 8 preemption
|
|
events. This affects completion set, TTFT tail, and E2E tail.
|
|
- The remaining gaps are no longer explained by the linear/MoE 4096-token
|
|
extrapolation alone. The 32K run still has TTFT p50 at 0.20x, TTFT p95 at
|
|
0.44x, TPOT p95 at 0.23x, and throughput at 0.61x of real vLLM. This points
|
|
to missing CPU/scheduler/CUDA-graph modeling plus Frontier's scheduler and
|
|
completion/preemption fidelity.
|
|
- The 32K run still completes only 96/100 requests in latency/cache metrics
|
|
(`70`, `77`, `88`, `90` missing), while real vLLM completes 100/100. This is
|
|
a Frontier lifecycle/metrics or scheduler-fidelity issue to debug separately.
|
|
|
|
## 2026-06-24 Follow-Up
|
|
|
|
Handled in the ReplayServe harness:
|
|
|
|
- `tools/run_frontier_sweep.py` now passes an absolute metrics output path into
|
|
Frontier. Frontier runs with `cwd=/tmp/replayserve-frontier-rs1b`; relative
|
|
metrics paths can otherwise be written under the Frontier scratch instead of
|
|
ReplayServe's run directory.
|
|
- `tools/postprocess_frontier_smoke.py` now emits a `completion` block with
|
|
`completed_requests`, `total_requests`, and `missing_latency_request_ids`.
|
|
- `tools/aggregate_runs.py` now marks a run as `incomplete` when postprocess
|
|
reports missing latency rows. The latest RS6 summary is therefore incomplete,
|
|
not a clean pass.
|
|
|
|
Latest RS6 vs real vLLM TP1 after the 32K profile and harness fixes:
|
|
|
|
| metric | Frontier RS6 32K profile | real vLLM TP1 | Frontier / vLLM |
|
|
|---|---:|---:|---:|
|
|
| completed requests | 96/100 | 100/100 | 0.96 |
|
|
| prefix token hit ratio | 0.2488 | 0.2511 | 0.99 |
|
|
| preemption events | 0 | 8 | 0.00 |
|
|
| TTFT p50 | 0.909s | 4.503s | 0.20 |
|
|
| TTFT p95 | 12.763s | 29.060s | 0.44 |
|
|
| TPOT p50 | 0.0569s | 0.0661s | 0.86 |
|
|
| TPOT p95 | 0.146s | 0.621s | 0.23 |
|
|
| E2E p50 | 30.939s | 41.841s | 0.74 |
|
|
| E2E p95 | 119.636s | 97.366s | 1.23 |
|
|
| total tok/s | 2348.9 | 3832.3 | 0.61 |
|
|
| decode tok/s | 347.8 | 567.4 | 0.61 |
|
|
|
|
Preemption experiment:
|
|
|
|
- A local trial enabled waiting-admission preemption in Frontier Phase 2. It did
|
|
produce preemption events, but it was not a valid alignment improvement:
|
|
Frontier completed only 79/100 requests and amplified the early-decode
|
|
disappearance pattern. That config was removed from `configs/`.
|
|
- This means the remaining preemption gap is not just "turn on preemption in
|
|
Phase 2". Frontier's batch/runtime-epoch lifecycle needs a deeper fix before
|
|
its preemption behavior can be considered faithful to vLLM TP1.
|
|
|
|
Current interpretation:
|
|
|
|
- Prefix/cache replay is close: token-weighted prefix hit ratio is within about
|
|
1% relative of the vLLM synthetic replay estimate.
|
|
- Completion/preemption is not aligned. Requests `70`, `77`, `88`, and `90`
|
|
begin decode in RS6 but never reach completion metrics; vLLM completes all
|
|
100 requests and logs 8 preemption events.
|
|
- Timing is partially useful but not fully calibrated. Linear and MoE profiles
|
|
now cover the trace's long-prefill range up to 32768 tokens, so the old 4096
|
|
extrapolation is no longer the main explanation. The remaining TTFT/TPOT/E2E
|
|
gap likely comes from missing CPU/scheduler overhead, decode CUDA graph
|
|
modeling, and Frontier scheduler lifecycle differences.
|
|
|
|
## 2026-06-25 500-Request Stress
|
|
|
|
Generated `traces/fixtures/coder_500` from the first 500 rows of
|
|
`qwen_coder_blksz_16.jsonl`:
|
|
|
|
- `row_count=500`
|
|
- `max_total_tokens=21318`
|
|
- `overflow_count=0`
|
|
- `partial_final_block_rows=466`
|
|
|
|
Frontier RS8 used the same H20 TP1 Qwen3-30B-A3B full32K profile and explicit
|
|
KV block count as RS6:
|
|
|
|
- Config:
|
|
`configs/rs8_frontier_h20_tp1_profile_full32k_coder500.json`
|
|
- Run:
|
|
`runs/rs8_frontier_h20_tp1_profile_full32k_coder500_20260625`
|
|
- Runtime: 492 seconds
|
|
- Status: incomplete
|
|
|
|
| metric | Frontier RS6 100 reqs | Frontier RS8 500 reqs |
|
|
|---|---:|---:|
|
|
| completed requests | 96/100 | 439/500 |
|
|
| missing latency/cache rows | 4 | 61 |
|
|
| prefix token hit ratio | 0.2488 | 0.1192 |
|
|
| preemption events | 0 | 0 |
|
|
| TTFT p50/p95 | 0.909/12.763s | 136.776/340.237s |
|
|
| TPOT p50/p95 | 0.0569/0.146s | 0.0564/0.0894s |
|
|
| E2E p50/p95 | 30.939/119.636s | 177.800/397.291s |
|
|
| total tok/s | 2348.9 | 4733.7 |
|
|
| decode tok/s | 347.8 | 656.2 |
|
|
|
|
Missing request ids in RS8:
|
|
|
|
```text
|
|
70,77,88,90,103,106,134,135,142,143,153,154,176,178,183,184,186,188,210,211,216,222,245,246,263,272,274,278,291,298,299,300,320,325,334,335,347,348,363,367,373,374,393,399,403,409,412,413,414,433,434,437,439,450,453,460,469,475,476,479,497
|
|
```
|
|
|
|
The incomplete-row issue clearly scales: 4/100 missing in RS6 becomes 61/500
|
|
missing in RS8. This makes RS8 invalid for final performance claims, but useful
|
|
as a stress signal for Frontier lifecycle/metrics fidelity.
|
|
|
|
The lower prefix hit ratio is not by itself proof of adapter failure. The
|
|
unbounded trace-side trie estimate for `coder_500` is 0.3868 token hit ratio,
|
|
but the H20 TP1 configuration has finite KV capacity (`num_blocks=15281`, about
|
|
244K tokens). The 500-request window has 2.7M prompt tokens, so KV eviction can
|
|
substantially reduce real prefix hits. The dash1 vLLM run below is the current
|
|
finite-cache comparator for whether Frontier's behavior is faithful.
|
|
|
|
Real vLLM TP1 500 was first attempted on dash2 with the same settings as
|
|
`tp1_coder100_uncapped` (`max_num_seqs=64`, `max_num_batched_tokens=32768`,
|
|
`gpu_memory_utilization=0.85`, `CUDA_VISIBLE_DEVICES=0`), but did not start
|
|
because dash2 was already occupied by eight existing `agentic-kvc` vLLM serve
|
|
processes on ports 8000-8007. Each H20 had about 89GB allocated, and vLLM failed
|
|
with free memory below the required 0.85 utilization target. Those processes
|
|
were not killed; the temporary ReplayServe GPU lock was released.
|
|
|
|
A replacement vLLM TP1 500 run completed on dash1:
|
|
|
|
- Run:
|
|
`runs/vllm_gpu_smoke_20260625_dash1/tp1_coder500_uncapped`
|
|
- Runtime: vLLM 0.11.1
|
|
- Host/GPU: dash1, one NVIDIA H20 via `CUDA_VISIBLE_DEVICES=0`
|
|
- Model: `/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B`
|
|
- Command knobs: `TP=1`, `max_model_len=32768`, `max_num_seqs=64`,
|
|
`max_num_batched_tokens=32768`, `gpu_memory_utilization=0.85`,
|
|
prefix caching on, chunked prefill on
|
|
- vLLM profiled KV capacity: 244,496 tokens = 15,281 blocks at block size 16
|
|
- Replay wall time after engine startup: 595.116 seconds
|
|
- Process elapsed including model load/startup: 2026-06-25T03:08:18Z to
|
|
2026-06-25T03:19:41Z
|
|
|
|
| metric | Frontier RS8 500 reqs | vLLM TP1 500 reqs | vLLM / Frontier |
|
|
|---|---:|---:|---:|
|
|
| completed requests | 439/500 | 500/500 | not aligned |
|
|
| preemption events | 0 | 63 | not aligned |
|
|
| repeated/preempted request ids | 0 | 57 | not aligned |
|
|
| TTFT p50 | 136.776s | 185.658s | 1.36 |
|
|
| TTFT p95 | 340.237s | 375.895s | 1.10 |
|
|
| TPOT p50 | 0.0564s | 0.0498s | 0.88 |
|
|
| TPOT p95 | 0.0894s | 0.0919s | 1.03 |
|
|
| E2E p50 | 177.800s | 224.270s | 1.26 |
|
|
| E2E p95 | 397.291s | 417.356s | 1.05 |
|
|
| requests/s | 0.661 | 0.840 | 1.27 |
|
|
| total tok/s | 4733.7 | 5282.9 | 1.12 |
|
|
| decode tok/s | 656.2 | 732.3 | 1.12 |
|
|
|
|
Because Frontier emits latency/cache rows for only 439 requests, the latency
|
|
comparison above mixes Frontier's completed subset with vLLM's complete 500-row
|
|
run. Restricting vLLM to the same 439 request ids gives:
|
|
|
|
| metric | Frontier RS8 439 rows | vLLM same 439 ids | vLLM / Frontier |
|
|
|---|---:|---:|---:|
|
|
| TTFT p50 | 136.776s | 169.968s | 1.24 |
|
|
| TTFT p95 | 340.237s | 375.760s | 1.10 |
|
|
| TPOT p50 | 0.0564s | 0.0498s | 0.88 |
|
|
| TPOT p95 | 0.0894s | 0.1071s | 1.20 |
|
|
| E2E p50 | 177.800s | 218.606s | 1.23 |
|
|
| E2E p95 | 397.291s | 416.110s | 1.05 |
|
|
|
|
Prefix/cache comparison needs careful metric naming:
|
|
|
|
- The unbounded ReplayServe trie estimate for all 500 rows is 1,047,632 hit
|
|
tokens / 2,708,110 prompt tokens = 0.3868 token hit ratio.
|
|
- vLLM's finite-cache scheduler log is much lower under this pressure:
|
|
first-start `computed:` ratio is 0.0979, last-start ratio is 0.1643, and
|
|
max-per-request ratio is 0.1655.
|
|
- On the same 439 request ids where Frontier emits complete metrics, vLLM's
|
|
first-start `computed:` ratio is 0.1050, last-start ratio is 0.1665, and
|
|
max-per-request ratio is 0.1679.
|
|
- Frontier RS8 reports `replayserve_token_hit_ratio=0.1192` and
|
|
`frontier_block_hit_ratio=0.1191`, which is in the same order as vLLM's
|
|
finite-cache scheduler signal but far below the unbounded trace-side estimate.
|
|
|
|
Current 500-request judgment:
|
|
|
|
- Frontier's timing profile is now in the right broad range for this stressed
|
|
H20 TP1 run: TPOT p50/p95 and E2E p95 are close to vLLM, and aggregate token
|
|
throughput is within about 12%.
|
|
- The run is still not a faithful simulator result because completion and
|
|
preemption diverge: Frontier drops 61 latency/cache rows and reports zero
|
|
preemptions, while real vLLM completes all 500 requests and logs 63
|
|
preemption events across 57 request ids.
|
|
- The 500-request trace invalidates the earlier use of the unbounded sidecar
|
|
prefix estimate as the primary comparator. Finite KV capacity, eviction, and
|
|
preemption must be part of the prefix-cache replay metric.
|
|
|
|
ReplayServe TODO:
|
|
|
|
- Treat incomplete Frontier runs as invalid for final performance claims unless
|
|
the comparison explicitly reports the missing request set.
|
|
- Keep the focused Frontier debug guard in the local patch: sequential mode now
|
|
fails if `completed_requests < total_requests` at drain time and reports the
|
|
missing request state.
|
|
- Add a comparator that reports both unbounded trace-side prefix reuse and
|
|
finite-cache observed reuse from vLLM scheduler logs; do not compare
|
|
Frontier's finite-cache hit ratio directly to the unbounded trie estimate.
|
|
- Profile or import vLLM CPU overhead records for H20 TP1 before enabling
|
|
`skip_cpu_overhead_modeling=false`; without those records Frontier falls back
|
|
to zero CPU overhead.
|
|
- Collect kernel-only/decode-CUDA-graph timing profiles before using
|
|
`decode_cuda_graph_mode=full_decode_only`; the current RS6 profile is CUDA
|
|
event/eager timing.
|
|
|
|
## 2026-06-25 200-Request Timestamp Scale 2/3
|
|
|
|
Generated `traces/fixtures/coder_200_ts0667` from the first 200 rows of
|
|
`qwen_coder_blksz_16.jsonl`, with each timestamp multiplied by `2/3` in the
|
|
fixture files:
|
|
|
|
- `row_count=200`
|
|
- `timestamp_scale=0.6666666666666666`
|
|
- `last_timestamp=30.711333333333332`
|
|
- `max_total_tokens=18985`
|
|
- `partial_final_block_rows=182`
|
|
|
|
Important: in the current replay semantics, smaller timestamp scale makes
|
|
arrivals denser. It reduces the arrival window from about 46.1s to 30.7s for the
|
|
first 200 requests. This does not reduce queue pressure relative to the same
|
|
200 requests at scale 1.0; it only reduces the request count relative to the
|
|
500-request stress.
|
|
|
|
Frontier RS9:
|
|
|
|
- Config:
|
|
`configs/rs9_frontier_h20_tp1_profile_full32k_coder200_ts0667.json`
|
|
- Run:
|
|
`runs/rs9_frontier_h20_tp1_profile_full32k_coder200_ts0667`
|
|
- Runtime: 460 seconds
|
|
- Status: incomplete
|
|
|
|
vLLM dash1 TP1:
|
|
|
|
- Run:
|
|
`runs/vllm_gpu_smoke_20260625_dash1/tp1_coder200_ts0667_uncapped`
|
|
- Runtime: vLLM 0.11.1
|
|
- Host/GPU: dash1, one NVIDIA H20 via `CUDA_VISIBLE_DEVICES=0`
|
|
- vLLM profiled KV capacity: 244,496 tokens = 15,281 blocks at block size 16
|
|
- Replay wall time after engine startup: 242.813 seconds
|
|
|
|
| metric | Frontier RS9 200 ts=2/3 | vLLM TP1 200 ts=2/3 | vLLM / Frontier |
|
|
|---|---:|---:|---:|
|
|
| completed requests | 176/200 | 200/200 | not aligned |
|
|
| preemption events | 0 | 26 | not aligned |
|
|
| TTFT p50 | 20.580s | 34.563s | 1.68 |
|
|
| TTFT p95 | 96.718s | 120.804s | 1.25 |
|
|
| TPOT p50 | 0.0584s | 0.0515s | 0.88 |
|
|
| TPOT p95 | 0.2359s | 0.2535s | 1.07 |
|
|
| E2E p50 | 73.207s | 83.622s | 1.14 |
|
|
| E2E p95 | 189.240s | 183.727s | 0.97 |
|
|
| requests/s | 0.583 | 0.824 | 1.41 |
|
|
| total tok/s | 3913.4 | 4864.8 | 1.24 |
|
|
| decode tok/s | 593.3 | 737.5 | 1.24 |
|
|
|
|
Restricting vLLM to the same 176 request ids where Frontier emits complete
|
|
metrics gives:
|
|
|
|
| metric | Frontier RS9 176 rows | vLLM same 176 ids | vLLM / Frontier |
|
|
|---|---:|---:|---:|
|
|
| TTFT p50 | 20.580s | 27.896s | 1.36 |
|
|
| TTFT p95 | 96.718s | 120.804s | 1.25 |
|
|
| TPOT p50 | 0.0584s | 0.0520s | 0.89 |
|
|
| TPOT p95 | 0.2359s | 0.2539s | 1.08 |
|
|
| E2E p50 | 73.207s | 82.645s | 1.13 |
|
|
| E2E p95 | 189.240s | 183.727s | 0.97 |
|
|
|
|
Prefix/cache comparison:
|
|
|
|
- The unbounded ReplayServe trie estimate for all 200 rows is 270,336 hit
|
|
tokens / 1,002,154 prompt tokens = 0.2698 token hit ratio.
|
|
- vLLM finite-cache scheduler signal for all 200 rows: first-start `computed:`
|
|
ratio 0.1392, last-start ratio 0.2126, max-per-request ratio 0.2129.
|
|
- On the same 176 request ids where Frontier emits complete metrics, vLLM
|
|
first-start ratio is 0.1487, last-start ratio is 0.1926, and max-per-request
|
|
ratio is 0.1927.
|
|
- Frontier RS9 reports `replayserve_token_hit_ratio=0.1703` and
|
|
`frontier_block_hit_ratio=0.1700`, again between vLLM first-start and
|
|
last/max finite-cache scheduler signals.
|
|
|
|
Missing request ids in RS9:
|
|
|
|
```text
|
|
70,78,80,86,87,89,96,101,102,105,125,126,131,132,135,144,145,146,147,148,149,150,151,198
|
|
```
|
|
|
|
Current 200-request judgment:
|
|
|
|
- Reducing the request count from 500 to 200 substantially reduces TTFT and E2E
|
|
tails, but `scale=2/3` is still a dense-arrival stress test. vLLM TTFT p95 is
|
|
still 120.8s.
|
|
- Frontier timing is closer than the old 100-request dummy/profile baselines:
|
|
TPOT p50/p95 and E2E p50/p95 are broadly aligned.
|
|
- Completion/preemption remains the blocking fidelity issue: Frontier drops 24
|
|
rows and reports zero preemptions; vLLM completes all 200 and logs 26
|
|
preemptions across 22 repeated-start request ids.
|
|
- To actually reduce queue pressure for the same first 200 requests, use a
|
|
timestamp scale greater than 1. The follow-up scale 2 and 3 runs below do
|
|
this.
|
|
|
|
## 2026-06-25 200-Request Timestamp Scale 2 and 3
|
|
|
|
Generated two more first-200 fixtures from `qwen_coder_blksz_16.jsonl`:
|
|
|
|
| fixture | timestamp scale | last timestamp | max total tokens |
|
|
|---|---:|---:|---:|
|
|
| `coder_200_ts2` | 2.0 | 92.134s | 18,985 |
|
|
| `coder_200_ts3` | 3.0 | 138.201s | 18,985 |
|
|
|
|
These are the intended lower-arrival-pressure runs. The request payloads are the
|
|
same first 200 rows as `coder_200_ts0667`; only timestamps differ.
|
|
|
|
Frontier RS10:
|
|
|
|
- Config:
|
|
`configs/rs10_frontier_h20_tp1_profile_full32k_coder200_ts2_ts3.json`
|
|
- Run:
|
|
`runs/rs10_frontier_h20_tp1_profile_full32k_coder200_ts2_ts3`
|
|
- Status: incomplete for both fixtures
|
|
|
|
vLLM dash1 TP1:
|
|
|
|
- Runs:
|
|
`runs/vllm_gpu_smoke_20260625_dash1/tp1_coder_200_ts2_uncapped`
|
|
and `runs/vllm_gpu_smoke_20260625_dash1/tp1_coder_200_ts3_uncapped`
|
|
- Runtime: vLLM 0.11.1
|
|
- Host/GPU: dash1, one NVIDIA H20 via `CUDA_VISIBLE_DEVICES=0`
|
|
- vLLM profiled KV capacity: 244,496 tokens = 15,281 blocks at block size 16
|
|
|
|
Run-level comparison:
|
|
|
|
| metric | Frontier scale 2 | vLLM scale 2 | Frontier scale 3 | vLLM scale 3 |
|
|
|---|---:|---:|---:|---:|
|
|
| completed requests | 182/200 | 200/200 | 184/200 | 200/200 |
|
|
| preemption events | 0 | 43 | 0 | 16 |
|
|
| TTFT p50 | 8.118s | 9.217s | 0.779s | 1.166s |
|
|
| TTFT p95 | 67.850s | 69.211s | 35.918s | 32.258s |
|
|
| TPOT p50 | 0.0544s | 0.0497s | 0.0544s | 0.0462s |
|
|
| TPOT p95 | 0.0747s | 0.0686s | 0.0773s | 0.0714s |
|
|
| E2E p50 | 51.118s | 55.002s | 40.641s | 33.213s |
|
|
| E2E p95 | 162.607s | 142.338s | 158.434s | 122.789s |
|
|
| requests/s | 0.593 | 0.803 | 0.544 | 0.780 |
|
|
| total tok/s | 3846.1 | 4742.5 | 3490.6 | 4608.1 |
|
|
| decode tok/s | 583.1 | 719.0 | 529.2 | 698.6 |
|
|
|
|
Restricting vLLM to the same request ids where Frontier emits complete metrics:
|
|
|
|
| metric | Frontier scale 2 182 rows | vLLM same 182 ids | Frontier scale 3 184 rows | vLLM same 184 ids |
|
|
|---|---:|---:|---:|---:|
|
|
| TTFT p50 | 8.118s | 8.574s | 0.779s | 0.945s |
|
|
| TTFT p95 | 67.850s | 68.934s | 35.918s | 32.258s |
|
|
| TPOT p50 | 0.0544s | 0.0501s | 0.0544s | 0.0461s |
|
|
| TPOT p95 | 0.0747s | 0.0686s | 0.0773s | 0.0679s |
|
|
| E2E p50 | 51.118s | 53.263s | 40.641s | 33.213s |
|
|
| E2E p95 | 162.607s | 141.264s | 158.434s | 122.789s |
|
|
|
|
Prefix/cache comparison:
|
|
|
|
| metric | scale 2 | scale 3 |
|
|
|---|---:|---:|
|
|
| unbounded trace-side token hit ratio | 0.2698 | 0.2698 |
|
|
| vLLM first-start `computed:` ratio | 0.1433 | 0.1471 |
|
|
| vLLM last-start `computed:` ratio | 0.2382 | 0.1968 |
|
|
| vLLM max-per-request `computed:` ratio | 0.2383 | 0.1998 |
|
|
| Frontier `replayserve_token_hit_ratio` | 0.1448 | 0.1523 |
|
|
| Frontier `frontier_block_hit_ratio` | 0.1446 | 0.1521 |
|
|
|
|
Current scale 2 and 3 judgment:
|
|
|
|
- The user's intended `scale=2` and `scale=3` runs do reduce queueing. vLLM
|
|
TTFT p95 drops from 120.8s at `scale=2/3` to 69.2s at `scale=2` and 32.3s at
|
|
`scale=3`.
|
|
- `scale=3` is the first run where vLLM p50 TTFT is near 1s. The p95 is still
|
|
high because long prompts and KV pressure remain, but the severe all-request
|
|
queueing seen in the 500-request run is much reduced.
|
|
- Frontier timing is now close on TTFT and TPOT for the completed-row subset,
|
|
especially at `scale=2`. However, Frontier still misses completion/cache rows
|
|
and still reports zero preemptions.
|
|
- Completion/preemption is therefore still the main Frontier fidelity blocker:
|
|
`scale=2` misses 18 rows and vLLM logs 43 preemptions; `scale=3` misses 16 rows
|
|
and vLLM logs 16 preemptions.
|
|
|
|
## 2026-06-25 Frontier Lifecycle Fix For RS10
|
|
|
|
The missing-row root cause was Frontier lifecycle handling after decode-phase
|
|
preemption. Missing requests were preempted after prefill/decode had started,
|
|
then left in this inconsistent state:
|
|
|
|
```text
|
|
preempted=True
|
|
is_prefill_complete=True
|
|
num_processed_tokens=0
|
|
scheduled=False
|
|
completed=False
|
|
```
|
|
|
|
The next waiting admission computed `num_new_tokens=0` and removed the request
|
|
from the queue, so sequential simulation drained with fewer completed requests
|
|
but no remaining scheduler work.
|
|
|
|
The updated ReplayServe Frontier patch now:
|
|
|
|
- replays decode-phase preemption by treating already-produced tokens as the
|
|
next prefill segment and the remaining tokens as decode work;
|
|
- preserves unfinished zero-token waiting requests instead of silently dropping
|
|
them;
|
|
- reports metrics against user-facing trace prompt/output lengths after runtime
|
|
token splitting;
|
|
- fails fast if sequential mode drains before all generated requests complete.
|
|
|
|
Verification runs:
|
|
|
|
| run | old completion | fixed completion | Frontier preemptions | prefix token hit ratio | status |
|
|
|---|---:|---:|---:|---:|---|
|
|
| `coder_200_ts2` | 182/200 | 200/200 | 33 | 0.2313 | pass |
|
|
| `coder_200_ts3` | 184/200 | 200/200 | 20 | 0.2177 | pass |
|
|
|
|
Fixed-run paths:
|
|
|
|
- `runs/rs10_preemption_replay_fix_ts2/frontier_h20_tp1_profile_full32k/coder_200_ts2/vllm_kv_15281_profile_full32k`
|
|
- `runs/rs10_preemption_replay_fix_ts3/frontier_h20_tp1_profile_full32k/coder_200_ts3/vllm_kv_15281_profile_full32k`
|
|
|
|
Updated run-level comparison:
|
|
|
|
| metric | Frontier scale 2 fixed | vLLM scale 2 | Frontier scale 3 fixed | vLLM scale 3 |
|
|
|---|---:|---:|---:|---:|
|
|
| completed requests | 200/200 | 200/200 | 200/200 | 200/200 |
|
|
| preemption events | 33 | 43 | 20 | 16 |
|
|
| TTFT p50 | 9.595s | 9.217s | 1.001s | 1.166s |
|
|
| TTFT p95 | 77.503s | 69.211s | 45.947s | 32.258s |
|
|
| TPOT p50 | 0.0542s | 0.0497s | 0.0534s | 0.0462s |
|
|
| TPOT p95 | 0.0665s | 0.0686s | 0.0686s | 0.0714s |
|
|
| E2E p50 | 61.458s | 55.002s | 44.761s | 33.213s |
|
|
| E2E p95 | 174.484s | 142.338s | 154.548s | 122.789s |
|
|
| requests/s | 0.594 | 0.803 | 0.574 | 0.780 |
|
|
| total tok/s | 3506.3 | 4742.5 | 3390.0 | 4608.1 |
|
|
| decode tok/s | 531.6 | 719.0 | 513.9 | 698.6 |
|
|
|
|
Current judgment after the fix:
|
|
|
|
- The completion/preemption lifecycle blocker for RS10 is fixed: both scale 2
|
|
and scale 3 now emit 200 request rows and complete postprocess.
|
|
- Frontier preemption is now in the same order as vLLM, but not exact:
|
|
scale 2 is 33 vs 43 events, scale 3 is 20 vs 16 events.
|
|
- Prefix hit ratio changed materially because preempted requests now replay and
|
|
re-enter prefix-cache admission instead of disappearing. It is no longer valid
|
|
to compare the old incomplete RS10 prefix ratios against vLLM.
|
|
- Timing remains close in TPOT but Frontier is still slower in aggregate
|
|
throughput, about 0.74x of vLLM total/decode token throughput for both scale 2
|
|
and scale 3. TTFT/E2E tails are still worse after the completion set becomes
|
|
complete.
|
|
- Remaining gap is no longer "missing metrics rows"; it is scheduler/preemption
|
|
fidelity plus CPU/scheduler/CUDA-graph timing calibration.
|
|
|
|
## 2026-06-25 H20 TP2/TP4 Comparison
|
|
|
|
The TP2/TP4 comparison uses the same first-200 `coder_200_ts2` and
|
|
`coder_200_ts3` fixtures. The vLLM runs are on dash1 with
|
|
`/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B`, vLLM 0.11.1,
|
|
`max_model_len=32768`, `max_num_seqs=64`,
|
|
`max_num_batched_tokens=32768`, `gpu_memory_utilization=0.85`,
|
|
prefix caching on, and chunked prefill on.
|
|
|
|
vLLM measured KV capacity:
|
|
|
|
| TP | KV tokens | KV blocks |
|
|
|---:|---:|---:|
|
|
| 2 | 1,104,880 | 69,055 |
|
|
| 4 | 2,833,232 | 177,077 |
|
|
|
|
Frontier RS12 uses explicit matching KV blocks and fresh H20 TP2/TP4 profiles:
|
|
|
|
- Config:
|
|
`configs/rs12_frontier_h20_tp2_tp4_profile_full32k_coder200_ts2_ts3.json`
|
|
- Run:
|
|
`runs/rs12_frontier_h20_tp2_tp4_profile_full32k_coder200_ts2_ts3`
|
|
- Profile source:
|
|
`dash1:/home/admin/cpfs/wjh/replayserve_frontier_profiles/h20_tp2_tp4_qwen3_30ba3b_full32k_20260625_true_mixed`
|
|
- Linear/MoE profiles cover TP2/TP4 up to 32768 tokens.
|
|
- Attention profile covers TP2/TP4 standard attention plus 1260 true-mixed
|
|
prefill+decode rows. The true-mixed rows are required; standard attention
|
|
alone fails with missing `attn_decode_in_mixed` predictions.
|
|
|
|
All four Frontier runs completed 200/200 request rows. Neither Frontier nor the
|
|
vLLM TP2/TP4 logs reported preemption events. Prefix token hit ratio is exactly
|
|
the same in Frontier postprocess and vLLM's trace-side synthetic estimate:
|
|
0.2697549478.
|
|
|
|
Run-level comparison:
|
|
|
|
| TP | fixture | metric | Frontier | vLLM | Frontier / vLLM |
|
|
|---:|---|---|---:|---:|---:|
|
|
| 2 | `coder_200_ts2` | requests/s | 0.776 | 1.278 | 0.61 |
|
|
| 2 | `coder_200_ts2` | total tok/s | 4581 | 7547 | 0.61 |
|
|
| 2 | `coder_200_ts2` | decode tok/s | 695 | 1144 | 0.61 |
|
|
| 2 | `coder_200_ts2` | TTFT p50/p95 | 0.269/6.745s | 0.225/0.715s | 1.20/9.43 |
|
|
| 2 | `coder_200_ts2` | TPOT p50/p95 | 0.0430/0.0529s | 0.0300/0.0434s | 1.43/1.22 |
|
|
| 2 | `coder_200_ts2` | E2E p50/p95 | 26.05/106.76s | 16.45/72.53s | 1.58/1.47 |
|
|
| 4 | `coder_200_ts2` | requests/s | 0.853 | 1.536 | 0.55 |
|
|
| 4 | `coder_200_ts2` | total tok/s | 5035 | 9073 | 0.55 |
|
|
| 4 | `coder_200_ts2` | decode tok/s | 763 | 1376 | 0.55 |
|
|
| 4 | `coder_200_ts2` | TTFT p50/p95 | 0.098/0.386s | 0.170/1.420s | 0.57/0.27 |
|
|
| 4 | `coder_200_ts2` | TPOT p50/p95 | 0.0337/0.0384s | 0.0163/0.0283s | 2.06/1.36 |
|
|
| 4 | `coder_200_ts2` | E2E p50/p95 | 18.65/84.94s | 9.26/43.62s | 2.01/1.95 |
|
|
| 2 | `coder_200_ts3` | requests/s | 0.688 | 1.088 | 0.63 |
|
|
| 2 | `coder_200_ts3` | total tok/s | 4062 | 6426 | 0.63 |
|
|
| 2 | `coder_200_ts3` | decode tok/s | 616 | 974 | 0.63 |
|
|
| 2 | `coder_200_ts3` | TTFT p50/p95 | 0.134/0.574s | 0.154/0.627s | 0.87/0.92 |
|
|
| 2 | `coder_200_ts3` | TPOT p50/p95 | 0.0394/0.0467s | 0.0191/0.0280s | 2.07/1.67 |
|
|
| 2 | `coder_200_ts3` | E2E p50/p95 | 21.79/101.59s | 9.96/53.98s | 2.19/1.88 |
|
|
| 4 | `coder_200_ts3` | requests/s | 0.737 | 1.254 | 0.59 |
|
|
| 4 | `coder_200_ts3` | total tok/s | 4355 | 7403 | 0.59 |
|
|
| 4 | `coder_200_ts3` | decode tok/s | 660 | 1122 | 0.59 |
|
|
| 4 | `coder_200_ts3` | TTFT p50/p95 | 0.089/0.346s | 0.100/0.318s | 0.89/1.09 |
|
|
| 4 | `coder_200_ts3` | TPOT p50/p95 | 0.0311/0.0358s | 0.0094/0.0128s | 3.30/2.80 |
|
|
| 4 | `coder_200_ts3` | E2E p50/p95 | 16.90/83.01s | 5.55/27.87s | 3.05/2.98 |
|
|
|
|
TP scaling comparison:
|
|
|
|
| fixture | metric | Frontier TP4 / TP2 | vLLM TP4 / TP2 |
|
|
|---|---|---:|---:|
|
|
| `coder_200_ts2` | total tok/s speedup | 1.10 | 1.20 |
|
|
| `coder_200_ts2` | decode tok/s speedup | 1.10 | 1.20 |
|
|
| `coder_200_ts2` | TPOT p50 reduction | 0.78 | 0.54 |
|
|
| `coder_200_ts3` | total tok/s speedup | 1.07 | 1.15 |
|
|
| `coder_200_ts3` | decode tok/s speedup | 1.07 | 1.15 |
|
|
| `coder_200_ts3` | TPOT p50 reduction | 0.79 | 0.49 |
|
|
|
|
Current TP2/TP4 judgment:
|
|
|
|
- Functional replay is aligned for this setting: same request rows, same
|
|
trace-side prefix reuse ratio, matched vLLM KV block counts, and no
|
|
preemption on either side.
|
|
- Absolute performance is not aligned. Frontier reports only 55-63% of vLLM
|
|
total/decode throughput across TP2/TP4, and TPOT is especially pessimistic at
|
|
TP4.
|
|
- Relative TP scaling is also under-estimated. vLLM's TP4 improves TPOT p50 by
|
|
about 46-51% over TP2, while Frontier improves by only about 21-22%.
|
|
- The remaining gap is therefore not caused by missing rows, prefix-cache
|
|
mismatch, or KV capacity mismatch in these runs. It points to timing model
|
|
limitations: missing CPU/scheduler/CUDA-graph modeling, random-forest profile
|
|
interpolation error, and imperfect modeling of vLLM's TP-dependent decode
|
|
execution path.
|
|
- These RS12 results are acceptable for continuing ReplayServe integration and
|
|
rough qualitative trends. They are not yet acceptable as calibrated absolute
|
|
performance predictions.
|