gahow/replaysim

Fork 0

Files

Gahow Wang a99bd00782 Add ReplayServe Frontier vLLM alignment report

2026-06-25 17:10:30 +08:00

33 KiB

Raw Permalink Blame History

RS4 Frontier H20 TP1 Alignment

This note compares Frontier H20 TP1 against the real vLLM TP1 run on dash2 for coder_100.

Setup

Real vLLM:

Runtime: vLLM 0.11.1
Host/GPU: dash2, NVIDIA H20
Model: /home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B
TP: 1
KV capacity: 244,496 tokens = 15,281 blocks at block size 16
Run: runs/vllm_gpu_smoke_20260624/tp1_coder100_uncapped

Frontier:

Frontier root: /tmp/replayserve-frontier-rs1b
Frontier commit: d9cfeb6d8791fbf2f295dd9744c56a666171776e
Model config name: qwen3-a3b-30b-moe
Device: h20
Network node SKU: h20_dgx
TP: attn_tensor_parallel_size=1, moe_tensor_parallel_size=1, moe_expert_parallel_size=1
max_tokens_in_batch=32768, batch_size_cap=64, block size 16
Prefix cache on, chunked prefill on
long_prefill_token_threshold=32768
Config: configs/rs4_frontier_h20_tp1.json
Run: runs/rs4_frontier_h20_tp1_20260624

The high long-prefill threshold is deliberate. Frontier's earlier threshold 64 run under-counted prefix hits because long prompts were admitted in 64-token chunks, unlike the current real vLLM run.

KV Capacity

run	KV blocks	KV tokens	note
Frontier `planner_kv`	17,385	278,160	Frontier H20 memory planner, no non-KV overhead
Frontier `vllm_kv_15281`	15,281	244,496	Explicitly matched to real vLLM TP1
vLLM TP1	15,281	244,496	From vLLM memory profiling

So only vllm_kv_15281 has the same KV block count as real vLLM TP1.

Results

run	completed	prefix hit tokens / ratio	preemptions	TTFT p50/p95	TPOT p50/p95	E2E p50/p95	decode tok/s
Frontier `planner_kv`	96/100	110,608 / 0.240691	0	0.986/128.991s	0.582/0.582s	279.092/1706.675s	19.4
Frontier `vllm_kv_15281`	92/100	103,168 / 0.242542	0	0.964/182.639s	0.582/0.582s	305.290/1765.347s	19.4
vLLM TP1 real	100/100	119,152 / 0.251082 sidecar estimate	8	4.503/29.060s	0.066/0.621s	41.841/97.366s	567.4

The latency/throughput rows are not calibrated. Frontier still uses dummy execution timing, so TPOT is a constant simulator artifact.

Prefix Admission Check

For TP1, real vLLM has preemption. Therefore the sidecar theoretical prefix-hit estimate is not the right observed comparator for every request. The observed vLLM scheduler signal is the first computed: value in stdout.log for each request start.

Using first-start computed: tokens:

Frontier run	compared rows	Frontier computed sum	vLLM first-start computed sum	mismatch
`planner_kv`	96	110,608	108,208	one request differs
`vllm_kv_15281`	92	103,168	103,168	exact match

So with the KV block count explicitly matched, Frontier's prefix-cache admission matches real vLLM TP1 for every row where Frontier emits complete cache metrics.

Current Alignment Judgment

Aligned:

H20 device and Qwen3-30B-A3B structural model config can run in Frontier.
TP1 scheduler knobs can be matched.
KV block count can be matched explicitly at 15,281 blocks.
First-admission prefix-cache hit tokens match real vLLM TP1 on completed rows when KV blocks are explicit.

Not aligned:

Frontier emits complete request/cache metrics for only 92/100 requests in the explicit-KV run, while vLLM completes 100/100.
Frontier reports 0 preemptions; real vLLM TP1 reports 8 preemptions across 5 repeated-start requests.
Frontier timing is not comparable because it still uses dummy execution prediction. The current latency/throughput gap is expected and not a calibrated simulator error.

Next work:

Treat RS6 as the current profiled baseline and investigate why it omits complete latency/cache metrics for requests 70, 77, 88, and 90.
Instrument Frontier's vLLM V1 scheduler around KV block allocation, free-block count, and preemption victim selection. Real vLLM TP1 has 8 preemptions, while Frontier still reports 0 with the same explicit 15,281-block capacity.
Add a per-request Frontier/vLLM comparator that reports TTFT/TPOT/E2E ratios, prefix hits, and completion/preemption status on the same request ids.
Calibrate CPU/scheduler/CUDA-graph effects separately from op profile timing; RS6 removed the 4096-token linear/MoE extrapolation as the primary explanation for the remaining gap.

Performance Gap

Use Frontier vllm_kv_15281 as the current aligned-KV simulator point. This matches the real vLLM TP1 KV block count, but it still uses Frontier dummy execution timing.

metric	Frontier H20 TP1 explicit KV	real vLLM H20 TP1	gap
completed requests	92/100	100/100	not aligned
TTFT p50	0.964s	4.503s	Frontier 0.21x real
TTFT p95	182.639s	29.060s	Frontier 6.28x real
TPOT p50	0.582s	0.066s	Frontier 8.81x real
TPOT p95	0.582s	0.621s	Frontier 0.94x real
E2E p50	305.290s	41.841s	Frontier 7.30x real
E2E p95	1765.347s	97.366s	Frontier 18.13x real
RPS	0.0217	0.6880	vLLM 31.74x Frontier
decode tok/s	19.4	567.4	vLLM 29.20x Frontier

Interpretation:

The prefix admission path is close after explicit KV matching, but performance is not calibrated.
Frontier uses dummy execution timing; its TPOT is nearly constant at 582 ms, while real vLLM TP1 has p50 TPOT 66 ms and p95 TPOT 621 ms.
Frontier does not reproduce real vLLM's TP1 preemption behavior: real vLLM had 8 preemptions, while Frontier reported 0.
Frontier emits complete request/cache metrics for only 92 rows in this run, so p95 and throughput are not yet on the same request set.
The TTFT sign is mixed: Frontier p50 TTFT is too optimistic, but p95 TTFT is far too pessimistic. This is consistent with uncalibrated execution timing plus different queue/preemption dynamics.

RS5 Profiled Frontier Timing

Frontier does support replacing dummy timing with real CSV profiles through the random-forest execution-time predictor. The required non-dummy flags are wired in tools/run_frontier_sweep.py, and the active profiled config is configs/rs5_frontier_h20_tp1_profile.json.

Profile data collected on dash2 H20 TP1:

Linear ops: linear_op.csv, CUDA event, max tokens 4096.
Attention: attention_combined.csv, CUDA event, max sequence/chunk 18000, with 15417 standard rows plus 612 true-mixed rows. Online replay needs the true-mixed rows to train attn_prefill_mixed and attn_decode_in_mixed.
MoE: moe_vllm_fused.csv, CUDA event, max tokens 4096, vLLM fused MoE backend.

Frontier vLLM 0.11.1 profiling needed local compatibility patches in patches/frontier-vllm-0.11.1-profiling-compat.patch:

RoPE helper fallback when vLLM 0.11.1 get_rope() no longer accepts the legacy rotary_dim keyword.
_get_config_dtype_str fallback for vLLM fused MoE config dtype.
ReplicatedLinear(disable_tp=True) fallback to torch Linear when vLLM TP group is not initialized in standalone profiling.
fused_topk() variable-return handling.
invoke_fused_moe_kernel() 0.11.1 signature compatibility.

The first profiled MoE attempt used Frontier's frontier_loop backend and was not faithful to vLLM serving. It predicted moe_grouped_gemm at about 16 ms for 24 tokens and 19 ms for 1024 tokens, causing TPOT around 0.93 s. The vLLM fused MoE profile predicts about 0.32 ms for 24 tokens and 0.87 ms for 1024 tokens.

run	completed	prefix hit ratio	TTFT p50/p95	TPOT p50/p95	E2E p50/p95	total tok/s	decode tok/s
Frontier dummy `vllm_kv_15281`	92/100	0.2422	0.964/182.639s	0.582/0.582s	305.290/1765.347s	131.3	19.4
Frontier profiled `frontier_loop` MoE	93/100	0.2492	3.320/310.235s	0.930/1.767s	492.097/2038.538s	165.9	24.6
Frontier profiled vLLM fused MoE	97/100	0.2376	0.355/13.695s	0.056/0.098s	27.032/119.019s	2056.7	304.5
Frontier profiled vLLM fused MoE, linear/MoE 32K	96/100	0.2484	0.909/12.763s	0.057/0.146s	30.939/119.636s	2348.9	347.8
vLLM TP1 real	100/100	0.2511	4.503/29.060s	0.066/0.621s	41.841/97.366s	3832.3	567.4

Current judgment:

The profiled vLLM fused MoE run is the first useful timing baseline. TPOT p50 is close to real vLLM, but throughput is still about 54% of real vLLM and TTFT/E2E tails do not align.
After extending linear and MoE profiles to 32768 tokens and adding prefill_hot MoE rows, the cache hit ratio is nearly aligned (0.2484 vs vLLM 0.2511), throughput improves to about 61% of real vLLM, and TTFT p50 moves from 0.08x to 0.20x of real vLLM. This confirms that the 4096 profile ceiling was a real source of error.
Prefix/cache accounting remains close but not exact: the profiled run emits complete cache metrics for 96/100 requests in the 32K run, with token hit ratio 0.2488 vs vLLM's sidecar estimate 0.2511.
Frontier still reports zero preemptions, while real vLLM TP1 had 8 preemption events. This affects completion set, TTFT tail, and E2E tail.
The remaining gaps are no longer explained by the linear/MoE 4096-token extrapolation alone. The 32K run still has TTFT p50 at 0.20x, TTFT p95 at 0.44x, TPOT p95 at 0.23x, and throughput at 0.61x of real vLLM. This points to missing CPU/scheduler/CUDA-graph modeling plus Frontier's scheduler and completion/preemption fidelity.
The 32K run still completes only 96/100 requests in latency/cache metrics (70, 77, 88, 90 missing), while real vLLM completes 100/100. This is a Frontier lifecycle/metrics or scheduler-fidelity issue to debug separately.

2026-06-24 Follow-Up

Handled in the ReplayServe harness:

tools/run_frontier_sweep.py now passes an absolute metrics output path into Frontier. Frontier runs with cwd=/tmp/replayserve-frontier-rs1b; relative metrics paths can otherwise be written under the Frontier scratch instead of ReplayServe's run directory.
tools/postprocess_frontier_smoke.py now emits a completion block with completed_requests, total_requests, and missing_latency_request_ids.
tools/aggregate_runs.py now marks a run as incomplete when postprocess reports missing latency rows. The latest RS6 summary is therefore incomplete, not a clean pass.

Latest RS6 vs real vLLM TP1 after the 32K profile and harness fixes:

metric	Frontier RS6 32K profile	real vLLM TP1	Frontier / vLLM
completed requests	96/100	100/100	0.96
prefix token hit ratio	0.2488	0.2511	0.99
preemption events	0	8	0.00
TTFT p50	0.909s	4.503s	0.20
TTFT p95	12.763s	29.060s	0.44
TPOT p50	0.0569s	0.0661s	0.86
TPOT p95	0.146s	0.621s	0.23
E2E p50	30.939s	41.841s	0.74
E2E p95	119.636s	97.366s	1.23
total tok/s	2348.9	3832.3	0.61
decode tok/s	347.8	567.4	0.61

Preemption experiment:

A local trial enabled waiting-admission preemption in Frontier Phase 2. It did produce preemption events, but it was not a valid alignment improvement: Frontier completed only 79/100 requests and amplified the early-decode disappearance pattern. That config was removed from configs/.
This means the remaining preemption gap is not just "turn on preemption in Phase 2". Frontier's batch/runtime-epoch lifecycle needs a deeper fix before its preemption behavior can be considered faithful to vLLM TP1.

Current interpretation:

Prefix/cache replay is close: token-weighted prefix hit ratio is within about 1% relative of the vLLM synthetic replay estimate.
Completion/preemption is not aligned. Requests 70, 77, 88, and 90 begin decode in RS6 but never reach completion metrics; vLLM completes all 100 requests and logs 8 preemption events.
Timing is partially useful but not fully calibrated. Linear and MoE profiles now cover the trace's long-prefill range up to 32768 tokens, so the old 4096 extrapolation is no longer the main explanation. The remaining TTFT/TPOT/E2E gap likely comes from missing CPU/scheduler overhead, decode CUDA graph modeling, and Frontier scheduler lifecycle differences.

2026-06-25 500-Request Stress

Generated traces/fixtures/coder_500 from the first 500 rows of qwen_coder_blksz_16.jsonl:

row_count=500
max_total_tokens=21318
overflow_count=0
partial_final_block_rows=466

Frontier RS8 used the same H20 TP1 Qwen3-30B-A3B full32K profile and explicit KV block count as RS6:

Config: configs/rs8_frontier_h20_tp1_profile_full32k_coder500.json
Run: runs/rs8_frontier_h20_tp1_profile_full32k_coder500_20260625
Runtime: 492 seconds
Status: incomplete

metric	Frontier RS6 100 reqs	Frontier RS8 500 reqs
completed requests	96/100	439/500
missing latency/cache rows	4	61
prefix token hit ratio	0.2488	0.1192
preemption events	0	0
TTFT p50/p95	0.909/12.763s	136.776/340.237s
TPOT p50/p95	0.0569/0.146s	0.0564/0.0894s
E2E p50/p95	30.939/119.636s	177.800/397.291s
total tok/s	2348.9	4733.7
decode tok/s	347.8	656.2

Missing request ids in RS8:

70,77,88,90,103,106,134,135,142,143,153,154,176,178,183,184,186,188,210,211,216,222,245,246,263,272,274,278,291,298,299,300,320,325,334,335,347,348,363,367,373,374,393,399,403,409,412,413,414,433,434,437,439,450,453,460,469,475,476,479,497

The incomplete-row issue clearly scales: 4/100 missing in RS6 becomes 61/500 missing in RS8. This makes RS8 invalid for final performance claims, but useful as a stress signal for Frontier lifecycle/metrics fidelity.

The lower prefix hit ratio is not by itself proof of adapter failure. The unbounded trace-side trie estimate for coder_500 is 0.3868 token hit ratio, but the H20 TP1 configuration has finite KV capacity (num_blocks=15281, about 244K tokens). The 500-request window has 2.7M prompt tokens, so KV eviction can substantially reduce real prefix hits. The dash1 vLLM run below is the current finite-cache comparator for whether Frontier's behavior is faithful.

Real vLLM TP1 500 was first attempted on dash2 with the same settings as tp1_coder100_uncapped (max_num_seqs=64, max_num_batched_tokens=32768, gpu_memory_utilization=0.85, CUDA_VISIBLE_DEVICES=0), but did not start because dash2 was already occupied by eight existing agentic-kvc vLLM serve processes on ports 8000-8007. Each H20 had about 89GB allocated, and vLLM failed with free memory below the required 0.85 utilization target. Those processes were not killed; the temporary ReplayServe GPU lock was released.

A replacement vLLM TP1 500 run completed on dash1:

Run: runs/vllm_gpu_smoke_20260625_dash1/tp1_coder500_uncapped
Runtime: vLLM 0.11.1
Host/GPU: dash1, one NVIDIA H20 via CUDA_VISIBLE_DEVICES=0
Model: /home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B
Command knobs: TP=1, max_model_len=32768, max_num_seqs=64, max_num_batched_tokens=32768, gpu_memory_utilization=0.85, prefix caching on, chunked prefill on
vLLM profiled KV capacity: 244,496 tokens = 15,281 blocks at block size 16
Replay wall time after engine startup: 595.116 seconds
Process elapsed including model load/startup: 2026-06-25T03:08:18Z to 2026-06-25T03:19:41Z

metric	Frontier RS8 500 reqs	vLLM TP1 500 reqs	vLLM / Frontier
completed requests	439/500	500/500	not aligned
preemption events	0	63	not aligned
repeated/preempted request ids	0	57	not aligned
TTFT p50	136.776s	185.658s	1.36
TTFT p95	340.237s	375.895s	1.10
TPOT p50	0.0564s	0.0498s	0.88
TPOT p95	0.0894s	0.0919s	1.03
E2E p50	177.800s	224.270s	1.26
E2E p95	397.291s	417.356s	1.05
requests/s	0.661	0.840	1.27
total tok/s	4733.7	5282.9	1.12
decode tok/s	656.2	732.3	1.12

Because Frontier emits latency/cache rows for only 439 requests, the latency comparison above mixes Frontier's completed subset with vLLM's complete 500-row run. Restricting vLLM to the same 439 request ids gives:

metric	Frontier RS8 439 rows	vLLM same 439 ids	vLLM / Frontier
TTFT p50	136.776s	169.968s	1.24
TTFT p95	340.237s	375.760s	1.10
TPOT p50	0.0564s	0.0498s	0.88
TPOT p95	0.0894s	0.1071s	1.20
E2E p50	177.800s	218.606s	1.23
E2E p95	397.291s	416.110s	1.05

Prefix/cache comparison needs careful metric naming:

The unbounded ReplayServe trie estimate for all 500 rows is 1,047,632 hit tokens / 2,708,110 prompt tokens = 0.3868 token hit ratio.
vLLM's finite-cache scheduler log is much lower under this pressure: first-start computed: ratio is 0.0979, last-start ratio is 0.1643, and max-per-request ratio is 0.1655.
On the same 439 request ids where Frontier emits complete metrics, vLLM's first-start computed: ratio is 0.1050, last-start ratio is 0.1665, and max-per-request ratio is 0.1679.
Frontier RS8 reports replayserve_token_hit_ratio=0.1192 and frontier_block_hit_ratio=0.1191, which is in the same order as vLLM's finite-cache scheduler signal but far below the unbounded trace-side estimate.

Current 500-request judgment:

Frontier's timing profile is now in the right broad range for this stressed H20 TP1 run: TPOT p50/p95 and E2E p95 are close to vLLM, and aggregate token throughput is within about 12%.
The run is still not a faithful simulator result because completion and preemption diverge: Frontier drops 61 latency/cache rows and reports zero preemptions, while real vLLM completes all 500 requests and logs 63 preemption events across 57 request ids.
The 500-request trace invalidates the earlier use of the unbounded sidecar prefix estimate as the primary comparator. Finite KV capacity, eviction, and preemption must be part of the prefix-cache replay metric.

ReplayServe TODO:

Treat incomplete Frontier runs as invalid for final performance claims unless the comparison explicitly reports the missing request set.
Keep the focused Frontier debug guard in the local patch: sequential mode now fails if completed_requests < total_requests at drain time and reports the missing request state.
Add a comparator that reports both unbounded trace-side prefix reuse and finite-cache observed reuse from vLLM scheduler logs; do not compare Frontier's finite-cache hit ratio directly to the unbounded trie estimate.
Profile or import vLLM CPU overhead records for H20 TP1 before enabling skip_cpu_overhead_modeling=false; without those records Frontier falls back to zero CPU overhead.
Collect kernel-only/decode-CUDA-graph timing profiles before using decode_cuda_graph_mode=full_decode_only; the current RS6 profile is CUDA event/eager timing.

2026-06-25 200-Request Timestamp Scale 2/3

Generated traces/fixtures/coder_200_ts0667 from the first 200 rows of qwen_coder_blksz_16.jsonl, with each timestamp multiplied by 2/3 in the fixture files:

row_count=200
timestamp_scale=0.6666666666666666
last_timestamp=30.711333333333332
max_total_tokens=18985
partial_final_block_rows=182

Important: in the current replay semantics, smaller timestamp scale makes arrivals denser. It reduces the arrival window from about 46.1s to 30.7s for the first 200 requests. This does not reduce queue pressure relative to the same 200 requests at scale 1.0; it only reduces the request count relative to the 500-request stress.

Frontier RS9:

Config: configs/rs9_frontier_h20_tp1_profile_full32k_coder200_ts0667.json
Run: runs/rs9_frontier_h20_tp1_profile_full32k_coder200_ts0667
Runtime: 460 seconds
Status: incomplete

vLLM dash1 TP1:

Run: runs/vllm_gpu_smoke_20260625_dash1/tp1_coder200_ts0667_uncapped
Runtime: vLLM 0.11.1
Host/GPU: dash1, one NVIDIA H20 via CUDA_VISIBLE_DEVICES=0
vLLM profiled KV capacity: 244,496 tokens = 15,281 blocks at block size 16
Replay wall time after engine startup: 242.813 seconds

metric	Frontier RS9 200 ts=2/3	vLLM TP1 200 ts=2/3	vLLM / Frontier
completed requests	176/200	200/200	not aligned
preemption events	0	26	not aligned
TTFT p50	20.580s	34.563s	1.68
TTFT p95	96.718s	120.804s	1.25
TPOT p50	0.0584s	0.0515s	0.88
TPOT p95	0.2359s	0.2535s	1.07
E2E p50	73.207s	83.622s	1.14
E2E p95	189.240s	183.727s	0.97
requests/s	0.583	0.824	1.41
total tok/s	3913.4	4864.8	1.24
decode tok/s	593.3	737.5	1.24

Restricting vLLM to the same 176 request ids where Frontier emits complete metrics gives:

metric	Frontier RS9 176 rows	vLLM same 176 ids	vLLM / Frontier
TTFT p50	20.580s	27.896s	1.36
TTFT p95	96.718s	120.804s	1.25
TPOT p50	0.0584s	0.0520s	0.89
TPOT p95	0.2359s	0.2539s	1.08
E2E p50	73.207s	82.645s	1.13
E2E p95	189.240s	183.727s	0.97

Prefix/cache comparison:

The unbounded ReplayServe trie estimate for all 200 rows is 270,336 hit tokens / 1,002,154 prompt tokens = 0.2698 token hit ratio.
vLLM finite-cache scheduler signal for all 200 rows: first-start computed: ratio 0.1392, last-start ratio 0.2126, max-per-request ratio 0.2129.
On the same 176 request ids where Frontier emits complete metrics, vLLM first-start ratio is 0.1487, last-start ratio is 0.1926, and max-per-request ratio is 0.1927.
Frontier RS9 reports replayserve_token_hit_ratio=0.1703 and frontier_block_hit_ratio=0.1700, again between vLLM first-start and last/max finite-cache scheduler signals.

Missing request ids in RS9:

70,78,80,86,87,89,96,101,102,105,125,126,131,132,135,144,145,146,147,148,149,150,151,198

Current 200-request judgment:

Reducing the request count from 500 to 200 substantially reduces TTFT and E2E tails, but scale=2/3 is still a dense-arrival stress test. vLLM TTFT p95 is still 120.8s.
Frontier timing is closer than the old 100-request dummy/profile baselines: TPOT p50/p95 and E2E p50/p95 are broadly aligned.
Completion/preemption remains the blocking fidelity issue: Frontier drops 24 rows and reports zero preemptions; vLLM completes all 200 and logs 26 preemptions across 22 repeated-start request ids.
To actually reduce queue pressure for the same first 200 requests, use a timestamp scale greater than 1. The follow-up scale 2 and 3 runs below do this.

2026-06-25 200-Request Timestamp Scale 2 and 3

Generated two more first-200 fixtures from qwen_coder_blksz_16.jsonl:

fixture	timestamp scale	last timestamp	max total tokens
`coder_200_ts2`	2.0	92.134s	18,985
`coder_200_ts3`	3.0	138.201s	18,985

These are the intended lower-arrival-pressure runs. The request payloads are the same first 200 rows as coder_200_ts0667; only timestamps differ.

Frontier RS10:

Config: configs/rs10_frontier_h20_tp1_profile_full32k_coder200_ts2_ts3.json
Run: runs/rs10_frontier_h20_tp1_profile_full32k_coder200_ts2_ts3
Status: incomplete for both fixtures

vLLM dash1 TP1:

Runs: runs/vllm_gpu_smoke_20260625_dash1/tp1_coder_200_ts2_uncapped and runs/vllm_gpu_smoke_20260625_dash1/tp1_coder_200_ts3_uncapped
Runtime: vLLM 0.11.1
Host/GPU: dash1, one NVIDIA H20 via CUDA_VISIBLE_DEVICES=0
vLLM profiled KV capacity: 244,496 tokens = 15,281 blocks at block size 16

Run-level comparison:

metric	Frontier scale 2	vLLM scale 2	Frontier scale 3	vLLM scale 3
completed requests	182/200	200/200	184/200	200/200
preemption events	0	43	0	16
TTFT p50	8.118s	9.217s	0.779s	1.166s
TTFT p95	67.850s	69.211s	35.918s	32.258s
TPOT p50	0.0544s	0.0497s	0.0544s	0.0462s
TPOT p95	0.0747s	0.0686s	0.0773s	0.0714s
E2E p50	51.118s	55.002s	40.641s	33.213s
E2E p95	162.607s	142.338s	158.434s	122.789s
requests/s	0.593	0.803	0.544	0.780
total tok/s	3846.1	4742.5	3490.6	4608.1
decode tok/s	583.1	719.0	529.2	698.6

Restricting vLLM to the same request ids where Frontier emits complete metrics:

metric	Frontier scale 2 182 rows	vLLM same 182 ids	Frontier scale 3 184 rows	vLLM same 184 ids
TTFT p50	8.118s	8.574s	0.779s	0.945s
TTFT p95	67.850s	68.934s	35.918s	32.258s
TPOT p50	0.0544s	0.0501s	0.0544s	0.0461s
TPOT p95	0.0747s	0.0686s	0.0773s	0.0679s
E2E p50	51.118s	53.263s	40.641s	33.213s
E2E p95	162.607s	141.264s	158.434s	122.789s

Prefix/cache comparison:

metric	scale 2	scale 3
unbounded trace-side token hit ratio	0.2698	0.2698
vLLM first-start `computed:` ratio	0.1433	0.1471
vLLM last-start `computed:` ratio	0.2382	0.1968
vLLM max-per-request `computed:` ratio	0.2383	0.1998
Frontier `replayserve_token_hit_ratio`	0.1448	0.1523
Frontier `frontier_block_hit_ratio`	0.1446	0.1521

Current scale 2 and 3 judgment:

The user's intended scale=2 and scale=3 runs do reduce queueing. vLLM TTFT p95 drops from 120.8s at scale=2/3 to 69.2s at scale=2 and 32.3s at scale=3.
scale=3 is the first run where vLLM p50 TTFT is near 1s. The p95 is still high because long prompts and KV pressure remain, but the severe all-request queueing seen in the 500-request run is much reduced.
Frontier timing is now close on TTFT and TPOT for the completed-row subset, especially at scale=2. However, Frontier still misses completion/cache rows and still reports zero preemptions.
Completion/preemption is therefore still the main Frontier fidelity blocker: scale=2 misses 18 rows and vLLM logs 43 preemptions; scale=3 misses 16 rows and vLLM logs 16 preemptions.

2026-06-25 Frontier Lifecycle Fix For RS10

The missing-row root cause was Frontier lifecycle handling after decode-phase preemption. Missing requests were preempted after prefill/decode had started, then left in this inconsistent state:

preempted=True
is_prefill_complete=True
num_processed_tokens=0
scheduled=False
completed=False

The next waiting admission computed num_new_tokens=0 and removed the request from the queue, so sequential simulation drained with fewer completed requests but no remaining scheduler work.

The updated ReplayServe Frontier patch now:

replays decode-phase preemption by treating already-produced tokens as the next prefill segment and the remaining tokens as decode work;
preserves unfinished zero-token waiting requests instead of silently dropping them;
reports metrics against user-facing trace prompt/output lengths after runtime token splitting;
fails fast if sequential mode drains before all generated requests complete.

Verification runs:

run	old completion	fixed completion	Frontier preemptions	prefix token hit ratio	status
`coder_200_ts2`	182/200	200/200	33	0.2313	pass
`coder_200_ts3`	184/200	200/200	20	0.2177	pass

Fixed-run paths:

runs/rs10_preemption_replay_fix_ts2/frontier_h20_tp1_profile_full32k/coder_200_ts2/vllm_kv_15281_profile_full32k
runs/rs10_preemption_replay_fix_ts3/frontier_h20_tp1_profile_full32k/coder_200_ts3/vllm_kv_15281_profile_full32k

Updated run-level comparison:

metric	Frontier scale 2 fixed	vLLM scale 2	Frontier scale 3 fixed	vLLM scale 3
completed requests	200/200	200/200	200/200	200/200
preemption events	33	43	20	16
TTFT p50	9.595s	9.217s	1.001s	1.166s
TTFT p95	77.503s	69.211s	45.947s	32.258s
TPOT p50	0.0542s	0.0497s	0.0534s	0.0462s
TPOT p95	0.0665s	0.0686s	0.0686s	0.0714s
E2E p50	61.458s	55.002s	44.761s	33.213s
E2E p95	174.484s	142.338s	154.548s	122.789s
requests/s	0.594	0.803	0.574	0.780
total tok/s	3506.3	4742.5	3390.0	4608.1
decode tok/s	531.6	719.0	513.9	698.6

Current judgment after the fix:

The completion/preemption lifecycle blocker for RS10 is fixed: both scale 2 and scale 3 now emit 200 request rows and complete postprocess.
Frontier preemption is now in the same order as vLLM, but not exact: scale 2 is 33 vs 43 events, scale 3 is 20 vs 16 events.
Prefix hit ratio changed materially because preempted requests now replay and re-enter prefix-cache admission instead of disappearing. It is no longer valid to compare the old incomplete RS10 prefix ratios against vLLM.
Timing remains close in TPOT but Frontier is still slower in aggregate throughput, about 0.74x of vLLM total/decode token throughput for both scale 2 and scale 3. TTFT/E2E tails are still worse after the completion set becomes complete.
Remaining gap is no longer "missing metrics rows"; it is scheduler/preemption fidelity plus CPU/scheduler/CUDA-graph timing calibration.

2026-06-25 H20 TP2/TP4 Comparison

The TP2/TP4 comparison uses the same first-200 coder_200_ts2 and coder_200_ts3 fixtures. The vLLM runs are on dash1 with /home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B, vLLM 0.11.1, max_model_len=32768, max_num_seqs=64, max_num_batched_tokens=32768, gpu_memory_utilization=0.85, prefix caching on, and chunked prefill on.

vLLM measured KV capacity:

TP	KV tokens	KV blocks
2	1,104,880	69,055
4	2,833,232	177,077

Frontier RS12 uses explicit matching KV blocks and fresh H20 TP2/TP4 profiles:

Config: configs/rs12_frontier_h20_tp2_tp4_profile_full32k_coder200_ts2_ts3.json
Run: runs/rs12_frontier_h20_tp2_tp4_profile_full32k_coder200_ts2_ts3
Profile source: dash1:/home/admin/cpfs/wjh/replayserve_frontier_profiles/h20_tp2_tp4_qwen3_30ba3b_full32k_20260625_true_mixed
Linear/MoE profiles cover TP2/TP4 up to 32768 tokens.
Attention profile covers TP2/TP4 standard attention plus 1260 true-mixed prefill+decode rows. The true-mixed rows are required; standard attention alone fails with missing attn_decode_in_mixed predictions.

All four Frontier runs completed 200/200 request rows. Neither Frontier nor the vLLM TP2/TP4 logs reported preemption events. Prefix token hit ratio is exactly the same in Frontier postprocess and vLLM's trace-side synthetic estimate: 0.2697549478.