Add ReplayServe Frontier vLLM alignment report

2026-06-25 17:10:30 +08:00
commit a99bd00782
63 changed files with 17033 additions and 0 deletions
--- a/docs/rs4_frontier_h20_tp1_alignment.md
+++ b/docs/rs4_frontier_h20_tp1_alignment.md
@@ -0,0 +1,740 @@
+# RS4 Frontier H20 TP1 Alignment
+
+This note compares Frontier H20 TP1 against the real vLLM TP1 run on dash2 for
+`coder_100`.
+
+## Setup
+
+Real vLLM:
+
+- Runtime: vLLM 0.11.1
+- Host/GPU: dash2, NVIDIA H20
+- Model: `/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B`
+- TP: 1
+- KV capacity: 244,496 tokens = 15,281 blocks at block size 16
+- Run: `runs/vllm_gpu_smoke_20260624/tp1_coder100_uncapped`
+
+Frontier:
+
+- Frontier root: `/tmp/replayserve-frontier-rs1b`
+- Frontier commit: `d9cfeb6d8791fbf2f295dd9744c56a666171776e`
+- Model config name: `qwen3-a3b-30b-moe`
+- Device: `h20`
+- Network node SKU: `h20_dgx`
+- TP: `attn_tensor_parallel_size=1`, `moe_tensor_parallel_size=1`,
+  `moe_expert_parallel_size=1`
+- `max_tokens_in_batch=32768`, `batch_size_cap=64`, block size 16
+- Prefix cache on, chunked prefill on
+- `long_prefill_token_threshold=32768`
+- Config: `configs/rs4_frontier_h20_tp1.json`
+- Run: `runs/rs4_frontier_h20_tp1_20260624`
+
+The high long-prefill threshold is deliberate. Frontier's earlier threshold 64
+run under-counted prefix hits because long prompts were admitted in 64-token
+chunks, unlike the current real vLLM run.
+
+## KV Capacity
+
+| run | KV blocks | KV tokens | note |
+|---|---:|---:|---|
+| Frontier `planner_kv` | 17,385 | 278,160 | Frontier H20 memory planner, no non-KV overhead |
+| Frontier `vllm_kv_15281` | 15,281 | 244,496 | Explicitly matched to real vLLM TP1 |
+| vLLM TP1 | 15,281 | 244,496 | From vLLM memory profiling |
+
+So only `vllm_kv_15281` has the same KV block count as real vLLM TP1.
+
+## Results
+
+| run | completed | prefix hit tokens / ratio | preemptions | TTFT p50/p95 | TPOT p50/p95 | E2E p50/p95 | decode tok/s |
+|---|---:|---:|---:|---:|---:|---:|---:|
+| Frontier `planner_kv` | 96/100 | 110,608 / 0.240691 | 0 | 0.986/128.991s | 0.582/0.582s | 279.092/1706.675s | 19.4 |
+| Frontier `vllm_kv_15281` | 92/100 | 103,168 / 0.242542 | 0 | 0.964/182.639s | 0.582/0.582s | 305.290/1765.347s | 19.4 |
+| vLLM TP1 real | 100/100 | 119,152 / 0.251082 sidecar estimate | 8 | 4.503/29.060s | 0.066/0.621s | 41.841/97.366s | 567.4 |
+
+The latency/throughput rows are not calibrated. Frontier still uses dummy
+execution timing, so TPOT is a constant simulator artifact.
+
+## Prefix Admission Check
+
+For TP1, real vLLM has preemption. Therefore the sidecar theoretical prefix-hit
+estimate is not the right observed comparator for every request. The observed
+vLLM scheduler signal is the first `computed:` value in `stdout.log` for each
+request start.
+
+Using first-start `computed:` tokens:
+
+| Frontier run | compared rows | Frontier computed sum | vLLM first-start computed sum | mismatch |
+|---|---:|---:|---:|---:|
+| `planner_kv` | 96 | 110,608 | 108,208 | one request differs |
+| `vllm_kv_15281` | 92 | 103,168 | 103,168 | exact match |
+
+So with the KV block count explicitly matched, Frontier's prefix-cache admission
+matches real vLLM TP1 for every row where Frontier emits complete cache metrics.
+
+## Current Alignment Judgment
+
+Aligned:
+
+- H20 device and Qwen3-30B-A3B structural model config can run in Frontier.
+- TP1 scheduler knobs can be matched.
+- KV block count can be matched explicitly at 15,281 blocks.
+- First-admission prefix-cache hit tokens match real vLLM TP1 on completed rows
+  when KV blocks are explicit.
+
+Not aligned:
+
+- Frontier emits complete request/cache metrics for only 92/100 requests in the
+  explicit-KV run, while vLLM completes 100/100.
+- Frontier reports 0 preemptions; real vLLM TP1 reports 8 preemptions across 5
+  repeated-start requests.
+- Frontier timing is not comparable because it still uses dummy execution
+  prediction. The current latency/throughput gap is expected and not a
+  calibrated simulator error.
+
+Next work:
+
+- Treat RS6 as the current profiled baseline and investigate why it omits
+  complete latency/cache metrics for requests `70`, `77`, `88`, and `90`.
+- Instrument Frontier's vLLM V1 scheduler around KV block allocation, free-block
+  count, and preemption victim selection. Real vLLM TP1 has 8 preemptions, while
+  Frontier still reports 0 with the same explicit 15,281-block capacity.
+- Add a per-request Frontier/vLLM comparator that reports TTFT/TPOT/E2E ratios,
+  prefix hits, and completion/preemption status on the same request ids.
+- Calibrate CPU/scheduler/CUDA-graph effects separately from op profile timing;
+  RS6 removed the 4096-token linear/MoE extrapolation as the primary explanation
+  for the remaining gap.
+
+## Performance Gap
+
+Use Frontier `vllm_kv_15281` as the current aligned-KV simulator point. This
+matches the real vLLM TP1 KV block count, but it still uses Frontier dummy
+execution timing.
+
+| metric | Frontier H20 TP1 explicit KV | real vLLM H20 TP1 | gap |
+|---|---:|---:|---:|
+| completed requests | 92/100 | 100/100 | not aligned |
+| TTFT p50 | 0.964s | 4.503s | Frontier 0.21x real |
+| TTFT p95 | 182.639s | 29.060s | Frontier 6.28x real |
+| TPOT p50 | 0.582s | 0.066s | Frontier 8.81x real |
+| TPOT p95 | 0.582s | 0.621s | Frontier 0.94x real |
+| E2E p50 | 305.290s | 41.841s | Frontier 7.30x real |
+| E2E p95 | 1765.347s | 97.366s | Frontier 18.13x real |
+| RPS | 0.0217 | 0.6880 | vLLM 31.74x Frontier |
+| decode tok/s | 19.4 | 567.4 | vLLM 29.20x Frontier |
+
+Interpretation:
+
+- The prefix admission path is close after explicit KV matching, but performance
+  is not calibrated.
+- Frontier uses dummy execution timing; its TPOT is nearly constant at 582 ms,
+  while real vLLM TP1 has p50 TPOT 66 ms and p95 TPOT 621 ms.
+- Frontier does not reproduce real vLLM's TP1 preemption behavior: real vLLM had
+  8 preemptions, while Frontier reported 0.
+- Frontier emits complete request/cache metrics for only 92 rows in this run,
+  so p95 and throughput are not yet on the same request set.
+- The TTFT sign is mixed: Frontier p50 TTFT is too optimistic, but p95 TTFT is
+  far too pessimistic. This is consistent with uncalibrated execution timing plus
+  different queue/preemption dynamics.
+
+## RS5 Profiled Frontier Timing
+
+Frontier does support replacing dummy timing with real CSV profiles through the
+random-forest execution-time predictor. The required non-dummy flags are wired
+in `tools/run_frontier_sweep.py`, and the active profiled config is
+`configs/rs5_frontier_h20_tp1_profile.json`.
+
+Profile data collected on dash2 H20 TP1:
+
+- Linear ops: `linear_op.csv`, CUDA event, max tokens 4096.
+- Attention: `attention_combined.csv`, CUDA event, max sequence/chunk 18000,
+  with 15417 standard rows plus 612 true-mixed rows. Online replay needs the
+  true-mixed rows to train `attn_prefill_mixed` and `attn_decode_in_mixed`.
+- MoE: `moe_vllm_fused.csv`, CUDA event, max tokens 4096, vLLM fused MoE
+  backend.
+
+Frontier vLLM 0.11.1 profiling needed local compatibility patches in
+`patches/frontier-vllm-0.11.1-profiling-compat.patch`:
+
+- RoPE helper fallback when vLLM 0.11.1 `get_rope()` no longer accepts the
+  legacy `rotary_dim` keyword.
+- `_get_config_dtype_str` fallback for vLLM fused MoE config dtype.
+- `ReplicatedLinear(disable_tp=True)` fallback to torch `Linear` when vLLM TP
+  group is not initialized in standalone profiling.
+- `fused_topk()` variable-return handling.
+- `invoke_fused_moe_kernel()` 0.11.1 signature compatibility.
+
+The first profiled MoE attempt used Frontier's `frontier_loop` backend and was
+not faithful to vLLM serving. It predicted `moe_grouped_gemm` at about 16 ms for
+24 tokens and 19 ms for 1024 tokens, causing TPOT around 0.93 s. The vLLM fused
+MoE profile predicts about 0.32 ms for 24 tokens and 0.87 ms for 1024 tokens.
+
+| run | completed | prefix hit ratio | TTFT p50/p95 | TPOT p50/p95 | E2E p50/p95 | total tok/s | decode tok/s |
+|---|---:|---:|---:|---:|---:|---:|---:|
+| Frontier dummy `vllm_kv_15281` | 92/100 | 0.2422 | 0.964/182.639s | 0.582/0.582s | 305.290/1765.347s | 131.3 | 19.4 |
+| Frontier profiled `frontier_loop` MoE | 93/100 | 0.2492 | 3.320/310.235s | 0.930/1.767s | 492.097/2038.538s | 165.9 | 24.6 |
+| Frontier profiled vLLM fused MoE | 97/100 | 0.2376 | 0.355/13.695s | 0.056/0.098s | 27.032/119.019s | 2056.7 | 304.5 |
+| Frontier profiled vLLM fused MoE, linear/MoE 32K | 96/100 | 0.2484 | 0.909/12.763s | 0.057/0.146s | 30.939/119.636s | 2348.9 | 347.8 |
+| vLLM TP1 real | 100/100 | 0.2511 | 4.503/29.060s | 0.066/0.621s | 41.841/97.366s | 3832.3 | 567.4 |
+
+Current judgment:
+
+- The profiled vLLM fused MoE run is the first useful timing baseline. TPOT p50
+  is close to real vLLM, but throughput is still about 54% of real vLLM and
+  TTFT/E2E tails do not align.
+- After extending linear and MoE profiles to 32768 tokens and adding
+  `prefill_hot` MoE rows, the cache hit ratio is nearly aligned
+  (0.2484 vs vLLM 0.2511), throughput improves to about 61% of real vLLM, and
+  TTFT p50 moves from 0.08x to 0.20x of real vLLM. This confirms that the 4096
+  profile ceiling was a real source of error.
+- Prefix/cache accounting remains close but not exact: the profiled run emits
+  complete cache metrics for 96/100 requests in the 32K run, with token hit
+  ratio 0.2488 vs vLLM's sidecar estimate 0.2511.
+- Frontier still reports zero preemptions, while real vLLM TP1 had 8 preemption
+  events. This affects completion set, TTFT tail, and E2E tail.
+- The remaining gaps are no longer explained by the linear/MoE 4096-token
+  extrapolation alone. The 32K run still has TTFT p50 at 0.20x, TTFT p95 at
+  0.44x, TPOT p95 at 0.23x, and throughput at 0.61x of real vLLM. This points
+  to missing CPU/scheduler/CUDA-graph modeling plus Frontier's scheduler and
+  completion/preemption fidelity.
+- The 32K run still completes only 96/100 requests in latency/cache metrics
+  (`70`, `77`, `88`, `90` missing), while real vLLM completes 100/100. This is
+  a Frontier lifecycle/metrics or scheduler-fidelity issue to debug separately.
+
+## 2026-06-24 Follow-Up
+
+Handled in the ReplayServe harness:
+
+- `tools/run_frontier_sweep.py` now passes an absolute metrics output path into
+  Frontier. Frontier runs with `cwd=/tmp/replayserve-frontier-rs1b`; relative
+  metrics paths can otherwise be written under the Frontier scratch instead of
+  ReplayServe's run directory.
+- `tools/postprocess_frontier_smoke.py` now emits a `completion` block with
+  `completed_requests`, `total_requests`, and `missing_latency_request_ids`.
+- `tools/aggregate_runs.py` now marks a run as `incomplete` when postprocess
+  reports missing latency rows. The latest RS6 summary is therefore incomplete,
+  not a clean pass.
+
+Latest RS6 vs real vLLM TP1 after the 32K profile and harness fixes:
+
+| metric | Frontier RS6 32K profile | real vLLM TP1 | Frontier / vLLM |
+|---|---:|---:|---:|
+| completed requests | 96/100 | 100/100 | 0.96 |
+| prefix token hit ratio | 0.2488 | 0.2511 | 0.99 |
+| preemption events | 0 | 8 | 0.00 |
+| TTFT p50 | 0.909s | 4.503s | 0.20 |
+| TTFT p95 | 12.763s | 29.060s | 0.44 |
+| TPOT p50 | 0.0569s | 0.0661s | 0.86 |
+| TPOT p95 | 0.146s | 0.621s | 0.23 |
+| E2E p50 | 30.939s | 41.841s | 0.74 |
+| E2E p95 | 119.636s | 97.366s | 1.23 |
+| total tok/s | 2348.9 | 3832.3 | 0.61 |
+| decode tok/s | 347.8 | 567.4 | 0.61 |
+
+Preemption experiment:
+
+- A local trial enabled waiting-admission preemption in Frontier Phase 2. It did
+  produce preemption events, but it was not a valid alignment improvement:
+  Frontier completed only 79/100 requests and amplified the early-decode
+  disappearance pattern. That config was removed from `configs/`.
+- This means the remaining preemption gap is not just "turn on preemption in
+  Phase 2". Frontier's batch/runtime-epoch lifecycle needs a deeper fix before
+  its preemption behavior can be considered faithful to vLLM TP1.
+
+Current interpretation:
+
+- Prefix/cache replay is close: token-weighted prefix hit ratio is within about
+  1% relative of the vLLM synthetic replay estimate.
+- Completion/preemption is not aligned. Requests `70`, `77`, `88`, and `90`
+  begin decode in RS6 but never reach completion metrics; vLLM completes all
+  100 requests and logs 8 preemption events.
+- Timing is partially useful but not fully calibrated. Linear and MoE profiles
+  now cover the trace's long-prefill range up to 32768 tokens, so the old 4096
+  extrapolation is no longer the main explanation. The remaining TTFT/TPOT/E2E
+  gap likely comes from missing CPU/scheduler overhead, decode CUDA graph
+  modeling, and Frontier scheduler lifecycle differences.
+
+## 2026-06-25 500-Request Stress
+
+Generated `traces/fixtures/coder_500` from the first 500 rows of
+`qwen_coder_blksz_16.jsonl`:
+
+- `row_count=500`
+- `max_total_tokens=21318`
+- `overflow_count=0`
+- `partial_final_block_rows=466`
+
+Frontier RS8 used the same H20 TP1 Qwen3-30B-A3B full32K profile and explicit
+KV block count as RS6:
+
+- Config:
+  `configs/rs8_frontier_h20_tp1_profile_full32k_coder500.json`
+- Run:
+  `runs/rs8_frontier_h20_tp1_profile_full32k_coder500_20260625`
+- Runtime: 492 seconds
+- Status: incomplete
+
+| metric | Frontier RS6 100 reqs | Frontier RS8 500 reqs |
+|---|---:|---:|
+| completed requests | 96/100 | 439/500 |
+| missing latency/cache rows | 4 | 61 |
+| prefix token hit ratio | 0.2488 | 0.1192 |
+| preemption events | 0 | 0 |
+| TTFT p50/p95 | 0.909/12.763s | 136.776/340.237s |
+| TPOT p50/p95 | 0.0569/0.146s | 0.0564/0.0894s |
+| E2E p50/p95 | 30.939/119.636s | 177.800/397.291s |
+| total tok/s | 2348.9 | 4733.7 |
+| decode tok/s | 347.8 | 656.2 |
+
+Missing request ids in RS8:
+
+```text
+70,77,88,90,103,106,134,135,142,143,153,154,176,178,183,184,186,188,210,211,216,222,245,246,263,272,274,278,291,298,299,300,320,325,334,335,347,348,363,367,373,374,393,399,403,409,412,413,414,433,434,437,439,450,453,460,469,475,476,479,497
+```
+
+The incomplete-row issue clearly scales: 4/100 missing in RS6 becomes 61/500
+missing in RS8. This makes RS8 invalid for final performance claims, but useful
+as a stress signal for Frontier lifecycle/metrics fidelity.
+
+The lower prefix hit ratio is not by itself proof of adapter failure. The
+unbounded trace-side trie estimate for `coder_500` is 0.3868 token hit ratio,
+but the H20 TP1 configuration has finite KV capacity (`num_blocks=15281`, about
+244K tokens). The 500-request window has 2.7M prompt tokens, so KV eviction can
+substantially reduce real prefix hits. The dash1 vLLM run below is the current
+finite-cache comparator for whether Frontier's behavior is faithful.
+
+Real vLLM TP1 500 was first attempted on dash2 with the same settings as
+`tp1_coder100_uncapped` (`max_num_seqs=64`, `max_num_batched_tokens=32768`,
+`gpu_memory_utilization=0.85`, `CUDA_VISIBLE_DEVICES=0`), but did not start
+because dash2 was already occupied by eight existing `agentic-kvc` vLLM serve
+processes on ports 8000-8007. Each H20 had about 89GB allocated, and vLLM failed
+with free memory below the required 0.85 utilization target. Those processes
+were not killed; the temporary ReplayServe GPU lock was released.
+
+A replacement vLLM TP1 500 run completed on dash1:
+
+- Run:
+  `runs/vllm_gpu_smoke_20260625_dash1/tp1_coder500_uncapped`
+- Runtime: vLLM 0.11.1
+- Host/GPU: dash1, one NVIDIA H20 via `CUDA_VISIBLE_DEVICES=0`
+- Model: `/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B`
+- Command knobs: `TP=1`, `max_model_len=32768`, `max_num_seqs=64`,
+  `max_num_batched_tokens=32768`, `gpu_memory_utilization=0.85`,
+  prefix caching on, chunked prefill on
+- vLLM profiled KV capacity: 244,496 tokens = 15,281 blocks at block size 16
+- Replay wall time after engine startup: 595.116 seconds
+- Process elapsed including model load/startup: 2026-06-25T03:08:18Z to
+  2026-06-25T03:19:41Z
+
+| metric | Frontier RS8 500 reqs | vLLM TP1 500 reqs | vLLM / Frontier |
+|---|---:|---:|---:|
+| completed requests | 439/500 | 500/500 | not aligned |
+| preemption events | 0 | 63 | not aligned |
+| repeated/preempted request ids | 0 | 57 | not aligned |
+| TTFT p50 | 136.776s | 185.658s | 1.36 |
+| TTFT p95 | 340.237s | 375.895s | 1.10 |
+| TPOT p50 | 0.0564s | 0.0498s | 0.88 |
+| TPOT p95 | 0.0894s | 0.0919s | 1.03 |
+| E2E p50 | 177.800s | 224.270s | 1.26 |
+| E2E p95 | 397.291s | 417.356s | 1.05 |
+| requests/s | 0.661 | 0.840 | 1.27 |
+| total tok/s | 4733.7 | 5282.9 | 1.12 |
+| decode tok/s | 656.2 | 732.3 | 1.12 |
+
+Because Frontier emits latency/cache rows for only 439 requests, the latency
+comparison above mixes Frontier's completed subset with vLLM's complete 500-row
+run. Restricting vLLM to the same 439 request ids gives:
+
+| metric | Frontier RS8 439 rows | vLLM same 439 ids | vLLM / Frontier |
+|---|---:|---:|---:|
+| TTFT p50 | 136.776s | 169.968s | 1.24 |
+| TTFT p95 | 340.237s | 375.760s | 1.10 |
+| TPOT p50 | 0.0564s | 0.0498s | 0.88 |
+| TPOT p95 | 0.0894s | 0.1071s | 1.20 |
+| E2E p50 | 177.800s | 218.606s | 1.23 |
+| E2E p95 | 397.291s | 416.110s | 1.05 |
+
+Prefix/cache comparison needs careful metric naming:
+
+- The unbounded ReplayServe trie estimate for all 500 rows is 1,047,632 hit
+  tokens / 2,708,110 prompt tokens = 0.3868 token hit ratio.
+- vLLM's finite-cache scheduler log is much lower under this pressure:
+  first-start `computed:` ratio is 0.0979, last-start ratio is 0.1643, and
+  max-per-request ratio is 0.1655.
+- On the same 439 request ids where Frontier emits complete metrics, vLLM's
+  first-start `computed:` ratio is 0.1050, last-start ratio is 0.1665, and
+  max-per-request ratio is 0.1679.
+- Frontier RS8 reports `replayserve_token_hit_ratio=0.1192` and
+  `frontier_block_hit_ratio=0.1191`, which is in the same order as vLLM's
+  finite-cache scheduler signal but far below the unbounded trace-side estimate.
+
+Current 500-request judgment:
+
+- Frontier's timing profile is now in the right broad range for this stressed
+  H20 TP1 run: TPOT p50/p95 and E2E p95 are close to vLLM, and aggregate token
+  throughput is within about 12%.
+- The run is still not a faithful simulator result because completion and
+  preemption diverge: Frontier drops 61 latency/cache rows and reports zero
+  preemptions, while real vLLM completes all 500 requests and logs 63
+  preemption events across 57 request ids.
+- The 500-request trace invalidates the earlier use of the unbounded sidecar
+  prefix estimate as the primary comparator. Finite KV capacity, eviction, and
+  preemption must be part of the prefix-cache replay metric.
+
+ReplayServe TODO:
+
+- Treat incomplete Frontier runs as invalid for final performance claims unless
+  the comparison explicitly reports the missing request set.
+- Keep the focused Frontier debug guard in the local patch: sequential mode now
+  fails if `completed_requests < total_requests` at drain time and reports the
+  missing request state.
+- Add a comparator that reports both unbounded trace-side prefix reuse and
+  finite-cache observed reuse from vLLM scheduler logs; do not compare
+  Frontier's finite-cache hit ratio directly to the unbounded trie estimate.
+- Profile or import vLLM CPU overhead records for H20 TP1 before enabling
+  `skip_cpu_overhead_modeling=false`; without those records Frontier falls back
+  to zero CPU overhead.
+- Collect kernel-only/decode-CUDA-graph timing profiles before using
+  `decode_cuda_graph_mode=full_decode_only`; the current RS6 profile is CUDA
+  event/eager timing.
+
+## 2026-06-25 200-Request Timestamp Scale 2/3
+
+Generated `traces/fixtures/coder_200_ts0667` from the first 200 rows of
+`qwen_coder_blksz_16.jsonl`, with each timestamp multiplied by `2/3` in the
+fixture files:
+
+- `row_count=200`
+- `timestamp_scale=0.6666666666666666`
+- `last_timestamp=30.711333333333332`
+- `max_total_tokens=18985`
+- `partial_final_block_rows=182`
+
+Important: in the current replay semantics, smaller timestamp scale makes
+arrivals denser. It reduces the arrival window from about 46.1s to 30.7s for the
+first 200 requests. This does not reduce queue pressure relative to the same
+200 requests at scale 1.0; it only reduces the request count relative to the
+500-request stress.
+
+Frontier RS9:
+
+- Config:
+  `configs/rs9_frontier_h20_tp1_profile_full32k_coder200_ts0667.json`
+- Run:
+  `runs/rs9_frontier_h20_tp1_profile_full32k_coder200_ts0667`
+- Runtime: 460 seconds
+- Status: incomplete
+
+vLLM dash1 TP1:
+
+- Run:
+  `runs/vllm_gpu_smoke_20260625_dash1/tp1_coder200_ts0667_uncapped`
+- Runtime: vLLM 0.11.1
+- Host/GPU: dash1, one NVIDIA H20 via `CUDA_VISIBLE_DEVICES=0`
+- vLLM profiled KV capacity: 244,496 tokens = 15,281 blocks at block size 16
+- Replay wall time after engine startup: 242.813 seconds
+
+| metric | Frontier RS9 200 ts=2/3 | vLLM TP1 200 ts=2/3 | vLLM / Frontier |
+|---|---:|---:|---:|
+| completed requests | 176/200 | 200/200 | not aligned |
+| preemption events | 0 | 26 | not aligned |
+| TTFT p50 | 20.580s | 34.563s | 1.68 |
+| TTFT p95 | 96.718s | 120.804s | 1.25 |
+| TPOT p50 | 0.0584s | 0.0515s | 0.88 |
+| TPOT p95 | 0.2359s | 0.2535s | 1.07 |
+| E2E p50 | 73.207s | 83.622s | 1.14 |
+| E2E p95 | 189.240s | 183.727s | 0.97 |
+| requests/s | 0.583 | 0.824 | 1.41 |
+| total tok/s | 3913.4 | 4864.8 | 1.24 |
+| decode tok/s | 593.3 | 737.5 | 1.24 |
+
+Restricting vLLM to the same 176 request ids where Frontier emits complete
+metrics gives:
+
+| metric | Frontier RS9 176 rows | vLLM same 176 ids | vLLM / Frontier |
+|---|---:|---:|---:|
+| TTFT p50 | 20.580s | 27.896s | 1.36 |
+| TTFT p95 | 96.718s | 120.804s | 1.25 |
+| TPOT p50 | 0.0584s | 0.0520s | 0.89 |
+| TPOT p95 | 0.2359s | 0.2539s | 1.08 |
+| E2E p50 | 73.207s | 82.645s | 1.13 |
+| E2E p95 | 189.240s | 183.727s | 0.97 |
+
+Prefix/cache comparison:
+
+- The unbounded ReplayServe trie estimate for all 200 rows is 270,336 hit
+  tokens / 1,002,154 prompt tokens = 0.2698 token hit ratio.
+- vLLM finite-cache scheduler signal for all 200 rows: first-start `computed:`
+  ratio 0.1392, last-start ratio 0.2126, max-per-request ratio 0.2129.
+- On the same 176 request ids where Frontier emits complete metrics, vLLM
+  first-start ratio is 0.1487, last-start ratio is 0.1926, and max-per-request
+  ratio is 0.1927.
+- Frontier RS9 reports `replayserve_token_hit_ratio=0.1703` and
+  `frontier_block_hit_ratio=0.1700`, again between vLLM first-start and
+  last/max finite-cache scheduler signals.
+
+Missing request ids in RS9:
+
+```text
+70,78,80,86,87,89,96,101,102,105,125,126,131,132,135,144,145,146,147,148,149,150,151,198
+```
+
+Current 200-request judgment:
+
+- Reducing the request count from 500 to 200 substantially reduces TTFT and E2E
+  tails, but `scale=2/3` is still a dense-arrival stress test. vLLM TTFT p95 is
+  still 120.8s.
+- Frontier timing is closer than the old 100-request dummy/profile baselines:
+  TPOT p50/p95 and E2E p50/p95 are broadly aligned.
+- Completion/preemption remains the blocking fidelity issue: Frontier drops 24
+  rows and reports zero preemptions; vLLM completes all 200 and logs 26
+  preemptions across 22 repeated-start request ids.
+- To actually reduce queue pressure for the same first 200 requests, use a
+  timestamp scale greater than 1. The follow-up scale 2 and 3 runs below do
+  this.
+
+## 2026-06-25 200-Request Timestamp Scale 2 and 3
+
+Generated two more first-200 fixtures from `qwen_coder_blksz_16.jsonl`:
+
+| fixture | timestamp scale | last timestamp | max total tokens |
+|---|---:|---:|---:|
+| `coder_200_ts2` | 2.0 | 92.134s | 18,985 |
+| `coder_200_ts3` | 3.0 | 138.201s | 18,985 |
+
+These are the intended lower-arrival-pressure runs. The request payloads are the
+same first 200 rows as `coder_200_ts0667`; only timestamps differ.
+
+Frontier RS10:
+
+- Config:
+  `configs/rs10_frontier_h20_tp1_profile_full32k_coder200_ts2_ts3.json`
+- Run:
+  `runs/rs10_frontier_h20_tp1_profile_full32k_coder200_ts2_ts3`
+- Status: incomplete for both fixtures
+
+vLLM dash1 TP1:
+
+- Runs:
+  `runs/vllm_gpu_smoke_20260625_dash1/tp1_coder_200_ts2_uncapped`
+  and `runs/vllm_gpu_smoke_20260625_dash1/tp1_coder_200_ts3_uncapped`
+- Runtime: vLLM 0.11.1
+- Host/GPU: dash1, one NVIDIA H20 via `CUDA_VISIBLE_DEVICES=0`
+- vLLM profiled KV capacity: 244,496 tokens = 15,281 blocks at block size 16
+
+Run-level comparison:
+
+| metric | Frontier scale 2 | vLLM scale 2 | Frontier scale 3 | vLLM scale 3 |
+|---|---:|---:|---:|---:|
+| completed requests | 182/200 | 200/200 | 184/200 | 200/200 |
+| preemption events | 0 | 43 | 0 | 16 |
+| TTFT p50 | 8.118s | 9.217s | 0.779s | 1.166s |
+| TTFT p95 | 67.850s | 69.211s | 35.918s | 32.258s |
+| TPOT p50 | 0.0544s | 0.0497s | 0.0544s | 0.0462s |
+| TPOT p95 | 0.0747s | 0.0686s | 0.0773s | 0.0714s |
+| E2E p50 | 51.118s | 55.002s | 40.641s | 33.213s |
+| E2E p95 | 162.607s | 142.338s | 158.434s | 122.789s |
+| requests/s | 0.593 | 0.803 | 0.544 | 0.780 |
+| total tok/s | 3846.1 | 4742.5 | 3490.6 | 4608.1 |
+| decode tok/s | 583.1 | 719.0 | 529.2 | 698.6 |
+
+Restricting vLLM to the same request ids where Frontier emits complete metrics:
+
+| metric | Frontier scale 2 182 rows | vLLM same 182 ids | Frontier scale 3 184 rows | vLLM same 184 ids |
+|---|---:|---:|---:|---:|
+| TTFT p50 | 8.118s | 8.574s | 0.779s | 0.945s |
+| TTFT p95 | 67.850s | 68.934s | 35.918s | 32.258s |
+| TPOT p50 | 0.0544s | 0.0501s | 0.0544s | 0.0461s |
+| TPOT p95 | 0.0747s | 0.0686s | 0.0773s | 0.0679s |
+| E2E p50 | 51.118s | 53.263s | 40.641s | 33.213s |
+| E2E p95 | 162.607s | 141.264s | 158.434s | 122.789s |
+
+Prefix/cache comparison:
+
+| metric | scale 2 | scale 3 |
+|---|---:|---:|
+| unbounded trace-side token hit ratio | 0.2698 | 0.2698 |
+| vLLM first-start `computed:` ratio | 0.1433 | 0.1471 |
+| vLLM last-start `computed:` ratio | 0.2382 | 0.1968 |
+| vLLM max-per-request `computed:` ratio | 0.2383 | 0.1998 |
+| Frontier `replayserve_token_hit_ratio` | 0.1448 | 0.1523 |
+| Frontier `frontier_block_hit_ratio` | 0.1446 | 0.1521 |
+
+Current scale 2 and 3 judgment:
+
+- The user's intended `scale=2` and `scale=3` runs do reduce queueing. vLLM
+  TTFT p95 drops from 120.8s at `scale=2/3` to 69.2s at `scale=2` and 32.3s at
+  `scale=3`.
+- `scale=3` is the first run where vLLM p50 TTFT is near 1s. The p95 is still
+  high because long prompts and KV pressure remain, but the severe all-request
+  queueing seen in the 500-request run is much reduced.
+- Frontier timing is now close on TTFT and TPOT for the completed-row subset,
+  especially at `scale=2`. However, Frontier still misses completion/cache rows
+  and still reports zero preemptions.
+- Completion/preemption is therefore still the main Frontier fidelity blocker:
+  `scale=2` misses 18 rows and vLLM logs 43 preemptions; `scale=3` misses 16 rows
+  and vLLM logs 16 preemptions.
+
+## 2026-06-25 Frontier Lifecycle Fix For RS10
+
+The missing-row root cause was Frontier lifecycle handling after decode-phase
+preemption. Missing requests were preempted after prefill/decode had started,
+then left in this inconsistent state:
+
+```text
+preempted=True
+is_prefill_complete=True
+num_processed_tokens=0
+scheduled=False
+completed=False
+```
+
+The next waiting admission computed `num_new_tokens=0` and removed the request
+from the queue, so sequential simulation drained with fewer completed requests
+but no remaining scheduler work.
+
+The updated ReplayServe Frontier patch now:
+
+- replays decode-phase preemption by treating already-produced tokens as the
+  next prefill segment and the remaining tokens as decode work;
+- preserves unfinished zero-token waiting requests instead of silently dropping
+  them;
+- reports metrics against user-facing trace prompt/output lengths after runtime
+  token splitting;
+- fails fast if sequential mode drains before all generated requests complete.
+
+Verification runs:
+
+| run | old completion | fixed completion | Frontier preemptions | prefix token hit ratio | status |
+|---|---:|---:|---:|---:|---|
+| `coder_200_ts2` | 182/200 | 200/200 | 33 | 0.2313 | pass |
+| `coder_200_ts3` | 184/200 | 200/200 | 20 | 0.2177 | pass |
+
+Fixed-run paths:
+
+- `runs/rs10_preemption_replay_fix_ts2/frontier_h20_tp1_profile_full32k/coder_200_ts2/vllm_kv_15281_profile_full32k`
+- `runs/rs10_preemption_replay_fix_ts3/frontier_h20_tp1_profile_full32k/coder_200_ts3/vllm_kv_15281_profile_full32k`
+
+Updated run-level comparison:
+
+| metric | Frontier scale 2 fixed | vLLM scale 2 | Frontier scale 3 fixed | vLLM scale 3 |
+|---|---:|---:|---:|---:|
+| completed requests | 200/200 | 200/200 | 200/200 | 200/200 |
+| preemption events | 33 | 43 | 20 | 16 |
+| TTFT p50 | 9.595s | 9.217s | 1.001s | 1.166s |
+| TTFT p95 | 77.503s | 69.211s | 45.947s | 32.258s |
+| TPOT p50 | 0.0542s | 0.0497s | 0.0534s | 0.0462s |
+| TPOT p95 | 0.0665s | 0.0686s | 0.0686s | 0.0714s |
+| E2E p50 | 61.458s | 55.002s | 44.761s | 33.213s |
+| E2E p95 | 174.484s | 142.338s | 154.548s | 122.789s |
+| requests/s | 0.594 | 0.803 | 0.574 | 0.780 |
+| total tok/s | 3506.3 | 4742.5 | 3390.0 | 4608.1 |
+| decode tok/s | 531.6 | 719.0 | 513.9 | 698.6 |
+
+Current judgment after the fix:
+
+- The completion/preemption lifecycle blocker for RS10 is fixed: both scale 2
+  and scale 3 now emit 200 request rows and complete postprocess.
+- Frontier preemption is now in the same order as vLLM, but not exact:
+  scale 2 is 33 vs 43 events, scale 3 is 20 vs 16 events.
+- Prefix hit ratio changed materially because preempted requests now replay and
+  re-enter prefix-cache admission instead of disappearing. It is no longer valid
+  to compare the old incomplete RS10 prefix ratios against vLLM.
+- Timing remains close in TPOT but Frontier is still slower in aggregate
+  throughput, about 0.74x of vLLM total/decode token throughput for both scale 2
+  and scale 3. TTFT/E2E tails are still worse after the completion set becomes
+  complete.
+- Remaining gap is no longer "missing metrics rows"; it is scheduler/preemption
+  fidelity plus CPU/scheduler/CUDA-graph timing calibration.
+
+## 2026-06-25 H20 TP2/TP4 Comparison
+
+The TP2/TP4 comparison uses the same first-200 `coder_200_ts2` and
+`coder_200_ts3` fixtures. The vLLM runs are on dash1 with
+`/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B`, vLLM 0.11.1,
+`max_model_len=32768`, `max_num_seqs=64`,
+`max_num_batched_tokens=32768`, `gpu_memory_utilization=0.85`,
+prefix caching on, and chunked prefill on.
+
+vLLM measured KV capacity:
+
+| TP | KV tokens | KV blocks |
+|---:|---:|---:|
+| 2 | 1,104,880 | 69,055 |
+| 4 | 2,833,232 | 177,077 |
+
+Frontier RS12 uses explicit matching KV blocks and fresh H20 TP2/TP4 profiles:
+
+- Config:
+  `configs/rs12_frontier_h20_tp2_tp4_profile_full32k_coder200_ts2_ts3.json`
+- Run:
+  `runs/rs12_frontier_h20_tp2_tp4_profile_full32k_coder200_ts2_ts3`
+- Profile source:
+  `dash1:/home/admin/cpfs/wjh/replayserve_frontier_profiles/h20_tp2_tp4_qwen3_30ba3b_full32k_20260625_true_mixed`
+- Linear/MoE profiles cover TP2/TP4 up to 32768 tokens.
+- Attention profile covers TP2/TP4 standard attention plus 1260 true-mixed
+  prefill+decode rows. The true-mixed rows are required; standard attention
+  alone fails with missing `attn_decode_in_mixed` predictions.
+
+All four Frontier runs completed 200/200 request rows. Neither Frontier nor the
+vLLM TP2/TP4 logs reported preemption events. Prefix token hit ratio is exactly
+the same in Frontier postprocess and vLLM's trace-side synthetic estimate:
+0.2697549478.
+
+Run-level comparison:
+
+| TP | fixture | metric | Frontier | vLLM | Frontier / vLLM |
+|---:|---|---|---:|---:|---:|
+| 2 | `coder_200_ts2` | requests/s | 0.776 | 1.278 | 0.61 |
+| 2 | `coder_200_ts2` | total tok/s | 4581 | 7547 | 0.61 |
+| 2 | `coder_200_ts2` | decode tok/s | 695 | 1144 | 0.61 |
+| 2 | `coder_200_ts2` | TTFT p50/p95 | 0.269/6.745s | 0.225/0.715s | 1.20/9.43 |
+| 2 | `coder_200_ts2` | TPOT p50/p95 | 0.0430/0.0529s | 0.0300/0.0434s | 1.43/1.22 |
+| 2 | `coder_200_ts2` | E2E p50/p95 | 26.05/106.76s | 16.45/72.53s | 1.58/1.47 |
+| 4 | `coder_200_ts2` | requests/s | 0.853 | 1.536 | 0.55 |
+| 4 | `coder_200_ts2` | total tok/s | 5035 | 9073 | 0.55 |
+| 4 | `coder_200_ts2` | decode tok/s | 763 | 1376 | 0.55 |
+| 4 | `coder_200_ts2` | TTFT p50/p95 | 0.098/0.386s | 0.170/1.420s | 0.57/0.27 |
+| 4 | `coder_200_ts2` | TPOT p50/p95 | 0.0337/0.0384s | 0.0163/0.0283s | 2.06/1.36 |
+| 4 | `coder_200_ts2` | E2E p50/p95 | 18.65/84.94s | 9.26/43.62s | 2.01/1.95 |
+| 2 | `coder_200_ts3` | requests/s | 0.688 | 1.088 | 0.63 |
+| 2 | `coder_200_ts3` | total tok/s | 4062 | 6426 | 0.63 |
+| 2 | `coder_200_ts3` | decode tok/s | 616 | 974 | 0.63 |
+| 2 | `coder_200_ts3` | TTFT p50/p95 | 0.134/0.574s | 0.154/0.627s | 0.87/0.92 |
+| 2 | `coder_200_ts3` | TPOT p50/p95 | 0.0394/0.0467s | 0.0191/0.0280s | 2.07/1.67 |
+| 2 | `coder_200_ts3` | E2E p50/p95 | 21.79/101.59s | 9.96/53.98s | 2.19/1.88 |
+| 4 | `coder_200_ts3` | requests/s | 0.737 | 1.254 | 0.59 |
+| 4 | `coder_200_ts3` | total tok/s | 4355 | 7403 | 0.59 |
+| 4 | `coder_200_ts3` | decode tok/s | 660 | 1122 | 0.59 |
+| 4 | `coder_200_ts3` | TTFT p50/p95 | 0.089/0.346s | 0.100/0.318s | 0.89/1.09 |
+| 4 | `coder_200_ts3` | TPOT p50/p95 | 0.0311/0.0358s | 0.0094/0.0128s | 3.30/2.80 |
+| 4 | `coder_200_ts3` | E2E p50/p95 | 16.90/83.01s | 5.55/27.87s | 3.05/2.98 |
+
+TP scaling comparison:
+
+| fixture | metric | Frontier TP4 / TP2 | vLLM TP4 / TP2 |
+|---|---|---:|---:|
+| `coder_200_ts2` | total tok/s speedup | 1.10 | 1.20 |
+| `coder_200_ts2` | decode tok/s speedup | 1.10 | 1.20 |
+| `coder_200_ts2` | TPOT p50 reduction | 0.78 | 0.54 |
+| `coder_200_ts3` | total tok/s speedup | 1.07 | 1.15 |
+| `coder_200_ts3` | decode tok/s speedup | 1.07 | 1.15 |
+| `coder_200_ts3` | TPOT p50 reduction | 0.79 | 0.49 |
+
+Current TP2/TP4 judgment:
+
+- Functional replay is aligned for this setting: same request rows, same
+  trace-side prefix reuse ratio, matched vLLM KV block counts, and no
+  preemption on either side.
+- Absolute performance is not aligned. Frontier reports only 55-63% of vLLM
+  total/decode throughput across TP2/TP4, and TPOT is especially pessimistic at
+  TP4.
+- Relative TP scaling is also under-estimated. vLLM's TP4 improves TPOT p50 by
+  about 46-51% over TP2, while Frontier improves by only about 21-22%.
+- The remaining gap is therefore not caused by missing rows, prefix-cache
+  mismatch, or KV capacity mismatch in these runs. It points to timing model
+  limitations: missing CPU/scheduler/CUDA-graph modeling, random-forest profile
+  interpolation error, and imperfect modeling of vLLM's TP-dependent decode
+  execution path.
+- These RS12 results are acceptable for continuing ReplayServe integration and
+  rough qualitative trends. They are not yet acceptable as calibrated absolute
+  performance predictions.