16 KiB
RS2 Simulator Comparison
Checked on 2026-06-24. RS2 compares simulator capabilities and first local ReplayServe results. It does not start the RS3 sweep and does not make performance-quality claims.
Sources
| Source | Local path | Commit / HEAD | RS2 use |
|---|---|---|---|
| ReplayServe | /home/gahow/phd/replayserve |
local RS0/RS1/RS1B artifacts | Adapter, fixtures, runs, postprocess summaries |
| Qwen trace | /home/gahow/phd/qwen-bailian-usagetraces-anon |
5f7439c51ec248a0c585f7d90a41a6f57773b912 |
Source qwen_coder_blksz_16.jsonl |
| Frontier canonical | /tmp/toc-llm-sim-research/Frontier |
d9cfeb6d8791fbf2f295dd9744c56a666171776e |
RS1 fixed config and source inspection |
| Frontier patched scratch | /tmp/replayserve-frontier-rs1b |
base d9cfeb6... plus patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch |
RS1B unblock verification |
| Vidur | /tmp/toc-llm-sim-research/vidur |
8383d2935bc62723a212090baa9f98ada206fc14 |
Source inspection for baseline capability |
| AIConfigurator | /tmp/toc-llm-sim-research/aiconfigurator |
e46ece7510e727fafefb8212e5846172145a30ea |
Source/docs inspection for config-estimator capability |
Key local evidence:
- Frontier trace replay:
/tmp/toc-llm-sim-research/Frontier/frontier/request_generator/trace_replay_request_generator.py - Frontier prefix-cache validation:
/tmp/toc-llm-sim-research/Frontier/frontier/scheduler/cluster_scheduler/base_cluster_scheduler.py - Frontier prefix-cache request metrics:
/tmp/toc-llm-sim-research/Frontier/frontier/metrics/metrics_store.py - Vidur trace replay:
/tmp/toc-llm-sim-research/vidur/vidur/request_generator/trace_replay_request_generator.py - Vidur request entity:
/tmp/toc-llm-sim-research/vidur/vidur/entities/request.py - AIConfigurator CLI/docs:
/tmp/toc-llm-sim-research/aiconfigurator/README.mdand/tmp/toc-llm-sim-research/aiconfigurator/src/aiconfigurator/cli/main.py
Capability Matrix
| Capability | Frontier | Vidur | AIConfigurator |
|---|---|---|---|
| Per-request timestamp replay | Yes. trace_replay consumes arrived_at and RS1 runs simulation_mode=online. |
Yes. TraceReplayRequestGenerator consumes arrived_at. |
No per-request replay. CLI consumes workload summaries such as --isl, --osl, and SLA targets. |
| Input/output length replay | Yes. Consumes num_prefill_tokens and num_decode_tokens. Frontier can clip overflows internally, so ReplayServe adapter validates before run. |
Yes. Consumes num_prefill_tokens and num_decode_tokens; current code clips prefill length if total exceeds max tokens. |
Only summary lengths, not per-request traces. |
Explicit block_hash_ids / prefix KV reuse replay |
Yes. Current Frontier parses session_id and block_hash_ids, validates they are present when prefix caching is enabled, and applies prefix-cache accounting in vLLM v1. RS1B needs a patch for prefix cache plus chunked prefill under pressure. |
No in this checkout. Request objects carry arrival/prefill/decode lengths and processed-token state, but no session_id, block hashes, or explicit prefix-reuse replay. README says prefix-caching work lives on a canary branch with sharp edges, not this main checkout. |
No. --prefix is an aggregate prefix length/workload parameter, not an explicit hash/session replay model. |
| Online arrival pattern | Yes. RS1 fixed config uses online mode and trace replay. | Yes for trace replay baseline. | No event-level online replay. It estimates candidate deployments from summary workload/SLA inputs. |
| Prefix-cache hit-ratio output | Yes. Frontier emits request metrics including cached prefill tokens, query blocks, and hit blocks when present. ReplayServe postprocess adds token-weighted hit ratio using sidecar partial-block counts. | No native prefix-hit ratio in current main because no explicit prefix replay. | No prefix-hit replay metric. |
| TTFT / TPOT / E2E / throughput output | Yes. Request and system metrics are emitted under Frontier metrics dirs. RS1 uses dummy execution predictor, so values are plumbing-only. | Yes. Vidur metrics include request E2E, prefill/TTFT-style, decode-normalized, and system metrics. Fidelity depends on matching profiles. | Yes as estimates: best throughput, per-GPU throughput, per-user throughput, TTFT, TPOT, and request latency. These are configuration-search outputs, not replay observations. |
| TP / EP / DP / config knobs | Yes. Frontier has model, device, network device, attention TP/DP, MoE TP/EP, PP, scheduler, block, batch, prefix-cache, chunked-prefill, and memory-planner knobs. | Partial. Vidur exposes model, device, network device, tensor parallel size, pipeline stages, scheduler and batch/KV knobs. The inspected checkout is not a faithful EP/DP prefix-replay candidate. | Strong for config search. Supports TP/PP/DP and expert TP/EP style search across supported backends/systems. |
| Arbitrary model/hardware/config boundary | Not arbitrary. Model/device configs may exist, but reliable latency/throughput requires compute/network profiles, scheduler support, matching parallel semantics, and bug-free code paths. RS1 Qwen3-32B on A800 uses dummy predictor because public A800 dense Qwen3-32B compute profiles are absent. | Not arbitrary. README lists supported model/device/profile combinations; docs say profiling on actual GPUs is needed for new model/hardware fidelity. Current public device/profile coverage does not match A800 Qwen3-32B. | Not arbitrary. It depends on supported model families, backend/system databases, and estimate modes. The support matrix includes H100/H200/B200/GB200/A100 variants; no A800 built-in silicon database was found. |
| Needs profile/calibration | Yes for performance claims. Dummy predictor plus analytical comm is only a smoke. | Yes for performance claims. New model/device requires compute, network, and CPU-overhead style profiling. | Yes for production-quality estimates. It relies on collected silicon/perf databases or rough estimate modes; README warns memory/results need validation. |
First Results
Frontier Canonical
Fixed RS1 config:
simulation_mode=onlinesys_arch=co-locationreplica_scheduler=vllm_v1device=a800network_device=a800_dgxmodel_name=Qwen/Qwen3-32Battn_tensor_parallel_size=2- dummy execution predictor
- analytical communication backend
trace_request_generator_config_max_tokens=32768- prefix caching enabled
- block size 16
- chunked prefill enabled
- batch cap 128
- max batch tokens 32768
- KV capacity from Frontier memory planner with
gpu_memory_utilization=0.9andnon_kv_cache_overhead_bytes=0
Results:
| Run | Result | Evidence | Notes |
|---|---|---|---|
coder_100 |
Pass | runs/rs1/coder_100/ |
Frontier block hit ratio 0.04948661841440835; ReplayServe token-weighted hit ratio 0.04956232588915065; no preemptions. |
coder_2000 |
Fail | runs/rs1/coder_2000/ |
Exit code 1 after 4 seconds with ValueError: Request 194 already scheduled. Traceback ends at Frontier vLLM v1 waiting scheduling calling request.on_cache_hit(prefix_cached_tokens). |
The canonical failure was minimized in docs/rs1_frontier_blocker.md: first-N
N=192 passes, N=193 fails as Request 192 already scheduled, and larger
fixed-config probes fail around the same preempted prefix-cache path. Prefix off,
chunked-prefill off, or a high long-prefill threshold avoids the failure, so this
is not a bad Qwen trace row.
Frontier Patched Scratch
Patch:
- File:
patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch - Documentation:
docs/rs1_frontier_patch.md - Scratch checkout only:
/tmp/replayserve-frontier-rs1b
The patch resets preempted request scheduler/cache-hit admission state before the request re-enters the waiting path. As of 2026-06-25 it also replays decode-phase preemption by moving already-produced tokens into the next prefill segment, preserves user-facing lengths for metrics, and fails fast if sequential simulation drains before all generated requests complete. It keeps the canonical Frontier checkout clean.
| Run | Result | Evidence | Hit ratios | Preemption | Memory planner facts |
|---|---|---|---|---|---|
N=193 fixed config |
Pass | runs/rs1b/patched/n193_fixed_v2/ |
Frontier block 0.12458971786194112; ReplayServe token-weighted 0.12476981408429115 |
5 total events, 1 request | num_blocks=36902, gpu_memory_utilization=0.9, non-KV overhead 0, weight shard estimate 26.953125 GiB |
coder_100 fixed config |
Pass | runs/rs1b/patched/coder_100/ |
Frontier block 0.04948661841440835; ReplayServe token-weighted 0.04956232588915065 |
0 | same derived memory planner point |
coder_2000 fixed config |
Pass | runs/rs1b/patched/coder_2000/ |
Frontier block 0.12318930248025924; ReplayServe token-weighted 0.12332978217090633 |
35940 total events, 1061 requests | same derived memory planner point |
Metrics caveats:
- These are plumbing-smoke metrics. The run uses dummy 1 ms execution time and analytical communication, not calibrated Qwen3-32B A800 compute profiles.
coder_2000producedrequest_metrics.csvwith 2000 rows, but 745 rows have blank request-level prefix-cache fields. ReplayServe token-weighted hit ratio therefore uses the 1255 rows with complete cache metrics. Frontier's aggregate prefix-cache statistics in the same summary also report 1255 requests with cache metrics. This is acceptable for blocker removal evidence, but it is not a final metric-quality result.- No allocation/OOM pressure log lines were found in the postprocess summaries.
Vidur
No Vidur baseline run was executed for RS2. Based on source inspection, Vidur is useful as an arrival-and-length baseline candidate, but it cannot faithfully compare ReplayServe prefix reuse without additional code:
vidur/request_generator/trace_replay_request_generator.pyconsumesarrived_at,num_prefill_tokens, andnum_decode_tokens.vidur/entities/request.pystores arrival, prefill length, decode length, processed tokens, schedule/completion timestamps, and preemption state.- The inspected request path does not carry
session_id,block_hash_ids, or sidecar block-token accounting. - Current Vidur trace replay clips prefill lengths when total tokens exceed
max_tokens; ReplayServe must keep its own hard-fail validation if Vidur is used later as a length-only baseline.
Conclusion for Vidur in RS2: it can likely replay coder_100/coder_2000
arrival and length after a simple CSV compatibility conversion, but it would
measure a different workload because prefix KV reuse is absent.
AIConfigurator
No AIConfigurator run was executed for RS2 because it is not a per-request replay simulator. Source/docs show it is a deployment/config search estimator:
- CLI examples take workload summaries such as
--isl,--osl,--prefix,--ttft,--tpot,--total-gpus,--system, and model path. - Outputs are best throughput, per-GPU throughput, per-user throughput, TTFT, TPOT, request latency, concurrency, and parallel deployment choices.
- It models operations and searches aggregated/disaggregated serving configurations using collected or estimated performance data.
Conclusion for AIConfigurator in RS2: it is useful for config candidates and reference sizing assumptions. It cannot directly compare faithful per-request prefix-hit replay on Qwen trace fixtures.
Metric Definitions
TTFT:
Time from request arrival to first generated token / prefill completion. Frontier
and Vidur both have request-level prefill/first-token style timing fields, but
RS1 Frontier values are not performance claims because the execution predictor is
dummy.
TPOT:
Decode time per output token. Tools differ on whether they report total decode
normalized by output tokens, inter-token latency, or a configured SLA target.
Use each tool's native field only within that tool unless calibrated against the
same serving definition.
E2E latency:
Completion time minus arrival time for one request.
Throughput:
Completed tokens or requests per unit time. AIConfigurator reports estimated
tokens/s style capacity; Frontier/Vidur report simulated metrics. RS1 Frontier
throughput is plumbing-only because compute is dummy.
KV-cache hit ratio:
- Frontier native block-level ratio:
sum(request_prefix_cache_hit_blocks) / sum(request_prefix_cache_query_blocks). - ReplayServe token-weighted ratio:
use sidecar
block_token_countsand count the firstrequest_prefix_cache_hit_blocksblocks by true token count, so a partial final block contributes its actual token count instead of always 16.
For coder_2000 patched, both ratios are computed only for request rows with
complete cache fields because 745 request metric rows have blank cache fields.
This is a metrics completeness caveat, not evidence that the trace has invalid
hashes.
Non-Comparable Items
- Frontier canonical and patched scratch are not equivalent artifacts. The patched result demonstrates an RS1 unblock path; it is not an upstream Frontier release.
- Frontier/Vidur simulator timings and AIConfigurator estimator timings are not directly comparable without shared profiles, calibration, and metric definitions.
- Prefix-reuse fidelity is not comparable across all three tools. Only Frontier currently consumes explicit block hash traces in the inspected checkouts.
- AIConfigurator's
prefixworkload parameter is not the same as ReplayServeblock_hash_ids; it cannot recover session-level sharing or partial-block token accounting.
Conclusions
There is no best open-source implementation that satisfies ReplayServe's target out of the box.
Frontier is the closest because it supports online trace replay, prefix-cache metadata, vLLM v1 style scheduler controls, memory planning, and request/system metrics. It is still not out of the box for ReplayServe: RS0 needed an adapter, RS1B needed a local patch or upstream fix for prefix cache plus chunked prefill, and performance-quality claims need Qwen3-32B A800 profiles/calibration.
Vidur can be a useful arrival-plus-length baseline, but not a faithful prefix KV reuse replay engine in the inspected checkout.
AIConfigurator can guide candidate deployment/config choices, but it is a workload-summary estimator rather than a per-request simulator.
Frontier also does not support arbitrary model plus arbitrary hardware plus arbitrary config in a performance-reliable sense. A model/device config may be accepted syntactically, but fidelity depends on compute profiles, network profiles, scheduler support, parallelism semantics, memory-planner assumptions, and bug surface. For RS1, A800 network profiles exist, but the public checkout does not provide dense Qwen3-32B A800 compute profiles, so latency/throughput remain plumbing smoke.
Next Steps
RS3 sweep prerequisites:
- Decide whether RS3 uses the local RS1B patch, waits for upstream Frontier, or carries both canonical and patched modes explicitly.
- Keep fixed-config smoke and any sweep configs separate from performance claims.
- Add a small run manifest/check script that records Frontier commit, patch status, fixture, command, and metric completeness.
- Treat the
coder_2000blank cache fields as a metrics issue to investigate before using request-level hit ratios as a headline metric.
RS4 calibration prerequisites:
- Collect or obtain dense
Qwen/Qwen3-32BA800 compute profiles for the Frontier predictor path. - Verify the A800 network profile and node SKU semantics match the target deployment.
- Add non-KV memory overhead assumptions from a real serving stack instead of
using
0. - Validate simulator TTFT/TPOT/E2E/throughput against measured vLLM runs before making performance conclusions.
Patch path recommendation:
- Keep
patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patchpinned in ReplayServe until an upstream Frontier fix is available; it now covers both the original prefix-cache/chunked-prefill preemption bug and the RS10 decode-phase preemption lifecycle bug. - Open an upstream issue or PR with the RS1B minimal repro (
N=193) and evidence fromdocs/rs1_frontier_blocker.md. - Re-run
coder_100,N=193, andcoder_2000when changing Frontier commit or patch status.