Files
replaysim/docs/comparison.md

16 KiB

RS2 Simulator Comparison

Checked on 2026-06-24. RS2 compares simulator capabilities and first local ReplayServe results. It does not start the RS3 sweep and does not make performance-quality claims.

Sources

Source Local path Commit / HEAD RS2 use
ReplayServe /home/gahow/phd/replayserve local RS0/RS1/RS1B artifacts Adapter, fixtures, runs, postprocess summaries
Qwen trace /home/gahow/phd/qwen-bailian-usagetraces-anon 5f7439c51ec248a0c585f7d90a41a6f57773b912 Source qwen_coder_blksz_16.jsonl
Frontier canonical /tmp/toc-llm-sim-research/Frontier d9cfeb6d8791fbf2f295dd9744c56a666171776e RS1 fixed config and source inspection
Frontier patched scratch /tmp/replayserve-frontier-rs1b base d9cfeb6... plus patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch RS1B unblock verification
Vidur /tmp/toc-llm-sim-research/vidur 8383d2935bc62723a212090baa9f98ada206fc14 Source inspection for baseline capability
AIConfigurator /tmp/toc-llm-sim-research/aiconfigurator e46ece7510e727fafefb8212e5846172145a30ea Source/docs inspection for config-estimator capability

Key local evidence:

  • Frontier trace replay: /tmp/toc-llm-sim-research/Frontier/frontier/request_generator/trace_replay_request_generator.py
  • Frontier prefix-cache validation: /tmp/toc-llm-sim-research/Frontier/frontier/scheduler/cluster_scheduler/base_cluster_scheduler.py
  • Frontier prefix-cache request metrics: /tmp/toc-llm-sim-research/Frontier/frontier/metrics/metrics_store.py
  • Vidur trace replay: /tmp/toc-llm-sim-research/vidur/vidur/request_generator/trace_replay_request_generator.py
  • Vidur request entity: /tmp/toc-llm-sim-research/vidur/vidur/entities/request.py
  • AIConfigurator CLI/docs: /tmp/toc-llm-sim-research/aiconfigurator/README.md and /tmp/toc-llm-sim-research/aiconfigurator/src/aiconfigurator/cli/main.py

Capability Matrix

Capability Frontier Vidur AIConfigurator
Per-request timestamp replay Yes. trace_replay consumes arrived_at and RS1 runs simulation_mode=online. Yes. TraceReplayRequestGenerator consumes arrived_at. No per-request replay. CLI consumes workload summaries such as --isl, --osl, and SLA targets.
Input/output length replay Yes. Consumes num_prefill_tokens and num_decode_tokens. Frontier can clip overflows internally, so ReplayServe adapter validates before run. Yes. Consumes num_prefill_tokens and num_decode_tokens; current code clips prefill length if total exceeds max tokens. Only summary lengths, not per-request traces.
Explicit block_hash_ids / prefix KV reuse replay Yes. Current Frontier parses session_id and block_hash_ids, validates they are present when prefix caching is enabled, and applies prefix-cache accounting in vLLM v1. RS1B needs a patch for prefix cache plus chunked prefill under pressure. No in this checkout. Request objects carry arrival/prefill/decode lengths and processed-token state, but no session_id, block hashes, or explicit prefix-reuse replay. README says prefix-caching work lives on a canary branch with sharp edges, not this main checkout. No. --prefix is an aggregate prefix length/workload parameter, not an explicit hash/session replay model.
Online arrival pattern Yes. RS1 fixed config uses online mode and trace replay. Yes for trace replay baseline. No event-level online replay. It estimates candidate deployments from summary workload/SLA inputs.
Prefix-cache hit-ratio output Yes. Frontier emits request metrics including cached prefill tokens, query blocks, and hit blocks when present. ReplayServe postprocess adds token-weighted hit ratio using sidecar partial-block counts. No native prefix-hit ratio in current main because no explicit prefix replay. No prefix-hit replay metric.
TTFT / TPOT / E2E / throughput output Yes. Request and system metrics are emitted under Frontier metrics dirs. RS1 uses dummy execution predictor, so values are plumbing-only. Yes. Vidur metrics include request E2E, prefill/TTFT-style, decode-normalized, and system metrics. Fidelity depends on matching profiles. Yes as estimates: best throughput, per-GPU throughput, per-user throughput, TTFT, TPOT, and request latency. These are configuration-search outputs, not replay observations.
TP / EP / DP / config knobs Yes. Frontier has model, device, network device, attention TP/DP, MoE TP/EP, PP, scheduler, block, batch, prefix-cache, chunked-prefill, and memory-planner knobs. Partial. Vidur exposes model, device, network device, tensor parallel size, pipeline stages, scheduler and batch/KV knobs. The inspected checkout is not a faithful EP/DP prefix-replay candidate. Strong for config search. Supports TP/PP/DP and expert TP/EP style search across supported backends/systems.
Arbitrary model/hardware/config boundary Not arbitrary. Model/device configs may exist, but reliable latency/throughput requires compute/network profiles, scheduler support, matching parallel semantics, and bug-free code paths. RS1 Qwen3-32B on A800 uses dummy predictor because public A800 dense Qwen3-32B compute profiles are absent. Not arbitrary. README lists supported model/device/profile combinations; docs say profiling on actual GPUs is needed for new model/hardware fidelity. Current public device/profile coverage does not match A800 Qwen3-32B. Not arbitrary. It depends on supported model families, backend/system databases, and estimate modes. The support matrix includes H100/H200/B200/GB200/A100 variants; no A800 built-in silicon database was found.
Needs profile/calibration Yes for performance claims. Dummy predictor plus analytical comm is only a smoke. Yes for performance claims. New model/device requires compute, network, and CPU-overhead style profiling. Yes for production-quality estimates. It relies on collected silicon/perf databases or rough estimate modes; README warns memory/results need validation.

First Results

Frontier Canonical

Fixed RS1 config:

  • simulation_mode=online
  • sys_arch=co-location
  • replica_scheduler=vllm_v1
  • device=a800
  • network_device=a800_dgx
  • model_name=Qwen/Qwen3-32B
  • attn_tensor_parallel_size=2
  • dummy execution predictor
  • analytical communication backend
  • trace_request_generator_config_max_tokens=32768
  • prefix caching enabled
  • block size 16
  • chunked prefill enabled
  • batch cap 128
  • max batch tokens 32768
  • KV capacity from Frontier memory planner with gpu_memory_utilization=0.9 and non_kv_cache_overhead_bytes=0

Results:

Run Result Evidence Notes
coder_100 Pass runs/rs1/coder_100/ Frontier block hit ratio 0.04948661841440835; ReplayServe token-weighted hit ratio 0.04956232588915065; no preemptions.
coder_2000 Fail runs/rs1/coder_2000/ Exit code 1 after 4 seconds with ValueError: Request 194 already scheduled. Traceback ends at Frontier vLLM v1 waiting scheduling calling request.on_cache_hit(prefix_cached_tokens).

The canonical failure was minimized in docs/rs1_frontier_blocker.md: first-N N=192 passes, N=193 fails as Request 192 already scheduled, and larger fixed-config probes fail around the same preempted prefix-cache path. Prefix off, chunked-prefill off, or a high long-prefill threshold avoids the failure, so this is not a bad Qwen trace row.

Frontier Patched Scratch

Patch:

  • File: patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch
  • Documentation: docs/rs1_frontier_patch.md
  • Scratch checkout only: /tmp/replayserve-frontier-rs1b

The patch resets preempted request scheduler/cache-hit admission state before the request re-enters the waiting path. As of 2026-06-25 it also replays decode-phase preemption by moving already-produced tokens into the next prefill segment, preserves user-facing lengths for metrics, and fails fast if sequential simulation drains before all generated requests complete. It keeps the canonical Frontier checkout clean.

Run Result Evidence Hit ratios Preemption Memory planner facts
N=193 fixed config Pass runs/rs1b/patched/n193_fixed_v2/ Frontier block 0.12458971786194112; ReplayServe token-weighted 0.12476981408429115 5 total events, 1 request num_blocks=36902, gpu_memory_utilization=0.9, non-KV overhead 0, weight shard estimate 26.953125 GiB
coder_100 fixed config Pass runs/rs1b/patched/coder_100/ Frontier block 0.04948661841440835; ReplayServe token-weighted 0.04956232588915065 0 same derived memory planner point
coder_2000 fixed config Pass runs/rs1b/patched/coder_2000/ Frontier block 0.12318930248025924; ReplayServe token-weighted 0.12332978217090633 35940 total events, 1061 requests same derived memory planner point

Metrics caveats:

  • These are plumbing-smoke metrics. The run uses dummy 1 ms execution time and analytical communication, not calibrated Qwen3-32B A800 compute profiles.
  • coder_2000 produced request_metrics.csv with 2000 rows, but 745 rows have blank request-level prefix-cache fields. ReplayServe token-weighted hit ratio therefore uses the 1255 rows with complete cache metrics. Frontier's aggregate prefix-cache statistics in the same summary also report 1255 requests with cache metrics. This is acceptable for blocker removal evidence, but it is not a final metric-quality result.
  • No allocation/OOM pressure log lines were found in the postprocess summaries.

Vidur

No Vidur baseline run was executed for RS2. Based on source inspection, Vidur is useful as an arrival-and-length baseline candidate, but it cannot faithfully compare ReplayServe prefix reuse without additional code:

  • vidur/request_generator/trace_replay_request_generator.py consumes arrived_at, num_prefill_tokens, and num_decode_tokens.
  • vidur/entities/request.py stores arrival, prefill length, decode length, processed tokens, schedule/completion timestamps, and preemption state.
  • The inspected request path does not carry session_id, block_hash_ids, or sidecar block-token accounting.
  • Current Vidur trace replay clips prefill lengths when total tokens exceed max_tokens; ReplayServe must keep its own hard-fail validation if Vidur is used later as a length-only baseline.

Conclusion for Vidur in RS2: it can likely replay coder_100/coder_2000 arrival and length after a simple CSV compatibility conversion, but it would measure a different workload because prefix KV reuse is absent.

AIConfigurator

No AIConfigurator run was executed for RS2 because it is not a per-request replay simulator. Source/docs show it is a deployment/config search estimator:

  • CLI examples take workload summaries such as --isl, --osl, --prefix, --ttft, --tpot, --total-gpus, --system, and model path.
  • Outputs are best throughput, per-GPU throughput, per-user throughput, TTFT, TPOT, request latency, concurrency, and parallel deployment choices.
  • It models operations and searches aggregated/disaggregated serving configurations using collected or estimated performance data.

Conclusion for AIConfigurator in RS2: it is useful for config candidates and reference sizing assumptions. It cannot directly compare faithful per-request prefix-hit replay on Qwen trace fixtures.

Metric Definitions

TTFT: Time from request arrival to first generated token / prefill completion. Frontier and Vidur both have request-level prefill/first-token style timing fields, but RS1 Frontier values are not performance claims because the execution predictor is dummy.

TPOT: Decode time per output token. Tools differ on whether they report total decode normalized by output tokens, inter-token latency, or a configured SLA target. Use each tool's native field only within that tool unless calibrated against the same serving definition.

E2E latency: Completion time minus arrival time for one request.

Throughput: Completed tokens or requests per unit time. AIConfigurator reports estimated tokens/s style capacity; Frontier/Vidur report simulated metrics. RS1 Frontier throughput is plumbing-only because compute is dummy.

KV-cache hit ratio:

  • Frontier native block-level ratio: sum(request_prefix_cache_hit_blocks) / sum(request_prefix_cache_query_blocks).
  • ReplayServe token-weighted ratio: use sidecar block_token_counts and count the first request_prefix_cache_hit_blocks blocks by true token count, so a partial final block contributes its actual token count instead of always 16.

For coder_2000 patched, both ratios are computed only for request rows with complete cache fields because 745 request metric rows have blank cache fields. This is a metrics completeness caveat, not evidence that the trace has invalid hashes.

Non-Comparable Items

  • Frontier canonical and patched scratch are not equivalent artifacts. The patched result demonstrates an RS1 unblock path; it is not an upstream Frontier release.
  • Frontier/Vidur simulator timings and AIConfigurator estimator timings are not directly comparable without shared profiles, calibration, and metric definitions.
  • Prefix-reuse fidelity is not comparable across all three tools. Only Frontier currently consumes explicit block hash traces in the inspected checkouts.
  • AIConfigurator's prefix workload parameter is not the same as ReplayServe block_hash_ids; it cannot recover session-level sharing or partial-block token accounting.

Conclusions

There is no best open-source implementation that satisfies ReplayServe's target out of the box.

Frontier is the closest because it supports online trace replay, prefix-cache metadata, vLLM v1 style scheduler controls, memory planning, and request/system metrics. It is still not out of the box for ReplayServe: RS0 needed an adapter, RS1B needed a local patch or upstream fix for prefix cache plus chunked prefill, and performance-quality claims need Qwen3-32B A800 profiles/calibration.

Vidur can be a useful arrival-plus-length baseline, but not a faithful prefix KV reuse replay engine in the inspected checkout.

AIConfigurator can guide candidate deployment/config choices, but it is a workload-summary estimator rather than a per-request simulator.

Frontier also does not support arbitrary model plus arbitrary hardware plus arbitrary config in a performance-reliable sense. A model/device config may be accepted syntactically, but fidelity depends on compute profiles, network profiles, scheduler support, parallelism semantics, memory-planner assumptions, and bug surface. For RS1, A800 network profiles exist, but the public checkout does not provide dense Qwen3-32B A800 compute profiles, so latency/throughput remain plumbing smoke.

Next Steps

RS3 sweep prerequisites:

  • Decide whether RS3 uses the local RS1B patch, waits for upstream Frontier, or carries both canonical and patched modes explicitly.
  • Keep fixed-config smoke and any sweep configs separate from performance claims.
  • Add a small run manifest/check script that records Frontier commit, patch status, fixture, command, and metric completeness.
  • Treat the coder_2000 blank cache fields as a metrics issue to investigate before using request-level hit ratios as a headline metric.

RS4 calibration prerequisites:

  • Collect or obtain dense Qwen/Qwen3-32B A800 compute profiles for the Frontier predictor path.
  • Verify the A800 network profile and node SKU semantics match the target deployment.
  • Add non-KV memory overhead assumptions from a real serving stack instead of using 0.
  • Validate simulator TTFT/TPOT/E2E/throughput against measured vLLM runs before making performance conclusions.

Patch path recommendation:

  • Keep patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch pinned in ReplayServe until an upstream Frontier fix is available; it now covers both the original prefix-cache/chunked-prefill preemption bug and the RS10 decode-phase preemption lifecycle bug.
  • Open an upstream issue or PR with the RS1B minimal repro (N=193) and evidence from docs/rs1_frontier_blocker.md.
  • Re-run coder_100, N=193, and coder_2000 when changing Frontier commit or patch status.