254 lines
16 KiB
Markdown
254 lines
16 KiB
Markdown
# RS2 Simulator Comparison
|
|
|
|
Checked on 2026-06-24. RS2 compares simulator capabilities and first local
|
|
ReplayServe results. It does not start the RS3 sweep and does not make
|
|
performance-quality claims.
|
|
|
|
## Sources
|
|
|
|
| Source | Local path | Commit / HEAD | RS2 use |
|
|
|---|---|---:|---|
|
|
| ReplayServe | `/home/gahow/phd/replayserve` | local RS0/RS1/RS1B artifacts | Adapter, fixtures, runs, postprocess summaries |
|
|
| Qwen trace | `/home/gahow/phd/qwen-bailian-usagetraces-anon` | `5f7439c51ec248a0c585f7d90a41a6f57773b912` | Source `qwen_coder_blksz_16.jsonl` |
|
|
| Frontier canonical | `/tmp/toc-llm-sim-research/Frontier` | `d9cfeb6d8791fbf2f295dd9744c56a666171776e` | RS1 fixed config and source inspection |
|
|
| Frontier patched scratch | `/tmp/replayserve-frontier-rs1b` | base `d9cfeb6...` plus `patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch` | RS1B unblock verification |
|
|
| Vidur | `/tmp/toc-llm-sim-research/vidur` | `8383d2935bc62723a212090baa9f98ada206fc14` | Source inspection for baseline capability |
|
|
| AIConfigurator | `/tmp/toc-llm-sim-research/aiconfigurator` | `e46ece7510e727fafefb8212e5846172145a30ea` | Source/docs inspection for config-estimator capability |
|
|
|
|
Key local evidence:
|
|
|
|
- Frontier trace replay: `/tmp/toc-llm-sim-research/Frontier/frontier/request_generator/trace_replay_request_generator.py`
|
|
- Frontier prefix-cache validation: `/tmp/toc-llm-sim-research/Frontier/frontier/scheduler/cluster_scheduler/base_cluster_scheduler.py`
|
|
- Frontier prefix-cache request metrics: `/tmp/toc-llm-sim-research/Frontier/frontier/metrics/metrics_store.py`
|
|
- Vidur trace replay: `/tmp/toc-llm-sim-research/vidur/vidur/request_generator/trace_replay_request_generator.py`
|
|
- Vidur request entity: `/tmp/toc-llm-sim-research/vidur/vidur/entities/request.py`
|
|
- AIConfigurator CLI/docs: `/tmp/toc-llm-sim-research/aiconfigurator/README.md` and `/tmp/toc-llm-sim-research/aiconfigurator/src/aiconfigurator/cli/main.py`
|
|
|
|
## Capability Matrix
|
|
|
|
| Capability | Frontier | Vidur | AIConfigurator |
|
|
|---|---|---|---|
|
|
| Per-request timestamp replay | Yes. `trace_replay` consumes `arrived_at` and RS1 runs `simulation_mode=online`. | Yes. `TraceReplayRequestGenerator` consumes `arrived_at`. | No per-request replay. CLI consumes workload summaries such as `--isl`, `--osl`, and SLA targets. |
|
|
| Input/output length replay | Yes. Consumes `num_prefill_tokens` and `num_decode_tokens`. Frontier can clip overflows internally, so ReplayServe adapter validates before run. | Yes. Consumes `num_prefill_tokens` and `num_decode_tokens`; current code clips prefill length if total exceeds max tokens. | Only summary lengths, not per-request traces. |
|
|
| Explicit `block_hash_ids` / prefix KV reuse replay | Yes. Current Frontier parses `session_id` and `block_hash_ids`, validates they are present when prefix caching is enabled, and applies prefix-cache accounting in vLLM v1. RS1B needs a patch for prefix cache plus chunked prefill under pressure. | No in this checkout. Request objects carry arrival/prefill/decode lengths and processed-token state, but no `session_id`, block hashes, or explicit prefix-reuse replay. README says prefix-caching work lives on a canary branch with sharp edges, not this main checkout. | No. `--prefix` is an aggregate prefix length/workload parameter, not an explicit hash/session replay model. |
|
|
| Online arrival pattern | Yes. RS1 fixed config uses online mode and trace replay. | Yes for trace replay baseline. | No event-level online replay. It estimates candidate deployments from summary workload/SLA inputs. |
|
|
| Prefix-cache hit-ratio output | Yes. Frontier emits request metrics including cached prefill tokens, query blocks, and hit blocks when present. ReplayServe postprocess adds token-weighted hit ratio using sidecar partial-block counts. | No native prefix-hit ratio in current main because no explicit prefix replay. | No prefix-hit replay metric. |
|
|
| TTFT / TPOT / E2E / throughput output | Yes. Request and system metrics are emitted under Frontier metrics dirs. RS1 uses dummy execution predictor, so values are plumbing-only. | Yes. Vidur metrics include request E2E, prefill/TTFT-style, decode-normalized, and system metrics. Fidelity depends on matching profiles. | Yes as estimates: best throughput, per-GPU throughput, per-user throughput, TTFT, TPOT, and request latency. These are configuration-search outputs, not replay observations. |
|
|
| TP / EP / DP / config knobs | Yes. Frontier has model, device, network device, attention TP/DP, MoE TP/EP, PP, scheduler, block, batch, prefix-cache, chunked-prefill, and memory-planner knobs. | Partial. Vidur exposes model, device, network device, tensor parallel size, pipeline stages, scheduler and batch/KV knobs. The inspected checkout is not a faithful EP/DP prefix-replay candidate. | Strong for config search. Supports TP/PP/DP and expert TP/EP style search across supported backends/systems. |
|
|
| Arbitrary model/hardware/config boundary | Not arbitrary. Model/device configs may exist, but reliable latency/throughput requires compute/network profiles, scheduler support, matching parallel semantics, and bug-free code paths. RS1 Qwen3-32B on A800 uses dummy predictor because public A800 dense Qwen3-32B compute profiles are absent. | Not arbitrary. README lists supported model/device/profile combinations; docs say profiling on actual GPUs is needed for new model/hardware fidelity. Current public device/profile coverage does not match A800 Qwen3-32B. | Not arbitrary. It depends on supported model families, backend/system databases, and estimate modes. The support matrix includes H100/H200/B200/GB200/A100 variants; no A800 built-in silicon database was found. |
|
|
| Needs profile/calibration | Yes for performance claims. Dummy predictor plus analytical comm is only a smoke. | Yes for performance claims. New model/device requires compute, network, and CPU-overhead style profiling. | Yes for production-quality estimates. It relies on collected silicon/perf databases or rough estimate modes; README warns memory/results need validation. |
|
|
|
|
## First Results
|
|
|
|
### Frontier Canonical
|
|
|
|
Fixed RS1 config:
|
|
|
|
- `simulation_mode=online`
|
|
- `sys_arch=co-location`
|
|
- `replica_scheduler=vllm_v1`
|
|
- `device=a800`
|
|
- `network_device=a800_dgx`
|
|
- `model_name=Qwen/Qwen3-32B`
|
|
- `attn_tensor_parallel_size=2`
|
|
- dummy execution predictor
|
|
- analytical communication backend
|
|
- `trace_request_generator_config_max_tokens=32768`
|
|
- prefix caching enabled
|
|
- block size 16
|
|
- chunked prefill enabled
|
|
- batch cap 128
|
|
- max batch tokens 32768
|
|
- KV capacity from Frontier memory planner with `gpu_memory_utilization=0.9` and `non_kv_cache_overhead_bytes=0`
|
|
|
|
Results:
|
|
|
|
| Run | Result | Evidence | Notes |
|
|
|---|---|---|---|
|
|
| `coder_100` | Pass | `runs/rs1/coder_100/` | Frontier block hit ratio `0.04948661841440835`; ReplayServe token-weighted hit ratio `0.04956232588915065`; no preemptions. |
|
|
| `coder_2000` | Fail | `runs/rs1/coder_2000/` | Exit code 1 after 4 seconds with `ValueError: Request 194 already scheduled.` Traceback ends at Frontier vLLM v1 waiting scheduling calling `request.on_cache_hit(prefix_cached_tokens)`. |
|
|
|
|
The canonical failure was minimized in `docs/rs1_frontier_blocker.md`: first-N
|
|
`N=192` passes, `N=193` fails as `Request 192 already scheduled`, and larger
|
|
fixed-config probes fail around the same preempted prefix-cache path. Prefix off,
|
|
chunked-prefill off, or a high long-prefill threshold avoids the failure, so this
|
|
is not a bad Qwen trace row.
|
|
|
|
### Frontier Patched Scratch
|
|
|
|
Patch:
|
|
|
|
- File: `patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch`
|
|
- Documentation: `docs/rs1_frontier_patch.md`
|
|
- Scratch checkout only: `/tmp/replayserve-frontier-rs1b`
|
|
|
|
The patch resets preempted request scheduler/cache-hit admission state before
|
|
the request re-enters the waiting path. As of 2026-06-25 it also replays
|
|
decode-phase preemption by moving already-produced tokens into the next prefill
|
|
segment, preserves user-facing lengths for metrics, and fails fast if
|
|
sequential simulation drains before all generated requests complete. It keeps
|
|
the canonical Frontier checkout clean.
|
|
|
|
| Run | Result | Evidence | Hit ratios | Preemption | Memory planner facts |
|
|
|---|---|---|---|---|---|
|
|
| `N=193` fixed config | Pass | `runs/rs1b/patched/n193_fixed_v2/` | Frontier block `0.12458971786194112`; ReplayServe token-weighted `0.12476981408429115` | 5 total events, 1 request | `num_blocks=36902`, `gpu_memory_utilization=0.9`, non-KV overhead `0`, weight shard estimate `26.953125 GiB` |
|
|
| `coder_100` fixed config | Pass | `runs/rs1b/patched/coder_100/` | Frontier block `0.04948661841440835`; ReplayServe token-weighted `0.04956232588915065` | 0 | same derived memory planner point |
|
|
| `coder_2000` fixed config | Pass | `runs/rs1b/patched/coder_2000/` | Frontier block `0.12318930248025924`; ReplayServe token-weighted `0.12332978217090633` | 35940 total events, 1061 requests | same derived memory planner point |
|
|
|
|
Metrics caveats:
|
|
|
|
- These are plumbing-smoke metrics. The run uses dummy 1 ms execution time and
|
|
analytical communication, not calibrated Qwen3-32B A800 compute profiles.
|
|
- `coder_2000` produced `request_metrics.csv` with 2000 rows, but 745 rows have
|
|
blank request-level prefix-cache fields. ReplayServe token-weighted hit ratio
|
|
therefore uses the 1255 rows with complete cache metrics. Frontier's aggregate
|
|
prefix-cache statistics in the same summary also report 1255 requests with
|
|
cache metrics. This is acceptable for blocker removal evidence, but it is not
|
|
a final metric-quality result.
|
|
- No allocation/OOM pressure log lines were found in the postprocess summaries.
|
|
|
|
### Vidur
|
|
|
|
No Vidur baseline run was executed for RS2. Based on source inspection, Vidur is
|
|
useful as an arrival-and-length baseline candidate, but it cannot faithfully
|
|
compare ReplayServe prefix reuse without additional code:
|
|
|
|
- `vidur/request_generator/trace_replay_request_generator.py` consumes
|
|
`arrived_at`, `num_prefill_tokens`, and `num_decode_tokens`.
|
|
- `vidur/entities/request.py` stores arrival, prefill length, decode length,
|
|
processed tokens, schedule/completion timestamps, and preemption state.
|
|
- The inspected request path does not carry `session_id`, `block_hash_ids`, or
|
|
sidecar block-token accounting.
|
|
- Current Vidur trace replay clips prefill lengths when total tokens exceed
|
|
`max_tokens`; ReplayServe must keep its own hard-fail validation if Vidur is
|
|
used later as a length-only baseline.
|
|
|
|
Conclusion for Vidur in RS2: it can likely replay `coder_100`/`coder_2000`
|
|
arrival and length after a simple CSV compatibility conversion, but it would
|
|
measure a different workload because prefix KV reuse is absent.
|
|
|
|
### AIConfigurator
|
|
|
|
No AIConfigurator run was executed for RS2 because it is not a per-request
|
|
replay simulator. Source/docs show it is a deployment/config search estimator:
|
|
|
|
- CLI examples take workload summaries such as `--isl`, `--osl`, `--prefix`,
|
|
`--ttft`, `--tpot`, `--total-gpus`, `--system`, and model path.
|
|
- Outputs are best throughput, per-GPU throughput, per-user throughput, TTFT,
|
|
TPOT, request latency, concurrency, and parallel deployment choices.
|
|
- It models operations and searches aggregated/disaggregated serving
|
|
configurations using collected or estimated performance data.
|
|
|
|
Conclusion for AIConfigurator in RS2: it is useful for config candidates and
|
|
reference sizing assumptions. It cannot directly compare faithful per-request
|
|
prefix-hit replay on Qwen trace fixtures.
|
|
|
|
## Metric Definitions
|
|
|
|
`TTFT`:
|
|
Time from request arrival to first generated token / prefill completion. Frontier
|
|
and Vidur both have request-level prefill/first-token style timing fields, but
|
|
RS1 Frontier values are not performance claims because the execution predictor is
|
|
dummy.
|
|
|
|
`TPOT`:
|
|
Decode time per output token. Tools differ on whether they report total decode
|
|
normalized by output tokens, inter-token latency, or a configured SLA target.
|
|
Use each tool's native field only within that tool unless calibrated against the
|
|
same serving definition.
|
|
|
|
`E2E latency`:
|
|
Completion time minus arrival time for one request.
|
|
|
|
`Throughput`:
|
|
Completed tokens or requests per unit time. AIConfigurator reports estimated
|
|
tokens/s style capacity; Frontier/Vidur report simulated metrics. RS1 Frontier
|
|
throughput is plumbing-only because compute is dummy.
|
|
|
|
`KV-cache hit ratio`:
|
|
|
|
- Frontier native block-level ratio:
|
|
`sum(request_prefix_cache_hit_blocks) / sum(request_prefix_cache_query_blocks)`.
|
|
- ReplayServe token-weighted ratio:
|
|
use sidecar `block_token_counts` and count the first
|
|
`request_prefix_cache_hit_blocks` blocks by true token count, so a partial
|
|
final block contributes its actual token count instead of always 16.
|
|
|
|
For `coder_2000` patched, both ratios are computed only for request rows with
|
|
complete cache fields because 745 request metric rows have blank cache fields.
|
|
This is a metrics completeness caveat, not evidence that the trace has invalid
|
|
hashes.
|
|
|
|
## Non-Comparable Items
|
|
|
|
- Frontier canonical and patched scratch are not equivalent artifacts. The
|
|
patched result demonstrates an RS1 unblock path; it is not an upstream
|
|
Frontier release.
|
|
- Frontier/Vidur simulator timings and AIConfigurator estimator timings are not
|
|
directly comparable without shared profiles, calibration, and metric
|
|
definitions.
|
|
- Prefix-reuse fidelity is not comparable across all three tools. Only Frontier
|
|
currently consumes explicit block hash traces in the inspected checkouts.
|
|
- AIConfigurator's `prefix` workload parameter is not the same as ReplayServe
|
|
`block_hash_ids`; it cannot recover session-level sharing or partial-block
|
|
token accounting.
|
|
|
|
## Conclusions
|
|
|
|
There is no best open-source implementation that satisfies ReplayServe's target
|
|
out of the box.
|
|
|
|
Frontier is the closest because it supports online trace replay, prefix-cache
|
|
metadata, vLLM v1 style scheduler controls, memory planning, and request/system
|
|
metrics. It is still not out of the box for ReplayServe: RS0 needed an adapter,
|
|
RS1B needed a local patch or upstream fix for prefix cache plus chunked prefill,
|
|
and performance-quality claims need Qwen3-32B A800 profiles/calibration.
|
|
|
|
Vidur can be a useful arrival-plus-length baseline, but not a faithful prefix KV
|
|
reuse replay engine in the inspected checkout.
|
|
|
|
AIConfigurator can guide candidate deployment/config choices, but it is a
|
|
workload-summary estimator rather than a per-request simulator.
|
|
|
|
Frontier also does not support arbitrary model plus arbitrary hardware plus
|
|
arbitrary config in a performance-reliable sense. A model/device config may be
|
|
accepted syntactically, but fidelity depends on compute profiles, network
|
|
profiles, scheduler support, parallelism semantics, memory-planner assumptions,
|
|
and bug surface. For RS1, A800 network profiles exist, but the public checkout
|
|
does not provide dense Qwen3-32B A800 compute profiles, so latency/throughput
|
|
remain plumbing smoke.
|
|
|
|
## Next Steps
|
|
|
|
RS3 sweep prerequisites:
|
|
|
|
- Decide whether RS3 uses the local RS1B patch, waits for upstream Frontier, or
|
|
carries both canonical and patched modes explicitly.
|
|
- Keep fixed-config smoke and any sweep configs separate from performance claims.
|
|
- Add a small run manifest/check script that records Frontier commit, patch
|
|
status, fixture, command, and metric completeness.
|
|
- Treat the `coder_2000` blank cache fields as a metrics issue to investigate
|
|
before using request-level hit ratios as a headline metric.
|
|
|
|
RS4 calibration prerequisites:
|
|
|
|
- Collect or obtain dense `Qwen/Qwen3-32B` A800 compute profiles for the Frontier
|
|
predictor path.
|
|
- Verify the A800 network profile and node SKU semantics match the target
|
|
deployment.
|
|
- Add non-KV memory overhead assumptions from a real serving stack instead of
|
|
using `0`.
|
|
- Validate simulator TTFT/TPOT/E2E/throughput against measured vLLM runs before
|
|
making performance conclusions.
|
|
|
|
Patch path recommendation:
|
|
|
|
- Keep `patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch` pinned in
|
|
ReplayServe until an upstream Frontier fix is available; it now covers both
|
|
the original prefix-cache/chunked-prefill preemption bug and the RS10
|
|
decode-phase preemption lifecycle bug.
|
|
- Open an upstream issue or PR with the RS1B minimal repro (`N=193`) and
|
|
evidence from `docs/rs1_frontier_blocker.md`.
|
|
- Re-run `coder_100`, `N=193`, and `coder_2000` when changing Frontier commit or
|
|
patch status.
|