Files
replaysim/docs/comparison.md

254 lines
16 KiB
Markdown

# RS2 Simulator Comparison
Checked on 2026-06-24. RS2 compares simulator capabilities and first local
ReplayServe results. It does not start the RS3 sweep and does not make
performance-quality claims.
## Sources
| Source | Local path | Commit / HEAD | RS2 use |
|---|---|---:|---|
| ReplayServe | `/home/gahow/phd/replayserve` | local RS0/RS1/RS1B artifacts | Adapter, fixtures, runs, postprocess summaries |
| Qwen trace | `/home/gahow/phd/qwen-bailian-usagetraces-anon` | `5f7439c51ec248a0c585f7d90a41a6f57773b912` | Source `qwen_coder_blksz_16.jsonl` |
| Frontier canonical | `/tmp/toc-llm-sim-research/Frontier` | `d9cfeb6d8791fbf2f295dd9744c56a666171776e` | RS1 fixed config and source inspection |
| Frontier patched scratch | `/tmp/replayserve-frontier-rs1b` | base `d9cfeb6...` plus `patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch` | RS1B unblock verification |
| Vidur | `/tmp/toc-llm-sim-research/vidur` | `8383d2935bc62723a212090baa9f98ada206fc14` | Source inspection for baseline capability |
| AIConfigurator | `/tmp/toc-llm-sim-research/aiconfigurator` | `e46ece7510e727fafefb8212e5846172145a30ea` | Source/docs inspection for config-estimator capability |
Key local evidence:
- Frontier trace replay: `/tmp/toc-llm-sim-research/Frontier/frontier/request_generator/trace_replay_request_generator.py`
- Frontier prefix-cache validation: `/tmp/toc-llm-sim-research/Frontier/frontier/scheduler/cluster_scheduler/base_cluster_scheduler.py`
- Frontier prefix-cache request metrics: `/tmp/toc-llm-sim-research/Frontier/frontier/metrics/metrics_store.py`
- Vidur trace replay: `/tmp/toc-llm-sim-research/vidur/vidur/request_generator/trace_replay_request_generator.py`
- Vidur request entity: `/tmp/toc-llm-sim-research/vidur/vidur/entities/request.py`
- AIConfigurator CLI/docs: `/tmp/toc-llm-sim-research/aiconfigurator/README.md` and `/tmp/toc-llm-sim-research/aiconfigurator/src/aiconfigurator/cli/main.py`
## Capability Matrix
| Capability | Frontier | Vidur | AIConfigurator |
|---|---|---|---|
| Per-request timestamp replay | Yes. `trace_replay` consumes `arrived_at` and RS1 runs `simulation_mode=online`. | Yes. `TraceReplayRequestGenerator` consumes `arrived_at`. | No per-request replay. CLI consumes workload summaries such as `--isl`, `--osl`, and SLA targets. |
| Input/output length replay | Yes. Consumes `num_prefill_tokens` and `num_decode_tokens`. Frontier can clip overflows internally, so ReplayServe adapter validates before run. | Yes. Consumes `num_prefill_tokens` and `num_decode_tokens`; current code clips prefill length if total exceeds max tokens. | Only summary lengths, not per-request traces. |
| Explicit `block_hash_ids` / prefix KV reuse replay | Yes. Current Frontier parses `session_id` and `block_hash_ids`, validates they are present when prefix caching is enabled, and applies prefix-cache accounting in vLLM v1. RS1B needs a patch for prefix cache plus chunked prefill under pressure. | No in this checkout. Request objects carry arrival/prefill/decode lengths and processed-token state, but no `session_id`, block hashes, or explicit prefix-reuse replay. README says prefix-caching work lives on a canary branch with sharp edges, not this main checkout. | No. `--prefix` is an aggregate prefix length/workload parameter, not an explicit hash/session replay model. |
| Online arrival pattern | Yes. RS1 fixed config uses online mode and trace replay. | Yes for trace replay baseline. | No event-level online replay. It estimates candidate deployments from summary workload/SLA inputs. |
| Prefix-cache hit-ratio output | Yes. Frontier emits request metrics including cached prefill tokens, query blocks, and hit blocks when present. ReplayServe postprocess adds token-weighted hit ratio using sidecar partial-block counts. | No native prefix-hit ratio in current main because no explicit prefix replay. | No prefix-hit replay metric. |
| TTFT / TPOT / E2E / throughput output | Yes. Request and system metrics are emitted under Frontier metrics dirs. RS1 uses dummy execution predictor, so values are plumbing-only. | Yes. Vidur metrics include request E2E, prefill/TTFT-style, decode-normalized, and system metrics. Fidelity depends on matching profiles. | Yes as estimates: best throughput, per-GPU throughput, per-user throughput, TTFT, TPOT, and request latency. These are configuration-search outputs, not replay observations. |
| TP / EP / DP / config knobs | Yes. Frontier has model, device, network device, attention TP/DP, MoE TP/EP, PP, scheduler, block, batch, prefix-cache, chunked-prefill, and memory-planner knobs. | Partial. Vidur exposes model, device, network device, tensor parallel size, pipeline stages, scheduler and batch/KV knobs. The inspected checkout is not a faithful EP/DP prefix-replay candidate. | Strong for config search. Supports TP/PP/DP and expert TP/EP style search across supported backends/systems. |
| Arbitrary model/hardware/config boundary | Not arbitrary. Model/device configs may exist, but reliable latency/throughput requires compute/network profiles, scheduler support, matching parallel semantics, and bug-free code paths. RS1 Qwen3-32B on A800 uses dummy predictor because public A800 dense Qwen3-32B compute profiles are absent. | Not arbitrary. README lists supported model/device/profile combinations; docs say profiling on actual GPUs is needed for new model/hardware fidelity. Current public device/profile coverage does not match A800 Qwen3-32B. | Not arbitrary. It depends on supported model families, backend/system databases, and estimate modes. The support matrix includes H100/H200/B200/GB200/A100 variants; no A800 built-in silicon database was found. |
| Needs profile/calibration | Yes for performance claims. Dummy predictor plus analytical comm is only a smoke. | Yes for performance claims. New model/device requires compute, network, and CPU-overhead style profiling. | Yes for production-quality estimates. It relies on collected silicon/perf databases or rough estimate modes; README warns memory/results need validation. |
## First Results
### Frontier Canonical
Fixed RS1 config:
- `simulation_mode=online`
- `sys_arch=co-location`
- `replica_scheduler=vllm_v1`
- `device=a800`
- `network_device=a800_dgx`
- `model_name=Qwen/Qwen3-32B`
- `attn_tensor_parallel_size=2`
- dummy execution predictor
- analytical communication backend
- `trace_request_generator_config_max_tokens=32768`
- prefix caching enabled
- block size 16
- chunked prefill enabled
- batch cap 128
- max batch tokens 32768
- KV capacity from Frontier memory planner with `gpu_memory_utilization=0.9` and `non_kv_cache_overhead_bytes=0`
Results:
| Run | Result | Evidence | Notes |
|---|---|---|---|
| `coder_100` | Pass | `runs/rs1/coder_100/` | Frontier block hit ratio `0.04948661841440835`; ReplayServe token-weighted hit ratio `0.04956232588915065`; no preemptions. |
| `coder_2000` | Fail | `runs/rs1/coder_2000/` | Exit code 1 after 4 seconds with `ValueError: Request 194 already scheduled.` Traceback ends at Frontier vLLM v1 waiting scheduling calling `request.on_cache_hit(prefix_cached_tokens)`. |
The canonical failure was minimized in `docs/rs1_frontier_blocker.md`: first-N
`N=192` passes, `N=193` fails as `Request 192 already scheduled`, and larger
fixed-config probes fail around the same preempted prefix-cache path. Prefix off,
chunked-prefill off, or a high long-prefill threshold avoids the failure, so this
is not a bad Qwen trace row.
### Frontier Patched Scratch
Patch:
- File: `patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch`
- Documentation: `docs/rs1_frontier_patch.md`
- Scratch checkout only: `/tmp/replayserve-frontier-rs1b`
The patch resets preempted request scheduler/cache-hit admission state before
the request re-enters the waiting path. As of 2026-06-25 it also replays
decode-phase preemption by moving already-produced tokens into the next prefill
segment, preserves user-facing lengths for metrics, and fails fast if
sequential simulation drains before all generated requests complete. It keeps
the canonical Frontier checkout clean.
| Run | Result | Evidence | Hit ratios | Preemption | Memory planner facts |
|---|---|---|---|---|---|
| `N=193` fixed config | Pass | `runs/rs1b/patched/n193_fixed_v2/` | Frontier block `0.12458971786194112`; ReplayServe token-weighted `0.12476981408429115` | 5 total events, 1 request | `num_blocks=36902`, `gpu_memory_utilization=0.9`, non-KV overhead `0`, weight shard estimate `26.953125 GiB` |
| `coder_100` fixed config | Pass | `runs/rs1b/patched/coder_100/` | Frontier block `0.04948661841440835`; ReplayServe token-weighted `0.04956232588915065` | 0 | same derived memory planner point |
| `coder_2000` fixed config | Pass | `runs/rs1b/patched/coder_2000/` | Frontier block `0.12318930248025924`; ReplayServe token-weighted `0.12332978217090633` | 35940 total events, 1061 requests | same derived memory planner point |
Metrics caveats:
- These are plumbing-smoke metrics. The run uses dummy 1 ms execution time and
analytical communication, not calibrated Qwen3-32B A800 compute profiles.
- `coder_2000` produced `request_metrics.csv` with 2000 rows, but 745 rows have
blank request-level prefix-cache fields. ReplayServe token-weighted hit ratio
therefore uses the 1255 rows with complete cache metrics. Frontier's aggregate
prefix-cache statistics in the same summary also report 1255 requests with
cache metrics. This is acceptable for blocker removal evidence, but it is not
a final metric-quality result.
- No allocation/OOM pressure log lines were found in the postprocess summaries.
### Vidur
No Vidur baseline run was executed for RS2. Based on source inspection, Vidur is
useful as an arrival-and-length baseline candidate, but it cannot faithfully
compare ReplayServe prefix reuse without additional code:
- `vidur/request_generator/trace_replay_request_generator.py` consumes
`arrived_at`, `num_prefill_tokens`, and `num_decode_tokens`.
- `vidur/entities/request.py` stores arrival, prefill length, decode length,
processed tokens, schedule/completion timestamps, and preemption state.
- The inspected request path does not carry `session_id`, `block_hash_ids`, or
sidecar block-token accounting.
- Current Vidur trace replay clips prefill lengths when total tokens exceed
`max_tokens`; ReplayServe must keep its own hard-fail validation if Vidur is
used later as a length-only baseline.
Conclusion for Vidur in RS2: it can likely replay `coder_100`/`coder_2000`
arrival and length after a simple CSV compatibility conversion, but it would
measure a different workload because prefix KV reuse is absent.
### AIConfigurator
No AIConfigurator run was executed for RS2 because it is not a per-request
replay simulator. Source/docs show it is a deployment/config search estimator:
- CLI examples take workload summaries such as `--isl`, `--osl`, `--prefix`,
`--ttft`, `--tpot`, `--total-gpus`, `--system`, and model path.
- Outputs are best throughput, per-GPU throughput, per-user throughput, TTFT,
TPOT, request latency, concurrency, and parallel deployment choices.
- It models operations and searches aggregated/disaggregated serving
configurations using collected or estimated performance data.
Conclusion for AIConfigurator in RS2: it is useful for config candidates and
reference sizing assumptions. It cannot directly compare faithful per-request
prefix-hit replay on Qwen trace fixtures.
## Metric Definitions
`TTFT`:
Time from request arrival to first generated token / prefill completion. Frontier
and Vidur both have request-level prefill/first-token style timing fields, but
RS1 Frontier values are not performance claims because the execution predictor is
dummy.
`TPOT`:
Decode time per output token. Tools differ on whether they report total decode
normalized by output tokens, inter-token latency, or a configured SLA target.
Use each tool's native field only within that tool unless calibrated against the
same serving definition.
`E2E latency`:
Completion time minus arrival time for one request.
`Throughput`:
Completed tokens or requests per unit time. AIConfigurator reports estimated
tokens/s style capacity; Frontier/Vidur report simulated metrics. RS1 Frontier
throughput is plumbing-only because compute is dummy.
`KV-cache hit ratio`:
- Frontier native block-level ratio:
`sum(request_prefix_cache_hit_blocks) / sum(request_prefix_cache_query_blocks)`.
- ReplayServe token-weighted ratio:
use sidecar `block_token_counts` and count the first
`request_prefix_cache_hit_blocks` blocks by true token count, so a partial
final block contributes its actual token count instead of always 16.
For `coder_2000` patched, both ratios are computed only for request rows with
complete cache fields because 745 request metric rows have blank cache fields.
This is a metrics completeness caveat, not evidence that the trace has invalid
hashes.
## Non-Comparable Items
- Frontier canonical and patched scratch are not equivalent artifacts. The
patched result demonstrates an RS1 unblock path; it is not an upstream
Frontier release.
- Frontier/Vidur simulator timings and AIConfigurator estimator timings are not
directly comparable without shared profiles, calibration, and metric
definitions.
- Prefix-reuse fidelity is not comparable across all three tools. Only Frontier
currently consumes explicit block hash traces in the inspected checkouts.
- AIConfigurator's `prefix` workload parameter is not the same as ReplayServe
`block_hash_ids`; it cannot recover session-level sharing or partial-block
token accounting.
## Conclusions
There is no best open-source implementation that satisfies ReplayServe's target
out of the box.
Frontier is the closest because it supports online trace replay, prefix-cache
metadata, vLLM v1 style scheduler controls, memory planning, and request/system
metrics. It is still not out of the box for ReplayServe: RS0 needed an adapter,
RS1B needed a local patch or upstream fix for prefix cache plus chunked prefill,
and performance-quality claims need Qwen3-32B A800 profiles/calibration.
Vidur can be a useful arrival-plus-length baseline, but not a faithful prefix KV
reuse replay engine in the inspected checkout.
AIConfigurator can guide candidate deployment/config choices, but it is a
workload-summary estimator rather than a per-request simulator.
Frontier also does not support arbitrary model plus arbitrary hardware plus
arbitrary config in a performance-reliable sense. A model/device config may be
accepted syntactically, but fidelity depends on compute profiles, network
profiles, scheduler support, parallelism semantics, memory-planner assumptions,
and bug surface. For RS1, A800 network profiles exist, but the public checkout
does not provide dense Qwen3-32B A800 compute profiles, so latency/throughput
remain plumbing smoke.
## Next Steps
RS3 sweep prerequisites:
- Decide whether RS3 uses the local RS1B patch, waits for upstream Frontier, or
carries both canonical and patched modes explicitly.
- Keep fixed-config smoke and any sweep configs separate from performance claims.
- Add a small run manifest/check script that records Frontier commit, patch
status, fixture, command, and metric completeness.
- Treat the `coder_2000` blank cache fields as a metrics issue to investigate
before using request-level hit ratios as a headline metric.
RS4 calibration prerequisites:
- Collect or obtain dense `Qwen/Qwen3-32B` A800 compute profiles for the Frontier
predictor path.
- Verify the A800 network profile and node SKU semantics match the target
deployment.
- Add non-KV memory overhead assumptions from a real serving stack instead of
using `0`.
- Validate simulator TTFT/TPOT/E2E/throughput against measured vLLM runs before
making performance conclusions.
Patch path recommendation:
- Keep `patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch` pinned in
ReplayServe until an upstream Frontier fix is available; it now covers both
the original prefix-cache/chunked-prefill preemption bug and the RS10
decode-phase preemption lifecycle bug.
- Open an upstream issue or PR with the RS1B minimal repro (`N=193`) and
evidence from `docs/rs1_frontier_blocker.md`.
- Re-run `coder_100`, `N=193`, and `coder_2000` when changing Frontier commit or
patch status.