Files

Gahow Wang a99bd00782 Add ReplayServe Frontier vLLM alignment report

2026-06-25 17:10:30 +08:00

16 KiB

Raw Blame History

RS2 Simulator Comparison

Checked on 2026-06-24. RS2 compares simulator capabilities and first local ReplayServe results. It does not start the RS3 sweep and does not make performance-quality claims.

Sources

Source	Local path	Commit / HEAD	RS2 use
ReplayServe	`/home/gahow/phd/replayserve`	local RS0/RS1/RS1B artifacts	Adapter, fixtures, runs, postprocess summaries
Qwen trace	`/home/gahow/phd/qwen-bailian-usagetraces-anon`	`5f7439c51ec248a0c585f7d90a41a6f57773b912`	Source `qwen_coder_blksz_16.jsonl`
Frontier canonical	`/tmp/toc-llm-sim-research/Frontier`	`d9cfeb6d8791fbf2f295dd9744c56a666171776e`	RS1 fixed config and source inspection
Frontier patched scratch	`/tmp/replayserve-frontier-rs1b`	base `d9cfeb6...` plus `patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch`	RS1B unblock verification
Vidur	`/tmp/toc-llm-sim-research/vidur`	`8383d2935bc62723a212090baa9f98ada206fc14`	Source inspection for baseline capability
AIConfigurator	`/tmp/toc-llm-sim-research/aiconfigurator`	`e46ece7510e727fafefb8212e5846172145a30ea`	Source/docs inspection for config-estimator capability

Key local evidence:

Frontier trace replay: /tmp/toc-llm-sim-research/Frontier/frontier/request_generator/trace_replay_request_generator.py
Frontier prefix-cache validation: /tmp/toc-llm-sim-research/Frontier/frontier/scheduler/cluster_scheduler/base_cluster_scheduler.py
Frontier prefix-cache request metrics: /tmp/toc-llm-sim-research/Frontier/frontier/metrics/metrics_store.py
Vidur trace replay: /tmp/toc-llm-sim-research/vidur/vidur/request_generator/trace_replay_request_generator.py
Vidur request entity: /tmp/toc-llm-sim-research/vidur/vidur/entities/request.py
AIConfigurator CLI/docs: /tmp/toc-llm-sim-research/aiconfigurator/README.md and /tmp/toc-llm-sim-research/aiconfigurator/src/aiconfigurator/cli/main.py

Capability Matrix

Capability	Frontier	Vidur	AIConfigurator
Per-request timestamp replay	Yes. `trace_replay` consumes `arrived_at` and RS1 runs `simulation_mode=online`.	Yes. `TraceReplayRequestGenerator` consumes `arrived_at`.	No per-request replay. CLI consumes workload summaries such as `--isl`, `--osl`, and SLA targets.
Input/output length replay	Yes. Consumes `num_prefill_tokens` and `num_decode_tokens`. Frontier can clip overflows internally, so ReplayServe adapter validates before run.	Yes. Consumes `num_prefill_tokens` and `num_decode_tokens`; current code clips prefill length if total exceeds max tokens.	Only summary lengths, not per-request traces.
Explicit `block_hash_ids` / prefix KV reuse replay	Yes. Current Frontier parses `session_id` and `block_hash_ids`, validates they are present when prefix caching is enabled, and applies prefix-cache accounting in vLLM v1. RS1B needs a patch for prefix cache plus chunked prefill under pressure.	No in this checkout. Request objects carry arrival/prefill/decode lengths and processed-token state, but no `session_id`, block hashes, or explicit prefix-reuse replay. README says prefix-caching work lives on a canary branch with sharp edges, not this main checkout.	No. `--prefix` is an aggregate prefix length/workload parameter, not an explicit hash/session replay model.
Online arrival pattern	Yes. RS1 fixed config uses online mode and trace replay.	Yes for trace replay baseline.	No event-level online replay. It estimates candidate deployments from summary workload/SLA inputs.
Prefix-cache hit-ratio output	Yes. Frontier emits request metrics including cached prefill tokens, query blocks, and hit blocks when present. ReplayServe postprocess adds token-weighted hit ratio using sidecar partial-block counts.	No native prefix-hit ratio in current main because no explicit prefix replay.	No prefix-hit replay metric.
TTFT / TPOT / E2E / throughput output	Yes. Request and system metrics are emitted under Frontier metrics dirs. RS1 uses dummy execution predictor, so values are plumbing-only.	Yes. Vidur metrics include request E2E, prefill/TTFT-style, decode-normalized, and system metrics. Fidelity depends on matching profiles.	Yes as estimates: best throughput, per-GPU throughput, per-user throughput, TTFT, TPOT, and request latency. These are configuration-search outputs, not replay observations.
TP / EP / DP / config knobs	Yes. Frontier has model, device, network device, attention TP/DP, MoE TP/EP, PP, scheduler, block, batch, prefix-cache, chunked-prefill, and memory-planner knobs.	Partial. Vidur exposes model, device, network device, tensor parallel size, pipeline stages, scheduler and batch/KV knobs. The inspected checkout is not a faithful EP/DP prefix-replay candidate.	Strong for config search. Supports TP/PP/DP and expert TP/EP style search across supported backends/systems.
Arbitrary model/hardware/config boundary	Not arbitrary. Model/device configs may exist, but reliable latency/throughput requires compute/network profiles, scheduler support, matching parallel semantics, and bug-free code paths. RS1 Qwen3-32B on A800 uses dummy predictor because public A800 dense Qwen3-32B compute profiles are absent.	Not arbitrary. README lists supported model/device/profile combinations; docs say profiling on actual GPUs is needed for new model/hardware fidelity. Current public device/profile coverage does not match A800 Qwen3-32B.	Not arbitrary. It depends on supported model families, backend/system databases, and estimate modes. The support matrix includes H100/H200/B200/GB200/A100 variants; no A800 built-in silicon database was found.
Needs profile/calibration	Yes for performance claims. Dummy predictor plus analytical comm is only a smoke.	Yes for performance claims. New model/device requires compute, network, and CPU-overhead style profiling.	Yes for production-quality estimates. It relies on collected silicon/perf databases or rough estimate modes; README warns memory/results need validation.

First Results

Frontier Canonical

Fixed RS1 config:

simulation_mode=online
sys_arch=co-location
replica_scheduler=vllm_v1
device=a800
network_device=a800_dgx
model_name=Qwen/Qwen3-32B
attn_tensor_parallel_size=2
dummy execution predictor
analytical communication backend
trace_request_generator_config_max_tokens=32768
prefix caching enabled
block size 16
chunked prefill enabled
batch cap 128
max batch tokens 32768
KV capacity from Frontier memory planner with gpu_memory_utilization=0.9 and non_kv_cache_overhead_bytes=0

Results:

Run	Result	Evidence	Notes
`coder_100`	Pass	`runs/rs1/coder_100/`	Frontier block hit ratio `0.04948661841440835`; ReplayServe token-weighted hit ratio `0.04956232588915065`; no preemptions.
`coder_2000`	Fail	`runs/rs1/coder_2000/`	Exit code 1 after 4 seconds with `ValueError: Request 194 already scheduled.` Traceback ends at Frontier vLLM v1 waiting scheduling calling `request.on_cache_hit(prefix_cached_tokens)`.

The canonical failure was minimized in docs/rs1_frontier_blocker.md: first-N N=192 passes, N=193 fails as Request 192 already scheduled, and larger fixed-config probes fail around the same preempted prefix-cache path. Prefix off, chunked-prefill off, or a high long-prefill threshold avoids the failure, so this is not a bad Qwen trace row.

Frontier Patched Scratch

Patch:

File: patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch
Documentation: docs/rs1_frontier_patch.md
Scratch checkout only: /tmp/replayserve-frontier-rs1b

The patch resets preempted request scheduler/cache-hit admission state before the request re-enters the waiting path. As of 2026-06-25 it also replays decode-phase preemption by moving already-produced tokens into the next prefill segment, preserves user-facing lengths for metrics, and fails fast if sequential simulation drains before all generated requests complete. It keeps the canonical Frontier checkout clean.

Run	Result	Evidence	Hit ratios	Preemption	Memory planner facts
`N=193` fixed config	Pass	`runs/rs1b/patched/n193_fixed_v2/`	Frontier block `0.12458971786194112`; ReplayServe token-weighted `0.12476981408429115`	5 total events, 1 request	`num_blocks=36902`, `gpu_memory_utilization=0.9`, non-KV overhead `0`, weight shard estimate `26.953125 GiB`
`coder_100` fixed config	Pass	`runs/rs1b/patched/coder_100/`	Frontier block `0.04948661841440835`; ReplayServe token-weighted `0.04956232588915065`	0	same derived memory planner point
`coder_2000` fixed config	Pass	`runs/rs1b/patched/coder_2000/`	Frontier block `0.12318930248025924`; ReplayServe token-weighted `0.12332978217090633`	35940 total events, 1061 requests	same derived memory planner point

Metrics caveats:

These are plumbing-smoke metrics. The run uses dummy 1 ms execution time and analytical communication, not calibrated Qwen3-32B A800 compute profiles.
coder_2000 produced request_metrics.csv with 2000 rows, but 745 rows have blank request-level prefix-cache fields. ReplayServe token-weighted hit ratio therefore uses the 1255 rows with complete cache metrics. Frontier's aggregate prefix-cache statistics in the same summary also report 1255 requests with cache metrics. This is acceptable for blocker removal evidence, but it is not a final metric-quality result.
No allocation/OOM pressure log lines were found in the postprocess summaries.

Vidur

No Vidur baseline run was executed for RS2. Based on source inspection, Vidur is useful as an arrival-and-length baseline candidate, but it cannot faithfully compare ReplayServe prefix reuse without additional code:

vidur/request_generator/trace_replay_request_generator.py consumes arrived_at, num_prefill_tokens, and num_decode_tokens.
vidur/entities/request.py stores arrival, prefill length, decode length, processed tokens, schedule/completion timestamps, and preemption state.
The inspected request path does not carry session_id, block_hash_ids, or sidecar block-token accounting.
Current Vidur trace replay clips prefill lengths when total tokens exceed max_tokens; ReplayServe must keep its own hard-fail validation if Vidur is used later as a length-only baseline.

Conclusion for Vidur in RS2: it can likely replay coder_100/coder_2000 arrival and length after a simple CSV compatibility conversion, but it would measure a different workload because prefix KV reuse is absent.

AIConfigurator

No AIConfigurator run was executed for RS2 because it is not a per-request replay simulator. Source/docs show it is a deployment/config search estimator:

CLI examples take workload summaries such as --isl, --osl, --prefix, --ttft, --tpot, --total-gpus, --system, and model path.
Outputs are best throughput, per-GPU throughput, per-user throughput, TTFT, TPOT, request latency, concurrency, and parallel deployment choices.
It models operations and searches aggregated/disaggregated serving configurations using collected or estimated performance data.

Conclusion for AIConfigurator in RS2: it is useful for config candidates and reference sizing assumptions. It cannot directly compare faithful per-request prefix-hit replay on Qwen trace fixtures.

Metric Definitions

TTFT: Time from request arrival to first generated token / prefill completion. Frontier and Vidur both have request-level prefill/first-token style timing fields, but RS1 Frontier values are not performance claims because the execution predictor is dummy.

TPOT: Decode time per output token. Tools differ on whether they report total decode normalized by output tokens, inter-token latency, or a configured SLA target. Use each tool's native field only within that tool unless calibrated against the same serving definition.

E2E latency: Completion time minus arrival time for one request.

Throughput: Completed tokens or requests per unit time. AIConfigurator reports estimated tokens/s style capacity; Frontier/Vidur report simulated metrics. RS1 Frontier throughput is plumbing-only because compute is dummy.

KV-cache hit ratio:

Frontier native block-level ratio: sum(request_prefix_cache_hit_blocks) / sum(request_prefix_cache_query_blocks).
ReplayServe token-weighted ratio: use sidecar block_token_counts and count the first request_prefix_cache_hit_blocks blocks by true token count, so a partial final block contributes its actual token count instead of always 16.

For coder_2000 patched, both ratios are computed only for request rows with complete cache fields because 745 request metric rows have blank cache fields. This is a metrics completeness caveat, not evidence that the trace has invalid hashes.

Non-Comparable Items

Frontier canonical and patched scratch are not equivalent artifacts. The patched result demonstrates an RS1 unblock path; it is not an upstream Frontier release.
Frontier/Vidur simulator timings and AIConfigurator estimator timings are not directly comparable without shared profiles, calibration, and metric definitions.
Prefix-reuse fidelity is not comparable across all three tools. Only Frontier currently consumes explicit block hash traces in the inspected checkouts.
AIConfigurator's prefix workload parameter is not the same as ReplayServe block_hash_ids; it cannot recover session-level sharing or partial-block token accounting.

Conclusions

There is no best open-source implementation that satisfies ReplayServe's target out of the box.

Frontier is the closest because it supports online trace replay, prefix-cache metadata, vLLM v1 style scheduler controls, memory planning, and request/system metrics. It is still not out of the box for ReplayServe: RS0 needed an adapter, RS1B needed a local patch or upstream fix for prefix cache plus chunked prefill, and performance-quality claims need Qwen3-32B A800 profiles/calibration.

Vidur can be a useful arrival-plus-length baseline, but not a faithful prefix KV reuse replay engine in the inspected checkout.

AIConfigurator can guide candidate deployment/config choices, but it is a workload-summary estimator rather than a per-request simulator.

Frontier also does not support arbitrary model plus arbitrary hardware plus arbitrary config in a performance-reliable sense. A model/device config may be accepted syntactically, but fidelity depends on compute profiles, network profiles, scheduler support, parallelism semantics, memory-planner assumptions, and bug surface. For RS1, A800 network profiles exist, but the public checkout does not provide dense Qwen3-32B A800 compute profiles, so latency/throughput remain plumbing smoke.

Next Steps

RS3 sweep prerequisites:

Decide whether RS3 uses the local RS1B patch, waits for upstream Frontier, or carries both canonical and patched modes explicitly.
Keep fixed-config smoke and any sweep configs separate from performance claims.
Add a small run manifest/check script that records Frontier commit, patch status, fixture, command, and metric completeness.
Treat the coder_2000 blank cache fields as a metrics issue to investigate before using request-level hit ratios as a headline metric.

RS4 calibration prerequisites:

Collect or obtain dense Qwen/Qwen3-32B A800 compute profiles for the Frontier predictor path.
Verify the A800 network profile and node SKU semantics match the target deployment.
Add non-KV memory overhead assumptions from a real serving stack instead of using 0.
Validate simulator TTFT/TPOT/E2E/throughput against measured vLLM runs before making performance conclusions.

Patch path recommendation:

Keep patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch pinned in ReplayServe until an upstream Frontier fix is available; it now covers both the original prefix-cache/chunked-prefill preemption bug and the RS10 decode-phase preemption lifecycle bug.
Open an upstream issue or PR with the RS1B minimal repro (N=193) and evidence from docs/rs1_frontier_blocker.md.
Re-run coder_100, N=193, and coder_2000 when changing Frontier commit or patch status.

16 KiB Raw Blame History