3.9 KiB
ReplayServe
ReplayServe is a small trace-replay workspace for reproducing real LLM serving traces in open simulators. The first target is Frontier trace replay with timestamp, prompt length, decode length, session, and prefix block reuse preserved from the Qwen Bailian anonymized JSONL traces.
RS0 only bootstraps the repository, documents source versions, implements the Qwen JSONL to Frontier CSV adapter semantics, and creates canonical fixtures. It does not run the Frontier simulator; RS1 owns simulator smoke runs.
First Frontier Smoke Point
The first Frontier smoke is fixed to this plumbing-only configuration:
simulation_mode=onlinesys_arch=co-locationreplica_scheduler=vllm_v1device=a800model_name=Qwen/Qwen3-32Battn_tensor_parallel_size=2- dummy execution predictor
- analytical communication backend
trace_request_generator_config_max_tokens=32768- prefix cache enabled
- block size 16
- chunked prefill enabled
- batch cap 128
- max batch tokens 32768
- KV capacity estimated by Frontier memory planner
Frontier currently has A800 network profiles, but the checked public A800
compute profiles do not include dense Qwen/Qwen3-32B. RS1 latency and
throughput numbers from this point are therefore plumbing smoke only, not
profile-faithful performance conclusions.
Real vLLM GPU Baseline
RS4 starts a real-backend baseline on dash2 H20 with
/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B. The runner
tools/vllm_synthetic_replay.py uses synthetic prompt_token_ids derived from
the Qwen block hashes, so equal trace blocks become equal token blocks and vLLM
prefix-cache hits can be observed directly.
See docs/rs4_vllm_gpu_smoke.md for the first TP=1/2 smoke results and the
reliability boundary.
Adapter Semantics
tools/qwen_to_frontier.py converts Qwen JSONL rows to Frontier CSV rows and a
ReplayServe sidecar JSONL.
Field mapping:
| Qwen JSONL | Frontier CSV | Notes |
|---|---|---|
timestamp |
arrived_at |
Trace-relative seconds. |
input_length |
num_prefill_tokens |
Already post chat-template serving input. |
output_length |
num_decode_tokens |
Generation length. |
chat_id |
session_id |
Preserved for session-aware analysis. |
hash_ids |
block_hash_ids |
Joined with ` |
The Qwen trace uses 16-token salted SipHash blocks. The adapter asserts
len(hash_ids) == ceil(input_length / block_size). The final block can be a
padded partial block; its true token count is input_length % block_size, or
block_size when the prompt length is divisible by the block size. The sidecar
records block_token_counts so downstream analyses can compute token-weighted
prefix-cache accounting while Frontier replays the original block hashes.
Overflow handling is intentionally explicit. The adapter never clips prompt or
decode tokens. With --fail-on-overflow, any row where
input_length + output_length > --max-tokens exits with an error before
publishing output files.
Example:
python3 tools/qwen_to_frontier.py \
--input /home/gahow/phd/qwen-bailian-usagetraces-anon/qwen_coder_blksz_16.jsonl \
--frontier-csv traces/fixtures/coder_100/frontier.csv \
--sidecar-jsonl traces/fixtures/coder_100/sidecar.jsonl \
--source-jsonl traces/fixtures/coder_100/source.jsonl \
--manifest-json traces/fixtures/coder_100/manifest.json \
--fixture-name coder_100 \
--limit 100 \
--max-tokens 32768 \
--block-size 16 \
--fail-on-overflow
Validate fixtures:
python3 tools/validate_fixtures.py \
traces/fixtures/coder_100 \
traces/fixtures/coder_2000 \
--max-tokens 32768 \
--block-size 16
Fixture Layout
Each fixture directory under traces/fixtures/ contains:
source.jsonl: the original Qwen JSONL slice.frontier.csv: Frontier trace replay CSV.sidecar.jsonl: ReplayServe metadata with original fields and block token counts.manifest.json: generation parameters and basic stats.