ReplayServe

ReplayServe is a small trace-replay workspace for reproducing real LLM serving traces in open simulators. The first target is Frontier trace replay with timestamp, prompt length, decode length, session, and prefix block reuse preserved from the Qwen Bailian anonymized JSONL traces.

RS0 only bootstraps the repository, documents source versions, implements the Qwen JSONL to Frontier CSV adapter semantics, and creates canonical fixtures. It does not run the Frontier simulator; RS1 owns simulator smoke runs.

First Frontier Smoke Point

The first Frontier smoke is fixed to this plumbing-only configuration:

simulation_mode=online
sys_arch=co-location
replica_scheduler=vllm_v1
device=a800
model_name=Qwen/Qwen3-32B
attn_tensor_parallel_size=2
dummy execution predictor
analytical communication backend
trace_request_generator_config_max_tokens=32768
prefix cache enabled
block size 16
chunked prefill enabled
batch cap 128
max batch tokens 32768
KV capacity estimated by Frontier memory planner

Frontier currently has A800 network profiles, but the checked public A800 compute profiles do not include dense Qwen/Qwen3-32B. RS1 latency and throughput numbers from this point are therefore plumbing smoke only, not profile-faithful performance conclusions.

Real vLLM GPU Baseline

RS4 starts a real-backend baseline on dash2 H20 with /home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B. The runner tools/vllm_synthetic_replay.py uses synthetic prompt_token_ids derived from the Qwen block hashes, so equal trace blocks become equal token blocks and vLLM prefix-cache hits can be observed directly.

See docs/rs4_vllm_gpu_smoke.md for the first TP=1/2 smoke results and the reliability boundary.

Adapter Semantics

tools/qwen_to_frontier.py converts Qwen JSONL rows to Frontier CSV rows and a ReplayServe sidecar JSONL.

Field mapping:

Qwen JSONL	Frontier CSV	Notes
`timestamp`	`arrived_at`	Trace-relative seconds.
`input_length`	`num_prefill_tokens`	Already post chat-template serving input.
`output_length`	`num_decode_tokens`	Generation length.
`chat_id`	`session_id`	Preserved for session-aware analysis.
`hash_ids`	`block_hash_ids`	Joined with `

The Qwen trace uses 16-token salted SipHash blocks. The adapter asserts len(hash_ids) == ceil(input_length / block_size). The final block can be a padded partial block; its true token count is input_length % block_size, or block_size when the prompt length is divisible by the block size. The sidecar records block_token_counts so downstream analyses can compute token-weighted prefix-cache accounting while Frontier replays the original block hashes.

Overflow handling is intentionally explicit. The adapter never clips prompt or decode tokens. With --fail-on-overflow, any row where input_length + output_length > --max-tokens exits with an error before publishing output files.

Example:

python3 tools/qwen_to_frontier.py \
  --input /home/gahow/phd/qwen-bailian-usagetraces-anon/qwen_coder_blksz_16.jsonl \
  --frontier-csv traces/fixtures/coder_100/frontier.csv \
  --sidecar-jsonl traces/fixtures/coder_100/sidecar.jsonl \
  --source-jsonl traces/fixtures/coder_100/source.jsonl \
  --manifest-json traces/fixtures/coder_100/manifest.json \
  --fixture-name coder_100 \
  --limit 100 \
  --max-tokens 32768 \
  --block-size 16 \
  --fail-on-overflow

Validate fixtures:

python3 tools/validate_fixtures.py \
  traces/fixtures/coder_100 \
  traces/fixtures/coder_2000 \
  --max-tokens 32768 \
  --block-size 16

Fixture Layout

Each fixture directory under traces/fixtures/ contains:

source.jsonl: the original Qwen JSONL slice.
frontier.csv: Frontier trace replay CSV.
sidecar.jsonl: ReplayServe metadata with original fields and block token counts.
manifest.json: generation parameters and basic stats.

3.9 KiB Raw Permalink Blame History

ReplayServe

First Frontier Smoke Point

Real vLLM GPU Baseline

Adapter Semantics

Fixture Layout

3.9 KiB

Raw Permalink Blame History