ReplayServe

ReplayServe is a small trace-replay workspace for reproducing real LLM serving traces in open simulators. The first target is Frontier trace replay with timestamp, prompt length, decode length, session, and prefix block reuse preserved from the Qwen Bailian anonymized JSONL traces.

RS0 only bootstraps the repository, documents source versions, implements the Qwen JSONL to Frontier CSV adapter semantics, and creates canonical fixtures. It does not run the Frontier simulator; RS1 owns simulator smoke runs.

First Frontier Smoke Point

The first Frontier smoke is fixed to this plumbing-only configuration:

  • simulation_mode=online
  • sys_arch=co-location
  • replica_scheduler=vllm_v1
  • device=a800
  • model_name=Qwen/Qwen3-32B
  • attn_tensor_parallel_size=2
  • dummy execution predictor
  • analytical communication backend
  • trace_request_generator_config_max_tokens=32768
  • prefix cache enabled
  • block size 16
  • chunked prefill enabled
  • batch cap 128
  • max batch tokens 32768
  • KV capacity estimated by Frontier memory planner

Frontier currently has A800 network profiles, but the checked public A800 compute profiles do not include dense Qwen/Qwen3-32B. RS1 latency and throughput numbers from this point are therefore plumbing smoke only, not profile-faithful performance conclusions.

Real vLLM GPU Baseline

RS4 starts a real-backend baseline on dash2 H20 with /home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B. The runner tools/vllm_synthetic_replay.py uses synthetic prompt_token_ids derived from the Qwen block hashes, so equal trace blocks become equal token blocks and vLLM prefix-cache hits can be observed directly.

See docs/rs4_vllm_gpu_smoke.md for the first TP=1/2 smoke results and the reliability boundary.

Adapter Semantics

tools/qwen_to_frontier.py converts Qwen JSONL rows to Frontier CSV rows and a ReplayServe sidecar JSONL.

Field mapping:

Qwen JSONL Frontier CSV Notes
timestamp arrived_at Trace-relative seconds.
input_length num_prefill_tokens Already post chat-template serving input.
output_length num_decode_tokens Generation length.
chat_id session_id Preserved for session-aware analysis.
hash_ids block_hash_ids Joined with `

The Qwen trace uses 16-token salted SipHash blocks. The adapter asserts len(hash_ids) == ceil(input_length / block_size). The final block can be a padded partial block; its true token count is input_length % block_size, or block_size when the prompt length is divisible by the block size. The sidecar records block_token_counts so downstream analyses can compute token-weighted prefix-cache accounting while Frontier replays the original block hashes.

Overflow handling is intentionally explicit. The adapter never clips prompt or decode tokens. With --fail-on-overflow, any row where input_length + output_length > --max-tokens exits with an error before publishing output files.

Example:

python3 tools/qwen_to_frontier.py \
  --input /home/gahow/phd/qwen-bailian-usagetraces-anon/qwen_coder_blksz_16.jsonl \
  --frontier-csv traces/fixtures/coder_100/frontier.csv \
  --sidecar-jsonl traces/fixtures/coder_100/sidecar.jsonl \
  --source-jsonl traces/fixtures/coder_100/source.jsonl \
  --manifest-json traces/fixtures/coder_100/manifest.json \
  --fixture-name coder_100 \
  --limit 100 \
  --max-tokens 32768 \
  --block-size 16 \
  --fail-on-overflow

Validate fixtures:

python3 tools/validate_fixtures.py \
  traces/fixtures/coder_100 \
  traces/fixtures/coder_2000 \
  --max-tokens 32768 \
  --block-size 16

Fixture Layout

Each fixture directory under traces/fixtures/ contains:

  • source.jsonl: the original Qwen JSONL slice.
  • frontier.csv: Frontier trace replay CSV.
  • sidecar.jsonl: ReplayServe metadata with original fields and block token counts.
  • manifest.json: generation parameters and basic stats.
Description
No description provided
Readme 7.3 MiB
Languages
Python 91.3%
Shell 8.7%