# ReplayServe ReplayServe is a small trace-replay workspace for reproducing real LLM serving traces in open simulators. The first target is Frontier trace replay with timestamp, prompt length, decode length, session, and prefix block reuse preserved from the Qwen Bailian anonymized JSONL traces. RS0 only bootstraps the repository, documents source versions, implements the Qwen JSONL to Frontier CSV adapter semantics, and creates canonical fixtures. It does not run the Frontier simulator; RS1 owns simulator smoke runs. ## First Frontier Smoke Point The first Frontier smoke is fixed to this plumbing-only configuration: - `simulation_mode=online` - `sys_arch=co-location` - `replica_scheduler=vllm_v1` - `device=a800` - `model_name=Qwen/Qwen3-32B` - `attn_tensor_parallel_size=2` - dummy execution predictor - analytical communication backend - `trace_request_generator_config_max_tokens=32768` - prefix cache enabled - block size 16 - chunked prefill enabled - batch cap 128 - max batch tokens 32768 - KV capacity estimated by Frontier memory planner Frontier currently has A800 network profiles, but the checked public A800 compute profiles do not include dense `Qwen/Qwen3-32B`. RS1 latency and throughput numbers from this point are therefore plumbing smoke only, not profile-faithful performance conclusions. ## Real vLLM GPU Baseline RS4 starts a real-backend baseline on dash2 H20 with `/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B`. The runner `tools/vllm_synthetic_replay.py` uses synthetic `prompt_token_ids` derived from the Qwen block hashes, so equal trace blocks become equal token blocks and vLLM prefix-cache hits can be observed directly. See `docs/rs4_vllm_gpu_smoke.md` for the first TP=1/2 smoke results and the reliability boundary. ## Adapter Semantics `tools/qwen_to_frontier.py` converts Qwen JSONL rows to Frontier CSV rows and a ReplayServe sidecar JSONL. Field mapping: | Qwen JSONL | Frontier CSV | Notes | |---|---|---| | `timestamp` | `arrived_at` | Trace-relative seconds. | | `input_length` | `num_prefill_tokens` | Already post chat-template serving input. | | `output_length` | `num_decode_tokens` | Generation length. | | `chat_id` | `session_id` | Preserved for session-aware analysis. | | `hash_ids` | `block_hash_ids` | Joined with `|` for Frontier. | The Qwen trace uses 16-token salted SipHash blocks. The adapter asserts `len(hash_ids) == ceil(input_length / block_size)`. The final block can be a padded partial block; its true token count is `input_length % block_size`, or `block_size` when the prompt length is divisible by the block size. The sidecar records `block_token_counts` so downstream analyses can compute token-weighted prefix-cache accounting while Frontier replays the original block hashes. Overflow handling is intentionally explicit. The adapter never clips prompt or decode tokens. With `--fail-on-overflow`, any row where `input_length + output_length > --max-tokens` exits with an error before publishing output files. Example: ```bash python3 tools/qwen_to_frontier.py \ --input /home/gahow/phd/qwen-bailian-usagetraces-anon/qwen_coder_blksz_16.jsonl \ --frontier-csv traces/fixtures/coder_100/frontier.csv \ --sidecar-jsonl traces/fixtures/coder_100/sidecar.jsonl \ --source-jsonl traces/fixtures/coder_100/source.jsonl \ --manifest-json traces/fixtures/coder_100/manifest.json \ --fixture-name coder_100 \ --limit 100 \ --max-tokens 32768 \ --block-size 16 \ --fail-on-overflow ``` Validate fixtures: ```bash python3 tools/validate_fixtures.py \ traces/fixtures/coder_100 \ traces/fixtures/coder_2000 \ --max-tokens 32768 \ --block-size 16 ``` ## Fixture Layout Each fixture directory under `traces/fixtures/` contains: - `source.jsonl`: the original Qwen JSONL slice. - `frontier.csv`: Frontier trace replay CSV. - `sidecar.jsonl`: ReplayServe metadata with original fields and block token counts. - `manifest.json`: generation parameters and basic stats.