110 lines
3.9 KiB
Markdown
110 lines
3.9 KiB
Markdown
# ReplayServe
|
|
|
|
ReplayServe is a small trace-replay workspace for reproducing real LLM serving
|
|
traces in open simulators. The first target is Frontier trace replay with
|
|
timestamp, prompt length, decode length, session, and prefix block reuse
|
|
preserved from the Qwen Bailian anonymized JSONL traces.
|
|
|
|
RS0 only bootstraps the repository, documents source versions, implements the
|
|
Qwen JSONL to Frontier CSV adapter semantics, and creates canonical fixtures.
|
|
It does not run the Frontier simulator; RS1 owns simulator smoke runs.
|
|
|
|
## First Frontier Smoke Point
|
|
|
|
The first Frontier smoke is fixed to this plumbing-only configuration:
|
|
|
|
- `simulation_mode=online`
|
|
- `sys_arch=co-location`
|
|
- `replica_scheduler=vllm_v1`
|
|
- `device=a800`
|
|
- `model_name=Qwen/Qwen3-32B`
|
|
- `attn_tensor_parallel_size=2`
|
|
- dummy execution predictor
|
|
- analytical communication backend
|
|
- `trace_request_generator_config_max_tokens=32768`
|
|
- prefix cache enabled
|
|
- block size 16
|
|
- chunked prefill enabled
|
|
- batch cap 128
|
|
- max batch tokens 32768
|
|
- KV capacity estimated by Frontier memory planner
|
|
|
|
Frontier currently has A800 network profiles, but the checked public A800
|
|
compute profiles do not include dense `Qwen/Qwen3-32B`. RS1 latency and
|
|
throughput numbers from this point are therefore plumbing smoke only, not
|
|
profile-faithful performance conclusions.
|
|
|
|
## Real vLLM GPU Baseline
|
|
|
|
RS4 starts a real-backend baseline on dash2 H20 with
|
|
`/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B`. The runner
|
|
`tools/vllm_synthetic_replay.py` uses synthetic `prompt_token_ids` derived from
|
|
the Qwen block hashes, so equal trace blocks become equal token blocks and vLLM
|
|
prefix-cache hits can be observed directly.
|
|
|
|
See `docs/rs4_vllm_gpu_smoke.md` for the first TP=1/2 smoke results and the
|
|
reliability boundary.
|
|
|
|
## Adapter Semantics
|
|
|
|
`tools/qwen_to_frontier.py` converts Qwen JSONL rows to Frontier CSV rows and a
|
|
ReplayServe sidecar JSONL.
|
|
|
|
Field mapping:
|
|
|
|
| Qwen JSONL | Frontier CSV | Notes |
|
|
|---|---|---|
|
|
| `timestamp` | `arrived_at` | Trace-relative seconds. |
|
|
| `input_length` | `num_prefill_tokens` | Already post chat-template serving input. |
|
|
| `output_length` | `num_decode_tokens` | Generation length. |
|
|
| `chat_id` | `session_id` | Preserved for session-aware analysis. |
|
|
| `hash_ids` | `block_hash_ids` | Joined with `|` for Frontier. |
|
|
|
|
The Qwen trace uses 16-token salted SipHash blocks. The adapter asserts
|
|
`len(hash_ids) == ceil(input_length / block_size)`. The final block can be a
|
|
padded partial block; its true token count is `input_length % block_size`, or
|
|
`block_size` when the prompt length is divisible by the block size. The sidecar
|
|
records `block_token_counts` so downstream analyses can compute token-weighted
|
|
prefix-cache accounting while Frontier replays the original block hashes.
|
|
|
|
Overflow handling is intentionally explicit. The adapter never clips prompt or
|
|
decode tokens. With `--fail-on-overflow`, any row where
|
|
`input_length + output_length > --max-tokens` exits with an error before
|
|
publishing output files.
|
|
|
|
Example:
|
|
|
|
```bash
|
|
python3 tools/qwen_to_frontier.py \
|
|
--input /home/gahow/phd/qwen-bailian-usagetraces-anon/qwen_coder_blksz_16.jsonl \
|
|
--frontier-csv traces/fixtures/coder_100/frontier.csv \
|
|
--sidecar-jsonl traces/fixtures/coder_100/sidecar.jsonl \
|
|
--source-jsonl traces/fixtures/coder_100/source.jsonl \
|
|
--manifest-json traces/fixtures/coder_100/manifest.json \
|
|
--fixture-name coder_100 \
|
|
--limit 100 \
|
|
--max-tokens 32768 \
|
|
--block-size 16 \
|
|
--fail-on-overflow
|
|
```
|
|
|
|
Validate fixtures:
|
|
|
|
```bash
|
|
python3 tools/validate_fixtures.py \
|
|
traces/fixtures/coder_100 \
|
|
traces/fixtures/coder_2000 \
|
|
--max-tokens 32768 \
|
|
--block-size 16
|
|
```
|
|
|
|
## Fixture Layout
|
|
|
|
Each fixture directory under `traces/fixtures/` contains:
|
|
|
|
- `source.jsonl`: the original Qwen JSONL slice.
|
|
- `frontier.csv`: Frontier trace replay CSV.
|
|
- `sidecar.jsonl`: ReplayServe metadata with original fields and block token
|
|
counts.
|
|
- `manifest.json`: generation parameters and basic stats.
|