Add ReplayServe Frontier vLLM alignment report

2026-06-25 17:10:30 +08:00
commit a99bd00782
63 changed files with 17033 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,109 @@
+# ReplayServe
+
+ReplayServe is a small trace-replay workspace for reproducing real LLM serving
+traces in open simulators. The first target is Frontier trace replay with
+timestamp, prompt length, decode length, session, and prefix block reuse
+preserved from the Qwen Bailian anonymized JSONL traces.
+
+RS0 only bootstraps the repository, documents source versions, implements the
+Qwen JSONL to Frontier CSV adapter semantics, and creates canonical fixtures.
+It does not run the Frontier simulator; RS1 owns simulator smoke runs.
+
+## First Frontier Smoke Point
+
+The first Frontier smoke is fixed to this plumbing-only configuration:
+
+- `simulation_mode=online`
+- `sys_arch=co-location`
+- `replica_scheduler=vllm_v1`
+- `device=a800`
+- `model_name=Qwen/Qwen3-32B`
+- `attn_tensor_parallel_size=2`
+- dummy execution predictor
+- analytical communication backend
+- `trace_request_generator_config_max_tokens=32768`
+- prefix cache enabled
+- block size 16
+- chunked prefill enabled
+- batch cap 128
+- max batch tokens 32768
+- KV capacity estimated by Frontier memory planner
+
+Frontier currently has A800 network profiles, but the checked public A800
+compute profiles do not include dense `Qwen/Qwen3-32B`. RS1 latency and
+throughput numbers from this point are therefore plumbing smoke only, not
+profile-faithful performance conclusions.
+
+## Real vLLM GPU Baseline
+
+RS4 starts a real-backend baseline on dash2 H20 with
+`/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B`. The runner
+`tools/vllm_synthetic_replay.py` uses synthetic `prompt_token_ids` derived from
+the Qwen block hashes, so equal trace blocks become equal token blocks and vLLM
+prefix-cache hits can be observed directly.
+
+See `docs/rs4_vllm_gpu_smoke.md` for the first TP=1/2 smoke results and the
+reliability boundary.
+
+## Adapter Semantics
+
+`tools/qwen_to_frontier.py` converts Qwen JSONL rows to Frontier CSV rows and a
+ReplayServe sidecar JSONL.
+
+Field mapping:
+
+| Qwen JSONL | Frontier CSV | Notes |
+|---|---|---|
+| `timestamp` | `arrived_at` | Trace-relative seconds. |
+| `input_length` | `num_prefill_tokens` | Already post chat-template serving input. |
+| `output_length` | `num_decode_tokens` | Generation length. |
+| `chat_id` | `session_id` | Preserved for session-aware analysis. |
+| `hash_ids` | `block_hash_ids` | Joined with `|` for Frontier. |
+
+The Qwen trace uses 16-token salted SipHash blocks. The adapter asserts
+`len(hash_ids) == ceil(input_length / block_size)`. The final block can be a
+padded partial block; its true token count is `input_length % block_size`, or
+`block_size` when the prompt length is divisible by the block size. The sidecar
+records `block_token_counts` so downstream analyses can compute token-weighted
+prefix-cache accounting while Frontier replays the original block hashes.
+
+Overflow handling is intentionally explicit. The adapter never clips prompt or
+decode tokens. With `--fail-on-overflow`, any row where
+`input_length + output_length > --max-tokens` exits with an error before
+publishing output files.
+
+Example:
+
+```bash
+python3 tools/qwen_to_frontier.py \
+  --input /home/gahow/phd/qwen-bailian-usagetraces-anon/qwen_coder_blksz_16.jsonl \
+  --frontier-csv traces/fixtures/coder_100/frontier.csv \
+  --sidecar-jsonl traces/fixtures/coder_100/sidecar.jsonl \
+  --source-jsonl traces/fixtures/coder_100/source.jsonl \
+  --manifest-json traces/fixtures/coder_100/manifest.json \
+  --fixture-name coder_100 \
+  --limit 100 \
+  --max-tokens 32768 \
+  --block-size 16 \
+  --fail-on-overflow
+```
+
+Validate fixtures:
+
+```bash
+python3 tools/validate_fixtures.py \
+  traces/fixtures/coder_100 \
+  traces/fixtures/coder_2000 \
+  --max-tokens 32768 \
+  --block-size 16
+```
+
+## Fixture Layout
+
+Each fixture directory under `traces/fixtures/` contains:
+
+- `source.jsonl`: the original Qwen JSONL slice.
+- `frontier.csv`: Frontier trace replay CSV.
+- `sidecar.jsonl`: ReplayServe metadata with original fields and block token
+  counts.
+- `manifest.json`: generation parameters and basic stats.