replaysim/README.md

# ReplayServe

ReplayServe is a small trace-replay workspace for reproducing real LLM serving
traces in open simulators. The first target is Frontier trace replay with
timestamp, prompt length, decode length, session, and prefix block reuse
preserved from the Qwen Bailian anonymized JSONL traces.

RS0 only bootstraps the repository, documents source versions, implements the
Qwen JSONL to Frontier CSV adapter semantics, and creates canonical fixtures.
It does not run the Frontier simulator; RS1 owns simulator smoke runs.

## First Frontier Smoke Point

The first Frontier smoke is fixed to this plumbing-only configuration:

- `simulation_mode=online`
- `sys_arch=co-location`
- `replica_scheduler=vllm_v1`
- `device=a800`
- `model_name=Qwen/Qwen3-32B`
- `attn_tensor_parallel_size=2`
- dummy execution predictor
- analytical communication backend
- `trace_request_generator_config_max_tokens=32768`
- prefix cache enabled
- block size 16
- chunked prefill enabled
- batch cap 128
- max batch tokens 32768
- KV capacity estimated by Frontier memory planner

Frontier currently has A800 network profiles, but the checked public A800
compute profiles do not include dense `Qwen/Qwen3-32B`. RS1 latency and
throughput numbers from this point are therefore plumbing smoke only, not
profile-faithful performance conclusions.

## Real vLLM GPU Baseline

RS4 starts a real-backend baseline on dash2 H20 with
`/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B`. The runner
`tools/vllm_synthetic_replay.py` uses synthetic `prompt_token_ids` derived from
the Qwen block hashes, so equal trace blocks become equal token blocks and vLLM
prefix-cache hits can be observed directly.

See `docs/rs4_vllm_gpu_smoke.md` for the first TP=1/2 smoke results and the
reliability boundary.

## Adapter Semantics

`tools/qwen_to_frontier.py` converts Qwen JSONL rows to Frontier CSV rows and a
ReplayServe sidecar JSONL.

Field mapping:

| Qwen JSONL | Frontier CSV | Notes |
|---|---|---|
| `timestamp` | `arrived_at` | Trace-relative seconds. |
| `input_length` | `num_prefill_tokens` | Already post chat-template serving input. |
| `output_length` | `num_decode_tokens` | Generation length. |
| `chat_id` | `session_id` | Preserved for session-aware analysis. |
| `hash_ids` | `block_hash_ids` | Joined with `|` for Frontier. |

The Qwen trace uses 16-token salted SipHash blocks. The adapter asserts
`len(hash_ids) == ceil(input_length / block_size)`. The final block can be a
padded partial block; its true token count is `input_length % block_size`, or
`block_size` when the prompt length is divisible by the block size. The sidecar
records `block_token_counts` so downstream analyses can compute token-weighted
prefix-cache accounting while Frontier replays the original block hashes.

Overflow handling is intentionally explicit. The adapter never clips prompt or
decode tokens. With `--fail-on-overflow`, any row where
`input_length + output_length > --max-tokens` exits with an error before
publishing output files.

Example:

```bash
python3 tools/qwen_to_frontier.py \
  --input /home/gahow/phd/qwen-bailian-usagetraces-anon/qwen_coder_blksz_16.jsonl \
  --frontier-csv traces/fixtures/coder_100/frontier.csv \
  --sidecar-jsonl traces/fixtures/coder_100/sidecar.jsonl \
  --source-jsonl traces/fixtures/coder_100/source.jsonl \
  --manifest-json traces/fixtures/coder_100/manifest.json \
  --fixture-name coder_100 \
  --limit 100 \
  --max-tokens 32768 \
  --block-size 16 \
  --fail-on-overflow
```

Validate fixtures:

```bash
python3 tools/validate_fixtures.py \
  traces/fixtures/coder_100 \
  traces/fixtures/coder_2000 \
  --max-tokens 32768 \
  --block-size 16
```

## Fixture Layout

Each fixture directory under `traces/fixtures/` contains:

- `source.jsonl`: the original Qwen JSONL slice.
- `frontier.csv`: Frontier trace replay CSV.
- `sidecar.jsonl`: ReplayServe metadata with original fields and block token
  counts.
- `manifest.json`: generation parameters and basic stats.