Files
replaysim/docs/rs3_sweep_harness.md

144 lines
4.8 KiB
Markdown

# RS3 Sweep Harness
RS3 adds a reproducible Frontier sweep harness and a tiny smoke. This is not the
full TP/EP/DP/config scan.
## Files
- Config: `configs/rs3_tiny_sweep.json`
- Runner: `tools/run_frontier_sweep.py`
- Aggregator: `tools/aggregate_runs.py`
- Tiny smoke outputs: `runs/rs3_tiny_smoke_20260624/`
The output layout is:
```text
runs/<suite>/<sim>/<fixture>/<config_id>/
command.txt
env.txt
run_manifest.json
run_status.json
stdout.log
stderr.log
exit_code.txt
runtime_seconds.txt
frontier_metrics/...
postprocess_summary.json
postprocess_summary.md
runs/<suite>/summary.csv
runs/<suite>/summary.md
```
## Config Scheme
`configs/rs3_tiny_sweep.json` is intentionally small JSON:
- `suite_id`: output suite under `runs/`.
- `sim`: simulator/mode name used in the run path.
- `frontier`: Frontier checkout metadata. The tiny smoke points at patched
scratch `/tmp/replayserve-frontier-rs1b`, not canonical Frontier.
- `fixtures`: fixture names under `traces/fixtures/`.
- `defaults`: fixed Frontier knobs shared by each config.
- `configs`: named variants with optional `overrides`.
The exposed Frontier knobs include:
- parallelism: `attn_tensor_parallel_size`, `attn_data_parallel_size`,
`moe_tensor_parallel_size`, `moe_expert_parallel_size`,
`num_pipeline_stages`, `num_replicas`
- scheduler: `batch_size_cap` / max-num-seqs equivalent,
`max_tokens_in_batch` / max-batch-tokens equivalent, `block_size`,
`enable_prefix_caching`, `enable_chunked_prefill`,
`long_prefill_token_threshold`
- fixed smoke context: model, device, network device, trace max tokens,
memory-planner mode, GPU memory utilization, non-KV overhead, and dummy
execution time
For dense `Qwen/Qwen3-32B`, the EP-like knobs stay at `1` in the tiny smoke.
They are present so later MoE configs can be represented without changing the
harness schema.
## Run Commands
From `/home/gahow/phd/replayserve`:
```bash
python3 tools/run_frontier_sweep.py \
--config configs/rs3_tiny_sweep.json \
--suite-id rs3_tiny_smoke_20260624
python3 tools/aggregate_runs.py runs/rs3_tiny_smoke_20260624
```
The runner refuses to replace an existing selected run directory unless
`--force` is passed. Use `--dry-run` to emit commands/manifests without running
Frontier, and `--only-config` / `--only-fixture` to narrow the selected matrix.
## Frontier Mode
The RS3 tiny smoke uses:
- `frontier.root=/tmp/replayserve-frontier-rs1b`
- `frontier.mode=patched_scratch`
- patch file `patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch`
The canonical checkout `/tmp/toc-llm-sim-research/Frontier` remains clean and is
not modified by the harness. `summary.csv` records `frontier_dirty=true` for the
patched scratch because the local patch is applied there; that is expected.
To run canonical mode for a safe config, copy the JSON config, set
`frontier.root` to `/tmp/toc-llm-sim-research/Frontier`, change `sim`, and run a
small selected config. Do not use canonical fixed `coder_2000` until the
prefix-cache chunked-prefill bug is fixed upstream.
## Tiny Smoke Results
Command:
```bash
python3 tools/run_frontier_sweep.py \
--config configs/rs3_tiny_sweep.json \
--suite-id rs3_tiny_smoke_20260624
python3 tools/aggregate_runs.py runs/rs3_tiny_smoke_20260624
```
Results:
| config | status | runtime | prefix cache | chunked prefill | Frontier block hit ratio | ReplayServe token hit ratio | preemptions |
|---|---:|---:|---:|---:|---:|---:|---:|
| `fixed_prefix_on` | pass | 8s | on | on | `0.049486618` | `0.049562326` | 0 |
| `prefix_cache_off` | pass | 7s | off | on | n/a | n/a | 0 |
Aggregated files:
- `runs/rs3_tiny_smoke_20260624/summary.csv`
- `runs/rs3_tiny_smoke_20260624/summary.md`
The prefix-off run does not have Frontier cache columns in `request_metrics.csv`;
`summary.csv` records `cache_metrics_available=false` and the missing-column
reason.
TTFT/TPOT/E2E/throughput fields are aggregated from Frontier `system_metrics.json`
when present. In this tiny smoke they are dummy-predictor plumbing outputs, not
performance results.
## Not Yet Run
- No `coder_2000` sweep was run in RS3.
- No TP/DP/EP matrix was swept.
- No batch cap, max batch tokens, block size, chunked-prefill, or threshold
matrix was swept beyond the two-config smoke.
- No canonical Frontier patched-vs-unpatched comparison was rerun.
- No Vidur or AIConfigurator run is part of this harness yet.
## Next Harness Work
- Add a small checked-in config for a real RS3 candidate grid only after deciding
the patch/upstream policy.
- Add guardrails for invalid dense/MoE parallelism combinations before launching
larger matrices.
- Investigate `coder_2000` missing request-level cache fields before using
request-level hit ratio as a headline sweep metric.
- Keep latency/throughput result tables clearly separated by predictor/profile
mode: dummy smoke, profiled Frontier, or calibrated run.