144 lines
4.8 KiB
Markdown
144 lines
4.8 KiB
Markdown
# RS3 Sweep Harness
|
|
|
|
RS3 adds a reproducible Frontier sweep harness and a tiny smoke. This is not the
|
|
full TP/EP/DP/config scan.
|
|
|
|
## Files
|
|
|
|
- Config: `configs/rs3_tiny_sweep.json`
|
|
- Runner: `tools/run_frontier_sweep.py`
|
|
- Aggregator: `tools/aggregate_runs.py`
|
|
- Tiny smoke outputs: `runs/rs3_tiny_smoke_20260624/`
|
|
|
|
The output layout is:
|
|
|
|
```text
|
|
runs/<suite>/<sim>/<fixture>/<config_id>/
|
|
command.txt
|
|
env.txt
|
|
run_manifest.json
|
|
run_status.json
|
|
stdout.log
|
|
stderr.log
|
|
exit_code.txt
|
|
runtime_seconds.txt
|
|
frontier_metrics/...
|
|
postprocess_summary.json
|
|
postprocess_summary.md
|
|
runs/<suite>/summary.csv
|
|
runs/<suite>/summary.md
|
|
```
|
|
|
|
## Config Scheme
|
|
|
|
`configs/rs3_tiny_sweep.json` is intentionally small JSON:
|
|
|
|
- `suite_id`: output suite under `runs/`.
|
|
- `sim`: simulator/mode name used in the run path.
|
|
- `frontier`: Frontier checkout metadata. The tiny smoke points at patched
|
|
scratch `/tmp/replayserve-frontier-rs1b`, not canonical Frontier.
|
|
- `fixtures`: fixture names under `traces/fixtures/`.
|
|
- `defaults`: fixed Frontier knobs shared by each config.
|
|
- `configs`: named variants with optional `overrides`.
|
|
|
|
The exposed Frontier knobs include:
|
|
|
|
- parallelism: `attn_tensor_parallel_size`, `attn_data_parallel_size`,
|
|
`moe_tensor_parallel_size`, `moe_expert_parallel_size`,
|
|
`num_pipeline_stages`, `num_replicas`
|
|
- scheduler: `batch_size_cap` / max-num-seqs equivalent,
|
|
`max_tokens_in_batch` / max-batch-tokens equivalent, `block_size`,
|
|
`enable_prefix_caching`, `enable_chunked_prefill`,
|
|
`long_prefill_token_threshold`
|
|
- fixed smoke context: model, device, network device, trace max tokens,
|
|
memory-planner mode, GPU memory utilization, non-KV overhead, and dummy
|
|
execution time
|
|
|
|
For dense `Qwen/Qwen3-32B`, the EP-like knobs stay at `1` in the tiny smoke.
|
|
They are present so later MoE configs can be represented without changing the
|
|
harness schema.
|
|
|
|
## Run Commands
|
|
|
|
From `/home/gahow/phd/replayserve`:
|
|
|
|
```bash
|
|
python3 tools/run_frontier_sweep.py \
|
|
--config configs/rs3_tiny_sweep.json \
|
|
--suite-id rs3_tiny_smoke_20260624
|
|
|
|
python3 tools/aggregate_runs.py runs/rs3_tiny_smoke_20260624
|
|
```
|
|
|
|
The runner refuses to replace an existing selected run directory unless
|
|
`--force` is passed. Use `--dry-run` to emit commands/manifests without running
|
|
Frontier, and `--only-config` / `--only-fixture` to narrow the selected matrix.
|
|
|
|
## Frontier Mode
|
|
|
|
The RS3 tiny smoke uses:
|
|
|
|
- `frontier.root=/tmp/replayserve-frontier-rs1b`
|
|
- `frontier.mode=patched_scratch`
|
|
- patch file `patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch`
|
|
|
|
The canonical checkout `/tmp/toc-llm-sim-research/Frontier` remains clean and is
|
|
not modified by the harness. `summary.csv` records `frontier_dirty=true` for the
|
|
patched scratch because the local patch is applied there; that is expected.
|
|
|
|
To run canonical mode for a safe config, copy the JSON config, set
|
|
`frontier.root` to `/tmp/toc-llm-sim-research/Frontier`, change `sim`, and run a
|
|
small selected config. Do not use canonical fixed `coder_2000` until the
|
|
prefix-cache chunked-prefill bug is fixed upstream.
|
|
|
|
## Tiny Smoke Results
|
|
|
|
Command:
|
|
|
|
```bash
|
|
python3 tools/run_frontier_sweep.py \
|
|
--config configs/rs3_tiny_sweep.json \
|
|
--suite-id rs3_tiny_smoke_20260624
|
|
python3 tools/aggregate_runs.py runs/rs3_tiny_smoke_20260624
|
|
```
|
|
|
|
Results:
|
|
|
|
| config | status | runtime | prefix cache | chunked prefill | Frontier block hit ratio | ReplayServe token hit ratio | preemptions |
|
|
|---|---:|---:|---:|---:|---:|---:|---:|
|
|
| `fixed_prefix_on` | pass | 8s | on | on | `0.049486618` | `0.049562326` | 0 |
|
|
| `prefix_cache_off` | pass | 7s | off | on | n/a | n/a | 0 |
|
|
|
|
Aggregated files:
|
|
|
|
- `runs/rs3_tiny_smoke_20260624/summary.csv`
|
|
- `runs/rs3_tiny_smoke_20260624/summary.md`
|
|
|
|
The prefix-off run does not have Frontier cache columns in `request_metrics.csv`;
|
|
`summary.csv` records `cache_metrics_available=false` and the missing-column
|
|
reason.
|
|
|
|
TTFT/TPOT/E2E/throughput fields are aggregated from Frontier `system_metrics.json`
|
|
when present. In this tiny smoke they are dummy-predictor plumbing outputs, not
|
|
performance results.
|
|
|
|
## Not Yet Run
|
|
|
|
- No `coder_2000` sweep was run in RS3.
|
|
- No TP/DP/EP matrix was swept.
|
|
- No batch cap, max batch tokens, block size, chunked-prefill, or threshold
|
|
matrix was swept beyond the two-config smoke.
|
|
- No canonical Frontier patched-vs-unpatched comparison was rerun.
|
|
- No Vidur or AIConfigurator run is part of this harness yet.
|
|
|
|
## Next Harness Work
|
|
|
|
- Add a small checked-in config for a real RS3 candidate grid only after deciding
|
|
the patch/upstream policy.
|
|
- Add guardrails for invalid dense/MoE parallelism combinations before launching
|
|
larger matrices.
|
|
- Investigate `coder_2000` missing request-level cache fields before using
|
|
request-level hit ratio as a headline sweep metric.
|
|
- Keep latency/throughput result tables clearly separated by predictor/profile
|
|
mode: dummy smoke, profiled Frontier, or calibrated run.
|