replaysim/docs/rs3_sweep_harness.md

# RS3 Sweep Harness

RS3 adds a reproducible Frontier sweep harness and a tiny smoke. This is not the
full TP/EP/DP/config scan.

## Files

- Config: `configs/rs3_tiny_sweep.json`
- Runner: `tools/run_frontier_sweep.py`
- Aggregator: `tools/aggregate_runs.py`
- Tiny smoke outputs: `runs/rs3_tiny_smoke_20260624/`

The output layout is:

```text
runs/<suite>/<sim>/<fixture>/<config_id>/
  command.txt
  env.txt
  run_manifest.json
  run_status.json
  stdout.log
  stderr.log
  exit_code.txt
  runtime_seconds.txt
  frontier_metrics/...
  postprocess_summary.json
  postprocess_summary.md
runs/<suite>/summary.csv
runs/<suite>/summary.md
```

## Config Scheme

`configs/rs3_tiny_sweep.json` is intentionally small JSON:

- `suite_id`: output suite under `runs/`.
- `sim`: simulator/mode name used in the run path.
- `frontier`: Frontier checkout metadata. The tiny smoke points at patched
  scratch `/tmp/replayserve-frontier-rs1b`, not canonical Frontier.
- `fixtures`: fixture names under `traces/fixtures/`.
- `defaults`: fixed Frontier knobs shared by each config.
- `configs`: named variants with optional `overrides`.

The exposed Frontier knobs include:

- parallelism: `attn_tensor_parallel_size`, `attn_data_parallel_size`,
  `moe_tensor_parallel_size`, `moe_expert_parallel_size`,
  `num_pipeline_stages`, `num_replicas`
- scheduler: `batch_size_cap` / max-num-seqs equivalent,
  `max_tokens_in_batch` / max-batch-tokens equivalent, `block_size`,
  `enable_prefix_caching`, `enable_chunked_prefill`,
  `long_prefill_token_threshold`
- fixed smoke context: model, device, network device, trace max tokens,
  memory-planner mode, GPU memory utilization, non-KV overhead, and dummy
  execution time

For dense `Qwen/Qwen3-32B`, the EP-like knobs stay at `1` in the tiny smoke.
They are present so later MoE configs can be represented without changing the
harness schema.

## Run Commands

From `/home/gahow/phd/replayserve`:

```bash
python3 tools/run_frontier_sweep.py \
  --config configs/rs3_tiny_sweep.json \
  --suite-id rs3_tiny_smoke_20260624

python3 tools/aggregate_runs.py runs/rs3_tiny_smoke_20260624
```

The runner refuses to replace an existing selected run directory unless
`--force` is passed. Use `--dry-run` to emit commands/manifests without running
Frontier, and `--only-config` / `--only-fixture` to narrow the selected matrix.

## Frontier Mode

The RS3 tiny smoke uses:

- `frontier.root=/tmp/replayserve-frontier-rs1b`
- `frontier.mode=patched_scratch`
- patch file `patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch`

The canonical checkout `/tmp/toc-llm-sim-research/Frontier` remains clean and is
not modified by the harness. `summary.csv` records `frontier_dirty=true` for the
patched scratch because the local patch is applied there; that is expected.

To run canonical mode for a safe config, copy the JSON config, set
`frontier.root` to `/tmp/toc-llm-sim-research/Frontier`, change `sim`, and run a
small selected config. Do not use canonical fixed `coder_2000` until the
prefix-cache chunked-prefill bug is fixed upstream.

## Tiny Smoke Results

Command:

```bash
python3 tools/run_frontier_sweep.py \
  --config configs/rs3_tiny_sweep.json \
  --suite-id rs3_tiny_smoke_20260624
python3 tools/aggregate_runs.py runs/rs3_tiny_smoke_20260624
```

Results:

| config | status | runtime | prefix cache | chunked prefill | Frontier block hit ratio | ReplayServe token hit ratio | preemptions |
|---|---:|---:|---:|---:|---:|---:|---:|
| `fixed_prefix_on` | pass | 8s | on | on | `0.049486618` | `0.049562326` | 0 |
| `prefix_cache_off` | pass | 7s | off | on | n/a | n/a | 0 |

Aggregated files:

- `runs/rs3_tiny_smoke_20260624/summary.csv`
- `runs/rs3_tiny_smoke_20260624/summary.md`

The prefix-off run does not have Frontier cache columns in `request_metrics.csv`;
`summary.csv` records `cache_metrics_available=false` and the missing-column
reason.

TTFT/TPOT/E2E/throughput fields are aggregated from Frontier `system_metrics.json`
when present. In this tiny smoke they are dummy-predictor plumbing outputs, not
performance results.

## Not Yet Run

- No `coder_2000` sweep was run in RS3.
- No TP/DP/EP matrix was swept.
- No batch cap, max batch tokens, block size, chunked-prefill, or threshold
  matrix was swept beyond the two-config smoke.
- No canonical Frontier patched-vs-unpatched comparison was rerun.
- No Vidur or AIConfigurator run is part of this harness yet.

## Next Harness Work

- Add a small checked-in config for a real RS3 candidate grid only after deciding
  the patch/upstream policy.
- Add guardrails for invalid dense/MoE parallelism combinations before launching
  larger matrices.
- Investigate `coder_2000` missing request-level cache fields before using
  request-level hit ratio as a headline sweep metric.
- Keep latency/throughput result tables clearly separated by predictor/profile
  mode: dummy smoke, profiled Frontier, or calibrated run.