4.8 KiB
RS3 Sweep Harness
RS3 adds a reproducible Frontier sweep harness and a tiny smoke. This is not the full TP/EP/DP/config scan.
Files
- Config:
configs/rs3_tiny_sweep.json - Runner:
tools/run_frontier_sweep.py - Aggregator:
tools/aggregate_runs.py - Tiny smoke outputs:
runs/rs3_tiny_smoke_20260624/
The output layout is:
runs/<suite>/<sim>/<fixture>/<config_id>/
command.txt
env.txt
run_manifest.json
run_status.json
stdout.log
stderr.log
exit_code.txt
runtime_seconds.txt
frontier_metrics/...
postprocess_summary.json
postprocess_summary.md
runs/<suite>/summary.csv
runs/<suite>/summary.md
Config Scheme
configs/rs3_tiny_sweep.json is intentionally small JSON:
suite_id: output suite underruns/.sim: simulator/mode name used in the run path.frontier: Frontier checkout metadata. The tiny smoke points at patched scratch/tmp/replayserve-frontier-rs1b, not canonical Frontier.fixtures: fixture names undertraces/fixtures/.defaults: fixed Frontier knobs shared by each config.configs: named variants with optionaloverrides.
The exposed Frontier knobs include:
- parallelism:
attn_tensor_parallel_size,attn_data_parallel_size,moe_tensor_parallel_size,moe_expert_parallel_size,num_pipeline_stages,num_replicas - scheduler:
batch_size_cap/ max-num-seqs equivalent,max_tokens_in_batch/ max-batch-tokens equivalent,block_size,enable_prefix_caching,enable_chunked_prefill,long_prefill_token_threshold - fixed smoke context: model, device, network device, trace max tokens, memory-planner mode, GPU memory utilization, non-KV overhead, and dummy execution time
For dense Qwen/Qwen3-32B, the EP-like knobs stay at 1 in the tiny smoke.
They are present so later MoE configs can be represented without changing the
harness schema.
Run Commands
From /home/gahow/phd/replayserve:
python3 tools/run_frontier_sweep.py \
--config configs/rs3_tiny_sweep.json \
--suite-id rs3_tiny_smoke_20260624
python3 tools/aggregate_runs.py runs/rs3_tiny_smoke_20260624
The runner refuses to replace an existing selected run directory unless
--force is passed. Use --dry-run to emit commands/manifests without running
Frontier, and --only-config / --only-fixture to narrow the selected matrix.
Frontier Mode
The RS3 tiny smoke uses:
frontier.root=/tmp/replayserve-frontier-rs1bfrontier.mode=patched_scratch- patch file
patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch
The canonical checkout /tmp/toc-llm-sim-research/Frontier remains clean and is
not modified by the harness. summary.csv records frontier_dirty=true for the
patched scratch because the local patch is applied there; that is expected.
To run canonical mode for a safe config, copy the JSON config, set
frontier.root to /tmp/toc-llm-sim-research/Frontier, change sim, and run a
small selected config. Do not use canonical fixed coder_2000 until the
prefix-cache chunked-prefill bug is fixed upstream.
Tiny Smoke Results
Command:
python3 tools/run_frontier_sweep.py \
--config configs/rs3_tiny_sweep.json \
--suite-id rs3_tiny_smoke_20260624
python3 tools/aggregate_runs.py runs/rs3_tiny_smoke_20260624
Results:
| config | status | runtime | prefix cache | chunked prefill | Frontier block hit ratio | ReplayServe token hit ratio | preemptions |
|---|---|---|---|---|---|---|---|
fixed_prefix_on |
pass | 8s | on | on | 0.049486618 |
0.049562326 |
0 |
prefix_cache_off |
pass | 7s | off | on | n/a | n/a | 0 |
Aggregated files:
runs/rs3_tiny_smoke_20260624/summary.csvruns/rs3_tiny_smoke_20260624/summary.md
The prefix-off run does not have Frontier cache columns in request_metrics.csv;
summary.csv records cache_metrics_available=false and the missing-column
reason.
TTFT/TPOT/E2E/throughput fields are aggregated from Frontier system_metrics.json
when present. In this tiny smoke they are dummy-predictor plumbing outputs, not
performance results.
Not Yet Run
- No
coder_2000sweep was run in RS3. - No TP/DP/EP matrix was swept.
- No batch cap, max batch tokens, block size, chunked-prefill, or threshold matrix was swept beyond the two-config smoke.
- No canonical Frontier patched-vs-unpatched comparison was rerun.
- No Vidur or AIConfigurator run is part of this harness yet.
Next Harness Work
- Add a small checked-in config for a real RS3 candidate grid only after deciding the patch/upstream policy.
- Add guardrails for invalid dense/MoE parallelism combinations before launching larger matrices.
- Investigate
coder_2000missing request-level cache fields before using request-level hit ratio as a headline sweep metric. - Keep latency/throughput result tables clearly separated by predictor/profile mode: dummy smoke, profiled Frontier, or calibrated run.