RS3 Sweep Harness

RS3 adds a reproducible Frontier sweep harness and a tiny smoke. This is not the full TP/EP/DP/config scan.

Files

Config: configs/rs3_tiny_sweep.json
Runner: tools/run_frontier_sweep.py
Aggregator: tools/aggregate_runs.py
Tiny smoke outputs: runs/rs3_tiny_smoke_20260624/

The output layout is:

runs/<suite>/<sim>/<fixture>/<config_id>/
  command.txt
  env.txt
  run_manifest.json
  run_status.json
  stdout.log
  stderr.log
  exit_code.txt
  runtime_seconds.txt
  frontier_metrics/...
  postprocess_summary.json
  postprocess_summary.md
runs/<suite>/summary.csv
runs/<suite>/summary.md

Config Scheme

configs/rs3_tiny_sweep.json is intentionally small JSON:

suite_id: output suite under runs/.
sim: simulator/mode name used in the run path.
frontier: Frontier checkout metadata. The tiny smoke points at patched scratch /tmp/replayserve-frontier-rs1b, not canonical Frontier.
fixtures: fixture names under traces/fixtures/.
defaults: fixed Frontier knobs shared by each config.
configs: named variants with optional overrides.

The exposed Frontier knobs include:

parallelism: attn_tensor_parallel_size, attn_data_parallel_size, moe_tensor_parallel_size, moe_expert_parallel_size, num_pipeline_stages, num_replicas
scheduler: batch_size_cap / max-num-seqs equivalent, max_tokens_in_batch / max-batch-tokens equivalent, block_size, enable_prefix_caching, enable_chunked_prefill, long_prefill_token_threshold
fixed smoke context: model, device, network device, trace max tokens, memory-planner mode, GPU memory utilization, non-KV overhead, and dummy execution time

For dense Qwen/Qwen3-32B, the EP-like knobs stay at 1 in the tiny smoke. They are present so later MoE configs can be represented without changing the harness schema.

Run Commands

From /home/gahow/phd/replayserve:

python3 tools/run_frontier_sweep.py \
  --config configs/rs3_tiny_sweep.json \
  --suite-id rs3_tiny_smoke_20260624

python3 tools/aggregate_runs.py runs/rs3_tiny_smoke_20260624

The runner refuses to replace an existing selected run directory unless --force is passed. Use --dry-run to emit commands/manifests without running Frontier, and --only-config / --only-fixture to narrow the selected matrix.

Frontier Mode

The RS3 tiny smoke uses:

frontier.root=/tmp/replayserve-frontier-rs1b
frontier.mode=patched_scratch
patch file patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch

The canonical checkout /tmp/toc-llm-sim-research/Frontier remains clean and is not modified by the harness. summary.csv records frontier_dirty=true for the patched scratch because the local patch is applied there; that is expected.

To run canonical mode for a safe config, copy the JSON config, set frontier.root to /tmp/toc-llm-sim-research/Frontier, change sim, and run a small selected config. Do not use canonical fixed coder_2000 until the prefix-cache chunked-prefill bug is fixed upstream.

Tiny Smoke Results

Command:

python3 tools/run_frontier_sweep.py \
  --config configs/rs3_tiny_sweep.json \
  --suite-id rs3_tiny_smoke_20260624
python3 tools/aggregate_runs.py runs/rs3_tiny_smoke_20260624

Results:

config	status	runtime	prefix cache	chunked prefill	Frontier block hit ratio	ReplayServe token hit ratio	preemptions
`fixed_prefix_on`	pass	8s	on	on	`0.049486618`	`0.049562326`	0
`prefix_cache_off`	pass	7s	off	on	n/a	n/a	0

Aggregated files:

runs/rs3_tiny_smoke_20260624/summary.csv
runs/rs3_tiny_smoke_20260624/summary.md

The prefix-off run does not have Frontier cache columns in request_metrics.csv; summary.csv records cache_metrics_available=false and the missing-column reason.

TTFT/TPOT/E2E/throughput fields are aggregated from Frontier system_metrics.json when present. In this tiny smoke they are dummy-predictor plumbing outputs, not performance results.

Not Yet Run

No coder_2000 sweep was run in RS3.
No TP/DP/EP matrix was swept.
No batch cap, max batch tokens, block size, chunked-prefill, or threshold matrix was swept beyond the two-config smoke.
No canonical Frontier patched-vs-unpatched comparison was rerun.
No Vidur or AIConfigurator run is part of this harness yet.

Next Harness Work

Add a small checked-in config for a real RS3 candidate grid only after deciding the patch/upstream policy.
Add guardrails for invalid dense/MoE parallelism combinations before launching larger matrices.
Investigate coder_2000 missing request-level cache fields before using request-level hit ratio as a headline sweep metric.
Keep latency/throughput result tables clearly separated by predictor/profile mode: dummy smoke, profiled Frontier, or calibrated run.

4.8 KiB Raw Blame History