Files
replaysim/docs/rs3_sweep_harness.md

4.8 KiB

RS3 Sweep Harness

RS3 adds a reproducible Frontier sweep harness and a tiny smoke. This is not the full TP/EP/DP/config scan.

Files

  • Config: configs/rs3_tiny_sweep.json
  • Runner: tools/run_frontier_sweep.py
  • Aggregator: tools/aggregate_runs.py
  • Tiny smoke outputs: runs/rs3_tiny_smoke_20260624/

The output layout is:

runs/<suite>/<sim>/<fixture>/<config_id>/
  command.txt
  env.txt
  run_manifest.json
  run_status.json
  stdout.log
  stderr.log
  exit_code.txt
  runtime_seconds.txt
  frontier_metrics/...
  postprocess_summary.json
  postprocess_summary.md
runs/<suite>/summary.csv
runs/<suite>/summary.md

Config Scheme

configs/rs3_tiny_sweep.json is intentionally small JSON:

  • suite_id: output suite under runs/.
  • sim: simulator/mode name used in the run path.
  • frontier: Frontier checkout metadata. The tiny smoke points at patched scratch /tmp/replayserve-frontier-rs1b, not canonical Frontier.
  • fixtures: fixture names under traces/fixtures/.
  • defaults: fixed Frontier knobs shared by each config.
  • configs: named variants with optional overrides.

The exposed Frontier knobs include:

  • parallelism: attn_tensor_parallel_size, attn_data_parallel_size, moe_tensor_parallel_size, moe_expert_parallel_size, num_pipeline_stages, num_replicas
  • scheduler: batch_size_cap / max-num-seqs equivalent, max_tokens_in_batch / max-batch-tokens equivalent, block_size, enable_prefix_caching, enable_chunked_prefill, long_prefill_token_threshold
  • fixed smoke context: model, device, network device, trace max tokens, memory-planner mode, GPU memory utilization, non-KV overhead, and dummy execution time

For dense Qwen/Qwen3-32B, the EP-like knobs stay at 1 in the tiny smoke. They are present so later MoE configs can be represented without changing the harness schema.

Run Commands

From /home/gahow/phd/replayserve:

python3 tools/run_frontier_sweep.py \
  --config configs/rs3_tiny_sweep.json \
  --suite-id rs3_tiny_smoke_20260624

python3 tools/aggregate_runs.py runs/rs3_tiny_smoke_20260624

The runner refuses to replace an existing selected run directory unless --force is passed. Use --dry-run to emit commands/manifests without running Frontier, and --only-config / --only-fixture to narrow the selected matrix.

Frontier Mode

The RS3 tiny smoke uses:

  • frontier.root=/tmp/replayserve-frontier-rs1b
  • frontier.mode=patched_scratch
  • patch file patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch

The canonical checkout /tmp/toc-llm-sim-research/Frontier remains clean and is not modified by the harness. summary.csv records frontier_dirty=true for the patched scratch because the local patch is applied there; that is expected.

To run canonical mode for a safe config, copy the JSON config, set frontier.root to /tmp/toc-llm-sim-research/Frontier, change sim, and run a small selected config. Do not use canonical fixed coder_2000 until the prefix-cache chunked-prefill bug is fixed upstream.

Tiny Smoke Results

Command:

python3 tools/run_frontier_sweep.py \
  --config configs/rs3_tiny_sweep.json \
  --suite-id rs3_tiny_smoke_20260624
python3 tools/aggregate_runs.py runs/rs3_tiny_smoke_20260624

Results:

config status runtime prefix cache chunked prefill Frontier block hit ratio ReplayServe token hit ratio preemptions
fixed_prefix_on pass 8s on on 0.049486618 0.049562326 0
prefix_cache_off pass 7s off on n/a n/a 0

Aggregated files:

  • runs/rs3_tiny_smoke_20260624/summary.csv
  • runs/rs3_tiny_smoke_20260624/summary.md

The prefix-off run does not have Frontier cache columns in request_metrics.csv; summary.csv records cache_metrics_available=false and the missing-column reason.

TTFT/TPOT/E2E/throughput fields are aggregated from Frontier system_metrics.json when present. In this tiny smoke they are dummy-predictor plumbing outputs, not performance results.

Not Yet Run

  • No coder_2000 sweep was run in RS3.
  • No TP/DP/EP matrix was swept.
  • No batch cap, max batch tokens, block size, chunked-prefill, or threshold matrix was swept beyond the two-config smoke.
  • No canonical Frontier patched-vs-unpatched comparison was rerun.
  • No Vidur or AIConfigurator run is part of this harness yet.

Next Harness Work

  • Add a small checked-in config for a real RS3 candidate grid only after deciding the patch/upstream policy.
  • Add guardrails for invalid dense/MoE parallelism combinations before launching larger matrices.
  • Investigate coder_2000 missing request-level cache fields before using request-level hit ratio as a headline sweep metric.
  • Keep latency/throughput result tables clearly separated by predictor/profile mode: dummy smoke, profiled Frontier, or calibrated run.