Files
agentic-kvc/microbench/fresh_setup/run_campaign.sh
Gahow Wang 9c105cf05a MB5 PD ablation: controlled-variable reuse/conc redo + campaign tooling
Reuse and concurrency axes redone with proper controlled variables, plus
the orchestration used to run them on dash0:

- run_reuse_fixed.sh: hold REAL prefill work (delta) constant, vary only
  cached prefix -> reuse = C/(C+U). Supersedes old fig1 (which held
  input=8192 and sliced prefix out, confounding "more reuse" with "less
  prefill").
- run_conc.sh: agentic-corner config (in=32768, delta=512, reuse=0.984,
  out=128) that exposes PD's structural KV-transfer tax. Supersedes old fig3.
- run_campaign{,2,3}.sh, backfill_d2048o128.sh: serial campaign drivers
  (strictly one driver at a time), out=128 sweeps, PD wall-cap for
  collapse-draining high-reuse arms, and flaked-arm backfill.
- mb5_run_gpu.sh: per-config bring-up / replay / teardown orchestrator.
- plot_pd_crossover.py: render the reuse_compare figures from fig_agg dumps.
- fig_agg.py: tolerate null stats from fully-collapsed arms (0 successes
  write the stat keys as null; `dict.get(k, {})` returns null, not {}).

Data: fig1_reuse_fixed.json, fig1_reuse_d{1024,2048}_o128.json
Figs: reuse_compare_AB.png, reuse_compare_ABC.png

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 01:03:27 +08:00

27 lines
1.3 KiB
Bash

#!/usr/bin/env bash
# Unattended serial PD-ablation campaign: reuse sweep -> conc sweep.
# STRICTLY one driver at a time (the hard lesson): each inner driver brings up and
# tears down its own vLLM per config via scripts/mb5_run_gpu.sh, and the two sweeps
# run sequentially (reuse fully finishes + tears down before conc starts). We verify
# GPUs are clear between sweeps. NO set -e here: a sub-sweep nonzero must NOT skip the
# other sweep; rc is captured and reported. Detached launch writes a DONE marker.
cd /home/admin/cpfs/wjh/agentic-kv-fresh
export MB5_VENV="${MB5_VENV:-/home/admin/cpfs/wjh/agentic-kv-fresh/.venv_dash0}"
FS=microbench/fresh_setup
echo "=== CAMPAIGN START $(date) ==="
echo "=== [1/2] REUSE SWEEP (fixed real prefill delta=2048, out=256, reuse 20-95%, N=8) $(date) ==="
bash "$FS/run_reuse_fixed.sh"; rc_reuse=$?
echo "=== reuse sweep rc=$rc_reuse $(date) ==="
sleep 15
echo "--- GPU mem after reuse sweep (expect ~0 before conc) ---"
nvidia-smi --query-gpu=index,memory.used --format=csv,noheader | head -8
echo "=== [2/2] CONC SWEEP (in=32768 reuse=0.984, balanced N grid 8 16 32 48 64 96 128) $(date) ==="
NLIST="8 16 32 48 64 96 128" bash "$FS/run_conc.sh"; rc_conc=$?
echo "=== conc sweep rc=$rc_conc $(date) ==="
echo "=== CAMPAIGN DONE reuse_rc=$rc_reuse conc_rc=$rc_conc $(date) ==="