MB5 PD reuse-centric ablation: tooling, data, Fig 1-3

Three-axis controlled ablation of PD-colo vs PD-disagg on synthetic regular traces (closed-loop, controlled reuse via REPLAY_NO_REALIZED_PREFIX) on the clean stack (e13391e gated off). Axis 1 (Fig 1) -- reuse 6%->94% at N=8, in8192/out256 Axis 2 (Fig 2) -- shape in2048/out2048 -> in32768/out64 at N=8, reuse~70% Axis 3 (Fig 3) -- concurrency N=8/16/32/64 at reuse~71%, in8192/out256 Findings: * APC parity colo=PD at every reuse (5.5/22/44/66/77/82%) -- contamination fix validated. * PD edge erodes 1.57x->1.10x with reuse; prefill GPUs strand 26%->9%. * Shape: PD-best peaks mid-sweep (1.34x at in8192/out512); wrong PD ratio catastrophic at prefill extreme (in32768/out64 pd2 = 378/400, p99 432s). * Concurrency: PD wins N<=32 (1.23-1.29x), TIPS at N=64 -- pd2/pd4 crater (APC 71%->1.4%, TPS -30%) while colo scales cleanly. Infrastructure: * replayer: --max-inflight-sessions, --inter-turn-think, --no-realized-prefix (env-defaulted via REPLAY_MAX_INFLIGHT, REPLAY_INTER_TURN_THINK_S, REPLAY_NO_REALIZED_PREFIX). * mb5_run.sh: writes bench_config.json + gpu_util.csv + run_window.json + instance_apc.txt + metrics.jsonl for bench_report/fig_agg ingest. * fig_agg.py: per-arm GPU role split + producer-side APC; --json mode. * gpu_util_report.py: companion per-GPU util report from gpu_util.csv. * partial_summary.py: stats from in-flight replay_metrics.jsonl (works before metrics.summary.json exists). Data: analysis/mb5_pd_ablation/fig{1,2,3}.json (24 + 20 + 16 rows). Figures: figs/mb5_pd_ablation/fig{1_reuse,2_shape,3_concurrency}_axis.png.
2026-05-31 20:14:46 +08:00
parent a2111b6e18
commit fafc44da79
12 changed files with 389 additions and 9 deletions
--- a/microbench/fresh_setup/mb5_run.sh
+++ b/microbench/fresh_setup/mb5_run.sh
@@ -69,6 +69,13 @@ run_one() {
    source "${VENV}/bin/activate"
    local replay_out="${rundir}/replay_metrics.jsonl"
    mkdir -p "$(dirname "${replay_out}")"
+    # bench_report.py inputs: worker->gpu map (worker i == gpu i for every config;
+    # for PD, workers 0-3 are producers on gpu0-3, 4-7 consumers on gpu4-7).
+    printf '{"base_port":8000,"n_instances":8,"gpu_indices":[0,1,2,3,4,5,6,7]}\n' \
+        > "${rundir}/bench_config.json"
+    # per-GPU utilization timeseries over the replay window (2s sampling)
+    bash "${SCRIPT_DIR}/gpu_monitor.sh" "${rundir}/gpu_util.csv" 2 >/dev/null 2>&1 &
+    local GPU_MON=$!
    local t0
    t0=$(date +%s.%N)
    if ! PYTHONPATH="${FRESH_ROOT}" python -m replayer \
@@ -82,6 +89,7 @@ run_one() {
        t1=$(date +%s.%N)
        local wall=$(python -c "print(${t1} - ${t0})")
        echo "[mb5-run] REPLAY FAILED after ${wall} s; see ${OUT_ROOT}/${config}_rep${rep}_replay.log"
+        kill "${GPU_MON}" 2>/dev/null || true
        bash "${LAUNCH}" stop > /dev/null 2>&1 || true
        return 1
    fi
@@ -91,6 +99,9 @@ run_one() {
    wall_clock_s=$(python -c "print(${t1} - ${t0})")
    echo "[mb5-run] replay done in ${wall_clock_s}s"
    echo "${wall_clock_s}" > "${rundir}/wall_clock_s.txt"
+    kill "${GPU_MON}" 2>/dev/null || true
+    printf '{"t_start_unix":%s,"t_end_unix":%s}\n' "${t0}" "${t1}" > "${rundir}/run_window.json"
+    cp -f "${replay_out}" "${rundir}/metrics.jsonl"   # bench_report.py expects metrics.jsonl

    # Per-instance prefix-cache counters, scraped from each backend BEFORE
    # teardown. For PD this is the only honest reuse signal: producer ports