MB5 PD reuse-centric ablation: tooling, data, Fig 1-3

Three-axis controlled ablation of PD-colo vs PD-disagg on synthetic regular traces (closed-loop, controlled reuse via REPLAY_NO_REALIZED_PREFIX) on the clean stack (e13391e gated off). Axis 1 (Fig 1) -- reuse 6%->94% at N=8, in8192/out256 Axis 2 (Fig 2) -- shape in2048/out2048 -> in32768/out64 at N=8, reuse~70% Axis 3 (Fig 3) -- concurrency N=8/16/32/64 at reuse~71%, in8192/out256 Findings: * APC parity colo=PD at every reuse (5.5/22/44/66/77/82%) -- contamination fix validated. * PD edge erodes 1.57x->1.10x with reuse; prefill GPUs strand 26%->9%. * Shape: PD-best peaks mid-sweep (1.34x at in8192/out512); wrong PD ratio catastrophic at prefill extreme (in32768/out64 pd2 = 378/400, p99 432s). * Concurrency: PD wins N<=32 (1.23-1.29x), TIPS at N=64 -- pd2/pd4 crater (APC 71%->1.4%, TPS -30%) while colo scales cleanly. Infrastructure: * replayer: --max-inflight-sessions, --inter-turn-think, --no-realized-prefix (env-defaulted via REPLAY_MAX_INFLIGHT, REPLAY_INTER_TURN_THINK_S, REPLAY_NO_REALIZED_PREFIX). * mb5_run.sh: writes bench_config.json + gpu_util.csv + run_window.json + instance_apc.txt + metrics.jsonl for bench_report/fig_agg ingest. * fig_agg.py: per-arm GPU role split + producer-side APC; --json mode. * gpu_util_report.py: companion per-GPU util report from gpu_util.csv. * partial_summary.py: stats from in-flight replay_metrics.jsonl (works before metrics.summary.json exists). Data: analysis/mb5_pd_ablation/fig{1,2,3}.json (24 + 20 + 16 rows). Figures: figs/mb5_pd_ablation/fig{1_reuse,2_shape,3_concurrency}_axis.png.
2026-05-31 20:14:46 +08:00
parent a2111b6e18
commit fafc44da79
12 changed files with 389 additions and 9 deletions
--- a/replayer/main.py
+++ b/replayer/main.py
@@ -30,12 +30,23 @@ def main() -> None:
                   default=float(_env_think) if _env_think else None,
                   help="Closed-loop think-time (s) after each turn completes; "
                        "ignore absolute trace schedule. Env: REPLAY_INTER_TURN_THINK_S")
+    p.add_argument("--no-realized-prefix",
+                   action="store_true",
+                   default=bool(os.environ.get("REPLAY_NO_REALIZED_PREFIX")),
+                   help="Controlled-reuse mode: prompt = hash-built tokens only "
+                        "(reuse set by hash_ids). Env: REPLAY_NO_REALIZED_PREFIX")
    p.add_argument("--dispatch-mode", choices=["tracets", "thinktime"],
                   default=os.environ.get("REPLAY_DISPATCH_MODE", "tracets"),
                   help="tracets (Mode 1): absolute trace ts = max(prev_finished, ts). "
                        "thinktime (Mode 2): turn-k at prev_finished + "
                        "time_to_parent_chat. Env: REPLAY_DISPATCH_MODE")
    p.add_argument("--request-timeout", type=float, default=600.0)
+    _env_maxdur = os.environ.get("REPLAY_MAX_DURATION")
+    p.add_argument("--max-duration", type=float,
+                   default=float(_env_maxdur) if _env_maxdur else None,
+                   help="Overall wall-clock deadline (s): cancel in-flight + write "
+                        "summary (un-run turns counted as failures) to bound a "
+                        "collapsed config's drain. Env: REPLAY_MAX_DURATION")
    p.add_argument("--request-limit", type=int, default=None,
                   help="Limit number of requests to replay")
    p.add_argument("-v", "--verbose", action="store_true")
@@ -56,7 +67,9 @@ def main() -> None:
        request_limit=args.request_limit,
        max_inflight_sessions=args.max_inflight_sessions,
        inter_turn_think_s=args.inter_turn_think,
+        no_realized_prefix=args.no_realized_prefix,
        dispatch_mode=args.dispatch_mode,
+        max_duration_s=args.max_duration,
    )

    results = asyncio.run(replay_trace(config))