From fb6d74a18ce373f7994d4a15da1bb82a77c65632 Mon Sep 17 00:00:00 2001 From: Gahow Wang Date: Tue, 12 May 2026 22:23:12 +0800 Subject: [PATCH] Document harness v2 rerun criteria --- ...-driven-harness-implementation-20260512.md | 23 +++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/docs/harness-ablation/profile-driven-harness-implementation-20260512.md b/docs/harness-ablation/profile-driven-harness-implementation-20260512.md index 12b886b..661ab9b 100644 --- a/docs/harness-ablation/profile-driven-harness-implementation-20260512.md +++ b/docs/harness-ablation/profile-driven-harness-implementation-20260512.md @@ -81,3 +81,26 @@ Started on `dash0` (`11.73.2.172`) at commit `17e9681`. - log: `.aituner-tight/logs/qwen27b-profileplanner-harness-20260512.log` - status at launch check: `trial-0001` baseline is running under AITuner; no manual intervention in the tuning loop. +## V1 Early Stop + +The first profile-planner run was stopped before accepting it as evidence. A read-only replay of its completed baseline probe history showed that the planner would choose `max-num-seqs=64` for iter2. That was a diagnosis bug: + +- `slo_pass_rate_unrecoverable` is a binary-search early-stop summary, not a bottleneck class. +- The harness was counting that summary as an admission/queueing failure. +- Because this count dominated the real TTFT/TPOT failure counts, the planner selected a concurrency action instead of testing TP. + +Fix commit: `e3ed775`. + +The fix excludes `slo_pass_rate_unrecoverable` from the admission/queueing failure bucket. With the same baseline probe history, the planner now ranks `ttft_prefill` first and proposes `tensor-parallel-size=2` for iter2. + +## V2 Experiment Started + +Started on `dash0` (`11.73.2.172`) at commit `e3ed775`. + +- tmux session: `qwen27b-profileplanner-v2-harness-20260512` +- spec: `.aituner-tight/specs/dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-harness-profileplanner-v2-20260512.json` +- study id: `dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-harness-profileplanner-v2-20260512` +- log: `.aituner-tight/logs/qwen27b-profileplanner-v2-harness-20260512.log` +- monitor: read-only subagent `Wegener` + +Acceptance for this run is based on end-to-end trial results, not unit tests. If the first four trials lag the min-prompt no-harness baseline (`0.0650`, `0.1992`, `0.2696`, then failed/NA), the run should be treated as a failed harness iteration and the harness should be optimized again.