From f59919e21cc0feb8df85f9a461b72cb165e3af6d Mon Sep 17 00:00:00 2001
From: Gahow Wang <gahow.wang@gmail.com>
Date: Thu, 30 Apr 2026 06:52:09 +0800
Subject: [PATCH] Clarify base-relative validation patches

---
 docs/qwen235b-thinking-decode/harness-20260428.md | 13 +++++++------
 src/aituner/harness.py                            |  1 +
 tests/test_core_flow.py                           |  4 ++++
 3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/docs/qwen235b-thinking-decode/harness-20260428.md b/docs/qwen235b-thinking-decode/harness-20260428.md
index ab997fc..967f07e 100644
--- a/docs/qwen235b-thinking-decode/harness-20260428.md
+++ b/docs/qwen235b-thinking-decode/harness-20260428.md
@@ -49,7 +49,8 @@ The active run is now seeded from the real run5 baseline and continues from `tri
 - `proposal-0002`: legal adjacent decode topology move from `TP4/DP2/EP8` to `TP2/DP4/EP8`; no EP-size search and no testcase threshold.
 - `trial-0002`: completed, 0.3767 request/s, 0.0471 request/s/GPU, pass rate 0.9779.
 - `trial-0003`: completed with no feasible point for `TP1/DP8/EP8`.
-- `proposal-0004`: generated a plausible same-topology `max-num-seqs=160` follow-up, but the raw JSON used an object for `observation`; schema validation rejected it and the tuning CLI exited before materializing `trial-0004`.
+- `trial-0004`: completed with no feasible point for `max-num-seqs=160`.
+- Important caveat: `trial-0004` did not actually validate `TP2/DP4/EP8 + max-num-seqs=160`. AITuner applies `config_patch` relative to the study base config, and the proposal only patched `max-num-seqs`. The actual launch therefore used the base topology `TP4/DP2/EP8 + max-num-seqs=160`, so this is not evidence that same-topology refinement around `trial-0002` is exhausted.
 
 The `trial-0002` proposal matches the first useful topology direction from the earlier before-harness run, but the new harness-controlled run measured substantially better throughput for that topology.
 
@@ -60,18 +61,18 @@ Fig-18-style raw throughput table:
 | Run | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
 | --- | ---: | ---: | --- | --- | --- | --- | --- | --- | ---: | --- | --- | --- |
 | before harness request/s | 0.1267 | 0.2450 | infeasible | launch fail | infeasible | infeasible | infeasible | infeasible | 0.2817 | infeasible | infeasible | infeasible |
-| harness request/s | 0.1267 | 0.3767 | infeasible | not run | not run | not run | not run | not run | not run | not run | not run | not run |
+| harness request/s | 0.1267 | 0.3767 | infeasible | infeasible | not run | not run | not run | not run | not run | not run | not run | not run |
 
 Per-GPU throughput table:
 
 | Run | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
 | --- | ---: | ---: | --- | --- | --- | --- | --- | --- | ---: | --- | --- | --- |
 | before harness req/s/GPU | 0.0158 | 0.0306 | infeasible | launch fail | infeasible | infeasible | infeasible | infeasible | 0.0352 | infeasible | infeasible | infeasible |
-| harness req/s/GPU | 0.0158 | 0.0471 | infeasible | not run | not run | not run | not run | not run | not run | not run | not run | not run |
+| harness req/s/GPU | 0.0158 | 0.0471 | infeasible | infeasible | not run | not run | not run | not run | not run | not run | not run | not run |
 
 Decision: the harness accelerated convergence on qwen235b decode-only, but this is not a proof of global optimality after one proposal. The before-harness run first reached its best observed throughput at iter 9 with 0.2817 request/s. The harness run exceeded that value at iter 2 with 0.3767 request/s, a 1.34x improvement over the before-harness 12-iter best and a 2.97x improvement over the baseline config.
 
-The harness did not stop cleanly after finding the strong incumbent. It spent one additional trial on `TP1/DP8/EP8`, which found no feasible point, and then the next LLM proposal failed schema validation before trial materialization. So the performance convergence goal is met, but the tuning loop should be hardened so a strong incumbent causes deterministic stop or a schema-repair retry rather than relying only on prompt instructions.
+The harness did not stop cleanly after finding the strong incumbent. It spent one additional trial on `TP1/DP8/EP8`, which found no feasible point. The next proposal intended same-topology runtime validation, but omitted the incumbent topology fields, so the materialized trial validated the base topology instead. So the performance convergence goal is met, but local optimality has not been fully proven yet.
 
 Important interpretation: `trial-0002` should be called the current best observed config, not "proven best". The harness got there quickly because the decode-only harness biases the first proposal toward the most relevant adjacent topology redistribution, `TP4/DP2/EP8 -> TP2/DP4/EP8`, instead of spending trials on prefill-oriented runtime knobs. Later iterations are still needed to validate local optimality by testing nearby topologies and same-topology runtime knobs.
 
@@ -84,9 +85,9 @@ Follow-up implementation after this result:
 
 After the implementation fix, the previously rejected `proposal-0004` was resumed as a validation trial:
 
-- `trial-0004`: same topology validation with `max-num-seqs=160`.
+- `trial-0004`: intended same-topology validation with `max-num-seqs=160`, but actually ran on base topology because the proposal omitted `TP2/DP4/EP8`.
 - Remote tmux: `aituner_qwen235b_decode_harness_validate_20260428`.
-- Status as of 2026-04-28 13:20 UTC on dash0: running; no result has been written yet.
+- Result: completed with no feasible point. This is useful negative evidence for the base topology plus `max-num-seqs=160`, but not for the `trial-0002` incumbent topology.
 
 ## Follow-up Fix
 
diff --git a/src/aituner/harness.py b/src/aituner/harness.py
index 100bbda..10059f9 100644
--- a/src/aituner/harness.py
+++ b/src/aituner/harness.py
@@ -604,6 +604,7 @@ def _proposal_rules() -> list[str]:
         "For decode_tpot bottlenecks, prefer legal TP/DP topology redistribution before runtime-only knobs, then tune max-num-seqs or max-num-batched-tokens only from observed SLO headroom/failures.",
         "Pick at most one primary knob family from knob_harnesses unless the history proves a coupled change is needed.",
         "Use adjacent legal values around the incumbent; avoid broad exploratory jumps.",
+        "config_patch is applied to the study base config, not as a delta on the incumbent; when validating an incumbent topology, include all topology flags needed to reconstruct that incumbent plus the runtime knob under test.",
         "When strong_incumbent.guard_active is true, treat the run as a validation phase, not as proof of optimum: probe only adjacent topology or same-topology runtime knobs that can falsify the incumbent.",
         "Do not stop solely because a strong incumbent appeared; stop only after nearby validation probes are exhausted, blocked by infeasible evidence, or repeatedly remain near the incumbent.",
         "During validation, prefer probes that answer whether the incumbent is locally optimal over probes that merely chase another large gain.",
diff --git a/tests/test_core_flow.py b/tests/test_core_flow.py
index 5c996fa..2e8d5bd 100644
--- a/tests/test_core_flow.py
+++ b/tests/test_core_flow.py
@@ -536,6 +536,10 @@ class CoreFlowTests(unittest.TestCase):
                 "For decode_only studies, ignore TTFT",
                 "\n".join(context["proposal_rules"]),
             )
+            self.assertIn(
+                "config_patch is applied to the study base config",
+                "\n".join(context["proposal_rules"]),
+            )
 
     def test_harness_uses_prior_infeasible_probe_for_active_bottleneck(self) -> None:
         with tempfile.TemporaryDirectory() as tmp: