Inherit incumbent topology for runtime validation

2026-04-30 09:33:49 +08:00
parent f59919e21c
commit 9e5394b557
3 changed files with 266 additions and 5 deletions
--- a/docs/qwen235b-thinking-decode/harness-20260428.md
+++ b/docs/qwen235b-thinking-decode/harness-20260428.md
@@ -51,6 +51,7 @@ The active run is now seeded from the real run5 baseline and continues from `tri
 - `trial-0003`: completed with no feasible point for `TP1/DP8/EP8`.
 - `trial-0004`: completed with no feasible point for `max-num-seqs=160`.
 - Important caveat: `trial-0004` did not actually validate `TP2/DP4/EP8 + max-num-seqs=160`. AITuner applies `config_patch` relative to the study base config, and the proposal only patched `max-num-seqs`. The actual launch therefore used the base topology `TP4/DP2/EP8 + max-num-seqs=160`, so this is not evidence that same-topology refinement around `trial-0002` is exhausted.
+- `trial-0005`: corrected same-topology validation, `TP2/DP4/EP8 + max-num-seqs=160`; completed with no feasible point.

 The `trial-0002` proposal matches the first useful topology direction from the earlier before-harness run, but the new harness-controlled run measured substantially better throughput for that topology.

@@ -61,20 +62,20 @@ Fig-18-style raw throughput table:
 | Run | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
 | --- | ---: | ---: | --- | --- | --- | --- | --- | --- | ---: | --- | --- | --- |
 | before harness request/s | 0.1267 | 0.2450 | infeasible | launch fail | infeasible | infeasible | infeasible | infeasible | 0.2817 | infeasible | infeasible | infeasible |
-| harness request/s | 0.1267 | 0.3767 | infeasible | infeasible | not run | not run | not run | not run | not run | not run | not run | not run |
+| harness request/s | 0.1267 | 0.3767 | infeasible | infeasible | infeasible | not run | not run | not run | not run | not run | not run | not run |

 Per-GPU throughput table:

 | Run | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
 | --- | ---: | ---: | --- | --- | --- | --- | --- | --- | ---: | --- | --- | --- |
 | before harness req/s/GPU | 0.0158 | 0.0306 | infeasible | launch fail | infeasible | infeasible | infeasible | infeasible | 0.0352 | infeasible | infeasible | infeasible |
-| harness req/s/GPU | 0.0158 | 0.0471 | infeasible | infeasible | not run | not run | not run | not run | not run | not run | not run | not run |
+| harness req/s/GPU | 0.0158 | 0.0471 | infeasible | infeasible | infeasible | not run | not run | not run | not run | not run | not run | not run |

 Decision: the harness accelerated convergence on qwen235b decode-only, but this is not a proof of global optimality after one proposal. The before-harness run first reached its best observed throughput at iter 9 with 0.2817 request/s. The harness run exceeded that value at iter 2 with 0.3767 request/s, a 1.34x improvement over the before-harness 12-iter best and a 2.97x improvement over the baseline config.

-The harness did not stop cleanly after finding the strong incumbent. It spent one additional trial on `TP1/DP8/EP8`, which found no feasible point. The next proposal intended same-topology runtime validation, but omitted the incumbent topology fields, so the materialized trial validated the base topology instead. So the performance convergence goal is met, but local optimality has not been fully proven yet.
+The harness did not stop cleanly after finding the strong incumbent. It spent one additional trial on `TP1/DP8/EP8`, which found no feasible point. The next proposal intended same-topology runtime validation, but omitted the incumbent topology fields, so the materialized trial validated the base topology instead. This issue was corrected with `trial-0005`.

-Important interpretation: `trial-0002` should be called the current best observed config, not "proven best". The harness got there quickly because the decode-only harness biases the first proposal toward the most relevant adjacent topology redistribution, `TP4/DP2/EP8 -> TP2/DP4/EP8`, instead of spending trials on prefill-oriented runtime knobs. Later iterations are still needed to validate local optimality by testing nearby topologies and same-topology runtime knobs.
+Important interpretation: `trial-0002` should be called the current best observed config, not a global optimum proof. The harness got there quickly because the decode-only harness biases the first proposal toward the most relevant adjacent topology redistribution, `TP4/DP2/EP8 -> TP2/DP4/EP8`, instead of spending trials on prefill-oriented runtime knobs. Later validation now supports local optimality against the tested adjacent topology and the tested same-topology `max-num-seqs=160` runtime refinement.

 Follow-up implementation after this result:

@@ -89,8 +90,38 @@ After the implementation fix, the previously rejected `proposal-0004` was resume
 - Remote tmux: `aituner_qwen235b_decode_harness_validate_20260428`.
 - Result: completed with no feasible point. This is useful negative evidence for the base topology plus `max-num-seqs=160`, but not for the `trial-0002` incumbent topology.

+A second validation trial was then launched with the full incumbent topology in the patch:
+
+- `trial-0005` config: `TP2/DP4/EP8 + max-num-seqs=160`.
+- Search range: low `0.017028808593`, high `0.125`, tolerance `0.001`, max probes `6`.
+- Result: completed with no feasible point; `trial-0002` remained the best trial.
+- Probe outcomes:
+
+| Probe sampling_u | Request/s | Pass rate | Feasible | Early-stop reason |
+| ---: | ---: | ---: | --- | --- |
+| 0.0710144 | 1.7800 | 0.2818 | no | `slo_pass_rate_unrecoverable` |
+| 0.0440216 | 1.0900 | 0.1789 | no | `slo_pass_rate_unrecoverable` |
+| 0.0305252 | 0.7050 | 0.3002 | no | `slo_pass_rate_unrecoverable` |
+| 0.0237770 | 0.5417 | 0.4092 | no | `slo_pass_rate_unrecoverable` |
+| 0.0204029 | 0.4533 | 0.4890 | no | `slo_pass_rate_unrecoverable` |
+| 0.0187159 | 0.4117 | 0.5466 | no | `slo_pass_rate_unrecoverable` |
+
+This directly answers the one-iter-to-best concern for this refinement: the harness did not stop after `trial-0002`; it ran a corrected same-topology validation, and every tested point above the incumbent search floor failed the 95% TPOT SLO. Therefore `max-num-seqs=160` does not falsify `trial-0002` as the current best.
+
 ## Follow-up Fix

 The seeded prompt exposed a generic diagnosis issue: if the best feasible probe had no latency failures, the harness could miss the prior infeasible probe that showed the real bottleneck at higher load. The harness now scans the probe sequence backward and uses the nearest non-trivial bottleneck before falling back to the best feasible probe. This keeps decode-only runs focused on `decode_tpot` after a feasible low-load point, without adding testcase thresholds.

 A second generic diagnosis bug was fixed: non-SLO bookkeeping counts such as `probe_elapsed_s>...` no longer collapse to `ttft_prefill` when TTFT/TPOT/request failure counts are all zero.
+
+## Follow-up Fix, 2026-04-30
+
+The base-relative patch issue is now guarded in code, not only in the LLM prompt. When `StudyStore.materialize_trial` sees a runtime/env-only proposal after a non-base incumbent has been found, it inherits the incumbent topology patch into the trial spec unless the proposal explicitly provides a topology. This keeps same-topology runtime validation on the actual incumbent while preserving the ability to test the base topology by stating it explicitly.
+
+Local verification: `PYTHONPATH=src python3 -m unittest discover -s tests` passed 68 tests.
+
+## Current Harness Judgment
+
+For qwen235b decode-only, the harness still accelerates convergence: before harness, the best observed 12-iter result appeared at iter 9 with 0.2817 request/s; with harness, iter 2 reached 0.3767 request/s and later validation did not find a better adjacent or same-topology runtime point.
+
+The remaining optimization is validation cost, not convergence quality. `trial-0005` took a long time because early-stopped decode-only probes still had to wait for in-flight long-output requests unless the engine is restarted after early stop. Future harness/study templates for long decode-only validation should use or automatically recommend `trace.restart_engine_after_early_stop=true` when repeated SLO-unrecoverable probes are expected.
--- a/src/aituner/store.py
+++ b/src/aituner/store.py
@@ -5,7 +5,15 @@ from dataclasses import replace
 from pathlib import Path
 from typing import Any

-from .spec import Proposal, StudySpec, StudyState, TrialSpec, TrialSummary, to_jsonable
+from .spec import ConfigPatch, Proposal, StudySpec, StudyState, TrialSpec, TrialSummary, to_jsonable
+
+
+_TOPOLOGY_FLAG_KEYS = {
+    "tensor-parallel-size",
+    "data-parallel-size",
+    "expert-parallel-size",
+    "enable-expert-parallel",
+}


 class StudyStore:
@@ -65,6 +73,11 @@ class StudyStore:
        state: StudyState,
        proposal: Proposal,
    ) -> tuple[TrialSpec, StudyState]:
+        proposal = _inherit_incumbent_topology_for_runtime_patch(
+            study=study,
+            state=state,
+            proposal=proposal,
+        )
        trial_id = f"trial-{state.next_trial_index:04d}"
        trial_root = self.study_root(study.study_id) / "trials" / trial_id
        trial_root.mkdir(parents=True, exist_ok=True)
@@ -225,6 +238,47 @@ def _parallel_size_for_proposal(*, study: StudySpec, proposal: Proposal) -> int:
    return _parallel_size_for_config(study=study, flag_patch=proposal.config_patch.flag_patch)


+def _inherit_incumbent_topology_for_runtime_patch(
+    *,
+    study: StudySpec,
+    state: StudyState,
+    proposal: Proposal,
+) -> Proposal:
+    flag_patch = dict(proposal.config_patch.flag_patch)
+    env_patch = dict(proposal.config_patch.env_patch)
+    if not flag_patch and not env_patch:
+        return proposal
+    if _TOPOLOGY_FLAG_KEYS.intersection(flag_patch):
+        return proposal
+    if not state.best_trial_id:
+        return proposal
+    incumbent = next(
+        (trial for trial in state.trials if trial.trial_id == state.best_trial_id),
+        None,
+    )
+    if incumbent is None or not isinstance(incumbent.config_patch, dict):
+        return proposal
+    incumbent_patch = incumbent.config_patch.get("flag_patch")
+    if not isinstance(incumbent_patch, dict):
+        return proposal
+    inherited_topology = {
+        key: value
+        for key, value in incumbent_patch.items()
+        if key in _TOPOLOGY_FLAG_KEYS and study.engine.base_flags.get(key) != value
+    }
+    if not inherited_topology:
+        return proposal
+    merged_flag_patch = dict(inherited_topology)
+    merged_flag_patch.update(flag_patch)
+    return replace(
+        proposal,
+        config_patch=ConfigPatch(
+            env_patch=env_patch,
+            flag_patch=merged_flag_patch,
+        ),
+    )
+
+
 def _parallel_size_for_trial_id(*, study: StudySpec, study_root: Path, trial_id: str) -> int | None:
    trial_spec_path = study_root / "trials" / trial_id / "trial_spec.json"
    if not trial_spec_path.exists():
--- a/tests/test_core_flow.py
+++ b/tests/test_core_flow.py
@@ -1614,6 +1614,182 @@ class CoreFlowTests(unittest.TestCase):
            trial, _ = store.materialize_trial(study=study, state=state, proposal=proposal)
            self.assertEqual(trial.search.low, study.search.low)

+    def test_materialize_trial_inherits_incumbent_topology_for_runtime_patch(self) -> None:
+        with tempfile.TemporaryDirectory() as tmp:
+            tmp_path = Path(tmp)
+            study_path = _write_study_assets(
+                tmp_path,
+                engine_overrides={
+                    "base_flags": {
+                        "host": "127.0.0.1",
+                        "port": 8000,
+                        "enable-expert-parallel": True,
+                        "tensor-parallel-size": 4,
+                        "data-parallel-size": 2,
+                        "expert-parallel-size": 8,
+                    },
+                    "tunable_flags": [
+                        "tensor-parallel-size",
+                        "data-parallel-size",
+                        "expert-parallel-size",
+                        "max-num-seqs",
+                    ],
+                    "topology_constraints": {
+                        "require_tp_dp_product_equals_gpu_count": True,
+                        "require_ep_size_leq_tp_dp_product": True,
+                        "require_ep_size_divides_tp_dp_product": True,
+                        "allowed_tensor_parallel_sizes": [1, 2, 4, 8],
+                        "allowed_data_parallel_sizes": [1, 2, 4, 8],
+                        "allowed_expert_parallel_sizes": [1, 2, 4, 8],
+                    },
+                },
+            )
+            study = load_study_spec(study_path)
+            store = StudyStore(tmp_path / ".aituner" / "studies")
+            store.init_study(spec_path=study_path, study=study)
+            state = StudyState(
+                study_id=study.study_id,
+                best_trial_id="trial-0002",
+                best_parallel_size=8,
+                best_sampling_u=0.125,
+                best_request_rate=3.0,
+                best_request_rate_per_gpu=0.375,
+                next_trial_index=3,
+                best_by_parallel_size={
+                    "8": {
+                        "trial_id": "trial-0002",
+                        "parallel_size": 8,
+                        "best_sampling_u": 0.125,
+                        "best_request_rate": 3.0,
+                        "best_request_rate_per_gpu": 0.375,
+                    }
+                },
+                trials=[
+                    TrialSummary(
+                        trial_id="trial-0002",
+                        status="completed",
+                        parallel_size=8,
+                        best_sampling_u=0.125,
+                        best_request_rate=3.0,
+                        best_request_rate_per_gpu=0.375,
+                        config_patch={
+                            "env_patch": {},
+                            "flag_patch": {
+                                "tensor-parallel-size": 2,
+                                "data-parallel-size": 4,
+                                "expert-parallel-size": 8,
+                            },
+                        },
+                    )
+                ],
+            )
+            proposal = Proposal.from_dict(
+                {
+                    "observation": "Validate runtime headroom around the incumbent.",
+                    "diagnosis": "Try lower concurrency on the current best topology.",
+                    "config_patch": {"env_patch": {}, "flag_patch": {"max-num-seqs": 160}},
+                    "expected_effects": ["validate incumbent runtime headroom"],
+                }
+            )
+
+            trial, next_state = store.materialize_trial(study=study, state=state, proposal=proposal)
+
+            self.assertEqual(
+                trial.config_patch.flag_patch,
+                {
+                    "tensor-parallel-size": 2,
+                    "data-parallel-size": 4,
+                    "max-num-seqs": 160,
+                },
+            )
+            self.assertEqual(trial.search.low, 0.125)
+            self.assertEqual(
+                next_state.trials[-1].config_patch["flag_patch"],
+                {
+                    "tensor-parallel-size": 2,
+                    "data-parallel-size": 4,
+                    "max-num-seqs": 160,
+                },
+            )
+
+    def test_materialize_trial_keeps_explicit_topology_runtime_patch(self) -> None:
+        with tempfile.TemporaryDirectory() as tmp:
+            tmp_path = Path(tmp)
+            study_path = _write_study_assets(
+                tmp_path,
+                engine_overrides={
+                    "base_flags": {
+                        "host": "127.0.0.1",
+                        "port": 8000,
+                        "enable-expert-parallel": True,
+                        "tensor-parallel-size": 4,
+                        "data-parallel-size": 2,
+                        "expert-parallel-size": 8,
+                    },
+                    "tunable_flags": [
+                        "tensor-parallel-size",
+                        "data-parallel-size",
+                        "expert-parallel-size",
+                        "max-num-seqs",
+                    ],
+                    "topology_constraints": {
+                        "require_tp_dp_product_equals_gpu_count": True,
+                        "require_ep_size_leq_tp_dp_product": True,
+                        "require_ep_size_divides_tp_dp_product": True,
+                        "allowed_tensor_parallel_sizes": [1, 2, 4, 8],
+                        "allowed_data_parallel_sizes": [1, 2, 4, 8],
+                        "allowed_expert_parallel_sizes": [1, 2, 4, 8],
+                    },
+                },
+            )
+            study = load_study_spec(study_path)
+            store = StudyStore(tmp_path / ".aituner" / "studies")
+            store.init_study(spec_path=study_path, study=study)
+            state = StudyState(
+                study_id=study.study_id,
+                best_trial_id="trial-0002",
+                next_trial_index=3,
+                trials=[
+                    TrialSummary(
+                        trial_id="trial-0002",
+                        status="completed",
+                        config_patch={
+                            "env_patch": {},
+                            "flag_patch": {
+                                "tensor-parallel-size": 2,
+                                "data-parallel-size": 4,
+                            },
+                        },
+                    )
+                ],
+            )
+            proposal = Proposal.from_dict(
+                {
+                    "observation": "Validate base topology runtime.",
+                    "diagnosis": "Explicitly keep base topology and adjust concurrency.",
+                    "config_patch": {
+                        "env_patch": {},
+                        "flag_patch": {
+                            "tensor-parallel-size": 4,
+                            "data-parallel-size": 2,
+                            "max-num-seqs": 160,
+                        },
+                    },
+                    "expected_effects": ["test base topology runtime headroom"],
+                }
+            )
+
+            trial, _ = store.materialize_trial(study=study, state=state, proposal=proposal)
+
+            self.assertEqual(
+                trial.config_patch.flag_patch,
+                {
+                    "tensor-parallel-size": 4,
+                    "data-parallel-size": 2,
+                    "max-num-seqs": 160,
+                },
+            )
+
    def test_ingest_trial_results_records_failure_reason(self) -> None:
        with tempfile.TemporaryDirectory() as tmp:
            tmp_path = Path(tmp)