Add infeasible plateau guard to harness

2026-04-25 18:47:32 +08:00
parent 6c04b9dbbc
commit 6bac389aae
3 changed files with 324 additions and 9 deletions
--- a/docs/harness-tuning-progress.md
+++ b/docs/harness-tuning-progress.md
@@ -24,6 +24,8 @@ Improve AITuner convergence for the `dash0` internal vLLM + Qwen3.5-27B 0-8k cha
  - Adds TP, max-num-seqs, max-num-batched-tokens, chunked-prefill, and memory-utilization harnesses when those knobs are tunable.
  - Extracts compact recent trial diagnostics from result JSON files.
  - Adds a convergence guard based on recent completed trial performance.
  - Adds an infeasible-progress guard: when recent all-infeasible trials at the same sampling threshold stop improving pass rate and p95 TTFT after changing one knob family, the next proposal must switch primary family or stop.
  - Classifies `slo_pass_rate_unrecoverable` by latency failure counts first, and ignores probe-budget markers such as `probe_elapsed_s>` for bottleneck voting, so TTFT-heavy failures stay aligned to prefill/TP or batching harnesses instead of being treated as generic queueing.
 - Extended `src/aituner/trace.py`.
  - `summarize_window` now reports L-C-A features.
  - `TraceRequest` now carries optional metadata for `hash_ids`, turn, parent chat id, and trace type.
@@ -43,7 +45,7 @@ Improve AITuner convergence for the `dash0` internal vLLM + Qwen3.5-27B 0-8k cha
 ## Local Verification
 - `python3 -m compileall -q src tests`: passed.
- `PYTHONPATH=src python3 -m unittest tests.test_core_flow`: passed, 59 tests.
+- `PYTHONPATH=src python3 -m unittest tests.test_core_flow`: passed, 62 tests.
 - `pytest -q` and `python3 -m pytest -q`: not runnable locally because `pytest` is not installed.
 ## Remote Experiment Log
@@ -82,10 +84,39 @@ Improve AITuner convergence for the `dash0` internal vLLM + Qwen3.5-27B 0-8k cha
 - This is not aligned with the paper's agentic loop, which evaluates the initial configuration first and then searches from measured feedback.
 - Action: update `study tune` so LLM-driven studies automatically materialize a baseline empty-patch trial first, unless `--skip-baseline` is passed. This should reduce early bad proposals because the first LLM edit will see real baseline bottleneck diagnostics and an incumbent request_rate_per_gpu.
 ### 2026-04-25 17:20-18:30 CST
 - r3 started with baseline-first enabled, but the full 0-8k run was too slow for fast iteration with raw chat completions. Stopped it before using it as a convergence signal.
 - A fast validation using `max_requests_per_probe=160` was invalid: the trace is downsampled before threshold selection, so lower thresholds can end up with `request_count=0`. Do not use that result for performance claims.
 - Prefill smoke v1 used `completion_tokens_override=1` but kept the TPOT SLO. That made TPOT missing failures dominate, so it was useful only for checking control flow, not for performance.
 ### 2026-04-25 18:30-20:10 CST
 - Prefill smoke v2 used real dash0 internal vLLM, Qwen3.5-27B, the real 0-8k prompt distribution and arrivals, `completion_tokens_override=1`, and `tpot_rule=null`.
 - Trial 0001 baseline TP1/DP1:
  - sampling `0.0078125`: pass rate 0.270, mean TTFT 2033.9 ms, p95 TTFT 5656.7 ms, p99 TTFT 6832.8 ms.
 - Trial 0002 TP1/DP2:
  - sampling `0.0078125`: pass rate 0.277, mean TTFT 1766.9 ms, p95 TTFT 4215.3 ms, p99 TTFT 5801.7 ms.
 - Trial 0003 TP1/DP4:
  - sampling `0.0078125`: pass rate 0.345, mean TTFT 1668.9 ms, p95 TTFT 3818.4 ms, p99 TTFT 5804.9 ms.
 - Trial 0004 TP1/DP8:
  - sampling `0.0078125`: pass rate 0.345, mean TTFT 1675.7 ms, p95 TTFT 3823.4 ms.
 - Interpretation:
  - The harness improved directionality: after the measured baseline, proposals followed a consistent scale-out path and avoided random runtime-knob churn.
  - The smoke result improved p95 TTFT by about 32% versus baseline at the low sampling threshold and improved pass rate from 0.270 to 0.345 within 3-4 trials.
  - It did not reach the 95% pass-rate SLO in this smoke setting, so this is not a full proof of convergence to a good production config.
  - DP8 did not improve over DP4, which exposed a gap: when every trial is infeasible, the prior convergence guard had no feasible incumbent and could not detect plateau.
 ### 2026-04-25 20:10 CST
 - Added the all-infeasible plateau guard described above.
 - Added unit coverage for:
  - TTFT failure classification under `slo_pass_rate_unrecoverable`;
  - blocking a repeat of the DP family after DP4 and DP8 show no material improvement at the same sampling threshold.
 - Current status: the harness now has the mechanism needed to avoid continuing the exact DP-only direction seen in the smoke v2 plateau. The next real experiment should either switch to a bottleneck-justified mixed TP/DP candidate or return `should_stop=true`.
 Remaining next steps:
-1. Start a real harness-guided Qwen3.5-27B 0-8k chat tuning run from `configs/examples/dash0_qwen27b_tight_slo_run4_0_8k.json`.
+1. Push/pull the plateau-guard commit to `dash0`.
-2. Compare the first few iterations against the prior 12-iteration behavior:
+2. Re-run the remote unit suite.
-   - best request rate per GPU should improve or reach the known good region in fewer trials;
+3. Start the next real tuning run only after deciding whether to spend a full multi-hour run on the production SLO or a shorter prefill-only confirmation of the new plateau guard.
   - proposals should follow the active bottleneck harness;
   - if the incumbent has converged, the LLM should emit `should_stop=true` instead of proposing a weak exploratory config.
--- a/src/aituner/harness.py
+++ b/src/aituner/harness.py
@@ -267,20 +267,26 @@ def _active_bottleneck(probe: dict[str, Any] | None) -> str:
    if not compact:
        return "unknown"
    early_stop_reason = str(compact.get("early_stop_reason") or "")
-    if early_stop_reason.startswith("arrival_lag_s>") or early_stop_reason == "slo_pass_rate_unrecoverable":
+    if early_stop_reason.startswith("arrival_lag_s>"):
        return "admission_or_queueing"
    latency_summary = compact.get("latency_summary")
    if not isinstance(latency_summary, dict):
        if early_stop_reason == "slo_pass_rate_unrecoverable":
            return "admission_or_queueing"
        return "unknown"
    failed = latency_summary.get("failed_reason_counts")
    if not isinstance(failed, dict) or not failed:
        if early_stop_reason == "slo_pass_rate_unrecoverable":
            return "admission_or_queueing"
        return "none_obvious"
    ttft_count = sum(int(v) for k, v in failed.items() if str(k).startswith("ttft"))
    tpot_count = sum(int(v) for k, v in failed.items() if str(k).startswith("tpot"))
    request_failed_count = sum(
        int(v)
        for k, v in failed.items()
-        if not str(k).startswith("ttft") and not str(k).startswith("tpot")
+        if not str(k).startswith("ttft")
        and not str(k).startswith("tpot")
        and not str(k).startswith("probe_elapsed_s>")
    )
    if ttft_count >= max(tpot_count, request_failed_count):
        return "ttft_prefill"
@@ -301,6 +307,7 @@ def _convergence_guard(
    state: StudyState,
    recent_diagnostics: list[dict[str, Any]],
 ) -> dict[str, Any]:
    infeasible_progress = _infeasible_progress_guard(recent_diagnostics)
    completed = [
        item
        for item in recent_diagnostics
@@ -320,9 +327,14 @@ def _convergence_guard(
            reason = "recent_trials_still_show_material_variation"
    elif state.best_request_rate_per_gpu is not None:
        reason = "need_more_evidence_before_stop"
    if not should_stop and infeasible_progress["plateau_detected"]:
        reason = str(infeasible_progress["reason"])
    return {
-        "should_stop_if_no_harness_can_justify_a_new_adjacent_probe": should_stop,
+        "should_stop_if_no_harness_can_justify_a_new_adjacent_probe": (
            should_stop or bool(infeasible_progress["stop_if_next_probe_repeats_family"])
        ),
        "reason": reason,
        "infeasible_progress": infeasible_progress,
        "incumbent": {
            "trial_id": state.best_trial_id,
            "parallel_size": state.best_parallel_size,
@@ -333,11 +345,136 @@ def _convergence_guard(
    }
 def _infeasible_progress_guard(recent_diagnostics: list[dict[str, Any]]) -> dict[str, Any]:
    points = [
        point
        for item in recent_diagnostics
        if (point := _infeasible_progress_point(item)) is not None
    ]
    default = {
        "plateau_detected": False,
        "stop_if_next_probe_repeats_family": False,
        "reason": "insufficient_infeasible_history",
        "blocked_primary_family": None,
        "recommended_next_action": "continue_adjacent_probe_if_a_harness_justifies_it",
        "comparison": None,
    }
    if len(points) < 2:
        return default
    prev, latest = points[-2], points[-1]
    if not _same_threshold(prev["threshold"], latest["threshold"]):
        return {
            **default,
            "reason": "recent_infeasible_trials_used_different_sampling_thresholds",
        }
    family = _changed_primary_family(prev["config_patch"], latest["config_patch"])
    pass_rate_delta = latest["pass_rate"] - prev["pass_rate"]
    p95_delta_ratio = _relative_delta(latest["ttft_p95_ms"], prev["ttft_p95_ms"])
    p99_delta_ratio = _relative_delta(latest["ttft_p99_ms"], prev["ttft_p99_ms"])
    plateau = (
        family is not None
        and abs(pass_rate_delta) <= 0.01
        and p95_delta_ratio is not None
        and p95_delta_ratio >= -0.02
    )
    comparison = {
        "previous": prev,
        "latest": latest,
        "changed_primary_family": family,
        "pass_rate_delta": pass_rate_delta,
        "ttft_p95_delta_ratio": p95_delta_ratio,
        "ttft_p99_delta_ratio": p99_delta_ratio,
    }
    if not plateau:
        return {
            **default,
            "reason": "recent_infeasible_trials_still_show_material_progress",
            "comparison": comparison,
        }
    return {
        "plateau_detected": True,
        "stop_if_next_probe_repeats_family": True,
        "reason": f"{family}_plateau_on_infeasible_trials",
        "blocked_primary_family": family,
        "recommended_next_action": (
            "switch_primary_family_with_a_specific_bottleneck_rationale_or_return_should_stop"
        ),
        "comparison": comparison,
    }
 def _infeasible_progress_point(item: dict[str, Any]) -> dict[str, Any] | None:
    if item.get("status") != "completed" or item.get("best_request_rate") is not None:
        return None
    probe_summary = item.get("probe_summary")
    if not isinstance(probe_summary, dict):
        return None
    probe = probe_summary.get("all_infeasible") or probe_summary.get("last_probe")
    compact = _compact_probe(probe if isinstance(probe, dict) else None)
    if not compact or compact.get("feasible"):
        return None
    latency_summary = compact.get("latency_summary")
    if not isinstance(latency_summary, dict):
        latency_summary = {}
    ttft = latency_summary.get("ttft_ms")
    if not isinstance(ttft, dict):
        ttft = {}
    return {
        "trial_id": item.get("trial_id"),
        "threshold": _as_float(compact.get("threshold")),
        "request_rate": _as_float(compact.get("request_rate")),
        "pass_rate": _as_float(compact.get("pass_rate")),
        "ttft_p95_ms": _optional_float(ttft.get("p95")),
        "ttft_p99_ms": _optional_float(ttft.get("p99")),
        "active_bottleneck": item.get("active_bottleneck"),
        "config_patch": item.get("config_patch") if isinstance(item.get("config_patch"), dict) else {},
    }
 def _changed_primary_family(
    previous_patch: dict[str, Any],
    latest_patch: dict[str, Any],
 ) -> str | None:
    previous_flags = previous_patch.get("flag_patch") if isinstance(previous_patch, dict) else {}
    latest_flags = latest_patch.get("flag_patch") if isinstance(latest_patch, dict) else {}
    if not isinstance(previous_flags, dict) or not isinstance(latest_flags, dict):
        return None
    changed: list[str] = []
    for key, family in (
        ("tensor-parallel-size", "tensor-parallel-size"),
        ("data-parallel-size", "data-parallel-size"),
        ("expert-parallel-size", "expert-parallel-size"),
        ("max-num-seqs", "max-num-seqs"),
        ("max-num-batched-tokens", "max-num-batched-tokens"),
        ("gpu-memory-utilization", "gpu-memory-utilization"),
        ("enable-chunked-prefill", "enable-chunked-prefill"),
    ):
        if previous_flags.get(key) != latest_flags.get(key):
            changed.append(family)
    if len(changed) == 1:
        return changed[0]
    if changed:
        return "mixed:" + ",".join(changed)
    return None
 def _same_threshold(left: float, right: float) -> bool:
    return abs(left - right) <= max(abs(left), abs(right), 1.0) * 1e-6
 def _relative_delta(new: float | None, old: float | None) -> float | None:
    if new is None or old is None or old <= 0:
        return None
    return (new - old) / old
 def _proposal_rules() -> list[str]:
    return [
        "First decide the active bottleneck from recent_trial_diagnostics.",
        "Pick at most one primary knob family from knob_harnesses unless the history proves a coupled change is needed.",
        "Use adjacent legal values around the incumbent; avoid broad exploratory jumps.",
        "If infeasible_progress blocks the last primary knob family, do not continue that family; switch families with direct bottleneck evidence or set should_stop=true.",
        "If a proposed config is likely to reduce request_rate_per_gpu under the active guard, set should_stop=true instead of exploring.",
        "Never repeat an already tested config signature.",
        "Return should_stop=true when the convergence guard fires and no active harness justifies another adjacent probe.",
@@ -388,3 +525,11 @@ def _as_float(value: Any) -> float:
    if isinstance(value, (int, float)):
        return float(value)
    return 0.0
 def _optional_float(value: Any) -> float | None:
    if isinstance(value, bool):
        return None
    if isinstance(value, (int, float)):
        return float(value)
    return None
--- a/tests/test_core_flow.py
+++ b/tests/test_core_flow.py
@@ -13,6 +13,7 @@ from aituner.compare import load_compare_spec, run_compare
 from aituner.engine import build_launch_recipe
 from aituner.http_client import _auth_headers, _openai_url, _should_bypass_proxy
 from aituner.job import append_job, build_trial_job
 from aituner.harness import build_harness_context
 from aituner.llm import _extract_response_text, build_prompt, parse_proposal_text, validate_proposal
 from aituner.search import ThresholdProbe, binary_search_max_feasible
 from aituner.slo import RequestOutcome, evaluate_request, summarize_evaluations
@@ -226,6 +227,144 @@ class CoreFlowTests(unittest.TestCase):
            self.assertIn("knob_harnesses", prompt)
            self.assertTrue(study_root.exists())
    def test_harness_uses_latency_failures_before_generic_unrecoverable(self) -> None:
        with tempfile.TemporaryDirectory() as tmp:
            tmp_path = Path(tmp)
            study_path = _write_study_assets(tmp_path)
            study = load_study_spec(study_path)
            result_path = tmp_path / "trial-result.json"
            result_path.write_text(
                json.dumps(
                    {
                        "status": "completed",
                        "probes": [
                            {
                                "threshold": 0.25,
                                "feasible": False,
                                "payload": {
                                    "request_count": 100,
                                    "pass_rate": 0.3,
                                    "request_rate": 1.0,
                                    "early_stopped": True,
                                    "early_stop_reason": "slo_pass_rate_unrecoverable",
                                    "latency_summary": {
                                        "failed_reason_counts": {
                                            "ttft_ms>5000.0": 70,
                                            "tpot_ms>50.0": 5,
                                            "probe_elapsed_s>240.0": 100,
                                        },
                                        "ttft_ms": {"p95": 6500.0, "p99": 7200.0},
                                    },
                                },
                            }
                        ],
                    }
                ),
                encoding="utf-8",
            )
            state = StudyState(
                study_id=study.study_id,
                trials=[
                    TrialSummary(
                        trial_id="trial-0001",
                        status="completed",
                        result_path=str(result_path),
                        config_patch={"env_patch": {}, "flag_patch": {}},
                    )
                ],
            )
            context = build_harness_context(
                study=study,
                window_summary={
                    "prompt_tokens_p95": 5000,
                    "prompt_tail_ratio_p95_p50": 3.0,
                },
                state=state,
            )
            self.assertEqual(
                context["recent_trial_diagnostics"][0]["active_bottleneck"],
                "ttft_prefill",
            )
    def test_harness_blocks_repeating_infeasible_plateau_family(self) -> None:
        with tempfile.TemporaryDirectory() as tmp:
            tmp_path = Path(tmp)
            study_path = _write_study_assets(
                tmp_path,
                engine_overrides={
                    "tunable_flags": [
                        "tensor-parallel-size",
                        "data-parallel-size",
                        "expert-parallel-size",
                    ],
                    "topology_constraints": {
                        "allowed_tensor_parallel_sizes": [1, 2, 4],
                        "allowed_data_parallel_sizes": [1, 2, 4, 8],
                        "allowed_expert_parallel_sizes": [1],
                        "allowed_tp_dp_products": [1, 2, 4, 8],
                    },
                },
            )
            study = load_study_spec(study_path)
            trial_summaries = []
            for index, (dp, pass_rate, p95) in enumerate(
                [(4, 0.345, 3818.4), (8, 0.345, 3823.4)], start=3
            ):
                result_path = tmp_path / f"trial-{index:04d}.json"
                result_path.write_text(
                    json.dumps(
                        {
                            "status": "completed",
                            "best_request_rate": None,
                            "all_infeasible_diagnostics": {
                                "threshold": 0.0078125,
                                "request_count": 148,
                                "request_rate": 0.22,
                                "pass_rate": pass_rate,
                                "early_stopped": True,
                                "early_stop_reason": "elapsed",
                                "latency_summary": {
                                    "failed_reason_counts": {"ttft_ms>5000.0": 97},
                                    "ttft_ms": {"p95": p95, "p99": 5800.0},
                                },
                            },
                        }
                    ),
                    encoding="utf-8",
                )
                trial_summaries.append(
                    TrialSummary(
                        trial_id=f"trial-{index:04d}",
                        status="completed",
                        result_path=str(result_path),
                        config_patch={
                            "env_patch": {},
                            "flag_patch": {
                                "tensor-parallel-size": 1,
                                "data-parallel-size": dp,
                                "expert-parallel-size": 1,
                            },
                        },
                    )
                )
            context = build_harness_context(
                study=study,
                window_summary={
                    "prompt_tokens_p95": 7628,
                    "prompt_tail_ratio_p95_p50": 3.83,
                },
                state=StudyState(study_id=study.study_id, trials=trial_summaries),
            )
            guard = context["convergence_guard"]["infeasible_progress"]
            self.assertTrue(guard["plateau_detected"])
            self.assertTrue(guard["stop_if_next_probe_repeats_family"])
            self.assertEqual(guard["blocked_primary_family"], "data-parallel-size")
            self.assertTrue(
                context["convergence_guard"][
                    "should_stop_if_no_harness_can_justify_a_new_adjacent_probe"
                ]
            )
    def test_trace_input_length_filter_keeps_only_matching_rows(self) -> None:
        with tempfile.TemporaryDirectory() as tmp:
            tmp_path = Path(tmp)