Add infeasible plateau guard to harness

This commit is contained in:
2026-04-25 18:47:32 +08:00
parent 6c04b9dbbc
commit 6bac389aae
3 changed files with 324 additions and 9 deletions

View File

@@ -24,6 +24,8 @@ Improve AITuner convergence for the `dash0` internal vLLM + Qwen3.5-27B 0-8k cha
- Adds TP, max-num-seqs, max-num-batched-tokens, chunked-prefill, and memory-utilization harnesses when those knobs are tunable. - Adds TP, max-num-seqs, max-num-batched-tokens, chunked-prefill, and memory-utilization harnesses when those knobs are tunable.
- Extracts compact recent trial diagnostics from result JSON files. - Extracts compact recent trial diagnostics from result JSON files.
- Adds a convergence guard based on recent completed trial performance. - Adds a convergence guard based on recent completed trial performance.
- Adds an infeasible-progress guard: when recent all-infeasible trials at the same sampling threshold stop improving pass rate and p95 TTFT after changing one knob family, the next proposal must switch primary family or stop.
- Classifies `slo_pass_rate_unrecoverable` by latency failure counts first, and ignores probe-budget markers such as `probe_elapsed_s>` for bottleneck voting, so TTFT-heavy failures stay aligned to prefill/TP or batching harnesses instead of being treated as generic queueing.
- Extended `src/aituner/trace.py`. - Extended `src/aituner/trace.py`.
- `summarize_window` now reports L-C-A features. - `summarize_window` now reports L-C-A features.
- `TraceRequest` now carries optional metadata for `hash_ids`, turn, parent chat id, and trace type. - `TraceRequest` now carries optional metadata for `hash_ids`, turn, parent chat id, and trace type.
@@ -43,7 +45,7 @@ Improve AITuner convergence for the `dash0` internal vLLM + Qwen3.5-27B 0-8k cha
## Local Verification ## Local Verification
- `python3 -m compileall -q src tests`: passed. - `python3 -m compileall -q src tests`: passed.
- `PYTHONPATH=src python3 -m unittest tests.test_core_flow`: passed, 59 tests. - `PYTHONPATH=src python3 -m unittest tests.test_core_flow`: passed, 62 tests.
- `pytest -q` and `python3 -m pytest -q`: not runnable locally because `pytest` is not installed. - `pytest -q` and `python3 -m pytest -q`: not runnable locally because `pytest` is not installed.
## Remote Experiment Log ## Remote Experiment Log
@@ -82,10 +84,39 @@ Improve AITuner convergence for the `dash0` internal vLLM + Qwen3.5-27B 0-8k cha
- This is not aligned with the paper's agentic loop, which evaluates the initial configuration first and then searches from measured feedback. - This is not aligned with the paper's agentic loop, which evaluates the initial configuration first and then searches from measured feedback.
- Action: update `study tune` so LLM-driven studies automatically materialize a baseline empty-patch trial first, unless `--skip-baseline` is passed. This should reduce early bad proposals because the first LLM edit will see real baseline bottleneck diagnostics and an incumbent request_rate_per_gpu. - Action: update `study tune` so LLM-driven studies automatically materialize a baseline empty-patch trial first, unless `--skip-baseline` is passed. This should reduce early bad proposals because the first LLM edit will see real baseline bottleneck diagnostics and an incumbent request_rate_per_gpu.
### 2026-04-25 17:20-18:30 CST
- r3 started with baseline-first enabled, but the full 0-8k run was too slow for fast iteration with raw chat completions. Stopped it before using it as a convergence signal.
- A fast validation using `max_requests_per_probe=160` was invalid: the trace is downsampled before threshold selection, so lower thresholds can end up with `request_count=0`. Do not use that result for performance claims.
- Prefill smoke v1 used `completion_tokens_override=1` but kept the TPOT SLO. That made TPOT missing failures dominate, so it was useful only for checking control flow, not for performance.
### 2026-04-25 18:30-20:10 CST
- Prefill smoke v2 used real dash0 internal vLLM, Qwen3.5-27B, the real 0-8k prompt distribution and arrivals, `completion_tokens_override=1`, and `tpot_rule=null`.
- Trial 0001 baseline TP1/DP1:
- sampling `0.0078125`: pass rate 0.270, mean TTFT 2033.9 ms, p95 TTFT 5656.7 ms, p99 TTFT 6832.8 ms.
- Trial 0002 TP1/DP2:
- sampling `0.0078125`: pass rate 0.277, mean TTFT 1766.9 ms, p95 TTFT 4215.3 ms, p99 TTFT 5801.7 ms.
- Trial 0003 TP1/DP4:
- sampling `0.0078125`: pass rate 0.345, mean TTFT 1668.9 ms, p95 TTFT 3818.4 ms, p99 TTFT 5804.9 ms.
- Trial 0004 TP1/DP8:
- sampling `0.0078125`: pass rate 0.345, mean TTFT 1675.7 ms, p95 TTFT 3823.4 ms.
- Interpretation:
- The harness improved directionality: after the measured baseline, proposals followed a consistent scale-out path and avoided random runtime-knob churn.
- The smoke result improved p95 TTFT by about 32% versus baseline at the low sampling threshold and improved pass rate from 0.270 to 0.345 within 3-4 trials.
- It did not reach the 95% pass-rate SLO in this smoke setting, so this is not a full proof of convergence to a good production config.
- DP8 did not improve over DP4, which exposed a gap: when every trial is infeasible, the prior convergence guard had no feasible incumbent and could not detect plateau.
### 2026-04-25 20:10 CST
- Added the all-infeasible plateau guard described above.
- Added unit coverage for:
- TTFT failure classification under `slo_pass_rate_unrecoverable`;
- blocking a repeat of the DP family after DP4 and DP8 show no material improvement at the same sampling threshold.
- Current status: the harness now has the mechanism needed to avoid continuing the exact DP-only direction seen in the smoke v2 plateau. The next real experiment should either switch to a bottleneck-justified mixed TP/DP candidate or return `should_stop=true`.
Remaining next steps: Remaining next steps:
1. Start a real harness-guided Qwen3.5-27B 0-8k chat tuning run from `configs/examples/dash0_qwen27b_tight_slo_run4_0_8k.json`. 1. Push/pull the plateau-guard commit to `dash0`.
2. Compare the first few iterations against the prior 12-iteration behavior: 2. Re-run the remote unit suite.
- best request rate per GPU should improve or reach the known good region in fewer trials; 3. Start the next real tuning run only after deciding whether to spend a full multi-hour run on the production SLO or a shorter prefill-only confirmation of the new plateau guard.
- proposals should follow the active bottleneck harness;
- if the incumbent has converged, the LLM should emit `should_stop=true` instead of proposing a weak exploratory config.

View File

@@ -267,20 +267,26 @@ def _active_bottleneck(probe: dict[str, Any] | None) -> str:
if not compact: if not compact:
return "unknown" return "unknown"
early_stop_reason = str(compact.get("early_stop_reason") or "") early_stop_reason = str(compact.get("early_stop_reason") or "")
if early_stop_reason.startswith("arrival_lag_s>") or early_stop_reason == "slo_pass_rate_unrecoverable": if early_stop_reason.startswith("arrival_lag_s>"):
return "admission_or_queueing" return "admission_or_queueing"
latency_summary = compact.get("latency_summary") latency_summary = compact.get("latency_summary")
if not isinstance(latency_summary, dict): if not isinstance(latency_summary, dict):
if early_stop_reason == "slo_pass_rate_unrecoverable":
return "admission_or_queueing"
return "unknown" return "unknown"
failed = latency_summary.get("failed_reason_counts") failed = latency_summary.get("failed_reason_counts")
if not isinstance(failed, dict) or not failed: if not isinstance(failed, dict) or not failed:
if early_stop_reason == "slo_pass_rate_unrecoverable":
return "admission_or_queueing"
return "none_obvious" return "none_obvious"
ttft_count = sum(int(v) for k, v in failed.items() if str(k).startswith("ttft")) ttft_count = sum(int(v) for k, v in failed.items() if str(k).startswith("ttft"))
tpot_count = sum(int(v) for k, v in failed.items() if str(k).startswith("tpot")) tpot_count = sum(int(v) for k, v in failed.items() if str(k).startswith("tpot"))
request_failed_count = sum( request_failed_count = sum(
int(v) int(v)
for k, v in failed.items() for k, v in failed.items()
if not str(k).startswith("ttft") and not str(k).startswith("tpot") if not str(k).startswith("ttft")
and not str(k).startswith("tpot")
and not str(k).startswith("probe_elapsed_s>")
) )
if ttft_count >= max(tpot_count, request_failed_count): if ttft_count >= max(tpot_count, request_failed_count):
return "ttft_prefill" return "ttft_prefill"
@@ -301,6 +307,7 @@ def _convergence_guard(
state: StudyState, state: StudyState,
recent_diagnostics: list[dict[str, Any]], recent_diagnostics: list[dict[str, Any]],
) -> dict[str, Any]: ) -> dict[str, Any]:
infeasible_progress = _infeasible_progress_guard(recent_diagnostics)
completed = [ completed = [
item item
for item in recent_diagnostics for item in recent_diagnostics
@@ -320,9 +327,14 @@ def _convergence_guard(
reason = "recent_trials_still_show_material_variation" reason = "recent_trials_still_show_material_variation"
elif state.best_request_rate_per_gpu is not None: elif state.best_request_rate_per_gpu is not None:
reason = "need_more_evidence_before_stop" reason = "need_more_evidence_before_stop"
if not should_stop and infeasible_progress["plateau_detected"]:
reason = str(infeasible_progress["reason"])
return { return {
"should_stop_if_no_harness_can_justify_a_new_adjacent_probe": should_stop, "should_stop_if_no_harness_can_justify_a_new_adjacent_probe": (
should_stop or bool(infeasible_progress["stop_if_next_probe_repeats_family"])
),
"reason": reason, "reason": reason,
"infeasible_progress": infeasible_progress,
"incumbent": { "incumbent": {
"trial_id": state.best_trial_id, "trial_id": state.best_trial_id,
"parallel_size": state.best_parallel_size, "parallel_size": state.best_parallel_size,
@@ -333,11 +345,136 @@ def _convergence_guard(
} }
def _infeasible_progress_guard(recent_diagnostics: list[dict[str, Any]]) -> dict[str, Any]:
points = [
point
for item in recent_diagnostics
if (point := _infeasible_progress_point(item)) is not None
]
default = {
"plateau_detected": False,
"stop_if_next_probe_repeats_family": False,
"reason": "insufficient_infeasible_history",
"blocked_primary_family": None,
"recommended_next_action": "continue_adjacent_probe_if_a_harness_justifies_it",
"comparison": None,
}
if len(points) < 2:
return default
prev, latest = points[-2], points[-1]
if not _same_threshold(prev["threshold"], latest["threshold"]):
return {
**default,
"reason": "recent_infeasible_trials_used_different_sampling_thresholds",
}
family = _changed_primary_family(prev["config_patch"], latest["config_patch"])
pass_rate_delta = latest["pass_rate"] - prev["pass_rate"]
p95_delta_ratio = _relative_delta(latest["ttft_p95_ms"], prev["ttft_p95_ms"])
p99_delta_ratio = _relative_delta(latest["ttft_p99_ms"], prev["ttft_p99_ms"])
plateau = (
family is not None
and abs(pass_rate_delta) <= 0.01
and p95_delta_ratio is not None
and p95_delta_ratio >= -0.02
)
comparison = {
"previous": prev,
"latest": latest,
"changed_primary_family": family,
"pass_rate_delta": pass_rate_delta,
"ttft_p95_delta_ratio": p95_delta_ratio,
"ttft_p99_delta_ratio": p99_delta_ratio,
}
if not plateau:
return {
**default,
"reason": "recent_infeasible_trials_still_show_material_progress",
"comparison": comparison,
}
return {
"plateau_detected": True,
"stop_if_next_probe_repeats_family": True,
"reason": f"{family}_plateau_on_infeasible_trials",
"blocked_primary_family": family,
"recommended_next_action": (
"switch_primary_family_with_a_specific_bottleneck_rationale_or_return_should_stop"
),
"comparison": comparison,
}
def _infeasible_progress_point(item: dict[str, Any]) -> dict[str, Any] | None:
if item.get("status") != "completed" or item.get("best_request_rate") is not None:
return None
probe_summary = item.get("probe_summary")
if not isinstance(probe_summary, dict):
return None
probe = probe_summary.get("all_infeasible") or probe_summary.get("last_probe")
compact = _compact_probe(probe if isinstance(probe, dict) else None)
if not compact or compact.get("feasible"):
return None
latency_summary = compact.get("latency_summary")
if not isinstance(latency_summary, dict):
latency_summary = {}
ttft = latency_summary.get("ttft_ms")
if not isinstance(ttft, dict):
ttft = {}
return {
"trial_id": item.get("trial_id"),
"threshold": _as_float(compact.get("threshold")),
"request_rate": _as_float(compact.get("request_rate")),
"pass_rate": _as_float(compact.get("pass_rate")),
"ttft_p95_ms": _optional_float(ttft.get("p95")),
"ttft_p99_ms": _optional_float(ttft.get("p99")),
"active_bottleneck": item.get("active_bottleneck"),
"config_patch": item.get("config_patch") if isinstance(item.get("config_patch"), dict) else {},
}
def _changed_primary_family(
previous_patch: dict[str, Any],
latest_patch: dict[str, Any],
) -> str | None:
previous_flags = previous_patch.get("flag_patch") if isinstance(previous_patch, dict) else {}
latest_flags = latest_patch.get("flag_patch") if isinstance(latest_patch, dict) else {}
if not isinstance(previous_flags, dict) or not isinstance(latest_flags, dict):
return None
changed: list[str] = []
for key, family in (
("tensor-parallel-size", "tensor-parallel-size"),
("data-parallel-size", "data-parallel-size"),
("expert-parallel-size", "expert-parallel-size"),
("max-num-seqs", "max-num-seqs"),
("max-num-batched-tokens", "max-num-batched-tokens"),
("gpu-memory-utilization", "gpu-memory-utilization"),
("enable-chunked-prefill", "enable-chunked-prefill"),
):
if previous_flags.get(key) != latest_flags.get(key):
changed.append(family)
if len(changed) == 1:
return changed[0]
if changed:
return "mixed:" + ",".join(changed)
return None
def _same_threshold(left: float, right: float) -> bool:
return abs(left - right) <= max(abs(left), abs(right), 1.0) * 1e-6
def _relative_delta(new: float | None, old: float | None) -> float | None:
if new is None or old is None or old <= 0:
return None
return (new - old) / old
def _proposal_rules() -> list[str]: def _proposal_rules() -> list[str]:
return [ return [
"First decide the active bottleneck from recent_trial_diagnostics.", "First decide the active bottleneck from recent_trial_diagnostics.",
"Pick at most one primary knob family from knob_harnesses unless the history proves a coupled change is needed.", "Pick at most one primary knob family from knob_harnesses unless the history proves a coupled change is needed.",
"Use adjacent legal values around the incumbent; avoid broad exploratory jumps.", "Use adjacent legal values around the incumbent; avoid broad exploratory jumps.",
"If infeasible_progress blocks the last primary knob family, do not continue that family; switch families with direct bottleneck evidence or set should_stop=true.",
"If a proposed config is likely to reduce request_rate_per_gpu under the active guard, set should_stop=true instead of exploring.", "If a proposed config is likely to reduce request_rate_per_gpu under the active guard, set should_stop=true instead of exploring.",
"Never repeat an already tested config signature.", "Never repeat an already tested config signature.",
"Return should_stop=true when the convergence guard fires and no active harness justifies another adjacent probe.", "Return should_stop=true when the convergence guard fires and no active harness justifies another adjacent probe.",
@@ -388,3 +525,11 @@ def _as_float(value: Any) -> float:
if isinstance(value, (int, float)): if isinstance(value, (int, float)):
return float(value) return float(value)
return 0.0 return 0.0
def _optional_float(value: Any) -> float | None:
if isinstance(value, bool):
return None
if isinstance(value, (int, float)):
return float(value)
return None

View File

@@ -13,6 +13,7 @@ from aituner.compare import load_compare_spec, run_compare
from aituner.engine import build_launch_recipe from aituner.engine import build_launch_recipe
from aituner.http_client import _auth_headers, _openai_url, _should_bypass_proxy from aituner.http_client import _auth_headers, _openai_url, _should_bypass_proxy
from aituner.job import append_job, build_trial_job from aituner.job import append_job, build_trial_job
from aituner.harness import build_harness_context
from aituner.llm import _extract_response_text, build_prompt, parse_proposal_text, validate_proposal from aituner.llm import _extract_response_text, build_prompt, parse_proposal_text, validate_proposal
from aituner.search import ThresholdProbe, binary_search_max_feasible from aituner.search import ThresholdProbe, binary_search_max_feasible
from aituner.slo import RequestOutcome, evaluate_request, summarize_evaluations from aituner.slo import RequestOutcome, evaluate_request, summarize_evaluations
@@ -226,6 +227,144 @@ class CoreFlowTests(unittest.TestCase):
self.assertIn("knob_harnesses", prompt) self.assertIn("knob_harnesses", prompt)
self.assertTrue(study_root.exists()) self.assertTrue(study_root.exists())
def test_harness_uses_latency_failures_before_generic_unrecoverable(self) -> None:
with tempfile.TemporaryDirectory() as tmp:
tmp_path = Path(tmp)
study_path = _write_study_assets(tmp_path)
study = load_study_spec(study_path)
result_path = tmp_path / "trial-result.json"
result_path.write_text(
json.dumps(
{
"status": "completed",
"probes": [
{
"threshold": 0.25,
"feasible": False,
"payload": {
"request_count": 100,
"pass_rate": 0.3,
"request_rate": 1.0,
"early_stopped": True,
"early_stop_reason": "slo_pass_rate_unrecoverable",
"latency_summary": {
"failed_reason_counts": {
"ttft_ms>5000.0": 70,
"tpot_ms>50.0": 5,
"probe_elapsed_s>240.0": 100,
},
"ttft_ms": {"p95": 6500.0, "p99": 7200.0},
},
},
}
],
}
),
encoding="utf-8",
)
state = StudyState(
study_id=study.study_id,
trials=[
TrialSummary(
trial_id="trial-0001",
status="completed",
result_path=str(result_path),
config_patch={"env_patch": {}, "flag_patch": {}},
)
],
)
context = build_harness_context(
study=study,
window_summary={
"prompt_tokens_p95": 5000,
"prompt_tail_ratio_p95_p50": 3.0,
},
state=state,
)
self.assertEqual(
context["recent_trial_diagnostics"][0]["active_bottleneck"],
"ttft_prefill",
)
def test_harness_blocks_repeating_infeasible_plateau_family(self) -> None:
with tempfile.TemporaryDirectory() as tmp:
tmp_path = Path(tmp)
study_path = _write_study_assets(
tmp_path,
engine_overrides={
"tunable_flags": [
"tensor-parallel-size",
"data-parallel-size",
"expert-parallel-size",
],
"topology_constraints": {
"allowed_tensor_parallel_sizes": [1, 2, 4],
"allowed_data_parallel_sizes": [1, 2, 4, 8],
"allowed_expert_parallel_sizes": [1],
"allowed_tp_dp_products": [1, 2, 4, 8],
},
},
)
study = load_study_spec(study_path)
trial_summaries = []
for index, (dp, pass_rate, p95) in enumerate(
[(4, 0.345, 3818.4), (8, 0.345, 3823.4)], start=3
):
result_path = tmp_path / f"trial-{index:04d}.json"
result_path.write_text(
json.dumps(
{
"status": "completed",
"best_request_rate": None,
"all_infeasible_diagnostics": {
"threshold": 0.0078125,
"request_count": 148,
"request_rate": 0.22,
"pass_rate": pass_rate,
"early_stopped": True,
"early_stop_reason": "elapsed",
"latency_summary": {
"failed_reason_counts": {"ttft_ms>5000.0": 97},
"ttft_ms": {"p95": p95, "p99": 5800.0},
},
},
}
),
encoding="utf-8",
)
trial_summaries.append(
TrialSummary(
trial_id=f"trial-{index:04d}",
status="completed",
result_path=str(result_path),
config_patch={
"env_patch": {},
"flag_patch": {
"tensor-parallel-size": 1,
"data-parallel-size": dp,
"expert-parallel-size": 1,
},
},
)
)
context = build_harness_context(
study=study,
window_summary={
"prompt_tokens_p95": 7628,
"prompt_tail_ratio_p95_p50": 3.83,
},
state=StudyState(study_id=study.study_id, trials=trial_summaries),
)
guard = context["convergence_guard"]["infeasible_progress"]
self.assertTrue(guard["plateau_detected"])
self.assertTrue(guard["stop_if_next_probe_repeats_family"])
self.assertEqual(guard["blocked_primary_family"], "data-parallel-size")
self.assertTrue(
context["convergence_guard"][
"should_stop_if_no_harness_can_justify_a_new_adjacent_probe"
]
)
def test_trace_input_length_filter_keeps_only_matching_rows(self) -> None: def test_trace_input_length_filter_keeps_only_matching_rows(self) -> None:
with tempfile.TemporaryDirectory() as tmp: with tempfile.TemporaryDirectory() as tmp:
tmp_path = Path(tmp) tmp_path = Path(tmp)