Add infeasible plateau guard to harness

2026-04-25 18:47:32 +08:00
parent 6c04b9dbbc
commit e188de7735
3 changed files with 320 additions and 8 deletions
--- a/docs/harness-tuning-progress.md
+++ b/docs/harness-tuning-progress.md
@@ -24,6 +24,8 @@ Improve AITuner convergence for the `dash0` internal vLLM + Qwen3.5-27B 0-8k cha
  - Adds TP, max-num-seqs, max-num-batched-tokens, chunked-prefill, and memory-utilization harnesses when those knobs are tunable.
  - Extracts compact recent trial diagnostics from result JSON files.
  - Adds a convergence guard based on recent completed trial performance.
+  - Adds an infeasible-progress guard: when recent all-infeasible trials at the same sampling threshold stop improving pass rate and p95 TTFT after changing one knob family, the next proposal must switch primary family or stop.
+  - Classifies `slo_pass_rate_unrecoverable` by latency failure counts first, so TTFT-heavy failures stay aligned to prefill/TP or batching harnesses instead of being treated as generic queueing.
 - Extended `src/aituner/trace.py`.
  - `summarize_window` now reports L-C-A features.
  - `TraceRequest` now carries optional metadata for `hash_ids`, turn, parent chat id, and trace type.
@@ -43,7 +45,7 @@ Improve AITuner convergence for the `dash0` internal vLLM + Qwen3.5-27B 0-8k cha
 ## Local Verification

 - `python3 -m compileall -q src tests`: passed.
- `PYTHONPATH=src python3 -m unittest tests.test_core_flow`: passed, 59 tests.
+- `PYTHONPATH=src python3 -m unittest tests.test_core_flow`: passed, 62 tests.
 - `pytest -q` and `python3 -m pytest -q`: not runnable locally because `pytest` is not installed.

 ## Remote Experiment Log
@@ -82,10 +84,39 @@ Improve AITuner convergence for the `dash0` internal vLLM + Qwen3.5-27B 0-8k cha
 - This is not aligned with the paper's agentic loop, which evaluates the initial configuration first and then searches from measured feedback.
 - Action: update `study tune` so LLM-driven studies automatically materialize a baseline empty-patch trial first, unless `--skip-baseline` is passed. This should reduce early bad proposals because the first LLM edit will see real baseline bottleneck diagnostics and an incumbent request_rate_per_gpu.

+### 2026-04-25 17:20-18:30 CST
+
+- r3 started with baseline-first enabled, but the full 0-8k run was too slow for fast iteration with raw chat completions. Stopped it before using it as a convergence signal.
+- A fast validation using `max_requests_per_probe=160` was invalid: the trace is downsampled before threshold selection, so lower thresholds can end up with `request_count=0`. Do not use that result for performance claims.
+- Prefill smoke v1 used `completion_tokens_override=1` but kept the TPOT SLO. That made TPOT missing failures dominate, so it was useful only for checking control flow, not for performance.
+
+### 2026-04-25 18:30-20:10 CST
+
+- Prefill smoke v2 used real dash0 internal vLLM, Qwen3.5-27B, the real 0-8k prompt distribution and arrivals, `completion_tokens_override=1`, and `tpot_rule=null`.
+- Trial 0001 baseline TP1/DP1:
+  - sampling `0.0078125`: pass rate 0.270, mean TTFT 2033.9 ms, p95 TTFT 5656.7 ms, p99 TTFT 6832.8 ms.
+- Trial 0002 TP1/DP2:
+  - sampling `0.0078125`: pass rate 0.277, mean TTFT 1766.9 ms, p95 TTFT 4215.3 ms, p99 TTFT 5801.7 ms.
+- Trial 0003 TP1/DP4:
+  - sampling `0.0078125`: pass rate 0.345, mean TTFT 1668.9 ms, p95 TTFT 3818.4 ms, p99 TTFT 5804.9 ms.
+- Trial 0004 TP1/DP8:
+  - sampling `0.0078125`: pass rate 0.345, mean TTFT 1675.7 ms, p95 TTFT 3823.4 ms.
+- Interpretation:
+  - The harness improved directionality: after the measured baseline, proposals followed a consistent scale-out path and avoided random runtime-knob churn.
+  - The smoke result improved p95 TTFT by about 32% versus baseline at the low sampling threshold and improved pass rate from 0.270 to 0.345 within 3-4 trials.
+  - It did not reach the 95% pass-rate SLO in this smoke setting, so this is not a full proof of convergence to a good production config.
+  - DP8 did not improve over DP4, which exposed a gap: when every trial is infeasible, the prior convergence guard had no feasible incumbent and could not detect plateau.
+
+### 2026-04-25 20:10 CST
+
+- Added the all-infeasible plateau guard described above.
+- Added unit coverage for:
+  - TTFT failure classification under `slo_pass_rate_unrecoverable`;
+  - blocking a repeat of the DP family after DP4 and DP8 show no material improvement at the same sampling threshold.
+- Current status: the harness now has the mechanism needed to avoid continuing the exact DP-only direction seen in the smoke v2 plateau. The next real experiment should either switch to a bottleneck-justified mixed TP/DP candidate or return `should_stop=true`.
+
 Remaining next steps:

-1. Start a real harness-guided Qwen3.5-27B 0-8k chat tuning run from `configs/examples/dash0_qwen27b_tight_slo_run4_0_8k.json`.
-2. Compare the first few iterations against the prior 12-iteration behavior:
-   - best request rate per GPU should improve or reach the known good region in fewer trials;
-   - proposals should follow the active bottleneck harness;
-   - if the incumbent has converged, the LLM should emit `should_stop=true` instead of proposing a weak exploratory config.
+1. Push/pull the plateau-guard commit to `dash0`.
+2. Re-run the remote unit suite.
+3. Start the next real tuning run only after deciding whether to spend a full multi-hour run on the production SLO or a shorter prefill-only confirmation of the new plateau guard.