Add profile-driven harness planner

2026-05-12 21:28:44 +08:00
parent 63d6a111f4
commit 17e9681ca0
3 changed files with 976 additions and 2 deletions
--- a/docs/harness-ablation/profile-driven-harness-implementation-20260512.md
+++ b/docs/harness-ablation/profile-driven-harness-implementation-20260512.md
@@ -0,0 +1,73 @@
+# Profile-Driven Harness Implementation Log
+
+Date: 2026-05-12
+
+## Goal
+
+The harness should accelerate AITuner as a general tuning system, not as a collection of case-specific rules. The current implementation moves the harness toward a performance-engineering loop:
+
+1. extract a compact profile from each measured trial;
+2. rank bottleneck hypotheses from workload and probe evidence;
+3. generate generic candidate actions from a knob-effect model;
+4. score candidates by expected bottleneck relief, information gain, launch safety, and regression risk;
+5. block early stop while a high-value untested candidate remains.
+
+This is intended to apply across qwen3.5-27b chat, qwen3-235b prefill-only, qwen3-235b decode-only, and different SLOs without encoding model names, SLO constants, or known winning configs.
+
+## Code Changes
+
+- `src/aituner/harness.py`
+  - Added `trial_profiles` to normalize trial topology, performance, probe failures, latency quantiles, and launch failure evidence.
+  - Added `bottleneck_hypotheses`, a ranked list instead of a single active bottleneck label.
+  - Added `candidate_actions`, generated from topology and runtime knob families.
+  - Added `experiment_plan`, which selects the next high-score candidate or declares stop readiness.
+  - Updated harness proposal generation to prefer the profile-driven next action before falling back to legacy deterministic proposal code.
+  - Updated harness stop logic so convergence/validation stop is blocked when the planner still has a high-value untested candidate.
+
+- `tests/test_core_flow.py`
+  - Added coverage that a strong TP=2 incumbent with TTFT pressure still selects an unmeasured TP=4 topology candidate.
+  - Added coverage that decode-only TPOT pressure at max TP can prefer lowering `max-num-seqs` instead of blindly lowering TP.
+
+## Current Scoring Model
+
+The candidate score is intentionally generic:
+
+```text
+score = expected_bottleneck_relief * bottleneck_confidence
+      + information_gain
+      + launch_safety
+      - regression_risk
+```
+
+Examples:
+
+- TTFT/prefill bottleneck: increasing TP and prefill batching candidates receive relief score.
+- Decode TPOT bottleneck: increasing TP is useful if a higher legal TP exists; if already at high TP, lowering decode concurrency can become the higher-value candidate.
+- Admission/queueing bottleneck: more DP or higher safe concurrency receives relief score.
+
+The scores are not tied to qwen27b/qwen235b or a fixed TPOT/TTFT threshold. They are tied to the measured bottleneck class and legal tunable space.
+
+## Verification
+
+Local:
+
+```bash
+python3 -m compileall -q src tests
+PYTHONPATH=src python3 -m unittest tests.test_core_flow
+```
+
+Result: `93` tests passed.
+
+## Next Experiment
+
+Run the same qwen3.5-27b chat 0-8k setup as the current ablation baseline:
+
+- workload: chat, input length 0-8k
+- SLO: TTFT p95 <= 4000ms, TPOT p95 <= 25ms, target pass rate 0.95
+- search: full range, `inherit_incumbent_floor=false`
+- budget: 12 total tuning iterations
+- LLM model: `gpt-5.4`
+- variant: harness enabled with profile-driven planner
+
+The no-harness min-prompt baseline is already available and only needs to be reused for comparison unless the setup changes.
+