# Profile-Driven Harness Implementation Log Date: 2026-05-12 ## Goal The harness should accelerate AITuner as a general tuning system, not as a collection of case-specific rules. The current implementation moves the harness toward a performance-engineering loop: 1. extract a compact profile from each measured trial; 2. rank bottleneck hypotheses from workload and probe evidence; 3. generate generic candidate actions from a knob-effect model; 4. score candidates by expected bottleneck relief, information gain, launch safety, and regression risk; 5. block early stop while a high-value untested candidate remains. This is intended to apply across qwen3.5-27b chat, qwen3-235b prefill-only, qwen3-235b decode-only, and different SLOs without encoding model names, SLO constants, or known winning configs. ## Code Changes - `src/aituner/harness.py` - Added `trial_profiles` to normalize trial topology, performance, probe failures, latency quantiles, and launch failure evidence. - Added `bottleneck_hypotheses`, a ranked list instead of a single active bottleneck label. - Added `candidate_actions`, generated from topology and runtime knob families. - Added `experiment_plan`, which selects the next high-score candidate or declares stop readiness. - Updated harness proposal generation to prefer the profile-driven next action before falling back to legacy deterministic proposal code. - Updated harness stop logic so convergence/validation stop is blocked when the planner still has a high-value untested candidate. - `tests/test_core_flow.py` - Added coverage that a strong TP=2 incumbent with TTFT pressure still selects an unmeasured TP=4 topology candidate. - Added coverage that decode-only TPOT pressure at max TP can prefer lowering `max-num-seqs` instead of blindly lowering TP. ## Current Scoring Model The candidate score is intentionally generic: ```text score = expected_bottleneck_relief * bottleneck_confidence + information_gain + launch_safety - regression_risk ``` Examples: - TTFT/prefill bottleneck: increasing TP and prefill batching candidates receive relief score. - Decode TPOT bottleneck: increasing TP is useful if a higher legal TP exists; if already at high TP, lowering decode concurrency can become the higher-value candidate. - Admission/queueing bottleneck: more DP or higher safe concurrency receives relief score. The scores are not tied to qwen27b/qwen235b or a fixed TPOT/TTFT threshold. They are tied to the measured bottleneck class and legal tunable space. ## Verification Local: ```bash python3 -m compileall -q src tests PYTHONPATH=src python3 -m unittest tests.test_core_flow ``` Result: `93` tests passed. ## Next Experiment Run the same qwen3.5-27b chat 0-8k setup as the current ablation baseline: - workload: chat, input length 0-8k - SLO: TTFT p95 <= 4000ms, TPOT p95 <= 25ms, target pass rate 0.95 - search: full range, `inherit_incumbent_floor=false` - budget: 12 total tuning iterations - LLM model: `gpt-5.4` - variant: harness enabled with profile-driven planner The no-harness min-prompt baseline is already available and only needs to be reused for comparison unless the setup changes. ## Experiment Started Started on `dash0` (`11.73.2.172`) at commit `17e9681`. - tmux session: `qwen27b-profileplanner-harness-20260512` - spec: `.aituner-tight/specs/dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-harness-profileplanner-20260512.json` - study id: `dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-harness-profileplanner-20260512` - log: `.aituner-tight/logs/qwen27b-profileplanner-harness-20260512.log` - status at launch check: `trial-0001` baseline is running under AITuner; no manual intervention in the tuning loop. ## V1 Early Stop The first profile-planner run was stopped before accepting it as evidence. A read-only replay of its completed baseline probe history showed that the planner would choose `max-num-seqs=64` for iter2. That was a diagnosis bug: - `slo_pass_rate_unrecoverable` is a binary-search early-stop summary, not a bottleneck class. - The harness was counting that summary as an admission/queueing failure. - Because this count dominated the real TTFT/TPOT failure counts, the planner selected a concurrency action instead of testing TP. Fix commit: `e3ed775`. The fix excludes `slo_pass_rate_unrecoverable` from the admission/queueing failure bucket. With the same baseline probe history, the planner now ranks `ttft_prefill` first and proposes `tensor-parallel-size=2` for iter2. ## V2 Experiment Started Started on `dash0` (`11.73.2.172`) at commit `e3ed775`. - tmux session: `qwen27b-profileplanner-v2-harness-20260512` - spec: `.aituner-tight/specs/dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-harness-profileplanner-v2-20260512.json` - study id: `dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-harness-profileplanner-v2-20260512` - log: `.aituner-tight/logs/qwen27b-profileplanner-v2-harness-20260512.log` - monitor: read-only subagent `Wegener` Acceptance for this run is based on end-to-end trial results, not unit tests. If the first four trials lag the min-prompt no-harness baseline (`0.0650`, `0.1992`, `0.2696`, then failed/NA), the run should be treated as a failed harness iteration and the harness should be optimized again. ## V2 Result And Failure V2 was stopped early after four trials because it did not improve the no-harness baseline and made a preventable launch-risk proposal. Raw `request_rate/GPU`: | Variant | iter1 | iter2 | iter3 | iter4 | | --- | ---: | ---: | ---: | --- | | no-harness min-prompt | 0.0650 | 0.1992 | 0.2696 | 0.2696 | | harness v2 | 0.0650 | 0.1992 | 0.2696 | failed | Harness v2 did correctly diagnose the first bottleneck and proposed: - iter2: `tensor-parallel-size=2`, raw `0.1992 req/s/GPU`; - iter3: `tensor-parallel-size=4`, raw `0.2696 req/s/GPU`. However, iter4 proposed `tensor-parallel-size=8` and failed at engine launch. The study's `hardware.gpu_count` is 8, but the launch environment sets `CUDA_VISIBLE_DEVICES=0,1,2,4,5,6,7`, which exposes only 7 GPUs. Therefore TP=8 should not have been considered launch-safe. This is a general harness bug: topology planning must use the effective visible GPU count from the execution profile, not just the nominal hardware count. Fix: - parse `engine.base_envs.CUDA_VISIBLE_DEVICES`; - compute effective GPU count as `min(hardware.gpu_count, visible_device_count)`; - filter topology candidates and adjacent TP frontier candidates by the effective GPU count. ## GPU Visibility Correction On 2026-05-13 we corrected the intended experiment setup: `CUDA_VISIBLE_DEVICES` should be `0,1,2,3,4,5,6,7`, not the previous `0,1,2,4,5,6,7`. This invalidates direct comparison between the old `gpu3skip` runs and new 8-GPU runs. The old v2 failure was real under the old visible-device profile, but it was not the intended 8-card H20 setup. New comparable studies: | Variant | Study ID | Status | | --- | --- | --- | | no-harness baseline | `dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu8-12iter-noharness-minprompt-gpt54-20260513` | running first | | harness | `dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu8-12iter-harness-profileplanner-20260513` | queued to run after baseline | Both specs set: - `CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7` - model endpoint: `gpt-5.4` - workload: qwen3.5-27b chat 0-8k - SLO: TTFT p95 <= 4000ms, TPOT p95 <= 25ms, target pass rate 0.95 - search: full range, `inherit_incumbent_floor=false` The no-harness baseline is running in tmux session `qwen27b-gpu8-noharness-20260513`. The harness run should only be started after the no-harness baseline finishes or reaches a sufficient early comparison point, because both need the full GPU host and should not run concurrently.