3.7 KiB
Profile-Driven Harness Implementation Log
Date: 2026-05-12
Goal
The harness should accelerate AITuner as a general tuning system, not as a collection of case-specific rules. The current implementation moves the harness toward a performance-engineering loop:
- extract a compact profile from each measured trial;
- rank bottleneck hypotheses from workload and probe evidence;
- generate generic candidate actions from a knob-effect model;
- score candidates by expected bottleneck relief, information gain, launch safety, and regression risk;
- block early stop while a high-value untested candidate remains.
This is intended to apply across qwen3.5-27b chat, qwen3-235b prefill-only, qwen3-235b decode-only, and different SLOs without encoding model names, SLO constants, or known winning configs.
Code Changes
-
src/aituner/harness.py- Added
trial_profilesto normalize trial topology, performance, probe failures, latency quantiles, and launch failure evidence. - Added
bottleneck_hypotheses, a ranked list instead of a single active bottleneck label. - Added
candidate_actions, generated from topology and runtime knob families. - Added
experiment_plan, which selects the next high-score candidate or declares stop readiness. - Updated harness proposal generation to prefer the profile-driven next action before falling back to legacy deterministic proposal code.
- Updated harness stop logic so convergence/validation stop is blocked when the planner still has a high-value untested candidate.
- Added
-
tests/test_core_flow.py- Added coverage that a strong TP=2 incumbent with TTFT pressure still selects an unmeasured TP=4 topology candidate.
- Added coverage that decode-only TPOT pressure at max TP can prefer lowering
max-num-seqsinstead of blindly lowering TP.
Current Scoring Model
The candidate score is intentionally generic:
score = expected_bottleneck_relief * bottleneck_confidence
+ information_gain
+ launch_safety
- regression_risk
Examples:
- TTFT/prefill bottleneck: increasing TP and prefill batching candidates receive relief score.
- Decode TPOT bottleneck: increasing TP is useful if a higher legal TP exists; if already at high TP, lowering decode concurrency can become the higher-value candidate.
- Admission/queueing bottleneck: more DP or higher safe concurrency receives relief score.
The scores are not tied to qwen27b/qwen235b or a fixed TPOT/TTFT threshold. They are tied to the measured bottleneck class and legal tunable space.
Verification
Local:
python3 -m compileall -q src tests
PYTHONPATH=src python3 -m unittest tests.test_core_flow
Result: 93 tests passed.
Next Experiment
Run the same qwen3.5-27b chat 0-8k setup as the current ablation baseline:
- workload: chat, input length 0-8k
- SLO: TTFT p95 <= 4000ms, TPOT p95 <= 25ms, target pass rate 0.95
- search: full range,
inherit_incumbent_floor=false - budget: 12 total tuning iterations
- LLM model:
gpt-5.4 - variant: harness enabled with profile-driven planner
The no-harness min-prompt baseline is already available and only needs to be reused for comparison unless the setup changes.
Experiment Started
Started on dash0 (11.73.2.172) at commit 17e9681.
- tmux session:
qwen27b-profileplanner-harness-20260512 - spec:
.aituner-tight/specs/dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-harness-profileplanner-20260512.json - study id:
dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-harness-profileplanner-20260512 - log:
.aituner-tight/logs/qwen27b-profileplanner-harness-20260512.log - status at launch check:
trial-0001baseline is running under AITuner; no manual intervention in the tuning loop.