107 lines
5.2 KiB
Markdown
107 lines
5.2 KiB
Markdown
# Profile-Driven Harness Implementation Log
|
|
|
|
Date: 2026-05-12
|
|
|
|
## Goal
|
|
|
|
The harness should accelerate AITuner as a general tuning system, not as a collection of case-specific rules. The current implementation moves the harness toward a performance-engineering loop:
|
|
|
|
1. extract a compact profile from each measured trial;
|
|
2. rank bottleneck hypotheses from workload and probe evidence;
|
|
3. generate generic candidate actions from a knob-effect model;
|
|
4. score candidates by expected bottleneck relief, information gain, launch safety, and regression risk;
|
|
5. block early stop while a high-value untested candidate remains.
|
|
|
|
This is intended to apply across qwen3.5-27b chat, qwen3-235b prefill-only, qwen3-235b decode-only, and different SLOs without encoding model names, SLO constants, or known winning configs.
|
|
|
|
## Code Changes
|
|
|
|
- `src/aituner/harness.py`
|
|
- Added `trial_profiles` to normalize trial topology, performance, probe failures, latency quantiles, and launch failure evidence.
|
|
- Added `bottleneck_hypotheses`, a ranked list instead of a single active bottleneck label.
|
|
- Added `candidate_actions`, generated from topology and runtime knob families.
|
|
- Added `experiment_plan`, which selects the next high-score candidate or declares stop readiness.
|
|
- Updated harness proposal generation to prefer the profile-driven next action before falling back to legacy deterministic proposal code.
|
|
- Updated harness stop logic so convergence/validation stop is blocked when the planner still has a high-value untested candidate.
|
|
|
|
- `tests/test_core_flow.py`
|
|
- Added coverage that a strong TP=2 incumbent with TTFT pressure still selects an unmeasured TP=4 topology candidate.
|
|
- Added coverage that decode-only TPOT pressure at max TP can prefer lowering `max-num-seqs` instead of blindly lowering TP.
|
|
|
|
## Current Scoring Model
|
|
|
|
The candidate score is intentionally generic:
|
|
|
|
```text
|
|
score = expected_bottleneck_relief * bottleneck_confidence
|
|
+ information_gain
|
|
+ launch_safety
|
|
- regression_risk
|
|
```
|
|
|
|
Examples:
|
|
|
|
- TTFT/prefill bottleneck: increasing TP and prefill batching candidates receive relief score.
|
|
- Decode TPOT bottleneck: increasing TP is useful if a higher legal TP exists; if already at high TP, lowering decode concurrency can become the higher-value candidate.
|
|
- Admission/queueing bottleneck: more DP or higher safe concurrency receives relief score.
|
|
|
|
The scores are not tied to qwen27b/qwen235b or a fixed TPOT/TTFT threshold. They are tied to the measured bottleneck class and legal tunable space.
|
|
|
|
## Verification
|
|
|
|
Local:
|
|
|
|
```bash
|
|
python3 -m compileall -q src tests
|
|
PYTHONPATH=src python3 -m unittest tests.test_core_flow
|
|
```
|
|
|
|
Result: `93` tests passed.
|
|
|
|
## Next Experiment
|
|
|
|
Run the same qwen3.5-27b chat 0-8k setup as the current ablation baseline:
|
|
|
|
- workload: chat, input length 0-8k
|
|
- SLO: TTFT p95 <= 4000ms, TPOT p95 <= 25ms, target pass rate 0.95
|
|
- search: full range, `inherit_incumbent_floor=false`
|
|
- budget: 12 total tuning iterations
|
|
- LLM model: `gpt-5.4`
|
|
- variant: harness enabled with profile-driven planner
|
|
|
|
The no-harness min-prompt baseline is already available and only needs to be reused for comparison unless the setup changes.
|
|
|
|
## Experiment Started
|
|
|
|
Started on `dash0` (`11.73.2.172`) at commit `17e9681`.
|
|
|
|
- tmux session: `qwen27b-profileplanner-harness-20260512`
|
|
- spec: `.aituner-tight/specs/dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-harness-profileplanner-20260512.json`
|
|
- study id: `dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-harness-profileplanner-20260512`
|
|
- log: `.aituner-tight/logs/qwen27b-profileplanner-harness-20260512.log`
|
|
- status at launch check: `trial-0001` baseline is running under AITuner; no manual intervention in the tuning loop.
|
|
|
|
## V1 Early Stop
|
|
|
|
The first profile-planner run was stopped before accepting it as evidence. A read-only replay of its completed baseline probe history showed that the planner would choose `max-num-seqs=64` for iter2. That was a diagnosis bug:
|
|
|
|
- `slo_pass_rate_unrecoverable` is a binary-search early-stop summary, not a bottleneck class.
|
|
- The harness was counting that summary as an admission/queueing failure.
|
|
- Because this count dominated the real TTFT/TPOT failure counts, the planner selected a concurrency action instead of testing TP.
|
|
|
|
Fix commit: `e3ed775`.
|
|
|
|
The fix excludes `slo_pass_rate_unrecoverable` from the admission/queueing failure bucket. With the same baseline probe history, the planner now ranks `ttft_prefill` first and proposes `tensor-parallel-size=2` for iter2.
|
|
|
|
## V2 Experiment Started
|
|
|
|
Started on `dash0` (`11.73.2.172`) at commit `e3ed775`.
|
|
|
|
- tmux session: `qwen27b-profileplanner-v2-harness-20260512`
|
|
- spec: `.aituner-tight/specs/dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-harness-profileplanner-v2-20260512.json`
|
|
- study id: `dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-harness-profileplanner-v2-20260512`
|
|
- log: `.aituner-tight/logs/qwen27b-profileplanner-v2-harness-20260512.log`
|
|
- monitor: read-only subagent `Wegener`
|
|
|
|
Acceptance for this run is based on end-to-end trial results, not unit tests. If the first four trials lag the min-prompt no-harness baseline (`0.0650`, `0.1992`, `0.2696`, then failed/NA), the run should be treated as a failed harness iteration and the harness should be optimized again.
|