Files
aituner/docs/harness-ablation/profile-driven-harness-implementation-20260512.md

7.6 KiB

Profile-Driven Harness Implementation Log

Date: 2026-05-12

Goal

The harness should accelerate AITuner as a general tuning system, not as a collection of case-specific rules. The current implementation moves the harness toward a performance-engineering loop:

  1. extract a compact profile from each measured trial;
  2. rank bottleneck hypotheses from workload and probe evidence;
  3. generate generic candidate actions from a knob-effect model;
  4. score candidates by expected bottleneck relief, information gain, launch safety, and regression risk;
  5. block early stop while a high-value untested candidate remains.

This is intended to apply across qwen3.5-27b chat, qwen3-235b prefill-only, qwen3-235b decode-only, and different SLOs without encoding model names, SLO constants, or known winning configs.

Code Changes

  • src/aituner/harness.py

    • Added trial_profiles to normalize trial topology, performance, probe failures, latency quantiles, and launch failure evidence.
    • Added bottleneck_hypotheses, a ranked list instead of a single active bottleneck label.
    • Added candidate_actions, generated from topology and runtime knob families.
    • Added experiment_plan, which selects the next high-score candidate or declares stop readiness.
    • Updated harness proposal generation to prefer the profile-driven next action before falling back to legacy deterministic proposal code.
    • Updated harness stop logic so convergence/validation stop is blocked when the planner still has a high-value untested candidate.
  • tests/test_core_flow.py

    • Added coverage that a strong TP=2 incumbent with TTFT pressure still selects an unmeasured TP=4 topology candidate.
    • Added coverage that decode-only TPOT pressure at max TP can prefer lowering max-num-seqs instead of blindly lowering TP.

Current Scoring Model

The candidate score is intentionally generic:

score = expected_bottleneck_relief * bottleneck_confidence
      + information_gain
      + launch_safety
      - regression_risk

Examples:

  • TTFT/prefill bottleneck: increasing TP and prefill batching candidates receive relief score.
  • Decode TPOT bottleneck: increasing TP is useful if a higher legal TP exists; if already at high TP, lowering decode concurrency can become the higher-value candidate.
  • Admission/queueing bottleneck: more DP or higher safe concurrency receives relief score.

The scores are not tied to qwen27b/qwen235b or a fixed TPOT/TTFT threshold. They are tied to the measured bottleneck class and legal tunable space.

Verification

Local:

python3 -m compileall -q src tests
PYTHONPATH=src python3 -m unittest tests.test_core_flow

Result: 93 tests passed.

Next Experiment

Run the same qwen3.5-27b chat 0-8k setup as the current ablation baseline:

  • workload: chat, input length 0-8k
  • SLO: TTFT p95 <= 4000ms, TPOT p95 <= 25ms, target pass rate 0.95
  • search: full range, inherit_incumbent_floor=false
  • budget: 12 total tuning iterations
  • LLM model: gpt-5.4
  • variant: harness enabled with profile-driven planner

The no-harness min-prompt baseline is already available and only needs to be reused for comparison unless the setup changes.

Experiment Started

Started on dash0 (11.73.2.172) at commit 17e9681.

  • tmux session: qwen27b-profileplanner-harness-20260512
  • spec: .aituner-tight/specs/dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-harness-profileplanner-20260512.json
  • study id: dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-harness-profileplanner-20260512
  • log: .aituner-tight/logs/qwen27b-profileplanner-harness-20260512.log
  • status at launch check: trial-0001 baseline is running under AITuner; no manual intervention in the tuning loop.

V1 Early Stop

The first profile-planner run was stopped before accepting it as evidence. A read-only replay of its completed baseline probe history showed that the planner would choose max-num-seqs=64 for iter2. That was a diagnosis bug:

  • slo_pass_rate_unrecoverable is a binary-search early-stop summary, not a bottleneck class.
  • The harness was counting that summary as an admission/queueing failure.
  • Because this count dominated the real TTFT/TPOT failure counts, the planner selected a concurrency action instead of testing TP.

Fix commit: e3ed775.

The fix excludes slo_pass_rate_unrecoverable from the admission/queueing failure bucket. With the same baseline probe history, the planner now ranks ttft_prefill first and proposes tensor-parallel-size=2 for iter2.

V2 Experiment Started

Started on dash0 (11.73.2.172) at commit e3ed775.

  • tmux session: qwen27b-profileplanner-v2-harness-20260512
  • spec: .aituner-tight/specs/dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-harness-profileplanner-v2-20260512.json
  • study id: dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-harness-profileplanner-v2-20260512
  • log: .aituner-tight/logs/qwen27b-profileplanner-v2-harness-20260512.log
  • monitor: read-only subagent Wegener

Acceptance for this run is based on end-to-end trial results, not unit tests. If the first four trials lag the min-prompt no-harness baseline (0.0650, 0.1992, 0.2696, then failed/NA), the run should be treated as a failed harness iteration and the harness should be optimized again.

V2 Result And Failure

V2 was stopped early after four trials because it did not improve the no-harness baseline and made a preventable launch-risk proposal.

Raw request_rate/GPU:

Variant iter1 iter2 iter3 iter4
no-harness min-prompt 0.0650 0.1992 0.2696 0.2696
harness v2 0.0650 0.1992 0.2696 failed

Harness v2 did correctly diagnose the first bottleneck and proposed:

  • iter2: tensor-parallel-size=2, raw 0.1992 req/s/GPU;
  • iter3: tensor-parallel-size=4, raw 0.2696 req/s/GPU.

However, iter4 proposed tensor-parallel-size=8 and failed at engine launch. The study's hardware.gpu_count is 8, but the launch environment sets CUDA_VISIBLE_DEVICES=0,1,2,4,5,6,7, which exposes only 7 GPUs. Therefore TP=8 should not have been considered launch-safe.

This is a general harness bug: topology planning must use the effective visible GPU count from the execution profile, not just the nominal hardware count.

Fix:

  • parse engine.base_envs.CUDA_VISIBLE_DEVICES;
  • compute effective GPU count as min(hardware.gpu_count, visible_device_count);
  • filter topology candidates and adjacent TP frontier candidates by the effective GPU count.

GPU Visibility Correction

On 2026-05-13 we corrected the intended experiment setup: CUDA_VISIBLE_DEVICES should be 0,1,2,3,4,5,6,7, not the previous 0,1,2,4,5,6,7.

This invalidates direct comparison between the old gpu3skip runs and new 8-GPU runs. The old v2 failure was real under the old visible-device profile, but it was not the intended 8-card H20 setup.

New comparable studies:

Variant Study ID Status
no-harness baseline dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu8-12iter-noharness-minprompt-gpt54-20260513 running first
harness dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu8-12iter-harness-profileplanner-20260513 queued to run after baseline

Both specs set:

  • CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
  • model endpoint: gpt-5.4
  • workload: qwen3.5-27b chat 0-8k
  • SLO: TTFT p95 <= 4000ms, TPOT p95 <= 25ms, target pass rate 0.95
  • search: full range, inherit_incumbent_floor=false

The no-harness baseline is running in tmux session qwen27b-gpu8-noharness-20260513. The harness run should only be started after the no-harness baseline finishes or reaches a sufficient early comparison point, because both need the full GPU host and should not run concurrently.