Files
aituner/docs/harness-tuning-progress.md

8.9 KiB

Harness-Guided AITuner Progress

Goal

Improve AITuner convergence for the dash0 internal vLLM + Qwen3.5-27B 0-8k chat study. The prior 12-iteration run can still propose worse configs after finding good ones. The new harness should make config proposals bottleneck-directed and stop spending GPU trials once no adjacent harness-guided probe is justified.

Paper Alignment

  • Prompt structure now includes an explicit [Harnesses] section aligned with paper Figure 12.
  • The harness uses the paper's L-C-A workload model:
    • L: prompt length percentiles and tail ratio.
    • C: prefix/KV-cache reuse estimated from repeated hash_ids blocks when available.
    • A: request rate, 1-second QPS burst ratio, and interarrival CV.
  • Knob rules follow the paper's Figure 13 style:
    • map active bottleneck to a knob family;
    • probe adjacent legal choices;
    • enforce guard conditions to avoid harmful side effects;
    • prefer stopping over weak exploratory proposals after convergence.

Local Implementation Log

  • Added src/aituner/harness.py.
    • Builds structured harness context for prompt injection.
    • Adds TP, max-num-seqs, max-num-batched-tokens, chunked-prefill, and memory-utilization harnesses when those knobs are tunable.
    • Extracts compact recent trial diagnostics from result JSON files.
    • Adds a convergence guard based on recent completed trial performance.
    • Adds an infeasible-progress guard: when recent all-infeasible trials at the same sampling threshold stop improving pass rate and p95 TTFT after changing one knob family, the next proposal must switch primary family or stop.
    • Classifies slo_pass_rate_unrecoverable by latency failure counts first, and ignores probe-budget markers such as probe_elapsed_s> for bottleneck voting, so TTFT-heavy failures stay aligned to prefill/TP or batching harnesses instead of being treated as generic queueing.
  • Extended src/aituner/trace.py.
    • summarize_window now reports L-C-A features.
    • TraceRequest now carries optional metadata for hash_ids, turn, parent chat id, and trace type.
  • Extended src/aituner/llm.py.
    • Prompt now includes tested config signatures and the structured harness section.
    • Prompt schema now asks for should_stop.
  • Extended src/aituner/spec.py.
    • Proposal accepts optional should_stop.
  • Extended src/aituner/cli.py.
    • study tune honors should_stop=true by recording the proposal and not launching another GPU trial.
  • Extended tests/test_core_flow.py.
    • Prompt includes harness context.
    • Trace summary includes new L-C-A fields.
    • Proposal parsing accepts should_stop.
    • CLI does not launch a trial for a stop proposal.

Local Verification

  • python3 -m compileall -q src tests: passed.
  • PYTHONPATH=src python3 -m unittest tests.test_core_flow: passed, 62 tests.
  • pytest -q and python3 -m pytest -q: not runnable locally because pytest is not installed.

Remote Experiment Log

2026-04-25 16:30-16:45 CST

  • Pushed commit 2c5e9af to origin/main and pulled it on dash0.
  • Remote prompt check command:
    • PYTHONPATH=src python3 -m aituner.cli study prompt --study-root /tmp/aituner-harness-prompt-check/dash0-qwen27b-tight-slo-10min-run4-chat-0-8k --store-root /tmp/aituner-harness-prompt-check --prompt-name harness-check
  • Harness profile for chat_w20260311_1000, after applying the 0-8k filter:
    • L: p50 1992, p95 7628, p99 8102, tail ratio 3.83, regime moderate_tail_prefill_sensitive.
    • C: repeated token ratio estimate 0.191, repeated block ratio 0.189, multi-turn ratio 0.160, regime low_prefix_reuse.
    • A: request rate 29.52 req/s, p95 1s QPS 40, burst ratio 1.36, regime smooth.
    • Active harnesses: tensor-parallel-size and max-num-batched-tokens, which matches a TTFT/prefill-sensitive 0-8k chat workload.
  • Remote compileall passed.
  • Remote unittest discover initially exposed two pre-existing path-sensitive tests that hardcoded /home/gahow/phd/aituner; fixed them to derive REPO_ROOT from the test file path.

2026-04-25 16:38-16:58 CST

  • Started real run in tmux session aituner_harness_qwen27b_0_8k_20260425.
  • Store root: .aituner/harness-studies-20260425.
  • First proposal followed the harness:
    • proposal: tensor-parallel-size: 2;
    • rationale: L profile is prefill-sensitive, prefix reuse is low, arrivals are smooth, so probe adjacent TP before runtime batching knobs.
  • First high-load probe at sampling_u=0.03125 was infeasible:
    • request rate 0.895 req/s;
    • pass rate 0.145;
    • p95 TTFT 4063 ms and p95 TPOT 113 ms;
    • failed reasons included tpot_ms>50.0 and slo_pass_rate_unrecoverable.
  • Important implementation issue found: after an early-stopped probe, the worker returned while in-flight HTTP requests could continue occupying the engine, stalling/polluting the next binary-search probe.
  • Action: stopped the run and freed GPUs. Updating worker._replay_requests to drain in-flight requests after early stop before the next probe starts.

2026-04-25 17:00-17:12 CST

  • r2 confirmed that draining avoids immediate cross-probe pollution, but the first LLM trial still started from a speculative TP=2 edit without a measured incumbent.
  • This is not aligned with the paper's agentic loop, which evaluates the initial configuration first and then searches from measured feedback.
  • Action: update study tune so LLM-driven studies automatically materialize a baseline empty-patch trial first, unless --skip-baseline is passed. This should reduce early bad proposals because the first LLM edit will see real baseline bottleneck diagnostics and an incumbent request_rate_per_gpu.

2026-04-25 17:20-18:30 CST

  • r3 started with baseline-first enabled, but the full 0-8k run was too slow for fast iteration with raw chat completions. Stopped it before using it as a convergence signal.
  • A fast validation using max_requests_per_probe=160 was invalid: the trace is downsampled before threshold selection, so lower thresholds can end up with request_count=0. Do not use that result for performance claims.
  • Prefill smoke v1 used completion_tokens_override=1 but kept the TPOT SLO. That made TPOT missing failures dominate, so it was useful only for checking control flow, not for performance.

2026-04-25 18:30-20:10 CST

  • Prefill smoke v2 used real dash0 internal vLLM, Qwen3.5-27B, the real 0-8k prompt distribution and arrivals, completion_tokens_override=1, and tpot_rule=null.
  • Trial 0001 baseline TP1/DP1:
    • sampling 0.0078125: pass rate 0.270, mean TTFT 2033.9 ms, p95 TTFT 5656.7 ms, p99 TTFT 6832.8 ms.
  • Trial 0002 TP1/DP2:
    • sampling 0.0078125: pass rate 0.277, mean TTFT 1766.9 ms, p95 TTFT 4215.3 ms, p99 TTFT 5801.7 ms.
  • Trial 0003 TP1/DP4:
    • sampling 0.0078125: pass rate 0.345, mean TTFT 1668.9 ms, p95 TTFT 3818.4 ms, p99 TTFT 5804.9 ms.
  • Trial 0004 TP1/DP8:
    • sampling 0.0078125: pass rate 0.345, mean TTFT 1675.7 ms, p95 TTFT 3823.4 ms.
  • Interpretation:
    • The harness improved directionality: after the measured baseline, proposals followed a consistent scale-out path and avoided random runtime-knob churn.
    • The smoke result improved p95 TTFT by about 32% versus baseline at the low sampling threshold and improved pass rate from 0.270 to 0.345 within 3-4 trials.
    • It did not reach the 95% pass-rate SLO in this smoke setting, so this is not a full proof of convergence to a good production config.
    • DP8 did not improve over DP4, which exposed a gap: when every trial is infeasible, the prior convergence guard had no feasible incumbent and could not detect plateau.

2026-04-25 20:10 CST

  • Added the all-infeasible plateau guard described above.
  • Added unit coverage for:
    • TTFT failure classification under slo_pass_rate_unrecoverable;
    • blocking a repeat of the DP family after DP4 and DP8 show no material improvement at the same sampling threshold.
  • Pulled the commit on dash0 and reran remote verification:
    • python3 -m compileall -q src tests: passed.
    • PYTHONPATH=src python3 -m unittest discover -s tests -p "test_*.py": passed, 62 tests.
  • Regenerated a prompt against the real smoke v2 history:
    • convergence_guard.reason: data-parallel-size_plateau_on_infeasible_trials.
    • should_stop_if_no_harness_can_justify_a_new_adjacent_probe: true.
    • blocked primary family: data-parallel-size.
    • latest two active bottlenecks after ignoring probe_elapsed_s> for voting: ttft_prefill, ttft_prefill.
  • Current status: the harness now has the mechanism needed to avoid continuing the exact DP-only direction seen in the smoke v2 plateau. The next real experiment should either switch to a bottleneck-justified mixed TP/DP candidate or return should_stop=true.

Remaining next steps:

  1. Start the next real tuning run only after deciding whether to spend a full multi-hour run on the production SLO or a shorter prefill-only confirmation of the new plateau guard.
  2. If the LLM proposes another DP-only change after this guard fires, tighten validation to reject proposals that repeat convergence_guard.infeasible_progress.blocked_primary_family.