# Harness-Guided AITuner Progress ## Goal Improve AITuner convergence for the `dash0` internal vLLM + Qwen3.5-27B 0-8k chat study. The prior 12-iteration run can still propose worse configs after finding good ones. The new harness should make config proposals bottleneck-directed and stop spending GPU trials once no adjacent harness-guided probe is justified. ## Paper Alignment - Prompt structure now includes an explicit `[Harnesses]` section aligned with paper Figure 12. - The harness uses the paper's L-C-A workload model: - L: prompt length percentiles and tail ratio. - C: prefix/KV-cache reuse estimated from repeated `hash_ids` blocks when available. - A: request rate, 1-second QPS burst ratio, and interarrival CV. - Knob rules follow the paper's Figure 13 style: - map active bottleneck to a knob family; - probe adjacent legal choices; - enforce guard conditions to avoid harmful side effects; - prefer stopping over weak exploratory proposals after convergence. ## Local Implementation Log - Added `src/aituner/harness.py`. - Builds structured harness context for prompt injection. - Adds TP, max-num-seqs, max-num-batched-tokens, chunked-prefill, and memory-utilization harnesses when those knobs are tunable. - Extracts compact recent trial diagnostics from result JSON files. - Adds a convergence guard based on recent completed trial performance. - Adds an infeasible-progress guard: when recent all-infeasible trials at the same sampling threshold stop improving pass rate and p95 TTFT after changing one knob family, the next proposal must switch primary family or stop. - Classifies `slo_pass_rate_unrecoverable` by latency failure counts first, and ignores probe-budget markers such as `probe_elapsed_s>` for bottleneck voting, so TTFT-heavy failures stay aligned to prefill/TP or batching harnesses instead of being treated as generic queueing. - Extended `src/aituner/trace.py`. - `summarize_window` now reports L-C-A features. - `TraceRequest` now carries optional metadata for `hash_ids`, turn, parent chat id, and trace type. - Extended `src/aituner/llm.py`. - Prompt now includes tested config signatures and the structured harness section. - Prompt schema now asks for `should_stop`. - Extended `src/aituner/spec.py`. - `Proposal` accepts optional `should_stop`. - Extended `src/aituner/cli.py`. - `study tune` honors `should_stop=true` by recording the proposal and not launching another GPU trial. - Extended `tests/test_core_flow.py`. - Prompt includes harness context. - Trace summary includes new L-C-A fields. - Proposal parsing accepts `should_stop`. - CLI does not launch a trial for a stop proposal. ## Local Verification - `python3 -m compileall -q src tests`: passed. - `PYTHONPATH=src python3 -m unittest tests.test_core_flow`: passed, 62 tests. - `pytest -q` and `python3 -m pytest -q`: not runnable locally because `pytest` is not installed. ## Remote Experiment Log ### 2026-04-25 16:30-16:45 CST - Pushed commit `2c5e9af` to `origin/main` and pulled it on `dash0`. - Remote prompt check command: - `PYTHONPATH=src python3 -m aituner.cli study prompt --study-root /tmp/aituner-harness-prompt-check/dash0-qwen27b-tight-slo-10min-run4-chat-0-8k --store-root /tmp/aituner-harness-prompt-check --prompt-name harness-check` - Harness profile for `chat_w20260311_1000`, after applying the 0-8k filter: - L: p50 1992, p95 7628, p99 8102, tail ratio 3.83, regime `moderate_tail_prefill_sensitive`. - C: repeated token ratio estimate 0.191, repeated block ratio 0.189, multi-turn ratio 0.160, regime `low_prefix_reuse`. - A: request rate 29.52 req/s, p95 1s QPS 40, burst ratio 1.36, regime `smooth`. - Active harnesses: `tensor-parallel-size` and `max-num-batched-tokens`, which matches a TTFT/prefill-sensitive 0-8k chat workload. - Remote `compileall` passed. - Remote `unittest discover` initially exposed two pre-existing path-sensitive tests that hardcoded `/home/gahow/phd/aituner`; fixed them to derive `REPO_ROOT` from the test file path. ### 2026-04-25 16:38-16:58 CST - Started real run in tmux session `aituner_harness_qwen27b_0_8k_20260425`. - Store root: `.aituner/harness-studies-20260425`. - First proposal followed the harness: - proposal: `tensor-parallel-size: 2`; - rationale: L profile is prefill-sensitive, prefix reuse is low, arrivals are smooth, so probe adjacent TP before runtime batching knobs. - First high-load probe at `sampling_u=0.03125` was infeasible: - request rate 0.895 req/s; - pass rate 0.145; - p95 TTFT 4063 ms and p95 TPOT 113 ms; - failed reasons included `tpot_ms>50.0` and `slo_pass_rate_unrecoverable`. - Important implementation issue found: after an early-stopped probe, the worker returned while in-flight HTTP requests could continue occupying the engine, stalling/polluting the next binary-search probe. - Action: stopped the run and freed GPUs. Updating `worker._replay_requests` to drain in-flight requests after early stop before the next probe starts. ### 2026-04-25 17:00-17:12 CST - r2 confirmed that draining avoids immediate cross-probe pollution, but the first LLM trial still started from a speculative TP=2 edit without a measured incumbent. - This is not aligned with the paper's agentic loop, which evaluates the initial configuration first and then searches from measured feedback. - Action: update `study tune` so LLM-driven studies automatically materialize a baseline empty-patch trial first, unless `--skip-baseline` is passed. This should reduce early bad proposals because the first LLM edit will see real baseline bottleneck diagnostics and an incumbent request_rate_per_gpu. ### 2026-04-25 17:20-18:30 CST - r3 started with baseline-first enabled, but the full 0-8k run was too slow for fast iteration with raw chat completions. Stopped it before using it as a convergence signal. - A fast validation using `max_requests_per_probe=160` was invalid: the trace is downsampled before threshold selection, so lower thresholds can end up with `request_count=0`. Do not use that result for performance claims. - Prefill smoke v1 used `completion_tokens_override=1` but kept the TPOT SLO. That made TPOT missing failures dominate, so it was useful only for checking control flow, not for performance. ### 2026-04-25 18:30-20:10 CST - Prefill smoke v2 used real dash0 internal vLLM, Qwen3.5-27B, the real 0-8k prompt distribution and arrivals, `completion_tokens_override=1`, and `tpot_rule=null`. - Trial 0001 baseline TP1/DP1: - sampling `0.0078125`: pass rate 0.270, mean TTFT 2033.9 ms, p95 TTFT 5656.7 ms, p99 TTFT 6832.8 ms. - Trial 0002 TP1/DP2: - sampling `0.0078125`: pass rate 0.277, mean TTFT 1766.9 ms, p95 TTFT 4215.3 ms, p99 TTFT 5801.7 ms. - Trial 0003 TP1/DP4: - sampling `0.0078125`: pass rate 0.345, mean TTFT 1668.9 ms, p95 TTFT 3818.4 ms, p99 TTFT 5804.9 ms. - Trial 0004 TP1/DP8: - sampling `0.0078125`: pass rate 0.345, mean TTFT 1675.7 ms, p95 TTFT 3823.4 ms. - Interpretation: - The harness improved directionality: after the measured baseline, proposals followed a consistent scale-out path and avoided random runtime-knob churn. - The smoke result improved p95 TTFT by about 32% versus baseline at the low sampling threshold and improved pass rate from 0.270 to 0.345 within 3-4 trials. - It did not reach the 95% pass-rate SLO in this smoke setting, so this is not a full proof of convergence to a good production config. - DP8 did not improve over DP4, which exposed a gap: when every trial is infeasible, the prior convergence guard had no feasible incumbent and could not detect plateau. ### 2026-04-25 20:10 CST - Added the all-infeasible plateau guard described above. - Added unit coverage for: - TTFT failure classification under `slo_pass_rate_unrecoverable`; - blocking a repeat of the DP family after DP4 and DP8 show no material improvement at the same sampling threshold. - Current status: the harness now has the mechanism needed to avoid continuing the exact DP-only direction seen in the smoke v2 plateau. The next real experiment should either switch to a bottleneck-justified mixed TP/DP candidate or return `should_stop=true`. Remaining next steps: 1. Push/pull the plateau-guard commit to `dash0`. 2. Re-run the remote unit suite. 3. Start the next real tuning run only after deciding whether to spend a full multi-hour run on the production SLO or a shorter prefill-only confirmation of the new plateau guard.