# Harness-Guided AITuner Progress ## Goal Improve AITuner convergence for the `dash0` internal vLLM + Qwen3.5-27B 0-8k chat study. The prior 12-iteration run can still propose worse configs after finding good ones. The new harness should make config proposals bottleneck-directed and stop spending GPU trials once no adjacent harness-guided probe is justified. ## Paper Alignment - Prompt structure now includes an explicit `[Harnesses]` section aligned with paper Figure 12. - The harness uses the paper's L-C-A workload model: - L: prompt length percentiles and tail ratio. - C: prefix/KV-cache reuse estimated from repeated `hash_ids` blocks when available. - A: request rate, 1-second QPS burst ratio, and interarrival CV. - Knob rules follow the paper's Figure 13 style: - map active bottleneck to a knob family; - probe adjacent legal choices; - enforce guard conditions to avoid harmful side effects; - prefer stopping over weak exploratory proposals after convergence. ## Local Implementation Log - Added `src/aituner/harness.py`. - Builds structured harness context for prompt injection. - Adds TP, max-num-seqs, max-num-batched-tokens, chunked-prefill, and memory-utilization harnesses when those knobs are tunable. - Extracts compact recent trial diagnostics from result JSON files. - Adds a convergence guard based on recent completed trial performance. - Extended `src/aituner/trace.py`. - `summarize_window` now reports L-C-A features. - `TraceRequest` now carries optional metadata for `hash_ids`, turn, parent chat id, and trace type. - Extended `src/aituner/llm.py`. - Prompt now includes tested config signatures and the structured harness section. - Prompt schema now asks for `should_stop`. - Extended `src/aituner/spec.py`. - `Proposal` accepts optional `should_stop`. - Extended `src/aituner/cli.py`. - `study tune` honors `should_stop=true` by recording the proposal and not launching another GPU trial. - Extended `tests/test_core_flow.py`. - Prompt includes harness context. - Trace summary includes new L-C-A fields. - Proposal parsing accepts `should_stop`. - CLI does not launch a trial for a stop proposal. ## Local Verification - `python3 -m compileall -q src tests`: passed. - `PYTHONPATH=src python3 -m unittest tests.test_core_flow`: passed, 59 tests. - `pytest -q` and `python3 -m pytest -q`: not runnable locally because `pytest` is not installed. ## Remote Experiment Log ### 2026-04-25 16:30-16:45 CST - Pushed commit `2c5e9af` to `origin/main` and pulled it on `dash0`. - Remote prompt check command: - `PYTHONPATH=src python3 -m aituner.cli study prompt --study-root /tmp/aituner-harness-prompt-check/dash0-qwen27b-tight-slo-10min-run4-chat-0-8k --store-root /tmp/aituner-harness-prompt-check --prompt-name harness-check` - Harness profile for `chat_w20260311_1000`, after applying the 0-8k filter: - L: p50 1992, p95 7628, p99 8102, tail ratio 3.83, regime `moderate_tail_prefill_sensitive`. - C: repeated token ratio estimate 0.191, repeated block ratio 0.189, multi-turn ratio 0.160, regime `low_prefix_reuse`. - A: request rate 29.52 req/s, p95 1s QPS 40, burst ratio 1.36, regime `smooth`. - Active harnesses: `tensor-parallel-size` and `max-num-batched-tokens`, which matches a TTFT/prefill-sensitive 0-8k chat workload. - Remote `compileall` passed. - Remote `unittest discover` initially exposed two pre-existing path-sensitive tests that hardcoded `/home/gahow/phd/aituner`; fixed them to derive `REPO_ROOT` from the test file path. Remaining next steps: 1. Start a real harness-guided Qwen3.5-27B 0-8k chat tuning run from `configs/examples/dash0_qwen27b_tight_slo_run4_0_8k.json`. 2. Compare the first few iterations against the prior 12-iteration behavior: - best request rate per GPU should improve or reach the known good region in fewer trials; - proposals should follow the active bottleneck harness; - if the incumbent has converged, the LLM should emit `should_stop=true` instead of proposing a weak exploratory config.