5.5 KiB
5.5 KiB
Harness-Guided AITuner Progress
Goal
Improve AITuner convergence for the dash0 internal vLLM + Qwen3.5-27B 0-8k chat study. The prior 12-iteration run can still propose worse configs after finding good ones. The new harness should make config proposals bottleneck-directed and stop spending GPU trials once no adjacent harness-guided probe is justified.
Paper Alignment
- Prompt structure now includes an explicit
[Harnesses]section aligned with paper Figure 12. - The harness uses the paper's L-C-A workload model:
- L: prompt length percentiles and tail ratio.
- C: prefix/KV-cache reuse estimated from repeated
hash_idsblocks when available. - A: request rate, 1-second QPS burst ratio, and interarrival CV.
- Knob rules follow the paper's Figure 13 style:
- map active bottleneck to a knob family;
- probe adjacent legal choices;
- enforce guard conditions to avoid harmful side effects;
- prefer stopping over weak exploratory proposals after convergence.
Local Implementation Log
- Added
src/aituner/harness.py.- Builds structured harness context for prompt injection.
- Adds TP, max-num-seqs, max-num-batched-tokens, chunked-prefill, and memory-utilization harnesses when those knobs are tunable.
- Extracts compact recent trial diagnostics from result JSON files.
- Adds a convergence guard based on recent completed trial performance.
- Extended
src/aituner/trace.py.summarize_windownow reports L-C-A features.TraceRequestnow carries optional metadata forhash_ids, turn, parent chat id, and trace type.
- Extended
src/aituner/llm.py.- Prompt now includes tested config signatures and the structured harness section.
- Prompt schema now asks for
should_stop.
- Extended
src/aituner/spec.py.Proposalaccepts optionalshould_stop.
- Extended
src/aituner/cli.py.study tunehonorsshould_stop=trueby recording the proposal and not launching another GPU trial.
- Extended
tests/test_core_flow.py.- Prompt includes harness context.
- Trace summary includes new L-C-A fields.
- Proposal parsing accepts
should_stop. - CLI does not launch a trial for a stop proposal.
Local Verification
python3 -m compileall -q src tests: passed.PYTHONPATH=src python3 -m unittest tests.test_core_flow: passed, 59 tests.pytest -qandpython3 -m pytest -q: not runnable locally becausepytestis not installed.
Remote Experiment Log
2026-04-25 16:30-16:45 CST
- Pushed commit
2c5e9aftoorigin/mainand pulled it ondash0. - Remote prompt check command:
PYTHONPATH=src python3 -m aituner.cli study prompt --study-root /tmp/aituner-harness-prompt-check/dash0-qwen27b-tight-slo-10min-run4-chat-0-8k --store-root /tmp/aituner-harness-prompt-check --prompt-name harness-check
- Harness profile for
chat_w20260311_1000, after applying the 0-8k filter:- L: p50 1992, p95 7628, p99 8102, tail ratio 3.83, regime
moderate_tail_prefill_sensitive. - C: repeated token ratio estimate 0.191, repeated block ratio 0.189, multi-turn ratio 0.160, regime
low_prefix_reuse. - A: request rate 29.52 req/s, p95 1s QPS 40, burst ratio 1.36, regime
smooth. - Active harnesses:
tensor-parallel-sizeandmax-num-batched-tokens, which matches a TTFT/prefill-sensitive 0-8k chat workload.
- L: p50 1992, p95 7628, p99 8102, tail ratio 3.83, regime
- Remote
compileallpassed. - Remote
unittest discoverinitially exposed two pre-existing path-sensitive tests that hardcoded/home/gahow/phd/aituner; fixed them to deriveREPO_ROOTfrom the test file path.
2026-04-25 16:38-16:58 CST
- Started real run in tmux session
aituner_harness_qwen27b_0_8k_20260425. - Store root:
.aituner/harness-studies-20260425. - First proposal followed the harness:
- proposal:
tensor-parallel-size: 2; - rationale: L profile is prefill-sensitive, prefix reuse is low, arrivals are smooth, so probe adjacent TP before runtime batching knobs.
- proposal:
- First high-load probe at
sampling_u=0.03125was infeasible:- request rate 0.895 req/s;
- pass rate 0.145;
- p95 TTFT 4063 ms and p95 TPOT 113 ms;
- failed reasons included
tpot_ms>50.0andslo_pass_rate_unrecoverable.
- Important implementation issue found: after an early-stopped probe, the worker returned while in-flight HTTP requests could continue occupying the engine, stalling/polluting the next binary-search probe.
- Action: stopped the run and freed GPUs. Updating
worker._replay_requeststo drain in-flight requests after early stop before the next probe starts.
2026-04-25 17:00-17:12 CST
- r2 confirmed that draining avoids immediate cross-probe pollution, but the first LLM trial still started from a speculative TP=2 edit without a measured incumbent.
- This is not aligned with the paper's agentic loop, which evaluates the initial configuration first and then searches from measured feedback.
- Action: update
study tuneso LLM-driven studies automatically materialize a baseline empty-patch trial first, unless--skip-baselineis passed. This should reduce early bad proposals because the first LLM edit will see real baseline bottleneck diagnostics and an incumbent request_rate_per_gpu.
Remaining next steps:
- Start a real harness-guided Qwen3.5-27B 0-8k chat tuning run from
configs/examples/dash0_qwen27b_tight_slo_run4_0_8k.json. - Compare the first few iterations against the prior 12-iteration behavior:
- best request rate per GPU should improve or reach the known good region in fewer trials;
- proposals should follow the active bottleneck harness;
- if the incumbent has converged, the LLM should emit
should_stop=trueinstead of proposing a weak exploratory config.