Files
aituner/docs/harness-tuning-progress.md

3.0 KiB

Harness-Guided AITuner Progress

Goal

Improve AITuner convergence for the dash0 internal vLLM + Qwen3.5-27B 0-8k chat study. The prior 12-iteration run can still propose worse configs after finding good ones. The new harness should make config proposals bottleneck-directed and stop spending GPU trials once no adjacent harness-guided probe is justified.

Paper Alignment

  • Prompt structure now includes an explicit [Harnesses] section aligned with paper Figure 12.
  • The harness uses the paper's L-C-A workload model:
    • L: prompt length percentiles and tail ratio.
    • C: prefix/KV-cache reuse estimated from repeated hash_ids blocks when available.
    • A: request rate, 1-second QPS burst ratio, and interarrival CV.
  • Knob rules follow the paper's Figure 13 style:
    • map active bottleneck to a knob family;
    • probe adjacent legal choices;
    • enforce guard conditions to avoid harmful side effects;
    • prefer stopping over weak exploratory proposals after convergence.

Local Implementation Log

  • Added src/aituner/harness.py.
    • Builds structured harness context for prompt injection.
    • Adds TP, max-num-seqs, max-num-batched-tokens, chunked-prefill, and memory-utilization harnesses when those knobs are tunable.
    • Extracts compact recent trial diagnostics from result JSON files.
    • Adds a convergence guard based on recent completed trial performance.
  • Extended src/aituner/trace.py.
    • summarize_window now reports L-C-A features.
    • TraceRequest now carries optional metadata for hash_ids, turn, parent chat id, and trace type.
  • Extended src/aituner/llm.py.
    • Prompt now includes tested config signatures and the structured harness section.
    • Prompt schema now asks for should_stop.
  • Extended src/aituner/spec.py.
    • Proposal accepts optional should_stop.
  • Extended src/aituner/cli.py.
    • study tune honors should_stop=true by recording the proposal and not launching another GPU trial.
  • Extended tests/test_core_flow.py.
    • Prompt includes harness context.
    • Trace summary includes new L-C-A fields.
    • Proposal parsing accepts should_stop.
    • CLI does not launch a trial for a stop proposal.

Local Verification

  • python3 -m compileall -q src tests: passed.
  • PYTHONPATH=src python3 -m unittest tests.test_core_flow: passed, 59 tests.
  • pytest -q and python3 -m pytest -q: not runnable locally because pytest is not installed.

Remote Experiment Log

Pending. Next steps:

  1. Commit and push the harness implementation.
  2. Pull on dash0 in /home/admin/cpfs/wjh/aituner/aituner.
  3. Start a real harness-guided Qwen3.5-27B 0-8k chat tuning run from configs/examples/dash0_qwen27b_tight_slo_run4_0_8k.json.
  4. Compare the first few iterations against the prior 12-iteration behavior:
    • best request rate per GPU should improve or reach the known good region in fewer trials;
    • proposals should follow the active bottleneck harness;
    • if the incumbent has converged, the LLM should emit should_stop=true instead of proposing a weak exploratory config.