Add harness-guided tuning prompts
This commit is contained in:
59
docs/harness-tuning-progress.md
Normal file
59
docs/harness-tuning-progress.md
Normal file
@@ -0,0 +1,59 @@
|
||||
# Harness-Guided AITuner Progress
|
||||
|
||||
## Goal
|
||||
|
||||
Improve AITuner convergence for the `dash0` internal vLLM + Qwen3.5-27B 0-8k chat study. The prior 12-iteration run can still propose worse configs after finding good ones. The new harness should make config proposals bottleneck-directed and stop spending GPU trials once no adjacent harness-guided probe is justified.
|
||||
|
||||
## Paper Alignment
|
||||
|
||||
- Prompt structure now includes an explicit `[Harnesses]` section aligned with paper Figure 12.
|
||||
- The harness uses the paper's L-C-A workload model:
|
||||
- L: prompt length percentiles and tail ratio.
|
||||
- C: prefix/KV-cache reuse estimated from repeated `hash_ids` blocks when available.
|
||||
- A: request rate, 1-second QPS burst ratio, and interarrival CV.
|
||||
- Knob rules follow the paper's Figure 13 style:
|
||||
- map active bottleneck to a knob family;
|
||||
- probe adjacent legal choices;
|
||||
- enforce guard conditions to avoid harmful side effects;
|
||||
- prefer stopping over weak exploratory proposals after convergence.
|
||||
|
||||
## Local Implementation Log
|
||||
|
||||
- Added `src/aituner/harness.py`.
|
||||
- Builds structured harness context for prompt injection.
|
||||
- Adds TP, max-num-seqs, max-num-batched-tokens, chunked-prefill, and memory-utilization harnesses when those knobs are tunable.
|
||||
- Extracts compact recent trial diagnostics from result JSON files.
|
||||
- Adds a convergence guard based on recent completed trial performance.
|
||||
- Extended `src/aituner/trace.py`.
|
||||
- `summarize_window` now reports L-C-A features.
|
||||
- `TraceRequest` now carries optional metadata for `hash_ids`, turn, parent chat id, and trace type.
|
||||
- Extended `src/aituner/llm.py`.
|
||||
- Prompt now includes tested config signatures and the structured harness section.
|
||||
- Prompt schema now asks for `should_stop`.
|
||||
- Extended `src/aituner/spec.py`.
|
||||
- `Proposal` accepts optional `should_stop`.
|
||||
- Extended `src/aituner/cli.py`.
|
||||
- `study tune` honors `should_stop=true` by recording the proposal and not launching another GPU trial.
|
||||
- Extended `tests/test_core_flow.py`.
|
||||
- Prompt includes harness context.
|
||||
- Trace summary includes new L-C-A fields.
|
||||
- Proposal parsing accepts `should_stop`.
|
||||
- CLI does not launch a trial for a stop proposal.
|
||||
|
||||
## Local Verification
|
||||
|
||||
- `python3 -m compileall -q src tests`: passed.
|
||||
- `PYTHONPATH=src python3 -m unittest tests.test_core_flow`: passed, 59 tests.
|
||||
- `pytest -q` and `python3 -m pytest -q`: not runnable locally because `pytest` is not installed.
|
||||
|
||||
## Remote Experiment Log
|
||||
|
||||
Pending. Next steps:
|
||||
|
||||
1. Commit and push the harness implementation.
|
||||
2. Pull on `dash0` in `/home/admin/cpfs/wjh/aituner/aituner`.
|
||||
3. Start a real harness-guided Qwen3.5-27B 0-8k chat tuning run from `configs/examples/dash0_qwen27b_tight_slo_run4_0_8k.json`.
|
||||
4. Compare the first few iterations against the prior 12-iteration behavior:
|
||||
- best request rate per GPU should improve or reach the known good region in fewer trials;
|
||||
- proposals should follow the active bottleneck harness;
|
||||
- if the incumbent has converged, the LLM should emit `should_stop=true` instead of proposing a weak exploratory config.
|
||||
Reference in New Issue
Block a user