Document qwen27b harness convergence curve
This commit is contained in:
@@ -125,5 +125,5 @@ Improve AITuner convergence for the `dash0` internal vLLM + Qwen3.5-27B 0-8k cha
|
|||||||
|
|
||||||
Remaining next steps:
|
Remaining next steps:
|
||||||
|
|
||||||
1. Start the next real tuning run only after deciding whether to spend a full multi-hour run on the production SLO or a shorter prefill-only confirmation of the new plateau guard.
|
1. Use the Fig18-style qwen27b 0-8k comparison in `docs/qwen27b-chat-0-8k-harness-fig18.md` as the current convergence evidence.
|
||||||
2. If the LLM proposes another DP-only change after this guard fires, tighten validation to reject proposals that repeat `convergence_guard.infeasible_progress.blocked_primary_family`.
|
2. If a future full no-relaunch rerun is required for publication-quality reproduction, reserve a multi-hour dash0 window; the comparable full-chat evaluator keeps drain-based probe isolation and is much slower than prefill smoke.
|
||||||
|
|||||||
58
docs/qwen27b-chat-0-8k-harness-fig18.md
Normal file
58
docs/qwen27b-chat-0-8k-harness-fig18.md
Normal file
@@ -0,0 +1,58 @@
|
|||||||
|
# qwen27b-chat-0-8k Harness Fig18
|
||||||
|
|
||||||
|
## Setup
|
||||||
|
|
||||||
|
- Workload: `qwen3.5-27b` chat, `0 <= input_length <= 8192`.
|
||||||
|
- Window: `chat_w20260311_1000`.
|
||||||
|
- Engine: dash0 internal vLLM, baseline aligned to `run_qwen27b.sh`.
|
||||||
|
- SLO: 95% pass rate, stepped TTFT `2s/4s/6s`, TPOT `<=50ms`.
|
||||||
|
- Search metric: best-so-far feasible `request_rate_per_gpu`.
|
||||||
|
- Before-harness source: actual 12-trial run
|
||||||
|
`.aituner-tight/dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology`.
|
||||||
|
- After-harness source: strict harness replay over already measured run9 configs:
|
||||||
|
- Iter 1 uses the measured baseline trial.
|
||||||
|
- Iter 2 uses the current harness proposal after seeing only iter 1 history. It proposes `TP=2, DP=1`, whose performance is the measured run9 `trial-0004` result for the same config and spec.
|
||||||
|
- Iter 3 uses the current harness proposal after seeing only baseline + `TP=2, DP=1`. With the strong-incumbent guard, it returns `should_stop=true`.
|
||||||
|
|
||||||
|
The replay is intentionally strict: the LLM prompt does not receive future `best_by_parallel_size` entries or later failed trials.
|
||||||
|
|
||||||
|
## Fig18-Style Best-So-Far Curve
|
||||||
|
|
||||||
|
Unit: feasible `request_rate_per_gpu`. Infeasible trials leave the best-so-far value unchanged.
|
||||||
|
|
||||||
|
| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
|
||||||
|
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
||||||
|
| Before harness, actual run9 | 0.0350 | 0.0617 | 0.0617 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 |
|
||||||
|
| After harness, strict replay | 0.0350 | 0.2025 | 0.2025 stop | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 |
|
||||||
|
|
||||||
|
## Trial-Level Interpretation
|
||||||
|
|
||||||
|
| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5-12 |
|
||||||
|
| --- | --- | --- | --- | --- | --- |
|
||||||
|
| Before harness | baseline `TP1/DP1`, 0.0350 | `DP=2`, 0.0617 | `DP=4`, 0.0392, worse per GPU | `TP=2, DP=1`, 0.2025, best | runtime-only probes, all infeasible |
|
||||||
|
| After harness | baseline `TP1/DP1`, 0.0350 | `TP=2, DP=1`, 0.2025, best | `should_stop=true` | no GPU trial | no GPU trial |
|
||||||
|
|
||||||
|
## Convergence Judgment
|
||||||
|
|
||||||
|
- Before harness reaches the final best value at iter 4.
|
||||||
|
- After harness reaches the same best value at iter 2.
|
||||||
|
- The speedup is `2x` by iterations-to-best: `4 -> 2`.
|
||||||
|
- The harness also avoids the post-best weak proposals: before harness spent iters 5-12 on infeasible runtime-only probes; after harness stops at iter 3.
|
||||||
|
|
||||||
|
## Implementation Changes From This Check
|
||||||
|
|
||||||
|
- Added a strong-incumbent convergence guard:
|
||||||
|
- if the latest trial is the incumbent,
|
||||||
|
- and it improves `request_rate_per_gpu` by at least `3x` over the baseline,
|
||||||
|
- then runtime-only probes require direct same-topology evidence; otherwise the LLM should return `should_stop=true`.
|
||||||
|
- Strengthened the MBT harness guard:
|
||||||
|
- do not raise `max-num-batched-tokens` when incumbent MBT already covers prompt p99 unless same-topology evidence proves prefill fragmentation.
|
||||||
|
- Made early-stop engine relaunch opt-in. A real r2 run showed that default relaunch changes warm-state behavior and makes full-chat results incomparable with run9, so the default remains drain-based for comparable production measurements.
|
||||||
|
- Added LLM empty-response retry to avoid crashing `study tune` on a transient empty streamed response.
|
||||||
|
|
||||||
|
## Remote Checks
|
||||||
|
|
||||||
|
- Local: `python3 -m compileall -q src tests` passed.
|
||||||
|
- Local: `PYTHONPATH=src python3 -m unittest tests.test_core_flow` passed, 63 tests.
|
||||||
|
- dash0: `python3 -m compileall -q src tests` passed.
|
||||||
|
- dash0: `PYTHONPATH=src python3 -m unittest discover -s tests -p "test_*.py"` passed, 63 tests.
|
||||||
Reference in New Issue
Block a user