Files
aituner/docs/qwen27b-chat-0-8k-current-config-fig18-20260506.md

174 lines
9.3 KiB
Markdown

# qwen27b-chat-0-8k Current-Config Fig18 Plan
## Question
The earlier tables used best-so-far throughput. That is useful for deciding the
best deployable incumbent, but it hides bad proposals because the curve is
monotonic by construction. To judge whether the harness makes tuning more
directional, the primary curve must be each iteration's measured current config
performance.
## Why Final Performance Can Be Close
Harness and no-harness can converge to similar final throughput when the search
space contains one dominant simple family. In this setup the dominant family is
`TP=2, DP=1` over the `run_qwen27b.sh` baseline. The no-harness LLM can still
eventually discover that family within 12 iterations, so final best performance
can be close.
The difference the harness is expected to improve is not necessarily the final
12-iter maximum. It should improve:
- iterations-to-first-good-config;
- number of worse or infeasible proposals after an incumbent is found;
- measured-current config oscillation;
- early-stop behavior once adjacent harness probes no longer justify more GPU
trials.
## Metrics
- `measured-current`: each trial's own feasible `request_rate_per_gpu`.
Failed or no-feasible-point trials are recorded as `NA`.
- `accepted-incumbent`: best deployable value after each trial. This is the
standard best-so-far curve and is monotonic by definition.
- `iters-to-best`: first iteration where the final best value or equivalent
config family appears.
- `wasted-trials-after-best`: trials after first best that are worse, infeasible,
or no-feasible-point.
## Historical Run9 Re-Read
Source:
`.aituner-tight/dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology`
on dash0.
| Variant | Curve | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| no-harness run9 | measured-current | 0.0350 | 0.0617 | 0.0392 | 0.2025 | NA | NA | NA | NA | NA | NA | NA | NA |
| no-harness run9 | accepted-incumbent | 0.0350 | 0.0617 | 0.0617 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 |
Interpretation: the no-harness current-config curve already has a regression at
iter 3 and then many no-feasible-point runtime probes. The monotonic curve only
shows the incumbent policy, not proposal quality.
## New Paired Test Plan
Run on dash0 with internal vLLM and the real `chat_w20260311_1000` 0-8k replay:
- Base spec: `configs/examples/dash0_qwen27b_tight_slo_run4_0_8k.json`.
- Model path:
`/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal`.
- Naming note: local configs and dash0 model directories expose this setup as
Qwen3.5-27B/Qwen35-27B, not `qwen32b`.
- Engine: `/usr/local/bin/vllm`, baseline aligned with `~/run_qwen27b.sh`.
- SLO: 95% pass, stepped TTFT `2s/4s/6s`, TPOT `<=50ms`.
- Search: `low=0`, `high=0.0625`, `max_probes=6`, `tolerance=0.001`.
- no-harness study:
`.aituner-tight/dash0-qwen27b-tight-slo-10min-run10-chat-0-8k-current-noharness`.
- harness study:
`.aituner-tight/dash0-qwen27b-tight-slo-10min-run10-chat-0-8k-current-harness`.
The result table will report both curves. The harness is considered successful
only if it reaches the same or better incumbent in fewer iterations and reduces
the measured-current regressions or replaces them with an explicit harness stop.
## Run Status
- 2026-05-06 07:05 CST: dash0 checked, 8 H20 GPUs idle.
- 2026-05-06 07:05 CST: generated paired specs under
`.aituner-tight/specs/`.
- 2026-05-06 07:05 CST: started no-harness full 12-iter run in tmux session
`qwen27b_run10_noharness_20260506`.
- 2026-05-06 07:18 CST: stopped the duplicate fresh no-harness run before
completion. Reason: run9 is already a completed real 12-iter no-harness run
for the same internal vLLM 0-8k setup, while the fresh full-chat run would
spend a multi-hour dash0 slot duplicating that curve.
- 2026-05-06 07:20 CST: seeded the harness study with the real run9 baseline
measurement as `trial-0001`, then started the harness run with
`--skip-baseline` in tmux session `qwen27b_run10_harness_skipbase_20260506`.
- 2026-05-06 07:20 CST: harness generated deterministic `trial-0002`:
`{"tensor-parallel-size": 2}`.
- 2026-05-06 08:11 CST: harness `trial-0002` completed:
`TP=2`, `0.2142 request_rate_per_gpu`.
- 2026-05-06 08:19 CST: harness `trial-0003` failed at engine launch.
Root cause: the old runtime refinement coupled `gpu-memory-utilization=0.95`
with larger `max-num-batched-tokens`, causing speculative sampler warmup OOM.
This is a generic harness safety bug; fixed locally by removing the automatic
memory-utilization bump from runtime refinement.
- 2026-05-06 09:24 CST: harness `trial-0004` completed:
`TP=4`, `0.4429 request_rate_per_gpu`. All six probes were feasible up to
`sampling_u=0.0615234375`, so this study is near the configured
`search.high=0.0625` ceiling.
- 2026-05-06 09:25 CST: old harness repeated the same unsafe runtime refinement
for TP4 and `trial-0005` failed at engine launch for the same OOM reason. The
old process was stopped before continuing.
- 2026-05-06 09:37 CST: pulled commit `5d96689` on dash0 and resumed. The
runtime-refinement OOM was fixed, but the stop guard was still too strict: it
did not treat a feasible high-edge probe with a small number of SLO failures
as saturation, even though the probe already met the 95% pass-rate target.
- 2026-05-06 09:50 CST: stopped the unnecessary product-8 validation. The queued
`trial-0006`/`trial-0007` are not used for convergence claims.
- 2026-05-06 09:56 CST: pulled commit `f653af0` on dash0. The fixed high-edge
stop guard produced `harness-stop-0008` without launching another GPU trial.
## Current Results
Unit: feasible `request_rate_per_gpu`. `NA` means the current trial did not
produce a feasible deployable config.
| Variant | Curve | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| no-harness run9 | measured-current | 0.0350 | 0.0617 | 0.0392 | 0.2025 | NA | NA | NA | NA | NA | NA | NA | NA |
| no-harness run9 | accepted-incumbent | 0.0350 | 0.0617 | 0.0617 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 |
| harness run10 | measured-current | 0.0350 | 0.2142 | NA | 0.4429 | NA | skipped | skipped | stop | | | | |
| harness run10 | accepted-incumbent | 0.0350 | 0.2142 | 0.2142 | 0.4429 | 0.4429 | 0.4429 | 0.4429 | 0.4429 stop | | | | |
The harness result is stronger than the earlier strict replay. It did not merely
reach the same TP2 region earlier; it then used the bottleneck/topology evidence
to validate TP4 and found a much higher current config.
## Interpretation
- Why both variants can look close when only best-so-far is shown: no-harness can
eventually find a good simple topology, and best-so-far hides every bad
proposal after that point.
- What the current-config curve shows: no-harness regresses at iter 3 and then
spends many iterations on no-feasible-point runtime probes. Harness reaches a
stronger TP2 config at iter 2 and a stronger TP4 config at iter 4.
- Why harness helped: the baseline diagnostics identify TTFT/prefill as the
active bottleneck on low-prefix-reuse long prompts. The harness maps that to
adjacent TP validation before DP/runtime exploration. The no-harness LLM chose
DP2 then DP4 first, which diluted per-GPU throughput and delayed TP.
- Defect fixed during the run: runtime refinement was too aggressive because it
combined larger MBT with higher memory utilization. It now changes batching
headroom without also raising memory pressure.
- Stop defect fixed during the run: high-edge probes can have a few individual
latency failures and still be feasible under the configured pass-rate SLO. The
stop guard now keys on `feasible=true` near `search.high`, not on an empty
failed-reason map.
- Search-high implication: TP4 reached `sampling_u=0.0615234375` with
`search.high=0.0625`, so the current spec is saturated for this topology. A
higher `search.high` would be required to distinguish whether TP4 can go even
higher in absolute throughput; it is not needed to show that harness converged
faster than no-harness under this spec.
## Mechanism
The harness contributes structured, non-testcase-specific information:
- Workload features: long-prompt 0-8k distribution, low prefix reuse, and smooth
arrivals.
- Bottleneck diagnosis from probes: baseline failures are TTFT/prefill-heavy, so
topology changes that reduce long-prefill latency should be tried before DP or
runtime batching.
- Topology adjacency: validate TP1 -> TP2 -> TP4 rather than jumping randomly or
repeating a failing runtime family.
- Stop condition: once the incumbent's feasible probe is within one binary-search
resolution of `search.high`, stop instead of spending more GPU trials.
Without the harness, the LLM response in run9 chose DP2 and DP4 before TP2. That
temporarily improved total request rate but reduced per-GPU efficiency, so the
measured-current curve dipped at iter 3 and reached the old best only at iter 4.
With the harness, the LLM receives the bottleneck/topology frame and chooses
TP-oriented validation; TP2 is reached at iter 2 and TP4 at iter 4.