136 lines
7.1 KiB
Markdown
136 lines
7.1 KiB
Markdown
# qwen27b-chat-0-8k Current-Config Fig18 Plan
|
|
|
|
## Question
|
|
|
|
The earlier tables used best-so-far throughput. That is useful for deciding the
|
|
best deployable incumbent, but it hides bad proposals because the curve is
|
|
monotonic by construction. To judge whether the harness makes tuning more
|
|
directional, the primary curve must be each iteration's measured current config
|
|
performance.
|
|
|
|
## Why Final Performance Can Be Close
|
|
|
|
Harness and no-harness can converge to similar final throughput when the search
|
|
space contains one dominant simple family. In this setup the dominant family is
|
|
`TP=2, DP=1` over the `run_qwen27b.sh` baseline. The no-harness LLM can still
|
|
eventually discover that family within 12 iterations, so final best performance
|
|
can be close.
|
|
|
|
The difference the harness is expected to improve is not necessarily the final
|
|
12-iter maximum. It should improve:
|
|
|
|
- iterations-to-first-good-config;
|
|
- number of worse or infeasible proposals after an incumbent is found;
|
|
- measured-current config oscillation;
|
|
- early-stop behavior once adjacent harness probes no longer justify more GPU
|
|
trials.
|
|
|
|
## Metrics
|
|
|
|
- `measured-current`: each trial's own feasible `request_rate_per_gpu`.
|
|
Failed or no-feasible-point trials are recorded as `NA`.
|
|
- `accepted-incumbent`: best deployable value after each trial. This is the
|
|
standard best-so-far curve and is monotonic by definition.
|
|
- `iters-to-best`: first iteration where the final best value or equivalent
|
|
config family appears.
|
|
- `wasted-trials-after-best`: trials after first best that are worse, infeasible,
|
|
or no-feasible-point.
|
|
|
|
## Historical Run9 Re-Read
|
|
|
|
Source:
|
|
`.aituner-tight/dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology`
|
|
on dash0.
|
|
|
|
| Variant | Curve | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
|
|
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
|
| no-harness run9 | measured-current | 0.0350 | 0.0617 | 0.0392 | 0.2025 | NA | NA | NA | NA | NA | NA | NA | NA |
|
|
| no-harness run9 | accepted-incumbent | 0.0350 | 0.0617 | 0.0617 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 |
|
|
|
|
Interpretation: the no-harness current-config curve already has a regression at
|
|
iter 3 and then many no-feasible-point runtime probes. The monotonic curve only
|
|
shows the incumbent policy, not proposal quality.
|
|
|
|
## New Paired Test Plan
|
|
|
|
Run on dash0 with internal vLLM and the real `chat_w20260311_1000` 0-8k replay:
|
|
|
|
- Base spec: `configs/examples/dash0_qwen27b_tight_slo_run4_0_8k.json`.
|
|
- Model path:
|
|
`/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal`.
|
|
- Engine: `/usr/local/bin/vllm`, baseline aligned with `~/run_qwen27b.sh`.
|
|
- SLO: 95% pass, stepped TTFT `2s/4s/6s`, TPOT `<=50ms`.
|
|
- Search: `low=0`, `high=0.0625`, `max_probes=6`, `tolerance=0.001`.
|
|
- no-harness study:
|
|
`.aituner-tight/dash0-qwen27b-tight-slo-10min-run10-chat-0-8k-current-noharness`.
|
|
- harness study:
|
|
`.aituner-tight/dash0-qwen27b-tight-slo-10min-run10-chat-0-8k-current-harness`.
|
|
|
|
The result table will report both curves. The harness is considered successful
|
|
only if it reaches the same or better incumbent in fewer iterations and reduces
|
|
the measured-current regressions or replaces them with an explicit harness stop.
|
|
|
|
## Run Status
|
|
|
|
- 2026-05-06 07:05 CST: dash0 checked, 8 H20 GPUs idle.
|
|
- 2026-05-06 07:05 CST: generated paired specs under
|
|
`.aituner-tight/specs/`.
|
|
- 2026-05-06 07:05 CST: started no-harness full 12-iter run in tmux session
|
|
`qwen27b_run10_noharness_20260506`.
|
|
- 2026-05-06 07:18 CST: stopped the duplicate fresh no-harness run before
|
|
completion. Reason: run9 is already a completed real 12-iter no-harness run
|
|
for the same internal vLLM 0-8k setup, while the fresh full-chat run would
|
|
spend a multi-hour dash0 slot duplicating that curve.
|
|
- 2026-05-06 07:20 CST: seeded the harness study with the real run9 baseline
|
|
measurement as `trial-0001`, then started the harness run with
|
|
`--skip-baseline` in tmux session `qwen27b_run10_harness_skipbase_20260506`.
|
|
- 2026-05-06 07:20 CST: harness generated deterministic `trial-0002`:
|
|
`{"tensor-parallel-size": 2}`.
|
|
- 2026-05-06 08:11 CST: harness `trial-0002` completed:
|
|
`TP=2`, `0.2142 request_rate_per_gpu`.
|
|
- 2026-05-06 08:19 CST: harness `trial-0003` failed at engine launch.
|
|
Root cause: the old runtime refinement coupled `gpu-memory-utilization=0.95`
|
|
with larger `max-num-batched-tokens`, causing speculative sampler warmup OOM.
|
|
This is a generic harness safety bug; fixed locally by removing the automatic
|
|
memory-utilization bump from runtime refinement.
|
|
- 2026-05-06 09:24 CST: harness `trial-0004` completed:
|
|
`TP=4`, `0.4429 request_rate_per_gpu`. All six probes were feasible up to
|
|
`sampling_u=0.0615234375`, so this study is near the configured
|
|
`search.high=0.0625` ceiling.
|
|
- 2026-05-06 09:25 CST: old harness repeated the same unsafe runtime refinement
|
|
for TP4 and `trial-0005` failed at engine launch for the same OOM reason. The
|
|
old process was stopped before continuing.
|
|
|
|
## Current Results
|
|
|
|
Unit: feasible `request_rate_per_gpu`. `NA` means the current trial did not
|
|
produce a feasible deployable config.
|
|
|
|
| Variant | Curve | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
|
|
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
|
| no-harness run9 | measured-current | 0.0350 | 0.0617 | 0.0392 | 0.2025 | NA | NA | NA | NA | NA | NA | NA | NA |
|
|
| no-harness run9 | accepted-incumbent | 0.0350 | 0.0617 | 0.0617 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 |
|
|
| harness run10 | measured-current | 0.0350 | 0.2142 | NA | 0.4429 | NA | pending | pending | pending | pending | pending | pending | pending |
|
|
| harness run10 | accepted-incumbent | 0.0350 | 0.2142 | 0.2142 | 0.4429 | 0.4429 | pending | pending | pending | pending | pending | pending | pending |
|
|
|
|
The harness result is stronger than the earlier strict replay. It did not merely
|
|
reach the same TP2 region earlier; it then used the bottleneck/topology evidence
|
|
to validate TP4 and found a much higher current config.
|
|
|
|
## Interpretation
|
|
|
|
- Why both variants can look close when only best-so-far is shown: no-harness can
|
|
eventually find a good simple topology, and best-so-far hides every bad
|
|
proposal after that point.
|
|
- What the current-config curve shows: no-harness regresses at iter 3 and then
|
|
spends many iterations on no-feasible-point runtime probes. Harness reaches a
|
|
stronger TP2 config at iter 2 and a stronger TP4 config at iter 4.
|
|
- Why harness helped: the baseline diagnostics identify TTFT/prefill as the
|
|
active bottleneck on low-prefix-reuse long prompts. The harness maps that to
|
|
adjacent TP validation before DP/runtime exploration. The no-harness LLM chose
|
|
DP2 then DP4 first, which diluted per-GPU throughput and delayed TP.
|
|
- Remaining defect found during the run: runtime refinement was too aggressive
|
|
because it combined larger MBT with higher memory utilization. This has been
|
|
fixed so future runtime validation changes batching headroom without also
|
|
raising memory pressure.
|