Document qwen27b current config harness curve

This commit is contained in:
2026-05-06 18:00:43 +08:00
parent f653af09a8
commit 98cd6dd81a

View File

@@ -58,6 +58,8 @@ Run on dash0 with internal vLLM and the real `chat_w20260311_1000` 0-8k replay:
- Base spec: `configs/examples/dash0_qwen27b_tight_slo_run4_0_8k.json`. - Base spec: `configs/examples/dash0_qwen27b_tight_slo_run4_0_8k.json`.
- Model path: - Model path:
`/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal`. `/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal`.
- Naming note: local configs and dash0 model directories expose this setup as
Qwen3.5-27B/Qwen35-27B, not `qwen32b`.
- Engine: `/usr/local/bin/vllm`, baseline aligned with `~/run_qwen27b.sh`. - Engine: `/usr/local/bin/vllm`, baseline aligned with `~/run_qwen27b.sh`.
- SLO: 95% pass, stepped TTFT `2s/4s/6s`, TPOT `<=50ms`. - SLO: 95% pass, stepped TTFT `2s/4s/6s`, TPOT `<=50ms`.
- Search: `low=0`, `high=0.0625`, `max_probes=6`, `tolerance=0.001`. - Search: `low=0`, `high=0.0625`, `max_probes=6`, `tolerance=0.001`.
@@ -100,6 +102,14 @@ the measured-current regressions or replaces them with an explicit harness stop.
- 2026-05-06 09:25 CST: old harness repeated the same unsafe runtime refinement - 2026-05-06 09:25 CST: old harness repeated the same unsafe runtime refinement
for TP4 and `trial-0005` failed at engine launch for the same OOM reason. The for TP4 and `trial-0005` failed at engine launch for the same OOM reason. The
old process was stopped before continuing. old process was stopped before continuing.
- 2026-05-06 09:37 CST: pulled commit `5d96689` on dash0 and resumed. The
runtime-refinement OOM was fixed, but the stop guard was still too strict: it
did not treat a feasible high-edge probe with a small number of SLO failures
as saturation, even though the probe already met the 95% pass-rate target.
- 2026-05-06 09:50 CST: stopped the unnecessary product-8 validation. The queued
`trial-0006`/`trial-0007` are not used for convergence claims.
- 2026-05-06 09:56 CST: pulled commit `f653af0` on dash0. The fixed high-edge
stop guard produced `harness-stop-0008` without launching another GPU trial.
## Current Results ## Current Results
@@ -110,8 +120,8 @@ produce a feasible deployable config.
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| no-harness run9 | measured-current | 0.0350 | 0.0617 | 0.0392 | 0.2025 | NA | NA | NA | NA | NA | NA | NA | NA | | no-harness run9 | measured-current | 0.0350 | 0.0617 | 0.0392 | 0.2025 | NA | NA | NA | NA | NA | NA | NA | NA |
| no-harness run9 | accepted-incumbent | 0.0350 | 0.0617 | 0.0617 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | | no-harness run9 | accepted-incumbent | 0.0350 | 0.0617 | 0.0617 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 |
| harness run10 | measured-current | 0.0350 | 0.2142 | NA | 0.4429 | NA | pending | pending | pending | pending | pending | pending | pending | | harness run10 | measured-current | 0.0350 | 0.2142 | NA | 0.4429 | NA | skipped | skipped | stop | | | | |
| harness run10 | accepted-incumbent | 0.0350 | 0.2142 | 0.2142 | 0.4429 | 0.4429 | pending | pending | pending | pending | pending | pending | pending | | harness run10 | accepted-incumbent | 0.0350 | 0.2142 | 0.2142 | 0.4429 | 0.4429 | 0.4429 | 0.4429 | 0.4429 stop | | | | |
The harness result is stronger than the earlier strict replay. It did not merely The harness result is stronger than the earlier strict replay. It did not merely
reach the same TP2 region earlier; it then used the bottleneck/topology evidence reach the same TP2 region earlier; it then used the bottleneck/topology evidence
@@ -129,7 +139,35 @@ to validate TP4 and found a much higher current config.
active bottleneck on low-prefix-reuse long prompts. The harness maps that to active bottleneck on low-prefix-reuse long prompts. The harness maps that to
adjacent TP validation before DP/runtime exploration. The no-harness LLM chose adjacent TP validation before DP/runtime exploration. The no-harness LLM chose
DP2 then DP4 first, which diluted per-GPU throughput and delayed TP. DP2 then DP4 first, which diluted per-GPU throughput and delayed TP.
- Remaining defect found during the run: runtime refinement was too aggressive - Defect fixed during the run: runtime refinement was too aggressive because it
because it combined larger MBT with higher memory utilization. This has been combined larger MBT with higher memory utilization. It now changes batching
fixed so future runtime validation changes batching headroom without also headroom without also raising memory pressure.
raising memory pressure. - Stop defect fixed during the run: high-edge probes can have a few individual
latency failures and still be feasible under the configured pass-rate SLO. The
stop guard now keys on `feasible=true` near `search.high`, not on an empty
failed-reason map.
- Search-high implication: TP4 reached `sampling_u=0.0615234375` with
`search.high=0.0625`, so the current spec is saturated for this topology. A
higher `search.high` would be required to distinguish whether TP4 can go even
higher in absolute throughput; it is not needed to show that harness converged
faster than no-harness under this spec.
## Mechanism
The harness contributes structured, non-testcase-specific information:
- Workload features: long-prompt 0-8k distribution, low prefix reuse, and smooth
arrivals.
- Bottleneck diagnosis from probes: baseline failures are TTFT/prefill-heavy, so
topology changes that reduce long-prefill latency should be tried before DP or
runtime batching.
- Topology adjacency: validate TP1 -> TP2 -> TP4 rather than jumping randomly or
repeating a failing runtime family.
- Stop condition: once the incumbent's feasible probe is within one binary-search
resolution of `search.high`, stop instead of spending more GPU trials.
Without the harness, the LLM response in run9 chose DP2 and DP4 before TP2. That
temporarily improved total request rate but reduced per-GPU efficiency, so the
measured-current curve dipped at iter 3 and reached the old best only at iter 4.
With the harness, the LLM receives the bottleneck/topology frame and chooses
TP-oriented validation; TP2 is reached at iter 2 and TP4 at iter 4.