Document qwen27b current config harness curve
This commit is contained in:
@@ -58,6 +58,8 @@ Run on dash0 with internal vLLM and the real `chat_w20260311_1000` 0-8k replay:
|
|||||||
- Base spec: `configs/examples/dash0_qwen27b_tight_slo_run4_0_8k.json`.
|
- Base spec: `configs/examples/dash0_qwen27b_tight_slo_run4_0_8k.json`.
|
||||||
- Model path:
|
- Model path:
|
||||||
`/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal`.
|
`/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal`.
|
||||||
|
- Naming note: local configs and dash0 model directories expose this setup as
|
||||||
|
Qwen3.5-27B/Qwen35-27B, not `qwen32b`.
|
||||||
- Engine: `/usr/local/bin/vllm`, baseline aligned with `~/run_qwen27b.sh`.
|
- Engine: `/usr/local/bin/vllm`, baseline aligned with `~/run_qwen27b.sh`.
|
||||||
- SLO: 95% pass, stepped TTFT `2s/4s/6s`, TPOT `<=50ms`.
|
- SLO: 95% pass, stepped TTFT `2s/4s/6s`, TPOT `<=50ms`.
|
||||||
- Search: `low=0`, `high=0.0625`, `max_probes=6`, `tolerance=0.001`.
|
- Search: `low=0`, `high=0.0625`, `max_probes=6`, `tolerance=0.001`.
|
||||||
@@ -100,6 +102,14 @@ the measured-current regressions or replaces them with an explicit harness stop.
|
|||||||
- 2026-05-06 09:25 CST: old harness repeated the same unsafe runtime refinement
|
- 2026-05-06 09:25 CST: old harness repeated the same unsafe runtime refinement
|
||||||
for TP4 and `trial-0005` failed at engine launch for the same OOM reason. The
|
for TP4 and `trial-0005` failed at engine launch for the same OOM reason. The
|
||||||
old process was stopped before continuing.
|
old process was stopped before continuing.
|
||||||
|
- 2026-05-06 09:37 CST: pulled commit `5d96689` on dash0 and resumed. The
|
||||||
|
runtime-refinement OOM was fixed, but the stop guard was still too strict: it
|
||||||
|
did not treat a feasible high-edge probe with a small number of SLO failures
|
||||||
|
as saturation, even though the probe already met the 95% pass-rate target.
|
||||||
|
- 2026-05-06 09:50 CST: stopped the unnecessary product-8 validation. The queued
|
||||||
|
`trial-0006`/`trial-0007` are not used for convergence claims.
|
||||||
|
- 2026-05-06 09:56 CST: pulled commit `f653af0` on dash0. The fixed high-edge
|
||||||
|
stop guard produced `harness-stop-0008` without launching another GPU trial.
|
||||||
|
|
||||||
## Current Results
|
## Current Results
|
||||||
|
|
||||||
@@ -110,8 +120,8 @@ produce a feasible deployable config.
|
|||||||
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
||||||
| no-harness run9 | measured-current | 0.0350 | 0.0617 | 0.0392 | 0.2025 | NA | NA | NA | NA | NA | NA | NA | NA |
|
| no-harness run9 | measured-current | 0.0350 | 0.0617 | 0.0392 | 0.2025 | NA | NA | NA | NA | NA | NA | NA | NA |
|
||||||
| no-harness run9 | accepted-incumbent | 0.0350 | 0.0617 | 0.0617 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 |
|
| no-harness run9 | accepted-incumbent | 0.0350 | 0.0617 | 0.0617 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 |
|
||||||
| harness run10 | measured-current | 0.0350 | 0.2142 | NA | 0.4429 | NA | pending | pending | pending | pending | pending | pending | pending |
|
| harness run10 | measured-current | 0.0350 | 0.2142 | NA | 0.4429 | NA | skipped | skipped | stop | | | | |
|
||||||
| harness run10 | accepted-incumbent | 0.0350 | 0.2142 | 0.2142 | 0.4429 | 0.4429 | pending | pending | pending | pending | pending | pending | pending |
|
| harness run10 | accepted-incumbent | 0.0350 | 0.2142 | 0.2142 | 0.4429 | 0.4429 | 0.4429 | 0.4429 | 0.4429 stop | | | | |
|
||||||
|
|
||||||
The harness result is stronger than the earlier strict replay. It did not merely
|
The harness result is stronger than the earlier strict replay. It did not merely
|
||||||
reach the same TP2 region earlier; it then used the bottleneck/topology evidence
|
reach the same TP2 region earlier; it then used the bottleneck/topology evidence
|
||||||
@@ -129,7 +139,35 @@ to validate TP4 and found a much higher current config.
|
|||||||
active bottleneck on low-prefix-reuse long prompts. The harness maps that to
|
active bottleneck on low-prefix-reuse long prompts. The harness maps that to
|
||||||
adjacent TP validation before DP/runtime exploration. The no-harness LLM chose
|
adjacent TP validation before DP/runtime exploration. The no-harness LLM chose
|
||||||
DP2 then DP4 first, which diluted per-GPU throughput and delayed TP.
|
DP2 then DP4 first, which diluted per-GPU throughput and delayed TP.
|
||||||
- Remaining defect found during the run: runtime refinement was too aggressive
|
- Defect fixed during the run: runtime refinement was too aggressive because it
|
||||||
because it combined larger MBT with higher memory utilization. This has been
|
combined larger MBT with higher memory utilization. It now changes batching
|
||||||
fixed so future runtime validation changes batching headroom without also
|
headroom without also raising memory pressure.
|
||||||
raising memory pressure.
|
- Stop defect fixed during the run: high-edge probes can have a few individual
|
||||||
|
latency failures and still be feasible under the configured pass-rate SLO. The
|
||||||
|
stop guard now keys on `feasible=true` near `search.high`, not on an empty
|
||||||
|
failed-reason map.
|
||||||
|
- Search-high implication: TP4 reached `sampling_u=0.0615234375` with
|
||||||
|
`search.high=0.0625`, so the current spec is saturated for this topology. A
|
||||||
|
higher `search.high` would be required to distinguish whether TP4 can go even
|
||||||
|
higher in absolute throughput; it is not needed to show that harness converged
|
||||||
|
faster than no-harness under this spec.
|
||||||
|
|
||||||
|
## Mechanism
|
||||||
|
|
||||||
|
The harness contributes structured, non-testcase-specific information:
|
||||||
|
|
||||||
|
- Workload features: long-prompt 0-8k distribution, low prefix reuse, and smooth
|
||||||
|
arrivals.
|
||||||
|
- Bottleneck diagnosis from probes: baseline failures are TTFT/prefill-heavy, so
|
||||||
|
topology changes that reduce long-prefill latency should be tried before DP or
|
||||||
|
runtime batching.
|
||||||
|
- Topology adjacency: validate TP1 -> TP2 -> TP4 rather than jumping randomly or
|
||||||
|
repeating a failing runtime family.
|
||||||
|
- Stop condition: once the incumbent's feasible probe is within one binary-search
|
||||||
|
resolution of `search.high`, stop instead of spending more GPU trials.
|
||||||
|
|
||||||
|
Without the harness, the LLM response in run9 chose DP2 and DP4 before TP2. That
|
||||||
|
temporarily improved total request rate but reduced per-GPU efficiency, so the
|
||||||
|
measured-current curve dipped at iter 3 and reached the old best only at iter 4.
|
||||||
|
With the harness, the LLM receives the bottleneck/topology frame and chooses
|
||||||
|
TP-oriented validation; TP2 is reached at iter 2 and TP4 at iter 4.
|
||||||
|
|||||||
Reference in New Issue
Block a user