6.3 KiB
qwen27b-chat-0-8k Setup and Result Audit
Purpose
This note audits the 2026-05-06 qwen27b chat 0-8k harness result because the
new best 0.4429 request_rate_per_gpu is much higher than the previous
no-harness best 0.2025.
Setup
- Host:
dash0. - Hardware: 8 NVIDIA H20 GPUs.
- Engine: internal vLLM at
/usr/local/bin/vllm. - Model:
/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal. - Served model name:
qwen35-27b-aituner. - Workload window:
chat_w20260311_1000. - Trace file source:
trace_windows/windows.json. - Request mode:
chat. - Input bucket:
0 <= input_length <= 8192. - Replay scale:
1.0. - Max concurrency:
32. - Max requests per probe: unset, so full selected trace subset is replayed.
- Search field:
sampling_u. - Search range:
low=0.0,high=0.0625. - Search probes:
max_probes=6,tolerance=0.001. - Sampling seed:
20260325.
The local configs and dash0 model directories name this setup Qwen3.5-27B /
Qwen35-27B. I did not find a qwen32b model/config for this internal chat
0-8k setup.
SLO
- Target pass rate:
0.95. - TTFT rule: stepped by input length.
| Input tokens | TTFT threshold |
|---|---|
<=4096 |
2000 ms |
<=32768 |
4000 ms |
| otherwise | 6000 ms |
- TPOT rule: fixed
<=50 ms.
A probe is feasible when its pass rate is at least 0.95. Individual requests
may still fail TTFT/TPOT while the whole probe remains feasible.
Compared Studies
| Variant | Study root | Notes |
|---|---|---|
| no-harness | .aituner-tight/dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology |
completed 12-trial historical run |
| harness | .aituner-tight/dash0-qwen27b-tight-slo-10min-run10-chat-0-8k-current-harness |
seeded with run9 baseline, then ran real harness trials |
The harness run reused the real run9 baseline as trial-0001 to avoid
duplicating a multi-hour cold-start baseline measurement. Later harness trials
were real dash0 runs.
Metric
The reported metric is request_rate_per_gpu:
request_rate_per_gpu = best_feasible_request_rate / parallel_size
parallel_size = tensor_parallel_size * data_parallel_size
The result JSON stores best_request_rate; StudyStore.ingest_trial_results
derives best_request_rate_per_gpu from the trial spec topology.
Result Table
Unit: feasible request_rate_per_gpu.
| Variant | Curve | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 |
|---|---|---|---|---|---|---|---|---|---|
| no-harness run9 | measured-current | 0.0350 | 0.0617 | 0.0392 | 0.2025 | NA | NA | NA | NA |
| no-harness run9 | accepted-incumbent | 0.0350 | 0.0617 | 0.0617 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 |
| harness run10 | measured-current | 0.0350 | 0.2142 | NA | 0.4429 | NA | skipped | skipped | stop |
| harness run10 | accepted-incumbent | 0.0350 | 0.2142 | 0.2142 | 0.4429 | 0.4429 | 0.4429 | 0.4429 | 0.4429 stop |
Why 0.4429 Is Plausible
The new value is not the old TP2 config suddenly doubling. The comparable TP2 results are close:
| Study | Trial | Config | Best request rate | Parallel size | request_rate_per_gpu |
|---|---|---|---|---|---|
| run9 | trial-0004 |
TP=2, DP=1 |
0.4050 | 2 | 0.2025 |
| run10 | trial-0002 |
TP=2 |
0.4283 | 2 | 0.2142 |
The large jump comes from a new topology that run9 did not evaluate:
| Study | Trial | Config | Best request rate | Parallel size | request_rate_per_gpu |
|---|---|---|---|---|---|
| run10 | trial-0004 |
TP=4 |
1.7717 | 4 | 0.4429 |
At the winning TP4 probe:
sampling_u=0.0615234375;- request count
1063; - request rate
1.7717 req/s; - pass rate
0.9680; - p95 TTFT
1476.9 ms; - p95 TPOT
44.4 ms.
This satisfies the configured SLO and is within one binary-search resolution of
search.high=0.0625.
Correctness Audit
The following fields match between run9 and run10 except for intentionally different identity fields such as study id and port:
- model path and served model name;
- internal vLLM executable;
- base launch flags other than port;
- trace window
chat_w20260311_1000; - input-length filter
0-8192; - replay scale
1.0; - max concurrency
32; - full selected trace replay, no
max_requests_per_probe; - SLO target and TTFT/TPOT thresholds;
- search
high=0.0625,max_probes=6,tolerance=0.001, seed20260325; - metric definition
best_request_rate / (TP * DP).
Checked differences and their impact:
- Port differs: run9 used
18087, run10 used18082; this should not affect measured throughput. - run10 has explicit
restart_engine_after_early_stop=false; chat studies default to the same behavior. - run10 has explicit
completion_tokens_override=null; equivalent to run9's absent field. - run9
trial-0004search floor was0.00390625because it reused the incumbent for the same parallel-size group. run10trial-0004search floor was0.0because pureTP=4had not been tried. Both have the same high and probe budget; this does not explain the higher result.
No metric-code logic error was found in the audit. The result JSONs store raw
request rate, and the state computes per-GPU throughput by dividing by
TP*DP. For run10 TP4, 1.7716666667 / 4 = 0.4429166667.
Issues Found During The Test
Two harness bugs were found and fixed:
- Runtime refinement coupled larger
max-num-batched-tokenswithgpu-memory-utilization=0.95, which caused launch-time OOM. Fixed in commit5d96689. - The search-high stop guard incorrectly required no individual SLO failures at
a feasible high-edge probe. Fixed in commit
f653af0; feasibility already means the probe passed the configured pass-rate SLO.
The queued product-8 trial-0006 and trial-0007 were stopped after the stop
guard fix and are not used in the convergence claim.
Conclusion
The 0.4429 result is being compared under the same workload, SLO, search
range, and metric definition as the previous 0.2025 result. The reason it is
higher is that no-harness run9 did not evaluate pure TP=4; the harness guided
the search from the TTFT/prefill bottleneck to adjacent TP validation and found
that topology by iter 4.
Because TP4 nearly saturates the configured search.high, a follow-up run with
a higher search.high is needed to measure the absolute ceiling. That follow-up
is separate from the current convergence comparison.