Files

Gahow Wang 871c4cfc02 Document qwen27b chat setup audit

2026-05-06 20:32:09 +08:00

6.3 KiB

Raw Permalink Blame History

qwen27b-chat-0-8k Setup and Result Audit

Purpose

This note audits the 2026-05-06 qwen27b chat 0-8k harness result because the new best 0.4429 request_rate_per_gpu is much higher than the previous no-harness best 0.2025.

Setup

Host: dash0.
Hardware: 8 NVIDIA H20 GPUs.
Engine: internal vLLM at /usr/local/bin/vllm.
Model: /home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal.
Served model name: qwen35-27b-aituner.
Workload window: chat_w20260311_1000.
Trace file source: trace_windows/windows.json.
Request mode: chat.
Input bucket: 0 <= input_length <= 8192.
Replay scale: 1.0.
Max concurrency: 32.
Max requests per probe: unset, so full selected trace subset is replayed.
Search field: sampling_u.
Search range: low=0.0, high=0.0625.
Search probes: max_probes=6, tolerance=0.001.
Sampling seed: 20260325.

The local configs and dash0 model directories name this setup Qwen3.5-27B / Qwen35-27B. I did not find a qwen32b model/config for this internal chat 0-8k setup.

SLO

Target pass rate: 0.95.
TTFT rule: stepped by input length.

Input tokens	TTFT threshold
`<=4096`	`2000 ms`
`<=32768`	`4000 ms`
otherwise	`6000 ms`

TPOT rule: fixed <=50 ms.

A probe is feasible when its pass rate is at least 0.95. Individual requests may still fail TTFT/TPOT while the whole probe remains feasible.

Compared Studies

Variant	Study root	Notes
no-harness	`.aituner-tight/dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology`	completed 12-trial historical run
harness	`.aituner-tight/dash0-qwen27b-tight-slo-10min-run10-chat-0-8k-current-harness`	seeded with run9 baseline, then ran real harness trials

The harness run reused the real run9 baseline as trial-0001 to avoid duplicating a multi-hour cold-start baseline measurement. Later harness trials were real dash0 runs.

Metric

The reported metric is request_rate_per_gpu:

request_rate_per_gpu = best_feasible_request_rate / parallel_size
parallel_size = tensor_parallel_size * data_parallel_size

The result JSON stores best_request_rate; StudyStore.ingest_trial_results derives best_request_rate_per_gpu from the trial spec topology.

Result Table

Unit: feasible request_rate_per_gpu.

Variant	Curve	Iter 1	Iter 2	Iter 3	Iter 4	Iter 5	Iter 6	Iter 7	Iter 8
no-harness run9	measured-current	0.0350	0.0617	0.0392	0.2025	NA	NA	NA	NA
no-harness run9	accepted-incumbent	0.0350	0.0617	0.0617	0.2025	0.2025	0.2025	0.2025	0.2025
harness run10	measured-current	0.0350	0.2142	NA	0.4429	NA	skipped	skipped	stop
harness run10	accepted-incumbent	0.0350	0.2142	0.2142	0.4429	0.4429	0.4429	0.4429	0.4429 stop

Why `0.4429` Is Plausible

The new value is not the old TP2 config suddenly doubling. The comparable TP2 results are close:

Study	Trial	Config	Best request rate	Parallel size	request_rate_per_gpu
run9	`trial-0004`	`TP=2, DP=1`	0.4050	2	0.2025
run10	`trial-0002`	`TP=2`	0.4283	2	0.2142

The large jump comes from a new topology that run9 did not evaluate:

Study	Trial	Config	Best request rate	Parallel size	request_rate_per_gpu
run10	`trial-0004`	`TP=4`	1.7717	4	0.4429

At the winning TP4 probe:

sampling_u=0.0615234375;
request count 1063;
request rate 1.7717 req/s;
pass rate 0.9680;
p95 TTFT 1476.9 ms;
p95 TPOT 44.4 ms.

This satisfies the configured SLO and is within one binary-search resolution of search.high=0.0625.

Correctness Audit

The following fields match between run9 and run10 except for intentionally different identity fields such as study id and port:

model path and served model name;
internal vLLM executable;
base launch flags other than port;
trace window chat_w20260311_1000;
input-length filter 0-8192;
replay scale 1.0;
max concurrency 32;
full selected trace replay, no max_requests_per_probe;
SLO target and TTFT/TPOT thresholds;
search high=0.0625, max_probes=6, tolerance=0.001, seed 20260325;
metric definition best_request_rate / (TP * DP).

Checked differences and their impact:

Port differs: run9 used 18087, run10 used 18082; this should not affect measured throughput.
run10 has explicit restart_engine_after_early_stop=false; chat studies default to the same behavior.
run10 has explicit completion_tokens_override=null; equivalent to run9's absent field.
run9 trial-0004 search floor was 0.00390625 because it reused the incumbent for the same parallel-size group. run10 trial-0004 search floor was 0.0 because pure TP=4 had not been tried. Both have the same high and probe budget; this does not explain the higher result.

No metric-code logic error was found in the audit. The result JSONs store raw request rate, and the state computes per-GPU throughput by dividing by TP*DP. For run10 TP4, 1.7716666667 / 4 = 0.4429166667.

Issues Found During The Test

Two harness bugs were found and fixed:

Runtime refinement coupled larger max-num-batched-tokens with gpu-memory-utilization=0.95, which caused launch-time OOM. Fixed in commit 5d96689.
The search-high stop guard incorrectly required no individual SLO failures at a feasible high-edge probe. Fixed in commit f653af0; feasibility already means the probe passed the configured pass-rate SLO.

The queued product-8 trial-0006 and trial-0007 were stopped after the stop guard fix and are not used in the convergence claim.

Conclusion

The 0.4429 result is being compared under the same workload, SLO, search range, and metric definition as the previous 0.2025 result. The reason it is higher is that no-harness run9 did not evaluate pure TP=4; the harness guided the search from the TTFT/prefill bottleneck to adjacent TP validation and found that topology by iter 4.

Because TP4 nearly saturates the configured search.high, a follow-up run with a higher search.high is needed to measure the absolute ceiling. That follow-up is separate from the current convergence comparison.

6.3 KiB Raw Permalink Blame History