Files
aituner/docs/qwen27b-chat-0-8k-setup-audit-20260506.md

6.3 KiB

qwen27b-chat-0-8k Setup and Result Audit

Purpose

This note audits the 2026-05-06 qwen27b chat 0-8k harness result because the new best 0.4429 request_rate_per_gpu is much higher than the previous no-harness best 0.2025.

Setup

  • Host: dash0.
  • Hardware: 8 NVIDIA H20 GPUs.
  • Engine: internal vLLM at /usr/local/bin/vllm.
  • Model: /home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal.
  • Served model name: qwen35-27b-aituner.
  • Workload window: chat_w20260311_1000.
  • Trace file source: trace_windows/windows.json.
  • Request mode: chat.
  • Input bucket: 0 <= input_length <= 8192.
  • Replay scale: 1.0.
  • Max concurrency: 32.
  • Max requests per probe: unset, so full selected trace subset is replayed.
  • Search field: sampling_u.
  • Search range: low=0.0, high=0.0625.
  • Search probes: max_probes=6, tolerance=0.001.
  • Sampling seed: 20260325.

The local configs and dash0 model directories name this setup Qwen3.5-27B / Qwen35-27B. I did not find a qwen32b model/config for this internal chat 0-8k setup.

SLO

  • Target pass rate: 0.95.
  • TTFT rule: stepped by input length.
Input tokens TTFT threshold
<=4096 2000 ms
<=32768 4000 ms
otherwise 6000 ms
  • TPOT rule: fixed <=50 ms.

A probe is feasible when its pass rate is at least 0.95. Individual requests may still fail TTFT/TPOT while the whole probe remains feasible.

Compared Studies

Variant Study root Notes
no-harness .aituner-tight/dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology completed 12-trial historical run
harness .aituner-tight/dash0-qwen27b-tight-slo-10min-run10-chat-0-8k-current-harness seeded with run9 baseline, then ran real harness trials

The harness run reused the real run9 baseline as trial-0001 to avoid duplicating a multi-hour cold-start baseline measurement. Later harness trials were real dash0 runs.

Metric

The reported metric is request_rate_per_gpu:

request_rate_per_gpu = best_feasible_request_rate / parallel_size
parallel_size = tensor_parallel_size * data_parallel_size

The result JSON stores best_request_rate; StudyStore.ingest_trial_results derives best_request_rate_per_gpu from the trial spec topology.

Result Table

Unit: feasible request_rate_per_gpu.

Variant Curve Iter 1 Iter 2 Iter 3 Iter 4 Iter 5 Iter 6 Iter 7 Iter 8
no-harness run9 measured-current 0.0350 0.0617 0.0392 0.2025 NA NA NA NA
no-harness run9 accepted-incumbent 0.0350 0.0617 0.0617 0.2025 0.2025 0.2025 0.2025 0.2025
harness run10 measured-current 0.0350 0.2142 NA 0.4429 NA skipped skipped stop
harness run10 accepted-incumbent 0.0350 0.2142 0.2142 0.4429 0.4429 0.4429 0.4429 0.4429 stop

Why 0.4429 Is Plausible

The new value is not the old TP2 config suddenly doubling. The comparable TP2 results are close:

Study Trial Config Best request rate Parallel size request_rate_per_gpu
run9 trial-0004 TP=2, DP=1 0.4050 2 0.2025
run10 trial-0002 TP=2 0.4283 2 0.2142

The large jump comes from a new topology that run9 did not evaluate:

Study Trial Config Best request rate Parallel size request_rate_per_gpu
run10 trial-0004 TP=4 1.7717 4 0.4429

At the winning TP4 probe:

  • sampling_u=0.0615234375;
  • request count 1063;
  • request rate 1.7717 req/s;
  • pass rate 0.9680;
  • p95 TTFT 1476.9 ms;
  • p95 TPOT 44.4 ms.

This satisfies the configured SLO and is within one binary-search resolution of search.high=0.0625.

Correctness Audit

The following fields match between run9 and run10 except for intentionally different identity fields such as study id and port:

  • model path and served model name;
  • internal vLLM executable;
  • base launch flags other than port;
  • trace window chat_w20260311_1000;
  • input-length filter 0-8192;
  • replay scale 1.0;
  • max concurrency 32;
  • full selected trace replay, no max_requests_per_probe;
  • SLO target and TTFT/TPOT thresholds;
  • search high=0.0625, max_probes=6, tolerance=0.001, seed 20260325;
  • metric definition best_request_rate / (TP * DP).

Checked differences and their impact:

  • Port differs: run9 used 18087, run10 used 18082; this should not affect measured throughput.
  • run10 has explicit restart_engine_after_early_stop=false; chat studies default to the same behavior.
  • run10 has explicit completion_tokens_override=null; equivalent to run9's absent field.
  • run9 trial-0004 search floor was 0.00390625 because it reused the incumbent for the same parallel-size group. run10 trial-0004 search floor was 0.0 because pure TP=4 had not been tried. Both have the same high and probe budget; this does not explain the higher result.

No metric-code logic error was found in the audit. The result JSONs store raw request rate, and the state computes per-GPU throughput by dividing by TP*DP. For run10 TP4, 1.7716666667 / 4 = 0.4429166667.

Issues Found During The Test

Two harness bugs were found and fixed:

  • Runtime refinement coupled larger max-num-batched-tokens with gpu-memory-utilization=0.95, which caused launch-time OOM. Fixed in commit 5d96689.
  • The search-high stop guard incorrectly required no individual SLO failures at a feasible high-edge probe. Fixed in commit f653af0; feasibility already means the probe passed the configured pass-rate SLO.

The queued product-8 trial-0006 and trial-0007 were stopped after the stop guard fix and are not used in the convergence claim.

Conclusion

The 0.4429 result is being compared under the same workload, SLO, search range, and metric definition as the previous 0.2025 result. The reason it is higher is that no-harness run9 did not evaluate pure TP=4; the harness guided the search from the TTFT/prefill bottleneck to adjacent TP validation and found that topology by iter 4.

Because TP4 nearly saturates the configured search.high, a follow-up run with a higher search.high is needed to measure the absolute ceiling. That follow-up is separate from the current convergence comparison.