From 871c4cfc0272ce66652ca314f98bad819dd63204 Mon Sep 17 00:00:00 2001 From: Gahow Wang Date: Wed, 6 May 2026 20:32:09 +0800 Subject: [PATCH] Document qwen27b chat setup audit --- .../qwen27b-chat-0-8k-setup-audit-20260506.md | 169 ++++++++++++++++++ 1 file changed, 169 insertions(+) create mode 100644 docs/qwen27b-chat-0-8k-setup-audit-20260506.md diff --git a/docs/qwen27b-chat-0-8k-setup-audit-20260506.md b/docs/qwen27b-chat-0-8k-setup-audit-20260506.md new file mode 100644 index 0000000..94cae53 --- /dev/null +++ b/docs/qwen27b-chat-0-8k-setup-audit-20260506.md @@ -0,0 +1,169 @@ +# qwen27b-chat-0-8k Setup and Result Audit + +## Purpose + +This note audits the 2026-05-06 qwen27b chat 0-8k harness result because the +new best `0.4429 request_rate_per_gpu` is much higher than the previous +no-harness best `0.2025`. + +## Setup + +- Host: `dash0`. +- Hardware: 8 NVIDIA H20 GPUs. +- Engine: internal vLLM at `/usr/local/bin/vllm`. +- Model: + `/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal`. +- Served model name: `qwen35-27b-aituner`. +- Workload window: `chat_w20260311_1000`. +- Trace file source: `trace_windows/windows.json`. +- Request mode: `chat`. +- Input bucket: `0 <= input_length <= 8192`. +- Replay scale: `1.0`. +- Max concurrency: `32`. +- Max requests per probe: unset, so full selected trace subset is replayed. +- Search field: `sampling_u`. +- Search range: `low=0.0`, `high=0.0625`. +- Search probes: `max_probes=6`, `tolerance=0.001`. +- Sampling seed: `20260325`. + +The local configs and dash0 model directories name this setup Qwen3.5-27B / +Qwen35-27B. I did not find a `qwen32b` model/config for this internal chat +0-8k setup. + +## SLO + +- Target pass rate: `0.95`. +- TTFT rule: stepped by input length. + +| Input tokens | TTFT threshold | +| ---: | ---: | +| `<=4096` | `2000 ms` | +| `<=32768` | `4000 ms` | +| otherwise | `6000 ms` | + +- TPOT rule: fixed `<=50 ms`. + +A probe is feasible when its pass rate is at least `0.95`. Individual requests +may still fail TTFT/TPOT while the whole probe remains feasible. + +## Compared Studies + +| Variant | Study root | Notes | +| --- | --- | --- | +| no-harness | `.aituner-tight/dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology` | completed 12-trial historical run | +| harness | `.aituner-tight/dash0-qwen27b-tight-slo-10min-run10-chat-0-8k-current-harness` | seeded with run9 baseline, then ran real harness trials | + +The harness run reused the real run9 baseline as `trial-0001` to avoid +duplicating a multi-hour cold-start baseline measurement. Later harness trials +were real dash0 runs. + +## Metric + +The reported metric is `request_rate_per_gpu`: + +```text +request_rate_per_gpu = best_feasible_request_rate / parallel_size +parallel_size = tensor_parallel_size * data_parallel_size +``` + +The result JSON stores `best_request_rate`; `StudyStore.ingest_trial_results` +derives `best_request_rate_per_gpu` from the trial spec topology. + +## Result Table + +Unit: feasible `request_rate_per_gpu`. + +| Variant | Curve | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | +| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | +| no-harness run9 | measured-current | 0.0350 | 0.0617 | 0.0392 | 0.2025 | NA | NA | NA | NA | +| no-harness run9 | accepted-incumbent | 0.0350 | 0.0617 | 0.0617 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | +| harness run10 | measured-current | 0.0350 | 0.2142 | NA | 0.4429 | NA | skipped | skipped | stop | +| harness run10 | accepted-incumbent | 0.0350 | 0.2142 | 0.2142 | 0.4429 | 0.4429 | 0.4429 | 0.4429 | 0.4429 stop | + +## Why `0.4429` Is Plausible + +The new value is not the old TP2 config suddenly doubling. The comparable TP2 +results are close: + +| Study | Trial | Config | Best request rate | Parallel size | request_rate_per_gpu | +| --- | --- | --- | ---: | ---: | ---: | +| run9 | `trial-0004` | `TP=2, DP=1` | 0.4050 | 2 | 0.2025 | +| run10 | `trial-0002` | `TP=2` | 0.4283 | 2 | 0.2142 | + +The large jump comes from a new topology that run9 did not evaluate: + +| Study | Trial | Config | Best request rate | Parallel size | request_rate_per_gpu | +| --- | --- | --- | ---: | ---: | ---: | +| run10 | `trial-0004` | `TP=4` | 1.7717 | 4 | 0.4429 | + +At the winning TP4 probe: + +- `sampling_u=0.0615234375`; +- request count `1063`; +- request rate `1.7717 req/s`; +- pass rate `0.9680`; +- p95 TTFT `1476.9 ms`; +- p95 TPOT `44.4 ms`. + +This satisfies the configured SLO and is within one binary-search resolution of +`search.high=0.0625`. + +## Correctness Audit + +The following fields match between run9 and run10 except for intentionally +different identity fields such as study id and port: + +- model path and served model name; +- internal vLLM executable; +- base launch flags other than port; +- trace window `chat_w20260311_1000`; +- input-length filter `0-8192`; +- replay scale `1.0`; +- max concurrency `32`; +- full selected trace replay, no `max_requests_per_probe`; +- SLO target and TTFT/TPOT thresholds; +- search `high=0.0625`, `max_probes=6`, `tolerance=0.001`, seed `20260325`; +- metric definition `best_request_rate / (TP * DP)`. + +Checked differences and their impact: + +- Port differs: run9 used `18087`, run10 used `18082`; this should not affect + measured throughput. +- run10 has explicit `restart_engine_after_early_stop=false`; chat studies + default to the same behavior. +- run10 has explicit `completion_tokens_override=null`; equivalent to run9's + absent field. +- run9 `trial-0004` search floor was `0.00390625` because it reused the + incumbent for the same parallel-size group. run10 `trial-0004` search floor + was `0.0` because pure `TP=4` had not been tried. Both have the same high and + probe budget; this does not explain the higher result. + +No metric-code logic error was found in the audit. The result JSONs store raw +request rate, and the state computes per-GPU throughput by dividing by +`TP*DP`. For run10 TP4, `1.7716666667 / 4 = 0.4429166667`. + +## Issues Found During The Test + +Two harness bugs were found and fixed: + +- Runtime refinement coupled larger `max-num-batched-tokens` with + `gpu-memory-utilization=0.95`, which caused launch-time OOM. Fixed in commit + `5d96689`. +- The search-high stop guard incorrectly required no individual SLO failures at + a feasible high-edge probe. Fixed in commit `f653af0`; feasibility already + means the probe passed the configured pass-rate SLO. + +The queued product-8 `trial-0006` and `trial-0007` were stopped after the stop +guard fix and are not used in the convergence claim. + +## Conclusion + +The `0.4429` result is being compared under the same workload, SLO, search +range, and metric definition as the previous `0.2025` result. The reason it is +higher is that no-harness run9 did not evaluate pure `TP=4`; the harness guided +the search from the TTFT/prefill bottleneck to adjacent TP validation and found +that topology by iter 4. + +Because TP4 nearly saturates the configured `search.high`, a follow-up run with +a higher `search.high` is needed to measure the absolute ceiling. That follow-up +is separate from the current convergence comparison.