Document qwen27b chat setup audit
This commit is contained in:
169
docs/qwen27b-chat-0-8k-setup-audit-20260506.md
Normal file
169
docs/qwen27b-chat-0-8k-setup-audit-20260506.md
Normal file
@@ -0,0 +1,169 @@
|
|||||||
|
# qwen27b-chat-0-8k Setup and Result Audit
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
|
||||||
|
This note audits the 2026-05-06 qwen27b chat 0-8k harness result because the
|
||||||
|
new best `0.4429 request_rate_per_gpu` is much higher than the previous
|
||||||
|
no-harness best `0.2025`.
|
||||||
|
|
||||||
|
## Setup
|
||||||
|
|
||||||
|
- Host: `dash0`.
|
||||||
|
- Hardware: 8 NVIDIA H20 GPUs.
|
||||||
|
- Engine: internal vLLM at `/usr/local/bin/vllm`.
|
||||||
|
- Model:
|
||||||
|
`/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal`.
|
||||||
|
- Served model name: `qwen35-27b-aituner`.
|
||||||
|
- Workload window: `chat_w20260311_1000`.
|
||||||
|
- Trace file source: `trace_windows/windows.json`.
|
||||||
|
- Request mode: `chat`.
|
||||||
|
- Input bucket: `0 <= input_length <= 8192`.
|
||||||
|
- Replay scale: `1.0`.
|
||||||
|
- Max concurrency: `32`.
|
||||||
|
- Max requests per probe: unset, so full selected trace subset is replayed.
|
||||||
|
- Search field: `sampling_u`.
|
||||||
|
- Search range: `low=0.0`, `high=0.0625`.
|
||||||
|
- Search probes: `max_probes=6`, `tolerance=0.001`.
|
||||||
|
- Sampling seed: `20260325`.
|
||||||
|
|
||||||
|
The local configs and dash0 model directories name this setup Qwen3.5-27B /
|
||||||
|
Qwen35-27B. I did not find a `qwen32b` model/config for this internal chat
|
||||||
|
0-8k setup.
|
||||||
|
|
||||||
|
## SLO
|
||||||
|
|
||||||
|
- Target pass rate: `0.95`.
|
||||||
|
- TTFT rule: stepped by input length.
|
||||||
|
|
||||||
|
| Input tokens | TTFT threshold |
|
||||||
|
| ---: | ---: |
|
||||||
|
| `<=4096` | `2000 ms` |
|
||||||
|
| `<=32768` | `4000 ms` |
|
||||||
|
| otherwise | `6000 ms` |
|
||||||
|
|
||||||
|
- TPOT rule: fixed `<=50 ms`.
|
||||||
|
|
||||||
|
A probe is feasible when its pass rate is at least `0.95`. Individual requests
|
||||||
|
may still fail TTFT/TPOT while the whole probe remains feasible.
|
||||||
|
|
||||||
|
## Compared Studies
|
||||||
|
|
||||||
|
| Variant | Study root | Notes |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| no-harness | `.aituner-tight/dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology` | completed 12-trial historical run |
|
||||||
|
| harness | `.aituner-tight/dash0-qwen27b-tight-slo-10min-run10-chat-0-8k-current-harness` | seeded with run9 baseline, then ran real harness trials |
|
||||||
|
|
||||||
|
The harness run reused the real run9 baseline as `trial-0001` to avoid
|
||||||
|
duplicating a multi-hour cold-start baseline measurement. Later harness trials
|
||||||
|
were real dash0 runs.
|
||||||
|
|
||||||
|
## Metric
|
||||||
|
|
||||||
|
The reported metric is `request_rate_per_gpu`:
|
||||||
|
|
||||||
|
```text
|
||||||
|
request_rate_per_gpu = best_feasible_request_rate / parallel_size
|
||||||
|
parallel_size = tensor_parallel_size * data_parallel_size
|
||||||
|
```
|
||||||
|
|
||||||
|
The result JSON stores `best_request_rate`; `StudyStore.ingest_trial_results`
|
||||||
|
derives `best_request_rate_per_gpu` from the trial spec topology.
|
||||||
|
|
||||||
|
## Result Table
|
||||||
|
|
||||||
|
Unit: feasible `request_rate_per_gpu`.
|
||||||
|
|
||||||
|
| Variant | Curve | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 |
|
||||||
|
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
||||||
|
| no-harness run9 | measured-current | 0.0350 | 0.0617 | 0.0392 | 0.2025 | NA | NA | NA | NA |
|
||||||
|
| no-harness run9 | accepted-incumbent | 0.0350 | 0.0617 | 0.0617 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 |
|
||||||
|
| harness run10 | measured-current | 0.0350 | 0.2142 | NA | 0.4429 | NA | skipped | skipped | stop |
|
||||||
|
| harness run10 | accepted-incumbent | 0.0350 | 0.2142 | 0.2142 | 0.4429 | 0.4429 | 0.4429 | 0.4429 | 0.4429 stop |
|
||||||
|
|
||||||
|
## Why `0.4429` Is Plausible
|
||||||
|
|
||||||
|
The new value is not the old TP2 config suddenly doubling. The comparable TP2
|
||||||
|
results are close:
|
||||||
|
|
||||||
|
| Study | Trial | Config | Best request rate | Parallel size | request_rate_per_gpu |
|
||||||
|
| --- | --- | --- | ---: | ---: | ---: |
|
||||||
|
| run9 | `trial-0004` | `TP=2, DP=1` | 0.4050 | 2 | 0.2025 |
|
||||||
|
| run10 | `trial-0002` | `TP=2` | 0.4283 | 2 | 0.2142 |
|
||||||
|
|
||||||
|
The large jump comes from a new topology that run9 did not evaluate:
|
||||||
|
|
||||||
|
| Study | Trial | Config | Best request rate | Parallel size | request_rate_per_gpu |
|
||||||
|
| --- | --- | --- | ---: | ---: | ---: |
|
||||||
|
| run10 | `trial-0004` | `TP=4` | 1.7717 | 4 | 0.4429 |
|
||||||
|
|
||||||
|
At the winning TP4 probe:
|
||||||
|
|
||||||
|
- `sampling_u=0.0615234375`;
|
||||||
|
- request count `1063`;
|
||||||
|
- request rate `1.7717 req/s`;
|
||||||
|
- pass rate `0.9680`;
|
||||||
|
- p95 TTFT `1476.9 ms`;
|
||||||
|
- p95 TPOT `44.4 ms`.
|
||||||
|
|
||||||
|
This satisfies the configured SLO and is within one binary-search resolution of
|
||||||
|
`search.high=0.0625`.
|
||||||
|
|
||||||
|
## Correctness Audit
|
||||||
|
|
||||||
|
The following fields match between run9 and run10 except for intentionally
|
||||||
|
different identity fields such as study id and port:
|
||||||
|
|
||||||
|
- model path and served model name;
|
||||||
|
- internal vLLM executable;
|
||||||
|
- base launch flags other than port;
|
||||||
|
- trace window `chat_w20260311_1000`;
|
||||||
|
- input-length filter `0-8192`;
|
||||||
|
- replay scale `1.0`;
|
||||||
|
- max concurrency `32`;
|
||||||
|
- full selected trace replay, no `max_requests_per_probe`;
|
||||||
|
- SLO target and TTFT/TPOT thresholds;
|
||||||
|
- search `high=0.0625`, `max_probes=6`, `tolerance=0.001`, seed `20260325`;
|
||||||
|
- metric definition `best_request_rate / (TP * DP)`.
|
||||||
|
|
||||||
|
Checked differences and their impact:
|
||||||
|
|
||||||
|
- Port differs: run9 used `18087`, run10 used `18082`; this should not affect
|
||||||
|
measured throughput.
|
||||||
|
- run10 has explicit `restart_engine_after_early_stop=false`; chat studies
|
||||||
|
default to the same behavior.
|
||||||
|
- run10 has explicit `completion_tokens_override=null`; equivalent to run9's
|
||||||
|
absent field.
|
||||||
|
- run9 `trial-0004` search floor was `0.00390625` because it reused the
|
||||||
|
incumbent for the same parallel-size group. run10 `trial-0004` search floor
|
||||||
|
was `0.0` because pure `TP=4` had not been tried. Both have the same high and
|
||||||
|
probe budget; this does not explain the higher result.
|
||||||
|
|
||||||
|
No metric-code logic error was found in the audit. The result JSONs store raw
|
||||||
|
request rate, and the state computes per-GPU throughput by dividing by
|
||||||
|
`TP*DP`. For run10 TP4, `1.7716666667 / 4 = 0.4429166667`.
|
||||||
|
|
||||||
|
## Issues Found During The Test
|
||||||
|
|
||||||
|
Two harness bugs were found and fixed:
|
||||||
|
|
||||||
|
- Runtime refinement coupled larger `max-num-batched-tokens` with
|
||||||
|
`gpu-memory-utilization=0.95`, which caused launch-time OOM. Fixed in commit
|
||||||
|
`5d96689`.
|
||||||
|
- The search-high stop guard incorrectly required no individual SLO failures at
|
||||||
|
a feasible high-edge probe. Fixed in commit `f653af0`; feasibility already
|
||||||
|
means the probe passed the configured pass-rate SLO.
|
||||||
|
|
||||||
|
The queued product-8 `trial-0006` and `trial-0007` were stopped after the stop
|
||||||
|
guard fix and are not used in the convergence claim.
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
The `0.4429` result is being compared under the same workload, SLO, search
|
||||||
|
range, and metric definition as the previous `0.2025` result. The reason it is
|
||||||
|
higher is that no-harness run9 did not evaluate pure `TP=4`; the harness guided
|
||||||
|
the search from the TTFT/prefill bottleneck to adjacent TP validation and found
|
||||||
|
that topology by iter 4.
|
||||||
|
|
||||||
|
Because TP4 nearly saturates the configured `search.high`, a follow-up run with
|
||||||
|
a higher `search.high` is needed to measure the absolute ceiling. That follow-up
|
||||||
|
is separate from the current convergence comparison.
|
||||||
Reference in New Issue
Block a user