4.0 KiB
Qwen235B Thinking Decode-Only Harness Run, 2026-04-28
Goal
Run the qwen235b thinking decode-only tuning with the same harness-guided workflow used for the prefill-only test, while keeping the harness generic. The harness must use workload mode, configured SLOs, legal topology constraints, and measured trial history rather than testcase-specific throughput thresholds.
Baseline Reference
The before-harness comparison run is dash0-qwen235b-decode-thinking-run5-tpot40-topology:
| Iter | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| before harness request/s | 0.1267 | 0.2450 | infeasible | launch fail | infeasible | infeasible | infeasible | infeasible | 0.2817 | infeasible | infeasible | infeasible |
Before harness, the best feasible config appeared at iter 9 with 0.2817 request/s.
Harness Change
The decode-only harness now defaults to decode_tpot when trace.request_mode=decode_only and a TPOT SLO is configured. This avoids treating long decode-only prompt hints as a TTFT-prefill workload.
Active decode harness families are generic:
tensor-parallel-size: legal TP/DP redistribution, judged by configured SLO pass rate and request_rate_per_gpu.data-parallel-size: legal replica topology changes for decode/admission bottlenecks.max-num-seqs: concurrency adjustment from observed TPOT failures or SLO headroom.max-num-batched-tokens: decode batching adjustment after topology is stable.expert-parallel: preserve known-valid EP topology, but change EP size only with EP-specific evidence.
No qwen235b-specific threshold or testcase-specific rule was added.
Current Run
Started on dash0, 8x H20.
- Remote spec:
.aituner/harness-qwen235b-decode-20260428/dash0_qwen235b_decode_thinking_harness_20260428.json - Remote store:
.aituner/harness-qwen235b-decode-20260428/dash0-qwen235b-decode-thinking-harness-20260428 - Remote tmux:
aituner_qwen235b_decode_harness_20260428 - Remote log:
logs/qwen235b_decode_harness_20260428.log - Code commit:
39aa47f - Verification: local and dash0 both passed
PYTHONPATH=src python3 -m unittest discover -s tests.
The first attempt started a duplicate trial-0001 baseline. Because the identical baseline was already measured in run5 and the decode probe can run for many minutes, that duplicate run was stopped and GPUs were freed.
The active run is now seeded from the real run5 baseline and continues from trial-0002:
- Remote spec:
.aituner/harness-qwen235b-decode-20260428-seeded/dash0_qwen235b_decode_thinking_harness_seeded_20260428.json - Remote store:
.aituner/harness-qwen235b-decode-20260428-seeded/dash0-qwen235b-decode-thinking-harness-seeded-20260428 - Seeded
trial-0001: 0.1267 request/s, 0.0158 request/s/GPU, pass rate 0.9868. proposal-0002: legal adjacent decode topology move fromTP4/DP2/EP8toTP2/DP4/EP8; no EP-size search and no testcase threshold.trial-0002status: running on dash0 intmuxsessionaituner_qwen235b_decode_harness_seeded_20260428.
The trial-0002 proposal matches the first useful topology direction from the earlier before-harness run, where the same effective config reached 0.2450 request/s at iter 2. The current run is still executing to verify this under the new harness-controlled study state before claiming final convergence data.
Follow-up Fix
The seeded prompt exposed a generic diagnosis issue: if the best feasible probe had no latency failures, the harness could miss the prior infeasible probe that showed the real bottleneck at higher load. The harness now scans the probe sequence backward and uses the nearest non-trivial bottleneck before falling back to the best feasible probe. This keeps decode-only runs focused on decode_tpot after a feasible low-load point, without adding testcase thresholds.
A second generic diagnosis bug was fixed: non-SLO bookkeeping counts such as probe_elapsed_s>... no longer collapse to ttft_prefill when TTFT/TPOT/request failure counts are all zero.