1.8 KiB
Qwen235B Thinking Decode-Only Harness Run, 2026-04-28
Goal
Run the qwen235b thinking decode-only tuning with the same harness-guided workflow used for the prefill-only test, while keeping the harness generic. The harness must use workload mode, configured SLOs, legal topology constraints, and measured trial history rather than testcase-specific throughput thresholds.
Baseline Reference
The before-harness comparison run is dash0-qwen235b-decode-thinking-run5-tpot40-topology:
| Iter | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| before harness request/s | 0.1267 | 0.2450 | infeasible | launch fail | infeasible | infeasible | infeasible | infeasible | 0.2817 | infeasible | infeasible | infeasible |
Before harness, the best feasible config appeared at iter 9 with 0.2817 request/s.
Harness Change
The decode-only harness now defaults to decode_tpot when trace.request_mode=decode_only and a TPOT SLO is configured. This avoids treating long decode-only prompt hints as a TTFT-prefill workload.
Active decode harness families are generic:
tensor-parallel-size: legal TP/DP redistribution, judged by configured SLO pass rate and request_rate_per_gpu.data-parallel-size: legal replica topology changes for decode/admission bottlenecks.max-num-seqs: concurrency adjustment from observed TPOT failures or SLO headroom.max-num-batched-tokens: decode batching adjustment after topology is stable.expert-parallel: preserve known-valid EP topology, but change EP size only with EP-specific evidence.
No qwen235b-specific threshold or testcase-specific rule was added.
Current Run
Pending. The next run will use dash0, 8x H20, and store results under .aituner/harness-qwen235b-decode-20260428.