# Qwen235B Thinking Decode-Only Harness Run, 2026-04-28 ## Goal Run the qwen235b thinking decode-only tuning with the same harness-guided workflow used for the prefill-only test, while keeping the harness generic. The harness must use workload mode, configured SLOs, legal topology constraints, and measured trial history rather than testcase-specific throughput thresholds. ## Baseline Reference The before-harness comparison run is `dash0-qwen235b-decode-thinking-run5-tpot40-topology`: | Iter | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | before harness request/s | 0.1267 | 0.2450 | infeasible | launch fail | infeasible | infeasible | infeasible | infeasible | 0.2817 | infeasible | infeasible | infeasible | Before harness, the best feasible config appeared at iter 9 with 0.2817 request/s. ## Harness Change The decode-only harness now defaults to `decode_tpot` when `trace.request_mode=decode_only` and a TPOT SLO is configured. This avoids treating long decode-only prompt hints as a TTFT-prefill workload. Active decode harness families are generic: - `tensor-parallel-size`: legal TP/DP redistribution, judged by configured SLO pass rate and request_rate_per_gpu. - `data-parallel-size`: legal replica topology changes for decode/admission bottlenecks. - `max-num-seqs`: concurrency adjustment from observed TPOT failures or SLO headroom. - `max-num-batched-tokens`: decode batching adjustment after topology is stable. - `expert-parallel`: preserve known-valid EP topology, but change EP size only with EP-specific evidence. No qwen235b-specific threshold or testcase-specific rule was added. ## Current Run Pending. The next run will use dash0, 8x H20, and store results under `.aituner/harness-qwen235b-decode-20260428`.