Qwen235B Thinking Decode-Only Harness Run, 2026-04-28

Goal

Run the qwen235b thinking decode-only tuning with the same harness-guided workflow used for the prefill-only test, while keeping the harness generic. The harness must use workload mode, configured SLOs, legal topology constraints, and measured trial history rather than testcase-specific throughput thresholds.

Baseline Reference

The before-harness comparison run is dash0-qwen235b-decode-thinking-run5-tpot40-topology:

Iter	1	2	3	4	5	6	7	8	9	10	11	12
before harness request/s	0.1267	0.2450	infeasible	launch fail	infeasible	infeasible	infeasible	infeasible	0.2817	infeasible	infeasible	infeasible

Before harness, the best feasible config appeared at iter 9 with 0.2817 request/s.

Harness Change

The decode-only harness now defaults to decode_tpot when trace.request_mode=decode_only and a TPOT SLO is configured. This avoids treating long decode-only prompt hints as a TTFT-prefill workload.

Active decode harness families are generic:

tensor-parallel-size: legal TP/DP redistribution, judged by configured SLO pass rate and request_rate_per_gpu.
data-parallel-size: legal replica topology changes for decode/admission bottlenecks.
max-num-seqs: concurrency adjustment from observed TPOT failures or SLO headroom.
max-num-batched-tokens: decode batching adjustment after topology is stable.
expert-parallel: preserve known-valid EP topology, but change EP size only with EP-specific evidence.

No qwen235b-specific threshold or testcase-specific rule was added.

Current Run

Pending. The next run will use dash0, 8x H20, and store results under .aituner/harness-qwen235b-decode-20260428.

1.8 KiB Raw Blame History

Qwen235B Thinking Decode-Only Harness Run, 2026-04-28

Goal

Baseline Reference

Harness Change

Current Run

1.8 KiB

Raw Blame History