Qwen235B Thinking Decode-Only Harness Run, 2026-04-28

Goal

Run the qwen235b thinking decode-only tuning with the same harness-guided workflow used for the prefill-only test, while keeping the harness generic. The harness must use workload mode, configured SLOs, legal topology constraints, and measured trial history rather than testcase-specific throughput thresholds.

Baseline Reference

The before-harness comparison run is dash0-qwen235b-decode-thinking-run5-tpot40-topology:

Iter	1	2	3	4	5	6	7	8	9	10	11	12
before harness request/s	0.1267	0.2450	infeasible	launch fail	infeasible	infeasible	infeasible	infeasible	0.2817	infeasible	infeasible	infeasible

Before harness, the best feasible config appeared at iter 9 with 0.2817 request/s.

Harness Change

The decode-only harness now defaults to decode_tpot when trace.request_mode=decode_only and a TPOT SLO is configured. This avoids treating long decode-only prompt hints as a TTFT-prefill workload.

Active decode harness families are generic:

tensor-parallel-size: legal TP/DP redistribution, judged by configured SLO pass rate and request_rate_per_gpu.
data-parallel-size: legal replica topology changes for decode/admission bottlenecks.
max-num-seqs: concurrency adjustment from observed TPOT failures or SLO headroom.
max-num-batched-tokens: decode batching adjustment after topology is stable.
expert-parallel: preserve known-valid EP topology, but change EP size only with EP-specific evidence.

No qwen235b-specific threshold or testcase-specific rule was added.

Current Run

Started on dash0, 8x H20.

Remote spec: .aituner/harness-qwen235b-decode-20260428/dash0_qwen235b_decode_thinking_harness_20260428.json
Remote store: .aituner/harness-qwen235b-decode-20260428/dash0-qwen235b-decode-thinking-harness-20260428
Remote tmux: aituner_qwen235b_decode_harness_20260428
Remote log: logs/qwen235b_decode_harness_20260428.log
Code commit: 39aa47f
Verification: local and dash0 both passed PYTHONPATH=src python3 -m unittest discover -s tests.

The first attempt started a duplicate trial-0001 baseline. Because the identical baseline was already measured in run5 and the decode probe can run for many minutes, that duplicate run was stopped and GPUs were freed.

The active run is now seeded from the real run5 baseline and continues from trial-0002:

Remote spec: .aituner/harness-qwen235b-decode-20260428-seeded/dash0_qwen235b_decode_thinking_harness_seeded_20260428.json
Remote store: .aituner/harness-qwen235b-decode-20260428-seeded/dash0-qwen235b-decode-thinking-harness-seeded-20260428
Seeded trial-0001: 0.1267 request/s, 0.0158 request/s/GPU, pass rate 0.9868.
proposal-0002: legal adjacent decode topology move from TP4/DP2/EP8 to TP2/DP4/EP8; no EP-size search and no testcase threshold.
trial-0002 status: running on dash0 in tmux session aituner_qwen235b_decode_harness_seeded_20260428.

The trial-0002 proposal matches the first useful topology direction from the earlier before-harness run, where the same effective config reached 0.2450 request/s at iter 2. The current run is still executing to verify this under the new harness-controlled study state before claiming final convergence data.

Follow-up Fix

The seeded prompt exposed a generic diagnosis issue: if the best feasible probe had no latency failures, the harness could miss the prior infeasible probe that showed the real bottleneck at higher load. The harness now scans the probe sequence backward and uses the nearest non-trivial bottleneck before falling back to the best feasible probe. This keeps decode-only runs focused on decode_tpot after a feasible low-load point, without adding testcase thresholds.

A second generic diagnosis bug was fixed: non-SLO bookkeeping counts such as probe_elapsed_s>... no longer collapse to ttft_prefill when TTFT/TPOT/request failure counts are all zero.

4.0 KiB Raw Blame History