# Qwen235B Thinking Decode-Only Harness Run, 2026-04-28 ## Goal Run the qwen235b thinking decode-only tuning with the same harness-guided workflow used for the prefill-only test, while keeping the harness generic. The harness must use workload mode, configured SLOs, legal topology constraints, and measured trial history rather than testcase-specific throughput thresholds. ## Baseline Reference The before-harness comparison run is `dash0-qwen235b-decode-thinking-run5-tpot40-topology`: | Iter | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | before harness request/s | 0.1267 | 0.2450 | infeasible | launch fail | infeasible | infeasible | infeasible | infeasible | 0.2817 | infeasible | infeasible | infeasible | Before harness, the best feasible config appeared at iter 9 with 0.2817 request/s. ## Harness Change The decode-only harness now defaults to `decode_tpot` when `trace.request_mode=decode_only` and a TPOT SLO is configured. This avoids treating long decode-only prompt hints as a TTFT-prefill workload. Active decode harness families are generic: - `tensor-parallel-size`: legal TP/DP redistribution, judged by configured SLO pass rate and request_rate_per_gpu. - `data-parallel-size`: legal replica topology changes for decode/admission bottlenecks. - `max-num-seqs`: concurrency adjustment from observed TPOT failures or SLO headroom. - `max-num-batched-tokens`: decode batching adjustment after topology is stable. - `expert-parallel`: preserve known-valid EP topology, but change EP size only with EP-specific evidence. No qwen235b-specific threshold or testcase-specific rule was added. ## Current Run Started on dash0, 8x H20. - Remote spec: `.aituner/harness-qwen235b-decode-20260428/dash0_qwen235b_decode_thinking_harness_20260428.json` - Remote store: `.aituner/harness-qwen235b-decode-20260428/dash0-qwen235b-decode-thinking-harness-20260428` - Remote tmux: `aituner_qwen235b_decode_harness_20260428` - Remote log: `logs/qwen235b_decode_harness_20260428.log` - Code commit: `39aa47f` - Verification: local and dash0 both passed `PYTHONPATH=src python3 -m unittest discover -s tests`. The first attempt started a duplicate `trial-0001` baseline. Because the identical baseline was already measured in run5 and the decode probe can run for many minutes, that duplicate run was stopped and GPUs were freed. The active run is now seeded from the real run5 baseline and continues from `trial-0002`: - Remote spec: `.aituner/harness-qwen235b-decode-20260428-seeded/dash0_qwen235b_decode_thinking_harness_seeded_20260428.json` - Remote store: `.aituner/harness-qwen235b-decode-20260428-seeded/dash0-qwen235b-decode-thinking-harness-seeded-20260428` - Seeded `trial-0001`: 0.1267 request/s, 0.0158 request/s/GPU, pass rate 0.9868. - `proposal-0002`: legal adjacent decode topology move from `TP4/DP2/EP8` to `TP2/DP4/EP8`; no EP-size search and no testcase threshold. - `trial-0002`: completed, 0.3767 request/s, 0.0471 request/s/GPU, pass rate 0.9779. - `trial-0003`: completed with no feasible point for `TP1/DP8/EP8`. - `proposal-0004`: generated a plausible same-topology `max-num-seqs=160` follow-up, but the raw JSON used an object for `observation`; schema validation rejected it and the tuning CLI exited before materializing `trial-0004`. The `trial-0002` proposal matches the first useful topology direction from the earlier before-harness run, but the new harness-controlled run measured substantially better throughput for that topology. ## Result Judgment Fig-18-style raw throughput table: | Run | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 | | --- | ---: | ---: | --- | --- | --- | --- | --- | --- | ---: | --- | --- | --- | | before harness request/s | 0.1267 | 0.2450 | infeasible | launch fail | infeasible | infeasible | infeasible | infeasible | 0.2817 | infeasible | infeasible | infeasible | | harness request/s | 0.1267 | 0.3767 | infeasible | not run | not run | not run | not run | not run | not run | not run | not run | not run | Per-GPU throughput table: | Run | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 | | --- | ---: | ---: | --- | --- | --- | --- | --- | --- | ---: | --- | --- | --- | | before harness req/s/GPU | 0.0158 | 0.0306 | infeasible | launch fail | infeasible | infeasible | infeasible | infeasible | 0.0352 | infeasible | infeasible | infeasible | | harness req/s/GPU | 0.0158 | 0.0471 | infeasible | not run | not run | not run | not run | not run | not run | not run | not run | not run | Decision: the harness accelerated convergence on qwen235b decode-only. The before-harness run first reached its best observed throughput at iter 9 with 0.2817 request/s. The harness run exceeded that value at iter 2 with 0.3767 request/s, a 1.34x improvement over the before-harness 12-iter best and a 2.97x improvement over the baseline config. The harness did not stop cleanly after finding the strong incumbent. It spent one additional trial on `TP1/DP8/EP8`, which found no feasible point, and then the next LLM proposal failed schema validation before trial materialization. So the performance convergence goal is met, but the tuning loop should be hardened so a strong incumbent causes deterministic stop or a schema-repair retry rather than relying only on prompt instructions. ## Follow-up Fix The seeded prompt exposed a generic diagnosis issue: if the best feasible probe had no latency failures, the harness could miss the prior infeasible probe that showed the real bottleneck at higher load. The harness now scans the probe sequence backward and uses the nearest non-trivial bottleneck before falling back to the best feasible probe. This keeps decode-only runs focused on `decode_tpot` after a feasible low-load point, without adding testcase thresholds. A second generic diagnosis bug was fixed: non-SLO bookkeeping counts such as `probe_elapsed_s>...` no longer collapse to `ttft_prefill` when TTFT/TPOT/request failure counts are all zero.