# Qwen235B Thinking Decode-Only Harness Run, 2026-04-28

## Goal

Run the qwen235b thinking decode-only tuning with the same harness-guided workflow used for the prefill-only test, while keeping the harness generic. The harness must use workload mode, configured SLOs, legal topology constraints, and measured trial history rather than testcase-specific throughput thresholds.

## Baseline Reference

The before-harness comparison run is `dash0-qwen235b-decode-thinking-run5-tpot40-topology`:

| Iter | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| before harness request/s | 0.1267 | 0.2450 | infeasible | launch fail | infeasible | infeasible | infeasible | infeasible | 0.2817 | infeasible | infeasible | infeasible |

Before harness, the best feasible config appeared at iter 9 with 0.2817 request/s.

## Harness Change

The decode-only harness now defaults to `decode_tpot` when `trace.request_mode=decode_only` and a TPOT SLO is configured. This avoids treating long decode-only prompt hints as a TTFT-prefill workload.

Active decode harness families are generic:

- `tensor-parallel-size`: legal TP/DP redistribution, judged by configured SLO pass rate and request_rate_per_gpu.
- `data-parallel-size`: legal replica topology changes for decode/admission bottlenecks.
- `max-num-seqs`: concurrency adjustment from observed TPOT failures or SLO headroom.
- `max-num-batched-tokens`: decode batching adjustment after topology is stable.
- `expert-parallel`: preserve known-valid EP topology, but change EP size only with EP-specific evidence.

No qwen235b-specific threshold or testcase-specific rule was added.

## Current Run

Started on dash0, 8x H20.

- Remote spec: `.aituner/harness-qwen235b-decode-20260428/dash0_qwen235b_decode_thinking_harness_20260428.json`
- Remote store: `.aituner/harness-qwen235b-decode-20260428/dash0-qwen235b-decode-thinking-harness-20260428`
- Remote tmux: `aituner_qwen235b_decode_harness_20260428`
- Remote log: `logs/qwen235b_decode_harness_20260428.log`
- Code commit: `39aa47f`
- Verification: local and dash0 both passed `PYTHONPATH=src python3 -m unittest discover -s tests`.

The first attempt started a duplicate `trial-0001` baseline. Because the identical baseline was already measured in run5 and the decode probe can run for many minutes, that duplicate run was stopped and GPUs were freed.

The active run is now seeded from the real run5 baseline and continues from `trial-0002`:

- Remote spec: `.aituner/harness-qwen235b-decode-20260428-seeded/dash0_qwen235b_decode_thinking_harness_seeded_20260428.json`
- Remote store: `.aituner/harness-qwen235b-decode-20260428-seeded/dash0-qwen235b-decode-thinking-harness-seeded-20260428`
- Seeded `trial-0001`: 0.1267 request/s, 0.0158 request/s/GPU, pass rate 0.9868.

## Follow-up Fix

The seeded prompt exposed a generic diagnosis issue: if the best feasible probe had no latency failures, the harness could miss the prior infeasible probe that showed the real bottleneck at higher load. The harness now scans the probe sequence backward and uses the nearest non-trivial bottleneck before falling back to the best feasible probe. This keeps decode-only runs focused on `decode_tpot` after a feasible low-load point, without adding testcase thresholds.