Files
aituner/docs/qwen235b-thinking-decode/harness-20260428.md

1.8 KiB

Qwen235B Thinking Decode-Only Harness Run, 2026-04-28

Goal

Run the qwen235b thinking decode-only tuning with the same harness-guided workflow used for the prefill-only test, while keeping the harness generic. The harness must use workload mode, configured SLOs, legal topology constraints, and measured trial history rather than testcase-specific throughput thresholds.

Baseline Reference

The before-harness comparison run is dash0-qwen235b-decode-thinking-run5-tpot40-topology:

Iter 1 2 3 4 5 6 7 8 9 10 11 12
before harness request/s 0.1267 0.2450 infeasible launch fail infeasible infeasible infeasible infeasible 0.2817 infeasible infeasible infeasible

Before harness, the best feasible config appeared at iter 9 with 0.2817 request/s.

Harness Change

The decode-only harness now defaults to decode_tpot when trace.request_mode=decode_only and a TPOT SLO is configured. This avoids treating long decode-only prompt hints as a TTFT-prefill workload.

Active decode harness families are generic:

  • tensor-parallel-size: legal TP/DP redistribution, judged by configured SLO pass rate and request_rate_per_gpu.
  • data-parallel-size: legal replica topology changes for decode/admission bottlenecks.
  • max-num-seqs: concurrency adjustment from observed TPOT failures or SLO headroom.
  • max-num-batched-tokens: decode batching adjustment after topology is stable.
  • expert-parallel: preserve known-valid EP topology, but change EP size only with EP-specific evidence.

No qwen235b-specific threshold or testcase-specific rule was added.

Current Run

Pending. The next run will use dash0, 8x H20, and store results under .aituner/harness-qwen235b-decode-20260428.