Files
aituner/docs/qwen235b-thinking-decode/harness-20260428.md

34 lines
1.8 KiB
Markdown

# Qwen235B Thinking Decode-Only Harness Run, 2026-04-28
## Goal
Run the qwen235b thinking decode-only tuning with the same harness-guided workflow used for the prefill-only test, while keeping the harness generic. The harness must use workload mode, configured SLOs, legal topology constraints, and measured trial history rather than testcase-specific throughput thresholds.
## Baseline Reference
The before-harness comparison run is `dash0-qwen235b-decode-thinking-run5-tpot40-topology`:
| Iter | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| before harness request/s | 0.1267 | 0.2450 | infeasible | launch fail | infeasible | infeasible | infeasible | infeasible | 0.2817 | infeasible | infeasible | infeasible |
Before harness, the best feasible config appeared at iter 9 with 0.2817 request/s.
## Harness Change
The decode-only harness now defaults to `decode_tpot` when `trace.request_mode=decode_only` and a TPOT SLO is configured. This avoids treating long decode-only prompt hints as a TTFT-prefill workload.
Active decode harness families are generic:
- `tensor-parallel-size`: legal TP/DP redistribution, judged by configured SLO pass rate and request_rate_per_gpu.
- `data-parallel-size`: legal replica topology changes for decode/admission bottlenecks.
- `max-num-seqs`: concurrency adjustment from observed TPOT failures or SLO headroom.
- `max-num-batched-tokens`: decode batching adjustment after topology is stable.
- `expert-parallel`: preserve known-valid EP topology, but change EP size only with EP-specific evidence.
No qwen235b-specific threshold or testcase-specific rule was added.
## Current Run
Pending. The next run will use dash0, 8x H20, and store results under `.aituner/harness-qwen235b-decode-20260428`.