34 lines
1.8 KiB
Markdown
34 lines
1.8 KiB
Markdown
# Qwen235B Thinking Decode-Only Harness Run, 2026-04-28
|
|
|
|
## Goal
|
|
|
|
Run the qwen235b thinking decode-only tuning with the same harness-guided workflow used for the prefill-only test, while keeping the harness generic. The harness must use workload mode, configured SLOs, legal topology constraints, and measured trial history rather than testcase-specific throughput thresholds.
|
|
|
|
## Baseline Reference
|
|
|
|
The before-harness comparison run is `dash0-qwen235b-decode-thinking-run5-tpot40-topology`:
|
|
|
|
| Iter | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|
|
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
|
| before harness request/s | 0.1267 | 0.2450 | infeasible | launch fail | infeasible | infeasible | infeasible | infeasible | 0.2817 | infeasible | infeasible | infeasible |
|
|
|
|
Before harness, the best feasible config appeared at iter 9 with 0.2817 request/s.
|
|
|
|
## Harness Change
|
|
|
|
The decode-only harness now defaults to `decode_tpot` when `trace.request_mode=decode_only` and a TPOT SLO is configured. This avoids treating long decode-only prompt hints as a TTFT-prefill workload.
|
|
|
|
Active decode harness families are generic:
|
|
|
|
- `tensor-parallel-size`: legal TP/DP redistribution, judged by configured SLO pass rate and request_rate_per_gpu.
|
|
- `data-parallel-size`: legal replica topology changes for decode/admission bottlenecks.
|
|
- `max-num-seqs`: concurrency adjustment from observed TPOT failures or SLO headroom.
|
|
- `max-num-batched-tokens`: decode batching adjustment after topology is stable.
|
|
- `expert-parallel`: preserve known-valid EP topology, but change EP size only with EP-specific evidence.
|
|
|
|
No qwen235b-specific threshold or testcase-specific rule was added.
|
|
|
|
## Current Run
|
|
|
|
Pending. The next run will use dash0, 8x H20, and store results under `.aituner/harness-qwen235b-decode-20260428`.
|