Files
aituner/docs/qwen235b-thinking-decode/harness-20260428.md

10 KiB

Qwen235B Thinking Decode-Only Harness Run, 2026-04-28

Goal

Run the qwen235b thinking decode-only tuning with the same harness-guided workflow used for the prefill-only test, while keeping the harness generic. The harness must use workload mode, configured SLOs, legal topology constraints, and measured trial history rather than testcase-specific throughput thresholds.

Baseline Reference

The before-harness comparison run is dash0-qwen235b-decode-thinking-run5-tpot40-topology:

Iter 1 2 3 4 5 6 7 8 9 10 11 12
before harness request/s 0.1267 0.2450 infeasible launch fail infeasible infeasible infeasible infeasible 0.2817 infeasible infeasible infeasible

Before harness, the best feasible config appeared at iter 9 with 0.2817 request/s.

Harness Change

The decode-only harness now defaults to decode_tpot when trace.request_mode=decode_only and a TPOT SLO is configured. This avoids treating long decode-only prompt hints as a TTFT-prefill workload.

Active decode harness families are generic:

  • tensor-parallel-size: legal TP/DP redistribution, judged by configured SLO pass rate and request_rate_per_gpu.
  • data-parallel-size: legal replica topology changes for decode/admission bottlenecks.
  • max-num-seqs: concurrency adjustment from observed TPOT failures or SLO headroom.
  • max-num-batched-tokens: decode batching adjustment after topology is stable.
  • expert-parallel: preserve known-valid EP topology, but change EP size only with EP-specific evidence.

No qwen235b-specific threshold or testcase-specific rule was added.

Current Run

Started on dash0, 8x H20.

  • Remote spec: .aituner/harness-qwen235b-decode-20260428/dash0_qwen235b_decode_thinking_harness_20260428.json
  • Remote store: .aituner/harness-qwen235b-decode-20260428/dash0-qwen235b-decode-thinking-harness-20260428
  • Remote tmux: aituner_qwen235b_decode_harness_20260428
  • Remote log: logs/qwen235b_decode_harness_20260428.log
  • Code commit: 39aa47f
  • Verification: local and dash0 both passed PYTHONPATH=src python3 -m unittest discover -s tests.

The first attempt started a duplicate trial-0001 baseline. Because the identical baseline was already measured in run5 and the decode probe can run for many minutes, that duplicate run was stopped and GPUs were freed.

The active run is now seeded from the real run5 baseline and continues from trial-0002:

  • Remote spec: .aituner/harness-qwen235b-decode-20260428-seeded/dash0_qwen235b_decode_thinking_harness_seeded_20260428.json
  • Remote store: .aituner/harness-qwen235b-decode-20260428-seeded/dash0-qwen235b-decode-thinking-harness-seeded-20260428
  • Seeded trial-0001: 0.1267 request/s, 0.0158 request/s/GPU, pass rate 0.9868.
  • proposal-0002: legal adjacent decode topology move from TP4/DP2/EP8 to TP2/DP4/EP8; no EP-size search and no testcase threshold.
  • trial-0002: completed, 0.3767 request/s, 0.0471 request/s/GPU, pass rate 0.9779.
  • trial-0003: completed with no feasible point for TP1/DP8/EP8.
  • trial-0004: completed with no feasible point for max-num-seqs=160.
  • Important caveat: trial-0004 did not actually validate TP2/DP4/EP8 + max-num-seqs=160. AITuner applies config_patch relative to the study base config, and the proposal only patched max-num-seqs. The actual launch therefore used the base topology TP4/DP2/EP8 + max-num-seqs=160, so this is not evidence that same-topology refinement around trial-0002 is exhausted.
  • trial-0005: corrected same-topology validation, TP2/DP4/EP8 + max-num-seqs=160; completed with no feasible point.

The trial-0002 proposal matches the first useful topology direction from the earlier before-harness run, but the new harness-controlled run measured substantially better throughput for that topology.

Result Judgment

Fig-18-style raw throughput table:

Run Iter 1 Iter 2 Iter 3 Iter 4 Iter 5 Iter 6 Iter 7 Iter 8 Iter 9 Iter 10 Iter 11 Iter 12
before harness request/s 0.1267 0.2450 infeasible launch fail infeasible infeasible infeasible infeasible 0.2817 infeasible infeasible infeasible
harness request/s 0.1267 0.3767 infeasible infeasible infeasible not run not run not run not run not run not run not run

Per-GPU throughput table:

Run Iter 1 Iter 2 Iter 3 Iter 4 Iter 5 Iter 6 Iter 7 Iter 8 Iter 9 Iter 10 Iter 11 Iter 12
before harness req/s/GPU 0.0158 0.0306 infeasible launch fail infeasible infeasible infeasible infeasible 0.0352 infeasible infeasible infeasible
harness req/s/GPU 0.0158 0.0471 infeasible infeasible infeasible not run not run not run not run not run not run not run

Decision: the harness accelerated convergence on qwen235b decode-only, but this is not a proof of global optimality after one proposal. The before-harness run first reached its best observed throughput at iter 9 with 0.2817 request/s. The harness run exceeded that value at iter 2 with 0.3767 request/s, a 1.34x improvement over the before-harness 12-iter best and a 2.97x improvement over the baseline config.

The harness did not stop cleanly after finding the strong incumbent. It spent one additional trial on TP1/DP8/EP8, which found no feasible point. The next proposal intended same-topology runtime validation, but omitted the incumbent topology fields, so the materialized trial validated the base topology instead. This issue was corrected with trial-0005.

Important interpretation: trial-0002 should be called the current best observed config, not a global optimum proof. The harness got there quickly because the decode-only harness biases the first proposal toward the most relevant adjacent topology redistribution, TP4/DP2/EP8 -> TP2/DP4/EP8, instead of spending trials on prefill-oriented runtime knobs. Later validation now supports local optimality against the tested adjacent topology and the tested same-topology max-num-seqs=160 runtime refinement.

Follow-up implementation after this result:

  • strong_incumbent.guard_active no longer directly contributes to should_stop_if_no_harness_can_justify_a_new_adjacent_probe.
  • A strong incumbent now means "enter validation phase": run adjacent topology or same-topology runtime probes that could falsify the incumbent.
  • The proposal rules now explicitly say not to stop solely because a strong incumbent appeared.
  • Proposal parsing now accepts structured observation/diagnosis by converting them to text, so a usable validation proposal is not dropped only because the LLM used an object instead of a string.

After the implementation fix, the previously rejected proposal-0004 was resumed as a validation trial:

  • trial-0004: intended same-topology validation with max-num-seqs=160, but actually ran on base topology because the proposal omitted TP2/DP4/EP8.
  • Remote tmux: aituner_qwen235b_decode_harness_validate_20260428.
  • Result: completed with no feasible point. This is useful negative evidence for the base topology plus max-num-seqs=160, but not for the trial-0002 incumbent topology.

A second validation trial was then launched with the full incumbent topology in the patch:

  • trial-0005 config: TP2/DP4/EP8 + max-num-seqs=160.
  • Search range: low 0.017028808593, high 0.125, tolerance 0.001, max probes 6.
  • Result: completed with no feasible point; trial-0002 remained the best trial.
  • Probe outcomes:
Probe sampling_u Request/s Pass rate Feasible Early-stop reason
0.0710144 1.7800 0.2818 no slo_pass_rate_unrecoverable
0.0440216 1.0900 0.1789 no slo_pass_rate_unrecoverable
0.0305252 0.7050 0.3002 no slo_pass_rate_unrecoverable
0.0237770 0.5417 0.4092 no slo_pass_rate_unrecoverable
0.0204029 0.4533 0.4890 no slo_pass_rate_unrecoverable
0.0187159 0.4117 0.5466 no slo_pass_rate_unrecoverable

This directly answers the one-iter-to-best concern for this refinement: the harness did not stop after trial-0002; it ran a corrected same-topology validation, and every tested point above the incumbent search floor failed the 95% TPOT SLO. Therefore max-num-seqs=160 does not falsify trial-0002 as the current best.

Follow-up Fix

The seeded prompt exposed a generic diagnosis issue: if the best feasible probe had no latency failures, the harness could miss the prior infeasible probe that showed the real bottleneck at higher load. The harness now scans the probe sequence backward and uses the nearest non-trivial bottleneck before falling back to the best feasible probe. This keeps decode-only runs focused on decode_tpot after a feasible low-load point, without adding testcase thresholds.

A second generic diagnosis bug was fixed: non-SLO bookkeeping counts such as probe_elapsed_s>... no longer collapse to ttft_prefill when TTFT/TPOT/request failure counts are all zero.

Follow-up Fix, 2026-04-30

The base-relative patch issue is now guarded in code, not only in the LLM prompt. When StudyStore.materialize_trial sees a runtime/env-only proposal after a non-base incumbent has been found, it inherits the incumbent topology patch into the trial spec unless the proposal explicitly provides a topology. This keeps same-topology runtime validation on the actual incumbent while preserving the ability to test the base topology by stating it explicitly.

Local verification: PYTHONPATH=src python3 -m unittest discover -s tests passed 68 tests.

Current Harness Judgment

For qwen235b decode-only, the harness still accelerates convergence: before harness, the best observed 12-iter result appeared at iter 9 with 0.2817 request/s; with harness, iter 2 reached 0.3767 request/s and later validation did not find a better adjacent or same-topology runtime point.

The remaining optimization is validation cost, not convergence quality. trial-0005 took a long time because early-stopped decode-only probes still had to wait for in-flight long-output requests unless the engine is restarted after early stop. Future harness/study templates for long decode-only validation should use or automatically recommend trace.restart_engine_after_early_stop=true when repeated SLO-unrecoverable probes are expected.