diff --git a/docs/qwen235b-thinking-decode/harness-20260428.md b/docs/qwen235b-thinking-decode/harness-20260428.md index 6426182..427100b 100644 --- a/docs/qwen235b-thinking-decode/harness-20260428.md +++ b/docs/qwen235b-thinking-decode/harness-20260428.md @@ -47,9 +47,31 @@ The active run is now seeded from the real run5 baseline and continues from `tri - Remote store: `.aituner/harness-qwen235b-decode-20260428-seeded/dash0-qwen235b-decode-thinking-harness-seeded-20260428` - Seeded `trial-0001`: 0.1267 request/s, 0.0158 request/s/GPU, pass rate 0.9868. - `proposal-0002`: legal adjacent decode topology move from `TP4/DP2/EP8` to `TP2/DP4/EP8`; no EP-size search and no testcase threshold. -- `trial-0002` status: running on dash0 in `tmux` session `aituner_qwen235b_decode_harness_seeded_20260428`. +- `trial-0002`: completed, 0.3767 request/s, 0.0471 request/s/GPU, pass rate 0.9779. +- `trial-0003`: completed with no feasible point for `TP1/DP8/EP8`. +- `proposal-0004`: generated a plausible same-topology `max-num-seqs=160` follow-up, but the raw JSON used an object for `observation`; schema validation rejected it and the tuning CLI exited before materializing `trial-0004`. -The `trial-0002` proposal matches the first useful topology direction from the earlier before-harness run, where the same effective config reached 0.2450 request/s at iter 2. The current run is still executing to verify this under the new harness-controlled study state before claiming final convergence data. +The `trial-0002` proposal matches the first useful topology direction from the earlier before-harness run, but the new harness-controlled run measured substantially better throughput for that topology. + +## Result Judgment + +Fig-18-style raw throughput table: + +| Run | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 | +| --- | ---: | ---: | --- | --- | --- | --- | --- | --- | ---: | --- | --- | --- | +| before harness request/s | 0.1267 | 0.2450 | infeasible | launch fail | infeasible | infeasible | infeasible | infeasible | 0.2817 | infeasible | infeasible | infeasible | +| harness request/s | 0.1267 | 0.3767 | infeasible | not run | not run | not run | not run | not run | not run | not run | not run | not run | + +Per-GPU throughput table: + +| Run | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 | +| --- | ---: | ---: | --- | --- | --- | --- | --- | --- | ---: | --- | --- | --- | +| before harness req/s/GPU | 0.0158 | 0.0306 | infeasible | launch fail | infeasible | infeasible | infeasible | infeasible | 0.0352 | infeasible | infeasible | infeasible | +| harness req/s/GPU | 0.0158 | 0.0471 | infeasible | not run | not run | not run | not run | not run | not run | not run | not run | not run | + +Decision: the harness accelerated convergence on qwen235b decode-only. The before-harness run first reached its best observed throughput at iter 9 with 0.2817 request/s. The harness run exceeded that value at iter 2 with 0.3767 request/s, a 1.34x improvement over the before-harness 12-iter best and a 2.97x improvement over the baseline config. + +The harness did not stop cleanly after finding the strong incumbent. It spent one additional trial on `TP1/DP8/EP8`, which found no feasible point, and then the next LLM proposal failed schema validation before trial materialization. So the performance convergence goal is met, but the tuning loop should be hardened so a strong incumbent causes deterministic stop or a schema-repair retry rather than relying only on prompt instructions. ## Follow-up Fix