# qwen235b Decode Harness One-Shot Mechanism and Ablation, 2026-05-02 ## Question The harness run reached its best observed qwen235b decode-only config at iter 2: `TP4/DP2/EP8 -> TP2/DP4/EP8` This document explains why that happened, what information the harness added to the LLM prompt, what the LLM did with that information, and what the non-harness ablation shows. ## Short Answer The iter-2 result is not magic and should not be described as a global optimum proof. It is a local topology sweet spot for this decode-only workload: - Baseline `TP4/DP2/EP8` has only 2 data-parallel replicas and pays tensor-parallel communication on every decode step. - `TP2/DP4/EP8` halves tensor-parallel width and doubles independent decode replicas while preserving the known-good EP8 MoE sharding. - `TP1/DP8/EP8` goes too far: it was tested next and produced no feasible point. - Same-topology `TP2/DP4/EP8 + max-num-seqs=160` was also tested later and produced no feasible point. So the harness run's iter 2 is best observed because it hit the nearby topology balance point early, and the follow-up validation probes did not falsify it. ## Mechanism The workload is `decode_only` with `TPOT <= 40ms` and no TTFT objective. In this regime, the critical cost is steady-state token generation rather than prompt prefill latency. For a large MoE decode stack: - Higher TP can reduce per-GPU model shard size, but it also adds tensor-parallel collectives to every decode step. - Higher DP gives more independent serving replicas and absorbs bursty arrivals better, but each replica has less tensor parallelism. - EP should not be changed without EP-specific evidence, because MoE expert sharding affects launch safety, memory layout, and expert dispatch. The baseline shape `TP4/DP2/EP8` is therefore not obviously optimal for decode. The adjacent legal move `TP2/DP4/EP8` is the natural first test: reduce repeated per-token TP communication and increase replica count while keeping all 8 GPUs and EP8 fixed. The measured data supports this: | Run | Config | Best sampling_u | Request/s | Pass rate | | --- | --- | ---: | ---: | ---: | | harness trial-0001 | `TP4/DP2/EP8` | 0.0058594 | 0.1267 | 0.9868 | | harness trial-0002 | `TP2/DP4/EP8` | 0.0170288 | 0.3767 | 0.9779 | | harness trial-0003 | `TP1/DP8/EP8` | none | infeasible | none | | harness trial-0005 | `TP2/DP4/EP8 + max-num-seqs=160` | none | infeasible | none | ## Harness Information Added The harness prompt added structured context that the non-harness prompt did not have: | Harness field | Concrete value in this run | How it affected the proposal | | --- | --- | --- | | workload mode | `request_mode=decode_only`; TTFT not an objective | Avoid prefill-first reasoning; optimize TPOT and decode throughput. | | active bottleneck | `decode_tpot` | Make TP/DP redistribution and decode batching relevant, not TTFT knobs. | | L-C-A profile | prompt p50 1491, p95 19670, p99 29961; prefix reuse about 0.41; burst ratio about 1.40 | Treat the workload as long-tail, moderately cache-reused, moderately bursty decode. | | current best | baseline request/s/GPU 0.0158, pass rate 0.9868 | Require proposals to improve per-GPU throughput under SLO. | | legal topology candidates | TP/DP products constrained to 8 GPUs; candidate includes `TP2/DP4/EP8` and `TP1/DP8/EP8` | Restrict search to launch-plausible adjacent topologies. | | knob harness rules | topology-first for `decode_tpot`; keep EP fixed without EP-specific evidence | Pick `TP2/DP4/EP8`, not EP changes or runtime-only knobs first. | | tested signatures | only baseline tested at iter 2 | Avoid repeating baseline; choose first adjacent topology. | The relevant LLM response for harness `proposal-0002` followed this structure: ```json { "diagnosis": "Follow the topology-first harness for decode_tpot. Because the incumbent already satisfies the TPOT SLO, the next justified adjacent probe is to trade some tensor parallelism for more data-parallel replicas, while keeping expert parallel fixed to avoid introducing an EP-specific variable without evidence. The adjacent legal move from TP4/DP2 is TP2/DP4 with EP8 preserved.", "config_patch": { "flag_patch": { "tensor-parallel-size": 2, "data-parallel-size": 4, "expert-parallel-size": 8 } } } ``` The important behavior is not just "choose TP2/DP4"; it is "choose the adjacent topology, keep EP fixed, judge by request_rate_per_gpu and TPOT SLO, then validate nearby alternatives." ## Non-Harness Ablation The before-harness run is `dash0-qwen235b-decode-thinking-run5-tpot40-topology`. It is a useful ablation because it used the same trace, same model family, same baseline topology, same TPOT SLO, and same 12-trial budget, but did not include the structured harness context. | Iter | Non-harness proposal | Result | | ---: | --- | --- | | 1 | baseline `TP4/DP2/EP8` | 0.1267 request/s | | 2 | `TP2/DP4` | 0.2450 request/s | | 3 | `TP1/DP8/EP8` | infeasible | | 4 | `TP2/DP4/EP4` | launch fail | | 5 | `gpu-memory-utilization=0.8`, `max-num-seqs=256` | infeasible | | 6 | `max-num-seqs=128` | infeasible | | 7 | `block-size=128` | infeasible | | 8 | `max-num-batched-tokens=384` | infeasible | | 9 | `TP2/DP4/EP8 + max-num-seqs=128 + max-num-batched-tokens=256` | 0.2817 request/s | | 10 | trial 9 + `block-size=128` | infeasible | | 11 | `TP1/DP8/EP8 + max-num-seqs=128 + max-num-batched-tokens=256` | infeasible | | 12 | `TP2/DP4/EP8 + max-num-seqs=96 + max-num-batched-tokens=192` | infeasible | The non-harness LLM also found `TP2/DP4` at iter 2, so we should not claim that the harness uniquely discovered the direction. The difference is that the non-harness prompt left the model to reason from raw history and raw topology candidates. After iter 2 it spent trials on EP changes, memory/concurrency changes, block size, and batch-token variants before finding its best observed point at iter 9. ## Full Harness Ablation | Run | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Best observed | | --- | ---: | ---: | --- | --- | --- | ---: | | non-harness | 0.1267 | 0.2450 | infeasible | launch fail | infeasible | 0.2817 at iter 9 | | harness | 0.1267 | 0.3767 | infeasible | infeasible | infeasible | 0.3767 at iter 2 | The harness accelerated convergence in two ways: - It made the first post-baseline trial a structured adjacent topology test with EP fixed. - It converted later iterations into validation of local alternatives rather than broad, weakly justified search. The measured validation points after iter 2: | Trial | Config | Outcome | | --- | --- | --- | | trial-0003 | `TP1/DP8/EP8` | no feasible point | | trial-0004 | intended `max-num-seqs=160`, but actually base topology due old base-relative patch issue | no feasible point, not valid incumbent validation | | trial-0005 | `TP2/DP4/EP8 + max-num-seqs=160` | no feasible point | For `trial-0005`, every probe above the incumbent floor failed: | Probe sampling_u | Request/s | Pass rate | Feasible | | ---: | ---: | ---: | --- | | 0.0710144 | 1.7800 | 0.2818 | no | | 0.0440216 | 1.0900 | 0.1789 | no | | 0.0305252 | 0.7050 | 0.3002 | no | | 0.0237770 | 0.5417 | 0.4092 | no | | 0.0204029 | 0.4533 | 0.4890 | no | | 0.0187159 | 0.4117 | 0.5466 | no | This is the evidence that iter 2 was not just a premature stop. The harness continued probing nearby alternatives, and those alternatives did not beat the incumbent. ## Implementation Update Long decode-only validation exposed a cost issue: once a probe became SLO-unrecoverable, the worker still waited for in-flight long-output requests unless the study explicitly enabled engine restart after early stop. The implementation now makes this the default for decode-only studies: - `trace.request_mode=decode_only` and no explicit `restart_engine_after_early_stop` means `restart_engine_after_early_stop=true`. - An explicit `restart_engine_after_early_stop=false` is still honored. - Chat/prefill studies keep the old default `false`. - The LLM prompt now includes `early_stop_max_lag_s`, `early_stop_max_elapsed_s`, and `restart_engine_after_early_stop` in the trace block. - qwen235b decode example specs now explicitly set `restart_engine_after_early_stop=true`. This change does not alter the SLO decision for a probe. It changes the cost model after an already-unrecoverable probe: cancel in-flight requests, restart the engine cleanly, and move to the next probe instead of waiting for long decode tails. ## Interpretation The correct claim is: The harness did not prove global optimality in one iteration. It made the first post-baseline proposal land on the correct local topology neighborhood, and follow-up harness validation failed to find a better adjacent topology or tested same-topology runtime refinement. On this workload, that was enough for iter 2 to remain the best observed configuration. The non-harness ablation shows that the model could guess the same topology direction, but without harness structure it spent the remaining budget exploring less controlled directions and only reached its best observed result at iter 9.