Files
aituner/docs/qwen235b-thinking-decode/one-shot-mechanism-ablation-20260502.md

8.9 KiB

qwen235b Decode Harness One-Shot Mechanism and Ablation, 2026-05-02

Question

The harness run reached its best observed qwen235b decode-only config at iter 2:

TP4/DP2/EP8 -> TP2/DP4/EP8

This document explains why that happened, what information the harness added to the LLM prompt, what the LLM did with that information, and what the non-harness ablation shows.

Short Answer

The iter-2 result is not magic and should not be described as a global optimum proof. It is a local topology sweet spot for this decode-only workload:

  • Baseline TP4/DP2/EP8 has only 2 data-parallel replicas and pays tensor-parallel communication on every decode step.
  • TP2/DP4/EP8 halves tensor-parallel width and doubles independent decode replicas while preserving the known-good EP8 MoE sharding.
  • TP1/DP8/EP8 goes too far: it was tested next and produced no feasible point.
  • Same-topology TP2/DP4/EP8 + max-num-seqs=160 was also tested later and produced no feasible point.

So the harness run's iter 2 is best observed because it hit the nearby topology balance point early, and the follow-up validation probes did not falsify it.

Mechanism

The workload is decode_only with TPOT <= 40ms and no TTFT objective. In this regime, the critical cost is steady-state token generation rather than prompt prefill latency.

For a large MoE decode stack:

  • Higher TP can reduce per-GPU model shard size, but it also adds tensor-parallel collectives to every decode step.
  • Higher DP gives more independent serving replicas and absorbs bursty arrivals better, but each replica has less tensor parallelism.
  • EP should not be changed without EP-specific evidence, because MoE expert sharding affects launch safety, memory layout, and expert dispatch.

The baseline shape TP4/DP2/EP8 is therefore not obviously optimal for decode. The adjacent legal move TP2/DP4/EP8 is the natural first test: reduce repeated per-token TP communication and increase replica count while keeping all 8 GPUs and EP8 fixed.

The measured data supports this:

Run Config Best sampling_u Request/s Pass rate
harness trial-0001 TP4/DP2/EP8 0.0058594 0.1267 0.9868
harness trial-0002 TP2/DP4/EP8 0.0170288 0.3767 0.9779
harness trial-0003 TP1/DP8/EP8 none infeasible none
harness trial-0005 TP2/DP4/EP8 + max-num-seqs=160 none infeasible none

Harness Information Added

The harness prompt added structured context that the non-harness prompt did not have:

Harness field Concrete value in this run How it affected the proposal
workload mode request_mode=decode_only; TTFT not an objective Avoid prefill-first reasoning; optimize TPOT and decode throughput.
active bottleneck decode_tpot Make TP/DP redistribution and decode batching relevant, not TTFT knobs.
L-C-A profile prompt p50 1491, p95 19670, p99 29961; prefix reuse about 0.41; burst ratio about 1.40 Treat the workload as long-tail, moderately cache-reused, moderately bursty decode.
current best baseline request/s/GPU 0.0158, pass rate 0.9868 Require proposals to improve per-GPU throughput under SLO.
legal topology candidates TP/DP products constrained to 8 GPUs; candidate includes TP2/DP4/EP8 and TP1/DP8/EP8 Restrict search to launch-plausible adjacent topologies.
knob harness rules topology-first for decode_tpot; keep EP fixed without EP-specific evidence Pick TP2/DP4/EP8, not EP changes or runtime-only knobs first.
tested signatures only baseline tested at iter 2 Avoid repeating baseline; choose first adjacent topology.

The relevant LLM response for harness proposal-0002 followed this structure:

{
  "diagnosis": "Follow the topology-first harness for decode_tpot. Because the incumbent already satisfies the TPOT SLO, the next justified adjacent probe is to trade some tensor parallelism for more data-parallel replicas, while keeping expert parallel fixed to avoid introducing an EP-specific variable without evidence. The adjacent legal move from TP4/DP2 is TP2/DP4 with EP8 preserved.",
  "config_patch": {
    "flag_patch": {
      "tensor-parallel-size": 2,
      "data-parallel-size": 4,
      "expert-parallel-size": 8
    }
  }
}

The important behavior is not just "choose TP2/DP4"; it is "choose the adjacent topology, keep EP fixed, judge by request_rate_per_gpu and TPOT SLO, then validate nearby alternatives."

Non-Harness Ablation

The before-harness run is dash0-qwen235b-decode-thinking-run5-tpot40-topology.

It is a useful ablation because it used the same trace, same model family, same baseline topology, same TPOT SLO, and same 12-trial budget, but did not include the structured harness context.

Iter Non-harness proposal Result
1 baseline TP4/DP2/EP8 0.1267 request/s
2 TP2/DP4 0.2450 request/s
3 TP1/DP8/EP8 infeasible
4 TP2/DP4/EP4 launch fail
5 gpu-memory-utilization=0.8, max-num-seqs=256 infeasible
6 max-num-seqs=128 infeasible
7 block-size=128 infeasible
8 max-num-batched-tokens=384 infeasible
9 TP2/DP4/EP8 + max-num-seqs=128 + max-num-batched-tokens=256 0.2817 request/s
10 trial 9 + block-size=128 infeasible
11 TP1/DP8/EP8 + max-num-seqs=128 + max-num-batched-tokens=256 infeasible
12 TP2/DP4/EP8 + max-num-seqs=96 + max-num-batched-tokens=192 infeasible

The non-harness LLM also found TP2/DP4 at iter 2, so we should not claim that the harness uniquely discovered the direction. The difference is that the non-harness prompt left the model to reason from raw history and raw topology candidates. After iter 2 it spent trials on EP changes, memory/concurrency changes, block size, and batch-token variants before finding its best observed point at iter 9.

Full Harness Ablation

Run Iter 1 Iter 2 Iter 3 Iter 4 Iter 5 Best observed
non-harness 0.1267 0.2450 infeasible launch fail infeasible 0.2817 at iter 9
harness 0.1267 0.3767 infeasible infeasible infeasible 0.3767 at iter 2

The harness accelerated convergence in two ways:

  • It made the first post-baseline trial a structured adjacent topology test with EP fixed.
  • It converted later iterations into validation of local alternatives rather than broad, weakly justified search.

The measured validation points after iter 2:

Trial Config Outcome
trial-0003 TP1/DP8/EP8 no feasible point
trial-0004 intended max-num-seqs=160, but actually base topology due old base-relative patch issue no feasible point, not valid incumbent validation
trial-0005 TP2/DP4/EP8 + max-num-seqs=160 no feasible point

For trial-0005, every probe above the incumbent floor failed:

Probe sampling_u Request/s Pass rate Feasible
0.0710144 1.7800 0.2818 no
0.0440216 1.0900 0.1789 no
0.0305252 0.7050 0.3002 no
0.0237770 0.5417 0.4092 no
0.0204029 0.4533 0.4890 no
0.0187159 0.4117 0.5466 no

This is the evidence that iter 2 was not just a premature stop. The harness continued probing nearby alternatives, and those alternatives did not beat the incumbent.

Implementation Update

Long decode-only validation exposed a cost issue: once a probe became SLO-unrecoverable, the worker still waited for in-flight long-output requests unless the study explicitly enabled engine restart after early stop.

The implementation now makes this the default for decode-only studies:

  • trace.request_mode=decode_only and no explicit restart_engine_after_early_stop means restart_engine_after_early_stop=true.
  • An explicit restart_engine_after_early_stop=false is still honored.
  • Chat/prefill studies keep the old default false.
  • The LLM prompt now includes early_stop_max_lag_s, early_stop_max_elapsed_s, and restart_engine_after_early_stop in the trace block.
  • qwen235b decode example specs now explicitly set restart_engine_after_early_stop=true.

This change does not alter the SLO decision for a probe. It changes the cost model after an already-unrecoverable probe: cancel in-flight requests, restart the engine cleanly, and move to the next probe instead of waiting for long decode tails.

Interpretation

The correct claim is:

The harness did not prove global optimality in one iteration. It made the first post-baseline proposal land on the correct local topology neighborhood, and follow-up harness validation failed to find a better adjacent topology or tested same-topology runtime refinement. On this workload, that was enough for iter 2 to remain the best observed configuration.

The non-harness ablation shows that the model could guess the same topology direction, but without harness structure it spent the remaining budget exploring less controlled directions and only reached its best observed result at iter 9.