gahow/aituner

Fork 0

Files

Gahow Wang 6d3459c82d Document decode harness one-shot mechanism

2026-05-02 06:25:06 +08:00

8.9 KiB

Raw Blame History

qwen235b Decode Harness One-Shot Mechanism and Ablation, 2026-05-02

Question

The harness run reached its best observed qwen235b decode-only config at iter 2:

TP4/DP2/EP8 -> TP2/DP4/EP8

This document explains why that happened, what information the harness added to the LLM prompt, what the LLM did with that information, and what the non-harness ablation shows.

Short Answer

The iter-2 result is not magic and should not be described as a global optimum proof. It is a local topology sweet spot for this decode-only workload:

Baseline TP4/DP2/EP8 has only 2 data-parallel replicas and pays tensor-parallel communication on every decode step.
TP2/DP4/EP8 halves tensor-parallel width and doubles independent decode replicas while preserving the known-good EP8 MoE sharding.
TP1/DP8/EP8 goes too far: it was tested next and produced no feasible point.
Same-topology TP2/DP4/EP8 + max-num-seqs=160 was also tested later and produced no feasible point.

So the harness run's iter 2 is best observed because it hit the nearby topology balance point early, and the follow-up validation probes did not falsify it.

Mechanism

The workload is decode_only with TPOT <= 40ms and no TTFT objective. In this regime, the critical cost is steady-state token generation rather than prompt prefill latency.

For a large MoE decode stack:

Higher TP can reduce per-GPU model shard size, but it also adds tensor-parallel collectives to every decode step.
Higher DP gives more independent serving replicas and absorbs bursty arrivals better, but each replica has less tensor parallelism.
EP should not be changed without EP-specific evidence, because MoE expert sharding affects launch safety, memory layout, and expert dispatch.

The baseline shape TP4/DP2/EP8 is therefore not obviously optimal for decode. The adjacent legal move TP2/DP4/EP8 is the natural first test: reduce repeated per-token TP communication and increase replica count while keeping all 8 GPUs and EP8 fixed.

The measured data supports this:

Run	Config	Best sampling_u	Request/s	Pass rate
harness trial-0001	`TP4/DP2/EP8`	0.0058594	0.1267	0.9868
harness trial-0002	`TP2/DP4/EP8`	0.0170288	0.3767	0.9779
harness trial-0003	`TP1/DP8/EP8`	none	infeasible	none
harness trial-0005	`TP2/DP4/EP8 + max-num-seqs=160`	none	infeasible	none

Harness Information Added

The harness prompt added structured context that the non-harness prompt did not have:

Harness field	Concrete value in this run	How it affected the proposal
workload mode	`request_mode=decode_only`; TTFT not an objective	Avoid prefill-first reasoning; optimize TPOT and decode throughput.
active bottleneck	`decode_tpot`	Make TP/DP redistribution and decode batching relevant, not TTFT knobs.
L-C-A profile	prompt p50 1491, p95 19670, p99 29961; prefix reuse about 0.41; burst ratio about 1.40	Treat the workload as long-tail, moderately cache-reused, moderately bursty decode.
current best	baseline request/s/GPU 0.0158, pass rate 0.9868	Require proposals to improve per-GPU throughput under SLO.
legal topology candidates	TP/DP products constrained to 8 GPUs; candidate includes `TP2/DP4/EP8` and `TP1/DP8/EP8`	Restrict search to launch-plausible adjacent topologies.
knob harness rules	topology-first for `decode_tpot`; keep EP fixed without EP-specific evidence	Pick `TP2/DP4/EP8`, not EP changes or runtime-only knobs first.
tested signatures	only baseline tested at iter 2	Avoid repeating baseline; choose first adjacent topology.

The relevant LLM response for harness proposal-0002 followed this structure:

{
  "diagnosis": "Follow the topology-first harness for decode_tpot. Because the incumbent already satisfies the TPOT SLO, the next justified adjacent probe is to trade some tensor parallelism for more data-parallel replicas, while keeping expert parallel fixed to avoid introducing an EP-specific variable without evidence. The adjacent legal move from TP4/DP2 is TP2/DP4 with EP8 preserved.",
  "config_patch": {
    "flag_patch": {
      "tensor-parallel-size": 2,
      "data-parallel-size": 4,
      "expert-parallel-size": 8
    }
  }
}

The important behavior is not just "choose TP2/DP4"; it is "choose the adjacent topology, keep EP fixed, judge by request_rate_per_gpu and TPOT SLO, then validate nearby alternatives."

Non-Harness Ablation

The before-harness run is dash0-qwen235b-decode-thinking-run5-tpot40-topology.

It is a useful ablation because it used the same trace, same model family, same baseline topology, same TPOT SLO, and same 12-trial budget, but did not include the structured harness context.

Iter	Non-harness proposal	Result
1	baseline `TP4/DP2/EP8`	0.1267 request/s
2	`TP2/DP4`	0.2450 request/s
3	`TP1/DP8/EP8`	infeasible
4	`TP2/DP4/EP4`	launch fail
5	`gpu-memory-utilization=0.8`, `max-num-seqs=256`	infeasible
6	`max-num-seqs=128`	infeasible
7	`block-size=128`	infeasible
8	`max-num-batched-tokens=384`	infeasible
9	`TP2/DP4/EP8 + max-num-seqs=128 + max-num-batched-tokens=256`	0.2817 request/s
10	trial 9 + `block-size=128`	infeasible
11	`TP1/DP8/EP8 + max-num-seqs=128 + max-num-batched-tokens=256`	infeasible
12	`TP2/DP4/EP8 + max-num-seqs=96 + max-num-batched-tokens=192`	infeasible

The non-harness LLM also found TP2/DP4 at iter 2, so we should not claim that the harness uniquely discovered the direction. The difference is that the non-harness prompt left the model to reason from raw history and raw topology candidates. After iter 2 it spent trials on EP changes, memory/concurrency changes, block size, and batch-token variants before finding its best observed point at iter 9.

Full Harness Ablation

Run	Iter 1	Iter 2	Iter 3	Iter 4	Iter 5	Best observed
non-harness	0.1267	0.2450	infeasible	launch fail	infeasible	0.2817 at iter 9
harness	0.1267	0.3767	infeasible	infeasible	infeasible	0.3767 at iter 2

The harness accelerated convergence in two ways:

It made the first post-baseline trial a structured adjacent topology test with EP fixed.
It converted later iterations into validation of local alternatives rather than broad, weakly justified search.

The measured validation points after iter 2:

Trial	Config	Outcome
trial-0003	`TP1/DP8/EP8`	no feasible point
trial-0004	intended `max-num-seqs=160`, but actually base topology due old base-relative patch issue	no feasible point, not valid incumbent validation
trial-0005	`TP2/DP4/EP8 + max-num-seqs=160`	no feasible point

For trial-0005, every probe above the incumbent floor failed:

Probe sampling_u	Request/s	Pass rate	Feasible
0.0710144	1.7800	0.2818	no
0.0440216	1.0900	0.1789	no
0.0305252	0.7050	0.3002	no
0.0237770	0.5417	0.4092	no
0.0204029	0.4533	0.4890	no
0.0187159	0.4117	0.5466	no

This is the evidence that iter 2 was not just a premature stop. The harness continued probing nearby alternatives, and those alternatives did not beat the incumbent.

Implementation Update

Long decode-only validation exposed a cost issue: once a probe became SLO-unrecoverable, the worker still waited for in-flight long-output requests unless the study explicitly enabled engine restart after early stop.

The implementation now makes this the default for decode-only studies:

trace.request_mode=decode_only and no explicit restart_engine_after_early_stop means restart_engine_after_early_stop=true.
An explicit restart_engine_after_early_stop=false is still honored.
Chat/prefill studies keep the old default false.
The LLM prompt now includes early_stop_max_lag_s, early_stop_max_elapsed_s, and restart_engine_after_early_stop in the trace block.
qwen235b decode example specs now explicitly set restart_engine_after_early_stop=true.

This change does not alter the SLO decision for a probe. It changes the cost model after an already-unrecoverable probe: cancel in-flight requests, restart the engine cleanly, and move to the next probe instead of waiting for long decode tails.

Interpretation

The correct claim is:

The harness did not prove global optimality in one iteration. It made the first post-baseline proposal land on the correct local topology neighborhood, and follow-up harness validation failed to find a better adjacent topology or tested same-topology runtime refinement. On this workload, that was enough for iter 2 to remain the best observed configuration.

The non-harness ablation shows that the model could guess the same topology direction, but without harness structure it spent the remaining budget exploring less controlled directions and only reached its best observed result at iter 9.

8.9 KiB Raw Blame History