Files
aituner/docs/qwen30b-community-vllm020/harness-early-stop-ablation-20260502.md
Gahow Wang d7df1ebdac
Some checks failed
CI / test (3.11) (push) Has been cancelled
CI / test (3.12) (push) Has been cancelled
Add open source project metadata
2026-05-06 21:18:21 +08:00

13 KiB

Qwen3-30B-A3B Community vLLM Harness Ablation, 2026-05-02

Goal

Run a fresh dash0 experiment on the community vLLM latest release with the local community model:

/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B

The comparison is:

Variant Spec Harness
no-harness configs/examples/dash0_qwen30b_a3b_community_vllm020_noharness.json disabled via llm.use_harness=false
harness configs/examples/dash0_qwen30b_a3b_community_vllm020_harness.json enabled, including deterministic stop proposal

Both specs start from the same base vLLM configuration. The base contains only serving access fields: host, port, and served-model-name. It does not set performance flags such as TP, DP, EP, max model length, prefix cache, chunked prefill, max-num-seqs, max-num-batched-tokens, or gpu-memory-utilization. The first trial therefore measures community vLLM defaults for this model.

The launch environment sets HOME=/tmp/wjh and XDG_CACHE_HOME=/tmp/wjh/.cache so vLLM, torch.compile, and FlashInfer build caches land on dash0 local storage instead of CPFS. This is a startup/cache placement choice, not a vLLM performance flag.

vLLM Install

PyPI reports vllm==0.20.0 as the current community release checked on 2026-05-02. The dash0 runtime venv is on local rootfs rather than CPFS, because installing torch/CUDA wheels into CPFS was I/O-bound:

/tmp/wjh/venvs/vllm-0.20.0-cu129

The first plain pip install vllm==0.20.0 smoke pulled torch 2.11.0+cu130 and failed on dash0's driver (570.133.20, CUDA 12.9). The active install uses the vLLM 0.20.0 GitHub release +cu129 wheel and the PyTorch CUDA 12.9 index, matching the vLLM documented CUDA 12.9 install path for this driver.

Install log:

/home/admin/cpfs/wjh/aituner/aituner/logs/install_vllm_0.20.0_20260502.log

Workload

The experiment reuses the 0-8k chat window that has already been used for qwen27b harness work:

Field Value
window chat_w20260311_1000
source rows 32606
input filter 0 to 8192 tokens
completion tokens fixed 128 via trace.completion_tokens_override
max requests per probe 512
replay time scale 0.1
target pass rate 0.95
TTFT SLO 2s up to 4k, 4s up to 32k, 6s above
TPOT SLO 50ms
search high 0.125 sampling_u
max probes per trial 4

The max_requests_per_probe=512 cap keeps the fresh community-vLLM ablation practical while preserving a real trace-shaped replay, SLO scoring, and binary-search threshold probe. A trace-only count check gives 31 to 65 selected requests across the six binary-search thresholds, avoiding the invalid low-cap case where early thresholds can select zero requests.

The first full-output attempt showed why a bounded replay is needed for a 12-iteration ablation: at the first threshold (0.0625), 31 selected requests contained 14,849 output tokens with out_max=2981. That makes one probe too slow to finish a full no-harness/harness pair. The first out128 attempt with replay_time_scale=1.0 was still bounded by real window time, so each probe waited close to the original window duration. The active ablation therefore fixes output length at 128 tokens, uses replay_time_scale=0.1, and limits each trial to four binary-search probes. load_trace_requests scales both request arrivals and the window duration, so reported request rates are the actual compressed replay request rates. This changes the load/decode mix, so the result should be interpreted as a community-vLLM harness convergence test under a bounded, time-compressed chat replay, not as a full-output production benchmark.

Harness Update Under Test

This run tests a stricter early-stop harness:

  • The harness still injects L-C-A workload features, recent trial diagnostics, active bottleneck, legal topology candidates, tested signatures, and knob-family rules.
  • A strong incumbent no longer means immediate stop. It means "validate nearby alternatives".
  • Deterministic stop is allowed only after completed validation evidence says continuing is unlikely to be useful:
    • the incumbent beats baseline by a generic large-gain ratio,
    • at least two post-incumbent validation trials have run,
    • those validation trials did not produce a feasible per-GPU improvement,
    • the validation covered topology and runtime families, or accumulated at least three post-incumbent validation attempts.
  • If the stop guard fires, study tune writes harness-stop-XXXX and exits without spending another GPU trial or asking the LLM for another proposal.
  • A single-family all-infeasible plateau is not enough to stop deterministically. It only blocks repeating that family; the LLM must either justify a different family or later satisfy the validation/convergence stop rule.
  • A search-high saturation guard stops immediately when the incumbent's highest measured probe is feasible and is within the configured binary-search resolution of search.high. A feasible probe may still contain individual SLO failures as long as it meets the configured pass-rate target. In that case the current study cannot measure a better config without increasing the workload search range, so more config proposals only waste tuning iterations.

This is a generic harness rule, not a testcase-specific threshold. It does not depend on qwen27b, qwen235b, qwen30b, a fixed TP/DP value, or a hardcoded SLO number.

Unit Tests

Local test command:

PYTHONPATH=src python3 -m unittest tests.test_core_flow -q

Result at the time of this note: passed. The current repository test count may be higher; use the command above as the source of truth.

The added coverage checks:

Test Purpose
test_harness_does_not_stop_immediately_after_strong_incumbent strong incumbent requires validation first
test_harness_stop_after_post_incumbent_validation_is_exhausted deterministic stop after validation exhaustion
test_cli_tune_uses_harness_stop_before_llm study tune can stop without calling the LLM or launching another GPU trial
test_prompt_can_disable_harness_for_ablation no-harness prompt removes structured harness context
test_harness_stop_when_incumbent_saturates_search_high deterministic stop when the incumbent saturates the configured workload search high
test_harness_guided_first_tp_probe_for_latency_bottleneck deterministic first TP probe after baseline latency bottleneck evidence
test_harness_guided_runtime_seed_preserves_tp_incumbent deterministic same-topology runtime refinement after a TP incumbent

Experiment Tracking

Completed dash0 runs:

Variant tmux session Log Study root
no-harness qwen30b_vllm020_noharness_out128_scale01_20260502 logs/qwen30b_vllm020_noharness_out128_scale01_20260502.log .aituner-community-vllm020/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-out128-scale01-noharness
harness qwen30b_vllm020_harness_highstop_gpu4_7_20260502 logs/qwen30b_vllm020_harness_highstop_gpu4_7_20260502.log .aituner-community-vllm020/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-out128-scale01-harness-highstop-gpu4-7

The harness run should be judged by best-so-far request_rate_per_gpu per tuning iteration, plus whether it stops only after validation evidence. The no-harness run should use the same trial budget so the ablation exposes whether the early-stop harness saves iterations without hiding a later better point.

Results

Metric: best-so-far request_rate_per_gpu under the bounded, time-compressed replay.

Variant Iter 1 Iter 2 Iter 3 Iter 4 Iter 5 Iter 6 Iter 7 Iter 8 Iter 9 Iter 10 Iter 11 Iter 12
no-harness 1.0333 1.0333 1.0333 1.0333 1.0333 1.0333 1.0333 1.0333 1.0333 1.0333 1.0333 1.0333
harness 1.0333 1.0333 stop 1.0333 1.0333 1.0333 1.0333 1.0333 1.0333 1.0333 1.0333 1.0333 1.0333

Actual per-iteration outcomes:

Variant Iter 1 Iter 2 Iter 3 Iter 4 Iter 5 Iter 6 Iter 7 Iter 8 Iter 9 Iter 10 Iter 11 Iter 12
no-harness 1.0333 0.5167 fail fail fail fail fail fail fail fail fail fail
harness 1.0333 stop stop stop stop stop stop stop stop stop stop stop

Interpretation:

  • The best config is the default community vLLM config for this bounded replay. It reaches the configured search high: the last baseline probe at sampling_u=0.1171875 is feasible, has pass rate 1.0, and has no TTFT/TPOT SLO failures. With search.high=0.125 and max_probes=4, this is exactly one binary-search resolution below the configured high.
  • The harness stops at iter 2 without calling the LLM or launching another GPU trial. This is not a claim that the engine is globally optimal; it is a claim that the current study cannot measure an improvement until search.high is increased.
  • No-harness spends all 12 tuning iterations anyway. Iter 2 changes to DP=2 and halves per-GPU throughput (0.5167). Iter 3-12 are launch failures from unguided or weakly guided proposals.
  • The harness therefore reaches the best measured config in one executed GPU trial and saves 11 tuning iterations on this setup.

Operational note:

  • The no-harness run left driver-side orphan GPU memory on GPU0/1 after repeated launch failures. An earlier pre-high-stop harness attempt left the same kind of residue on GPU2/3. The final harness run was executed on dash0 GPU4-7 via a runtime-derived spec to avoid this contamination. Its executed GPU trial used a single H20, matching the no-harness best trial's single-GPU default configuration.

High=1.0 Rerun

The search.high=0.125 run answered only "can this config handle up to about 1.08 req/s in the compressed replay?" It could not answer "which config is best?" because the default config already reached the measurement ceiling.

Trace request counts after raising search.high show the difference:

search.high Near-top selected requests Near-top request rate
0.125 65 1.0833 req/s
0.25 141 2.3500 req/s
0.5 269 4.4833 req/s
1.0 502 8.3667 req/s

The high=1.0 run used the same bounded replay (completion_tokens_override=128, replay_time_scale=0.1, max_requests_per_probe=512) but set search.high=1.0 and max_probes=6.

Completed dash0 high=1.0 runs:

Variant tmux session Study root
no-harness qwen30b_vllm020_noharness_high1_20260506 .aituner-community-vllm020/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-out128-scale01-high1-noharness
harness-guided-v2 qwen30b_vllm020_harness_high1_guided_v2_20260506 .aituner-community-vllm020/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-out128-scale01-high1-harness-guided-v2

Metric: best-so-far request_rate_per_gpu.

Variant Iter 1 Iter 2 Iter 3 Iter 4 Iter 5 Iter 6 Iter 7 Iter 8 Iter 9 Iter 10 Iter 11 Iter 12
no-harness 2.2000 3.2583 3.2583 3.2583 3.2583 3.3000 3.3500 3.3500 3.3500 3.3500 3.3500 3.3500
harness-guided-v2 2.3833 3.2583 3.2833 3.3000 3.3000 stop 3.3000 3.3000 3.3000 3.3000 3.3000 3.3000 3.3000

Actual per-iteration outcomes:

Variant Iter 1 Iter 2 Iter 3 Iter 4 Iter 5 Iter 6 Iter 7 Iter 8-12
no-harness 2.2000 3.2583 launch fail infeasible 1.1042 3.3000 3.3500 infeasible
harness-guided-v2 2.3833 3.2583 3.2833 3.3000 stop stop stop stop

Interpretation:

  • Raising search.high was necessary. The default config was not optimal under the expanded workload range; TP=2 immediately improved per-GPU throughput from about 2.2 to 3.2583.
  • The updated harness now provides deterministic proposals, not just early stop:
    • iter2: adjacent TP probe (tensor-parallel-size=2),
    • iter3: same-topology runtime seed (gpu-memory-utilization=0.95, chunked prefill, max-num-batched-tokens=16384),
    • iter4: controlled MBT growth to 24576.
  • No-harness reached the same config family at iter7, after an EP launch failure, an infeasible DP probe, a poor TP/DP probe, and then runtime refinement.
  • Harness reached the same config family at iter4 and stopped at iter5. Its measured best was 3.3000, while no-harness measured 3.3500 for the same TP=2 + MBT=24576 family; the 1.5% gap is within the observed boundary/noise of repeated high-load replay. The convergence claim is therefore "same config family in fewer iterations", not an exact higher single-run number.