Files
aituner/docs/qwen30b-community-vllm020/harness-early-stop-ablation-20260502.md

6.5 KiB

Qwen3-30B-A3B Community vLLM Harness Ablation, 2026-05-02

Goal

Run a fresh dash0 experiment on the community vLLM latest release with the local community model:

/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B

The comparison is:

Variant Spec Harness
no-harness configs/examples/dash0_qwen30b_a3b_community_vllm020_noharness.json disabled via llm.use_harness=false
harness configs/examples/dash0_qwen30b_a3b_community_vllm020_harness.json enabled, including deterministic stop proposal

Both specs start from the same base vLLM configuration. The base contains only serving access fields: host, port, and served-model-name. It does not set performance flags such as TP, DP, EP, max model length, prefix cache, chunked prefill, max-num-seqs, max-num-batched-tokens, or gpu-memory-utilization. The first trial therefore measures community vLLM defaults for this model.

The launch environment sets HOME=/tmp/wjh and XDG_CACHE_HOME=/tmp/wjh/.cache so vLLM, torch.compile, and FlashInfer build caches land on dash0 local storage instead of CPFS. This is a startup/cache placement choice, not a vLLM performance flag.

vLLM Install

PyPI reports vllm==0.20.0 as the current community release checked on 2026-05-02. The dash0 runtime venv is on local rootfs rather than CPFS, because installing torch/CUDA wheels into CPFS was I/O-bound:

/tmp/wjh/venvs/vllm-0.20.0-cu129

The first plain pip install vllm==0.20.0 smoke pulled torch 2.11.0+cu130 and failed on dash0's driver (570.133.20, CUDA 12.9). The active install uses the vLLM 0.20.0 GitHub release +cu129 wheel and the PyTorch CUDA 12.9 index, matching the vLLM documented CUDA 12.9 install path for this driver.

Install log:

/home/admin/cpfs/wjh/aituner/aituner/logs/install_vllm_0.20.0_20260502.log

Workload

The experiment reuses the 0-8k chat window that has already been used for qwen27b harness work:

Field Value
window chat_w20260311_1000
source rows 32606
input filter 0 to 8192 tokens
completion tokens fixed 128 via trace.completion_tokens_override
max requests per probe 512
replay time scale 0.1
target pass rate 0.95
TTFT SLO 2s up to 4k, 4s up to 32k, 6s above
TPOT SLO 50ms
search high 0.125 sampling_u
max probes per trial 4

The max_requests_per_probe=512 cap keeps the fresh community-vLLM ablation practical while preserving a real trace-shaped replay, SLO scoring, and binary-search threshold probe. A trace-only count check gives 31 to 65 selected requests across the six binary-search thresholds, avoiding the invalid low-cap case where early thresholds can select zero requests.

The first full-output attempt showed why a bounded replay is needed for a 12-iteration ablation: at the first threshold (0.0625), 31 selected requests contained 14,849 output tokens with out_max=2981. That makes one probe too slow to finish a full no-harness/harness pair. The first out128 attempt with replay_time_scale=1.0 was still bounded by real window time, so each probe waited close to the original window duration. The active ablation therefore fixes output length at 128 tokens, uses replay_time_scale=0.1, and limits each trial to four binary-search probes. load_trace_requests scales both request arrivals and the window duration, so reported request rates are the actual compressed replay request rates. This changes the load/decode mix, so the result should be interpreted as a community-vLLM harness convergence test under a bounded, time-compressed chat replay, not as a full-output production benchmark.

Harness Update Under Test

This run tests a stricter early-stop harness:

  • The harness still injects L-C-A workload features, recent trial diagnostics, active bottleneck, legal topology candidates, tested signatures, and knob-family rules.
  • A strong incumbent no longer means immediate stop. It means "validate nearby alternatives".
  • Deterministic stop is allowed only after completed validation evidence says continuing is unlikely to be useful:
    • the incumbent beats baseline by a generic large-gain ratio,
    • at least two post-incumbent validation trials have run,
    • those validation trials did not produce a feasible per-GPU improvement,
    • the validation covered topology and runtime families, or accumulated at least three post-incumbent validation attempts.
  • If the stop guard fires, study tune writes harness-stop-XXXX and exits without spending another GPU trial or asking the LLM for another proposal.
  • A single-family all-infeasible plateau is not enough to stop deterministically. It only blocks repeating that family; the LLM must either justify a different family or later satisfy the validation/convergence stop rule.

This is a generic harness rule, not a testcase-specific threshold. It does not depend on qwen27b, qwen235b, qwen30b, a fixed TP/DP value, or a hardcoded SLO number.

Unit Tests

Local test command:

PYTHONPATH=src python3 -m unittest tests.test_core_flow -q

Result: passed, 74 tests.

The added coverage checks:

Test Purpose
test_harness_does_not_stop_immediately_after_strong_incumbent strong incumbent requires validation first
test_harness_stop_after_post_incumbent_validation_is_exhausted deterministic stop after validation exhaustion
test_cli_tune_uses_harness_stop_before_llm study tune can stop without calling the LLM or launching another GPU trial
test_prompt_can_disable_harness_for_ablation no-harness prompt removes structured harness context

Experiment Tracking

Pending dash0 runs:

Variant tmux session Log Study root
no-harness qwen30b_vllm020_noharness_out128_scale01_20260502 logs/qwen30b_vllm020_noharness_out128_scale01_20260502.log .aituner-community-vllm020/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-out128-scale01-noharness
harness qwen30b_vllm020_harness_out128_scale01_20260502 logs/qwen30b_vllm020_harness_out128_scale01_20260502.log .aituner-community-vllm020/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-out128-scale01-harness

The harness run should be judged by best-so-far request_rate_per_gpu per tuning iteration, plus whether it stops only after validation evidence. The no-harness run should use the same trial budget so the ablation exposes whether the early-stop harness saves iterations without hiding a later better point.

Results

Pending. This section will be filled after the dash0 experiments finish.