Qwen3-30B-A3B Community vLLM Harness Ablation, 2026-05-02

Goal

Run a fresh dash0 experiment on the community vLLM latest release with the local community model:

/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B

The comparison is:

Variant	Spec	Harness
no-harness	`configs/examples/dash0_qwen30b_a3b_community_vllm020_noharness.json`	disabled via `llm.use_harness=false`
harness	`configs/examples/dash0_qwen30b_a3b_community_vllm020_harness.json`	enabled, including deterministic stop proposal

Both specs start from the same base vLLM configuration. The base contains only serving access fields: host, port, and served-model-name. It does not set performance flags such as TP, DP, EP, max model length, prefix cache, chunked prefill, max-num-seqs, max-num-batched-tokens, or gpu-memory-utilization. The first trial therefore measures community vLLM defaults for this model.

The launch environment sets HOME=/tmp/wjh and XDG_CACHE_HOME=/tmp/wjh/.cache so vLLM, torch.compile, and FlashInfer build caches land on dash0 local storage instead of CPFS. This is a startup/cache placement choice, not a vLLM performance flag.

vLLM Install

PyPI reports vllm==0.20.0 as the current community release checked on 2026-05-02. The dash0 runtime venv is on local rootfs rather than CPFS, because installing torch/CUDA wheels into CPFS was I/O-bound:

/tmp/wjh/venvs/vllm-0.20.0-cu129

The first plain pip install vllm==0.20.0 smoke pulled torch 2.11.0+cu130 and failed on dash0's driver (570.133.20, CUDA 12.9). The active install uses the vLLM 0.20.0 GitHub release +cu129 wheel and the PyTorch CUDA 12.9 index, matching the vLLM documented CUDA 12.9 install path for this driver.

Install log:

/home/admin/cpfs/wjh/aituner/aituner/logs/install_vllm_0.20.0_20260502.log

Workload

The experiment reuses the 0-8k chat window that has already been used for qwen27b harness work:

Field	Value
window	`chat_w20260311_1000`
source rows	32606
input filter	0 to 8192 tokens
max requests per probe	2048
target pass rate	0.95
TTFT SLO	2s up to 4k, 4s up to 32k, 6s above
TPOT SLO	50ms
search high	0.125 sampling_u
max probes per trial	6

The max_requests_per_probe=2048 cap keeps the fresh community-vLLM ablation practical while preserving a real trace-shaped replay, SLO scoring, and binary-search threshold probe.

Harness Update Under Test

This run tests a stricter early-stop harness:

The harness still injects L-C-A workload features, recent trial diagnostics, active bottleneck, legal topology candidates, tested signatures, and knob-family rules.
A strong incumbent no longer means immediate stop. It means "validate nearby alternatives".
Deterministic stop is allowed only after completed validation evidence says continuing is unlikely to be useful:
- the incumbent beats baseline by a generic large-gain ratio,
- at least two post-incumbent validation trials have run,
- those validation trials did not produce a feasible per-GPU improvement,
- the validation covered topology and runtime families, or accumulated at least three post-incumbent validation attempts.
If the stop guard fires, study tune writes harness-stop-XXXX and exits without spending another GPU trial or asking the LLM for another proposal.

This is a generic harness rule, not a testcase-specific threshold. It does not depend on qwen27b, qwen235b, qwen30b, a fixed TP/DP value, or a hardcoded SLO number.

Unit Tests

Local test command:

PYTHONPATH=src python3 -m unittest tests.test_core_flow -q

Result: passed, 74 tests.

The added coverage checks:

Test	Purpose
`test_harness_does_not_stop_immediately_after_strong_incumbent`	strong incumbent requires validation first
`test_harness_stop_after_post_incumbent_validation_is_exhausted`	deterministic stop after validation exhaustion
`test_cli_tune_uses_harness_stop_before_llm`	`study tune` can stop without calling the LLM or launching another GPU trial
`test_prompt_can_disable_harness_for_ablation`	no-harness prompt removes structured harness context

Experiment Tracking

Pending dash0 runs:

Variant	tmux session	Log	Study root
no-harness	`qwen30b_vllm020_noharness_20260502`	`logs/qwen30b_vllm020_noharness_20260502.log`	`.aituner-community-vllm020/studies/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-noharness`
harness	`qwen30b_vllm020_harness_20260502`	`logs/qwen30b_vllm020_harness_20260502.log`	`.aituner-community-vllm020/studies/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-harness`

The harness run should be judged by best-so-far request_rate_per_gpu per tuning iteration, plus whether it stops only after validation evidence. The no-harness run should use the same trial budget so the ablation exposes whether the early-stop harness saves iterations without hiding a later better point.

Results

Pending. This section will be filled after the dash0 experiments finish.

5.0 KiB Raw Blame History