gahow/aituner

Fork 0

Files

Gahow Wang d7df1ebdac

CI / test (3.11) (push) Has been cancelled

Details

CI / test (3.12) (push) Has been cancelled

Details

Add open source project metadata

2026-05-06 21:18:21 +08:00

13 KiB

Raw Blame History

Qwen3-30B-A3B Community vLLM Harness Ablation, 2026-05-02

Goal

Run a fresh dash0 experiment on the community vLLM latest release with the local community model:

/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B

The comparison is:

Variant	Spec	Harness
no-harness	`configs/examples/dash0_qwen30b_a3b_community_vllm020_noharness.json`	disabled via `llm.use_harness=false`
harness	`configs/examples/dash0_qwen30b_a3b_community_vllm020_harness.json`	enabled, including deterministic stop proposal

Both specs start from the same base vLLM configuration. The base contains only serving access fields: host, port, and served-model-name. It does not set performance flags such as TP, DP, EP, max model length, prefix cache, chunked prefill, max-num-seqs, max-num-batched-tokens, or gpu-memory-utilization. The first trial therefore measures community vLLM defaults for this model.

The launch environment sets HOME=/tmp/wjh and XDG_CACHE_HOME=/tmp/wjh/.cache so vLLM, torch.compile, and FlashInfer build caches land on dash0 local storage instead of CPFS. This is a startup/cache placement choice, not a vLLM performance flag.

vLLM Install

PyPI reports vllm==0.20.0 as the current community release checked on 2026-05-02. The dash0 runtime venv is on local rootfs rather than CPFS, because installing torch/CUDA wheels into CPFS was I/O-bound:

/tmp/wjh/venvs/vllm-0.20.0-cu129

The first plain pip install vllm==0.20.0 smoke pulled torch 2.11.0+cu130 and failed on dash0's driver (570.133.20, CUDA 12.9). The active install uses the vLLM 0.20.0 GitHub release +cu129 wheel and the PyTorch CUDA 12.9 index, matching the vLLM documented CUDA 12.9 install path for this driver.

Install log:

/home/admin/cpfs/wjh/aituner/aituner/logs/install_vllm_0.20.0_20260502.log

Workload

The experiment reuses the 0-8k chat window that has already been used for qwen27b harness work:

Field	Value
window	`chat_w20260311_1000`
source rows	32606
input filter	0 to 8192 tokens
completion tokens	fixed 128 via `trace.completion_tokens_override`
max requests per probe	512
replay time scale	0.1
target pass rate	0.95
TTFT SLO	2s up to 4k, 4s up to 32k, 6s above
TPOT SLO	50ms
search high	0.125 sampling_u
max probes per trial	4

The max_requests_per_probe=512 cap keeps the fresh community-vLLM ablation practical while preserving a real trace-shaped replay, SLO scoring, and binary-search threshold probe. A trace-only count check gives 31 to 65 selected requests across the six binary-search thresholds, avoiding the invalid low-cap case where early thresholds can select zero requests.

The first full-output attempt showed why a bounded replay is needed for a 12-iteration ablation: at the first threshold (0.0625), 31 selected requests contained 14,849 output tokens with out_max=2981. That makes one probe too slow to finish a full no-harness/harness pair. The first out128 attempt with replay_time_scale=1.0 was still bounded by real window time, so each probe waited close to the original window duration. The active ablation therefore fixes output length at 128 tokens, uses replay_time_scale=0.1, and limits each trial to four binary-search probes. load_trace_requests scales both request arrivals and the window duration, so reported request rates are the actual compressed replay request rates. This changes the load/decode mix, so the result should be interpreted as a community-vLLM harness convergence test under a bounded, time-compressed chat replay, not as a full-output production benchmark.

Harness Update Under Test

This run tests a stricter early-stop harness:

The harness still injects L-C-A workload features, recent trial diagnostics, active bottleneck, legal topology candidates, tested signatures, and knob-family rules.
A strong incumbent no longer means immediate stop. It means "validate nearby alternatives".
Deterministic stop is allowed only after completed validation evidence says continuing is unlikely to be useful:
- the incumbent beats baseline by a generic large-gain ratio,
- at least two post-incumbent validation trials have run,
- those validation trials did not produce a feasible per-GPU improvement,
- the validation covered topology and runtime families, or accumulated at least three post-incumbent validation attempts.
If the stop guard fires, study tune writes harness-stop-XXXX and exits without spending another GPU trial or asking the LLM for another proposal.
A single-family all-infeasible plateau is not enough to stop deterministically. It only blocks repeating that family; the LLM must either justify a different family or later satisfy the validation/convergence stop rule.
A search-high saturation guard stops immediately when the incumbent's highest measured probe is feasible and is within the configured binary-search resolution of search.high. A feasible probe may still contain individual SLO failures as long as it meets the configured pass-rate target. In that case the current study cannot measure a better config without increasing the workload search range, so more config proposals only waste tuning iterations.

This is a generic harness rule, not a testcase-specific threshold. It does not depend on qwen27b, qwen235b, qwen30b, a fixed TP/DP value, or a hardcoded SLO number.

Unit Tests

Local test command:

PYTHONPATH=src python3 -m unittest tests.test_core_flow -q

Result at the time of this note: passed. The current repository test count may be higher; use the command above as the source of truth.

The added coverage checks:

Test	Purpose
`test_harness_does_not_stop_immediately_after_strong_incumbent`	strong incumbent requires validation first
`test_harness_stop_after_post_incumbent_validation_is_exhausted`	deterministic stop after validation exhaustion
`test_cli_tune_uses_harness_stop_before_llm`	`study tune` can stop without calling the LLM or launching another GPU trial
`test_prompt_can_disable_harness_for_ablation`	no-harness prompt removes structured harness context
`test_harness_stop_when_incumbent_saturates_search_high`	deterministic stop when the incumbent saturates the configured workload search high
`test_harness_guided_first_tp_probe_for_latency_bottleneck`	deterministic first TP probe after baseline latency bottleneck evidence
`test_harness_guided_runtime_seed_preserves_tp_incumbent`	deterministic same-topology runtime refinement after a TP incumbent

Experiment Tracking

Completed dash0 runs:

Variant	tmux session	Log	Study root
no-harness	`qwen30b_vllm020_noharness_out128_scale01_20260502`	`logs/qwen30b_vllm020_noharness_out128_scale01_20260502.log`	`.aituner-community-vllm020/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-out128-scale01-noharness`
harness	`qwen30b_vllm020_harness_highstop_gpu4_7_20260502`	`logs/qwen30b_vllm020_harness_highstop_gpu4_7_20260502.log`	`.aituner-community-vllm020/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-out128-scale01-harness-highstop-gpu4-7`

The harness run should be judged by best-so-far request_rate_per_gpu per tuning iteration, plus whether it stops only after validation evidence. The no-harness run should use the same trial budget so the ablation exposes whether the early-stop harness saves iterations without hiding a later better point.

Results

Metric: best-so-far request_rate_per_gpu under the bounded, time-compressed replay.

Variant	Iter 1	Iter 2	Iter 3	Iter 4	Iter 5	Iter 6	Iter 7	Iter 8	Iter 9	Iter 10	Iter 11	Iter 12
no-harness	1.0333	1.0333	1.0333	1.0333	1.0333	1.0333	1.0333	1.0333	1.0333	1.0333	1.0333	1.0333
harness	1.0333	1.0333 stop	1.0333	1.0333	1.0333	1.0333	1.0333	1.0333	1.0333	1.0333	1.0333	1.0333

Actual per-iteration outcomes:

Variant	Iter 1	Iter 2	Iter 3	Iter 4	Iter 5	Iter 6	Iter 7	Iter 8	Iter 9	Iter 10	Iter 11	Iter 12
no-harness	1.0333	0.5167	fail	fail	fail	fail	fail	fail	fail	fail	fail	fail
harness	1.0333	stop	stop	stop	stop	stop	stop	stop	stop	stop	stop	stop

Interpretation:

The best config is the default community vLLM config for this bounded replay. It reaches the configured search high: the last baseline probe at sampling_u=0.1171875 is feasible, has pass rate 1.0, and has no TTFT/TPOT SLO failures. With search.high=0.125 and max_probes=4, this is exactly one binary-search resolution below the configured high.
The harness stops at iter 2 without calling the LLM or launching another GPU trial. This is not a claim that the engine is globally optimal; it is a claim that the current study cannot measure an improvement until search.high is increased.
No-harness spends all 12 tuning iterations anyway. Iter 2 changes to DP=2 and halves per-GPU throughput (0.5167). Iter 3-12 are launch failures from unguided or weakly guided proposals.
The harness therefore reaches the best measured config in one executed GPU trial and saves 11 tuning iterations on this setup.

Operational note:

The no-harness run left driver-side orphan GPU memory on GPU0/1 after repeated launch failures. An earlier pre-high-stop harness attempt left the same kind of residue on GPU2/3. The final harness run was executed on dash0 GPU4-7 via a runtime-derived spec to avoid this contamination. Its executed GPU trial used a single H20, matching the no-harness best trial's single-GPU default configuration.

High=1.0 Rerun

The search.high=0.125 run answered only "can this config handle up to about 1.08 req/s in the compressed replay?" It could not answer "which config is best?" because the default config already reached the measurement ceiling.

Trace request counts after raising search.high show the difference:

search.high	Near-top selected requests	Near-top request rate
0.125	65	1.0833 req/s
0.25	141	2.3500 req/s
0.5	269	4.4833 req/s
1.0	502	8.3667 req/s

The high=1.0 run used the same bounded replay (completion_tokens_override=128, replay_time_scale=0.1, max_requests_per_probe=512) but set search.high=1.0 and max_probes=6.

Completed dash0 high=1.0 runs:

Variant	tmux session	Study root
no-harness	`qwen30b_vllm020_noharness_high1_20260506`	`.aituner-community-vllm020/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-out128-scale01-high1-noharness`
harness-guided-v2	`qwen30b_vllm020_harness_high1_guided_v2_20260506`	`.aituner-community-vllm020/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-out128-scale01-high1-harness-guided-v2`

Metric: best-so-far request_rate_per_gpu.

Variant	Iter 1	Iter 2	Iter 3	Iter 4	Iter 5	Iter 6	Iter 7	Iter 8	Iter 9	Iter 10	Iter 11	Iter 12
no-harness	2.2000	3.2583	3.2583	3.2583	3.2583	3.3000	3.3500	3.3500	3.3500	3.3500	3.3500	3.3500
harness-guided-v2	2.3833	3.2583	3.2833	3.3000	3.3000 stop	3.3000	3.3000	3.3000	3.3000	3.3000	3.3000	3.3000