# Qwen3-30B-A3B Community vLLM Harness Ablation, 2026-05-02 ## Goal Run a fresh dash0 experiment on the community vLLM latest release with the local community model: `/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B` The comparison is: | Variant | Spec | Harness | | --- | --- | --- | | no-harness | `configs/examples/dash0_qwen30b_a3b_community_vllm020_noharness.json` | disabled via `llm.use_harness=false` | | harness | `configs/examples/dash0_qwen30b_a3b_community_vllm020_harness.json` | enabled, including deterministic stop proposal | Both specs start from the same base vLLM configuration. The base contains only serving access fields: `host`, `port`, and `served-model-name`. It does not set performance flags such as TP, DP, EP, max model length, prefix cache, chunked prefill, max-num-seqs, max-num-batched-tokens, or gpu-memory-utilization. The first trial therefore measures community vLLM defaults for this model. ## vLLM Install PyPI reports `vllm==0.20.0` as the current community release checked on 2026-05-02. The dash0 runtime venv is on local rootfs rather than CPFS, because installing torch/CUDA wheels into CPFS was I/O-bound: `/tmp/wjh/venvs/vllm-0.20.0-auto` The first plain `pip install vllm==0.20.0` smoke pulled `torch 2.11.0+cu130` and failed on dash0's driver (`570.133.20`, CUDA 12.9). The active install uses the vLLM-documented `uv pip install vllm==0.20.0 --torch-backend=auto` path so uv selects a CUDA backend compatible with the installed driver. Install log: `/home/admin/cpfs/wjh/aituner/aituner/logs/install_vllm_0.20.0_20260502.log` ## Workload The experiment reuses the 0-8k chat window that has already been used for qwen27b harness work: | Field | Value | | --- | --- | | window | `chat_w20260311_1000` | | source rows | 32606 | | input filter | 0 to 8192 tokens | | max requests per probe | 2048 | | target pass rate | 0.95 | | TTFT SLO | 2s up to 4k, 4s up to 32k, 6s above | | TPOT SLO | 50ms | | search high | 0.125 sampling_u | | max probes per trial | 6 | The `max_requests_per_probe=2048` cap keeps the fresh community-vLLM ablation practical while preserving a real trace-shaped replay, SLO scoring, and binary-search threshold probe. ## Harness Update Under Test This run tests a stricter early-stop harness: - The harness still injects L-C-A workload features, recent trial diagnostics, active bottleneck, legal topology candidates, tested signatures, and knob-family rules. - A strong incumbent no longer means immediate stop. It means "validate nearby alternatives". - Deterministic stop is allowed only after completed validation evidence says continuing is unlikely to be useful: - the incumbent beats baseline by a generic large-gain ratio, - at least two post-incumbent validation trials have run, - those validation trials did not produce a feasible per-GPU improvement, - the validation covered topology and runtime families, or accumulated at least three post-incumbent validation attempts. - If the stop guard fires, `study tune` writes `harness-stop-XXXX` and exits without spending another GPU trial or asking the LLM for another proposal. This is a generic harness rule, not a testcase-specific threshold. It does not depend on qwen27b, qwen235b, qwen30b, a fixed TP/DP value, or a hardcoded SLO number. ## Unit Tests Local test command: ```bash PYTHONPATH=src python3 -m unittest tests.test_core_flow -q ``` Result: passed, 74 tests. The added coverage checks: | Test | Purpose | | --- | --- | | `test_harness_does_not_stop_immediately_after_strong_incumbent` | strong incumbent requires validation first | | `test_harness_stop_after_post_incumbent_validation_is_exhausted` | deterministic stop after validation exhaustion | | `test_cli_tune_uses_harness_stop_before_llm` | `study tune` can stop without calling the LLM or launching another GPU trial | | `test_prompt_can_disable_harness_for_ablation` | no-harness prompt removes structured harness context | ## Experiment Tracking Pending dash0 runs: | Variant | tmux session | Log | Study root | | --- | --- | --- | --- | | no-harness | `qwen30b_vllm020_noharness_20260502` | `logs/qwen30b_vllm020_noharness_20260502.log` | `.aituner-community-vllm020/studies/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-noharness` | | harness | `qwen30b_vllm020_harness_20260502` | `logs/qwen30b_vllm020_harness_20260502.log` | `.aituner-community-vllm020/studies/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-harness` | The harness run should be judged by best-so-far `request_rate_per_gpu` per tuning iteration, plus whether it stops only after validation evidence. The no-harness run should use the same trial budget so the ablation exposes whether the early-stop harness saves iterations without hiding a later better point. ## Results Pending. This section will be filled after the dash0 experiments finish.