aituner/docs/qwen30b-community-vllm020/harness-early-stop-ablation-20260502.md

# Qwen3-30B-A3B Community vLLM Harness Ablation, 2026-05-02

## Goal

Run a fresh dash0 experiment on the community vLLM latest release with the local community model:

`/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B`

The comparison is:

| Variant | Spec | Harness |
| --- | --- | --- |
| no-harness | `configs/examples/dash0_qwen30b_a3b_community_vllm020_noharness.json` | disabled via `llm.use_harness=false` |
| harness | `configs/examples/dash0_qwen30b_a3b_community_vllm020_harness.json` | enabled, including deterministic stop proposal |

Both specs start from the same base vLLM configuration. The base contains only serving access fields: `host`, `port`, and `served-model-name`. It does not set performance flags such as TP, DP, EP, max model length, prefix cache, chunked prefill, max-num-seqs, max-num-batched-tokens, or gpu-memory-utilization. The first trial therefore measures community vLLM defaults for this model.

The launch environment sets `HOME=/tmp/wjh` and `XDG_CACHE_HOME=/tmp/wjh/.cache` so vLLM, torch.compile, and FlashInfer build caches land on dash0 local storage instead of CPFS. This is a startup/cache placement choice, not a vLLM performance flag.

## vLLM Install

PyPI reports `vllm==0.20.0` as the current community release checked on 2026-05-02. The dash0 runtime venv is on local rootfs rather than CPFS, because installing torch/CUDA wheels into CPFS was I/O-bound:

`/tmp/wjh/venvs/vllm-0.20.0-cu129`

The first plain `pip install vllm==0.20.0` smoke pulled `torch 2.11.0+cu130` and failed on dash0's driver (`570.133.20`, CUDA 12.9). The active install uses the vLLM 0.20.0 GitHub release `+cu129` wheel and the PyTorch CUDA 12.9 index, matching the vLLM documented CUDA 12.9 install path for this driver.

Install log:

`/home/admin/cpfs/wjh/aituner/aituner/logs/install_vllm_0.20.0_20260502.log`

## Workload

The experiment reuses the 0-8k chat window that has already been used for qwen27b harness work:

| Field | Value |
| --- | --- |
| window | `chat_w20260311_1000` |
| source rows | 32606 |
| input filter | 0 to 8192 tokens |
| completion tokens | fixed 128 via `trace.completion_tokens_override` |
| max requests per probe | 512 |
| replay time scale | 0.1 |
| target pass rate | 0.95 |
| TTFT SLO | 2s up to 4k, 4s up to 32k, 6s above |
| TPOT SLO | 50ms |
| search high | 0.125 sampling_u |
| max probes per trial | 4 |

The `max_requests_per_probe=512` cap keeps the fresh community-vLLM ablation practical while preserving a real trace-shaped replay, SLO scoring, and binary-search threshold probe. A trace-only count check gives 31 to 65 selected requests across the six binary-search thresholds, avoiding the invalid low-cap case where early thresholds can select zero requests.

The first full-output attempt showed why a bounded replay is needed for a 12-iteration ablation: at the first threshold (`0.0625`), 31 selected requests contained 14,849 output tokens with `out_max=2981`. That makes one probe too slow to finish a full no-harness/harness pair. The first out128 attempt with `replay_time_scale=1.0` was still bounded by real window time, so each probe waited close to the original window duration. The active ablation therefore fixes output length at 128 tokens, uses `replay_time_scale=0.1`, and limits each trial to four binary-search probes. `load_trace_requests` scales both request arrivals and the window duration, so reported request rates are the actual compressed replay request rates. This changes the load/decode mix, so the result should be interpreted as a community-vLLM harness convergence test under a bounded, time-compressed chat replay, not as a full-output production benchmark.

## Harness Update Under Test

This run tests a stricter early-stop harness:

- The harness still injects L-C-A workload features, recent trial diagnostics, active bottleneck, legal topology candidates, tested signatures, and knob-family rules.
- A strong incumbent no longer means immediate stop. It means "validate nearby alternatives".
- Deterministic stop is allowed only after completed validation evidence says continuing is unlikely to be useful:
  - the incumbent beats baseline by a generic large-gain ratio,
  - at least two post-incumbent validation trials have run,
  - those validation trials did not produce a feasible per-GPU improvement,
  - the validation covered topology and runtime families, or accumulated at least three post-incumbent validation attempts.
- If the stop guard fires, `study tune` writes `harness-stop-XXXX` and exits without spending another GPU trial or asking the LLM for another proposal.
- A single-family all-infeasible plateau is not enough to stop deterministically. It only blocks repeating that family; the LLM must either justify a different family or later satisfy the validation/convergence stop rule.
- A search-high saturation guard stops immediately when the incumbent's highest measured probe is feasible and is within the configured binary-search resolution of `search.high`. A feasible probe may still contain individual SLO failures as long as it meets the configured pass-rate target. In that case the current study cannot measure a better config without increasing the workload search range, so more config proposals only waste tuning iterations.

This is a generic harness rule, not a testcase-specific threshold. It does not depend on qwen27b, qwen235b, qwen30b, a fixed TP/DP value, or a hardcoded SLO number.

## Unit Tests

Local test command:

```bash
PYTHONPATH=src python3 -m unittest tests.test_core_flow -q
```

Result at the time of this note: passed. The current repository test count may be higher; use the command above as the source of truth.

The added coverage checks:

| Test | Purpose |
| --- | --- |
| `test_harness_does_not_stop_immediately_after_strong_incumbent` | strong incumbent requires validation first |
| `test_harness_stop_after_post_incumbent_validation_is_exhausted` | deterministic stop after validation exhaustion |
| `test_cli_tune_uses_harness_stop_before_llm` | `study tune` can stop without calling the LLM or launching another GPU trial |
| `test_prompt_can_disable_harness_for_ablation` | no-harness prompt removes structured harness context |
| `test_harness_stop_when_incumbent_saturates_search_high` | deterministic stop when the incumbent saturates the configured workload search high |
| `test_harness_guided_first_tp_probe_for_latency_bottleneck` | deterministic first TP probe after baseline latency bottleneck evidence |
| `test_harness_guided_runtime_seed_preserves_tp_incumbent` | deterministic same-topology runtime refinement after a TP incumbent |

## Experiment Tracking

Completed dash0 runs:

| Variant | tmux session | Log | Study root |
| --- | --- | --- | --- |
| no-harness | `qwen30b_vllm020_noharness_out128_scale01_20260502` | `logs/qwen30b_vllm020_noharness_out128_scale01_20260502.log` | `.aituner-community-vllm020/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-out128-scale01-noharness` |
| harness | `qwen30b_vllm020_harness_highstop_gpu4_7_20260502` | `logs/qwen30b_vllm020_harness_highstop_gpu4_7_20260502.log` | `.aituner-community-vllm020/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-out128-scale01-harness-highstop-gpu4-7` |

The harness run should be judged by best-so-far `request_rate_per_gpu` per tuning iteration, plus whether it stops only after validation evidence. The no-harness run should use the same trial budget so the ablation exposes whether the early-stop harness saves iterations without hiding a later better point.

## Results

Metric: best-so-far `request_rate_per_gpu` under the bounded, time-compressed replay.

| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| no-harness | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 |
| harness | 1.0333 | 1.0333 stop | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 |

Actual per-iteration outcomes:

| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| no-harness | 1.0333 | 0.5167 | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail |
| harness | 1.0333 | stop | stop | stop | stop | stop | stop | stop | stop | stop | stop | stop |

Interpretation:

- The best config is the default community vLLM config for this bounded replay. It reaches the configured search high: the last baseline probe at `sampling_u=0.1171875` is feasible, has pass rate `1.0`, and has no TTFT/TPOT SLO failures. With `search.high=0.125` and `max_probes=4`, this is exactly one binary-search resolution below the configured high.
- The harness stops at iter 2 without calling the LLM or launching another GPU trial. This is not a claim that the engine is globally optimal; it is a claim that the current study cannot measure an improvement until `search.high` is increased.
- No-harness spends all 12 tuning iterations anyway. Iter 2 changes to DP=2 and halves per-GPU throughput (`0.5167`). Iter 3-12 are launch failures from unguided or weakly guided proposals.
- The harness therefore reaches the best measured config in one executed GPU trial and saves 11 tuning iterations on this setup.

Operational note:

- The no-harness run left driver-side orphan GPU memory on GPU0/1 after repeated launch failures. An earlier pre-high-stop harness attempt left the same kind of residue on GPU2/3. The final harness run was executed on dash0 GPU4-7 via a runtime-derived spec to avoid this contamination. Its executed GPU trial used a single H20, matching the no-harness best trial's single-GPU default configuration.

## High=1.0 Rerun

The `search.high=0.125` run answered only "can this config handle up to about 1.08 req/s in the compressed replay?" It could not answer "which config is best?" because the default config already reached the measurement ceiling.

Trace request counts after raising `search.high` show the difference:

| search.high | Near-top selected requests | Near-top request rate |
| ---: | ---: | ---: |
| 0.125 | 65 | 1.0833 req/s |
| 0.25 | 141 | 2.3500 req/s |
| 0.5 | 269 | 4.4833 req/s |
| 1.0 | 502 | 8.3667 req/s |

The high=1.0 run used the same bounded replay (`completion_tokens_override=128`, `replay_time_scale=0.1`, `max_requests_per_probe=512`) but set `search.high=1.0` and `max_probes=6`.

Completed dash0 high=1.0 runs:

| Variant | tmux session | Study root |
| --- | --- | --- |
| no-harness | `qwen30b_vllm020_noharness_high1_20260506` | `.aituner-community-vllm020/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-out128-scale01-high1-noharness` |
| harness-guided-v2 | `qwen30b_vllm020_harness_high1_guided_v2_20260506` | `.aituner-community-vllm020/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-out128-scale01-high1-harness-guided-v2` |

Metric: best-so-far `request_rate_per_gpu`.

| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| no-harness | 2.2000 | 3.2583 | 3.2583 | 3.2583 | 3.2583 | 3.3000 | 3.3500 | 3.3500 | 3.3500 | 3.3500 | 3.3500 | 3.3500 |
| harness-guided-v2 | 2.3833 | 3.2583 | 3.2833 | 3.3000 | 3.3000 stop | 3.3000 | 3.3000 | 3.3000 | 3.3000 | 3.3000 | 3.3000 | 3.3000 |

Actual per-iteration outcomes:

| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8-12 |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | --- |
| no-harness | 2.2000 | 3.2583 | launch fail | infeasible | 1.1042 | 3.3000 | 3.3500 | infeasible |
| harness-guided-v2 | 2.3833 | 3.2583 | 3.2833 | 3.3000 | stop | stop | stop | stop |

Interpretation:

- Raising `search.high` was necessary. The default config was not optimal under the expanded workload range; `TP=2` immediately improved per-GPU throughput from about `2.2` to `3.2583`.
- The updated harness now provides deterministic proposals, not just early stop:
  - iter2: adjacent TP probe (`tensor-parallel-size=2`),
  - iter3: same-topology runtime seed (`gpu-memory-utilization=0.95`, chunked prefill, `max-num-batched-tokens=16384`),
  - iter4: controlled MBT growth to `24576`.
- No-harness reached the same config family at iter7, after an EP launch failure, an infeasible DP probe, a poor TP/DP probe, and then runtime refinement.
- Harness reached the same config family at iter4 and stopped at iter5. Its measured best was `3.3000`, while no-harness measured `3.3500` for the same `TP=2 + MBT=24576` family; the 1.5% gap is within the observed boundary/noise of repeated high-load replay. The convergence claim is therefore "same config family in fewer iterations", not an exact higher single-run number.