Files
aituner/docs/qwen27b-chat-0-8k-harness-fig18.md

3.5 KiB

qwen27b-chat-0-8k Harness Fig18

Setup

  • Workload: qwen3.5-27b chat, 0 <= input_length <= 8192.
  • Window: chat_w20260311_1000.
  • Engine: dash0 internal vLLM, baseline aligned to run_qwen27b.sh.
  • SLO: 95% pass rate, stepped TTFT 2s/4s/6s, TPOT <=50ms.
  • Search metric: best-so-far feasible request_rate_per_gpu.
  • Before-harness source: actual 12-trial run .aituner-tight/dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology.
  • After-harness source: strict harness replay over already measured run9 configs:
    • Iter 1 uses the measured baseline trial.
    • Iter 2 uses the current harness proposal after seeing only iter 1 history. It proposes TP=2, DP=1, whose performance is the measured run9 trial-0004 result for the same config and spec.
    • Iter 3 uses the current harness proposal after seeing only baseline + TP=2, DP=1. With the strong-incumbent guard, it returns should_stop=true.

The replay is intentionally strict: the LLM prompt does not receive future best_by_parallel_size entries or later failed trials.

Fig18-Style Best-So-Far Curve

Unit: feasible request_rate_per_gpu. Infeasible trials leave the best-so-far value unchanged.

Variant Iter 1 Iter 2 Iter 3 Iter 4 Iter 5 Iter 6 Iter 7 Iter 8 Iter 9 Iter 10 Iter 11 Iter 12
Before harness, actual run9 0.0350 0.0617 0.0617 0.2025 0.2025 0.2025 0.2025 0.2025 0.2025 0.2025 0.2025 0.2025
After harness, strict replay 0.0350 0.2025 0.2025 stop 0.2025 0.2025 0.2025 0.2025 0.2025 0.2025 0.2025 0.2025 0.2025

Trial-Level Interpretation

Variant Iter 1 Iter 2 Iter 3 Iter 4 Iter 5-12
Before harness baseline TP1/DP1, 0.0350 DP=2, 0.0617 DP=4, 0.0392, worse per GPU TP=2, DP=1, 0.2025, best runtime-only probes, all infeasible
After harness baseline TP1/DP1, 0.0350 TP=2, DP=1, 0.2025, best should_stop=true no GPU trial no GPU trial

Convergence Judgment

  • Before harness reaches the final best value at iter 4.
  • After harness reaches the same best value at iter 2.
  • The speedup is 2x by iterations-to-best: 4 -> 2.
  • The harness also avoids the post-best weak proposals: before harness spent iters 5-12 on infeasible runtime-only probes; after harness stops at iter 3.

Implementation Changes From This Check

  • Added a strong-incumbent convergence guard:
    • if the latest trial is the incumbent,
    • and it improves request_rate_per_gpu by at least 3x over the baseline,
    • then runtime-only probes require direct same-topology evidence; otherwise the LLM should return should_stop=true.
  • Strengthened the MBT harness guard:
    • do not raise max-num-batched-tokens when incumbent MBT already covers prompt p99 unless same-topology evidence proves prefill fragmentation.
  • Made early-stop engine relaunch opt-in. A real r2 run showed that default relaunch changes warm-state behavior and makes full-chat results incomparable with run9, so the default remains drain-based for comparable production measurements.
  • Added LLM empty-response retry to avoid crashing study tune on a transient empty streamed response.

Remote Checks

  • Local: python3 -m compileall -q src tests passed.
  • Local: PYTHONPATH=src python3 -m unittest tests.test_core_flow passed, 63 tests.
  • dash0: python3 -m compileall -q src tests passed.
  • dash0: PYTHONPATH=src python3 -m unittest discover -s tests -p "test_*.py" passed, 63 tests.