The reported trajectory validates the Stop-B mechanics only. TP2-DP2/TP4 saturated the trace ceiling (best_sampling_u~0.98) so their per-GPU peak is underestimated, and the run used the smoke regime (scale=0.1 + 512 cap). The TP1>TP2 ordering may be real for the small-active MoE but this run cannot establish it; the 27B TP A/B is the valid follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
87 lines
5.2 KiB
Markdown
87 lines
5.2 KiB
Markdown
# Stop-B end-to-end validation (Phase 5) — 2026-06-15
|
||
|
||
Branch `feat/two-stop`. Real agentic loop on dash0: Qwen3-30B-A3B / vLLM 0.11.1 /
|
||
8×H20, `gpt-5.4` (via codex/prism) proposing configs, Stop-A enabled to accelerate
|
||
each evaluation, `use_harness=True` so the Stop-B deterministic validator + LLM-stop
|
||
veto are active. Config `dash0_qwen30b_a3b_stopB_e2e.json`, `search.high=1.0`,
|
||
`max_probes=6`, `--max-trials 8`.
|
||
|
||
## Two stop paths exercised
|
||
|
||
**Run A (`search.high=0.125`)** — the default config already saturates the offered-load
|
||
search range, so Stop-B fired immediately via the **search-high-saturation** path:
|
||
`stop_authorized_by: validator`, reason *"the incumbent's highest measured probe is
|
||
feasible and within the binary-search resolution of search.high."* Correct
|
||
measurement-ceiling stop (no point proposing configs when the load range, not the
|
||
config, is the bound).
|
||
|
||
**Run B (`search.high=1.0`)** — forces a real multi-iteration search:
|
||
|
||
| trial | TP | DP | EP | feasible | raw req/s | **req/s/GPU** | source |
|
||
| --- | --- | --- | --- | --- | --- | --- | --- |
|
||
| 0001 | 1 | 1 | 1 | ✓ | 2.90 | **2.900** | baseline |
|
||
| 0002 | 2 | 1 | 1 | ✓ | 4.42 | 2.208 | harness TP-seed |
|
||
| 0003 | 2 | 1 | 2 | ✗ launch-fail | — | — | harness (EP) |
|
||
| 0004 | 1 | 2 | 1 | ✓ | 4.42 | 2.208 | LLM (after veto) |
|
||
| 0005 | 2 | 2 | 1 | ✓ | 8.37 | 2.092 | harness |
|
||
| 0006 | 2 | 2 | 2 | ✗ launch-fail | — | — | harness (EP) |
|
||
| 0007 | 4 | 1 | 1 | ✓ | 8.37 | 2.092 | LLM |
|
||
| (0008) | — | — | — | **STOP** | — | — | LLM stop, honored after veto budget |
|
||
|
||
Incumbent: **trial-0001 (TP1), 2.90 req/s/GPU — never beaten.**
|
||
|
||
> **⚠️ The per-GPU trajectory above is NOT a valid benchmark — it validates only
|
||
> the Stop-B *mechanics*.** Two confounds:
|
||
> 1. **Trace-ceiling saturation.** TP2·DP2 and TP4 reached `best_sampling_u≈0.98`
|
||
> (still feasible after consuming ~the whole window), so their *true* peak
|
||
> per-GPU is higher than the 2.09 shown — we ran out of offered load to push
|
||
> them to their boundary. Only TP1 (u=0.31), TP2 (u=0.48) and DP2 (u=0.48)
|
||
> found real boundaries. The `sampling_u` axis maxes at the full trace, so any
|
||
> config that sustains more than the window's offered rate cannot be measured.
|
||
> 2. **Smoke regime.** This run inherited `replay_time_scale=0.1` +
|
||
> `max_requests_per_probe=512` (README: convergence test, *not* a benchmark) —
|
||
> compressed arrivals distort A and the 512 cap imposes a ~8.4 req/s ceiling.
|
||
>
|
||
> The below-ceiling TP1 (2.90) > TP2 (2.21) ordering *may* be real for this model
|
||
> (Qwen3-30B-A3B is an MoE with ~3B active params → little compute per token → TP
|
||
> adds all-reduce overhead with little benefit), which differs from the dense
|
||
> Qwen3.5-27B where TP2 wins. But this run cannot establish it. A valid benchmark
|
||
> needs `scale=1.0`, no cap, and enough offered-load headroom that strong configs
|
||
> are not trace-saturated — see the 27B TP A/B follow-up.
|
||
|
||
## Phase-5 acceptance
|
||
|
||
- **No regression.** The primary metric `request_rate_per_gpu` stayed 2.90 the whole
|
||
run. Scaling TP/DP raised *raw* throughput (4.42, 8.37) but lowered per-GPU
|
||
efficiency (2.21, 2.09); the loop correctly kept the TP1 baseline as incumbent and
|
||
never adopted a worse-per-GPU config. (Matches the paper: long-prompt, low-reuse
|
||
chat prefers small TP for per-GPU efficiency.)
|
||
- **Stop-B authority validated live.** At trial 4 the LLM tried to stop
|
||
(`should_stop=true`); the deterministic validator **vetoed** it
|
||
(`validator_did_not_authorize_stop`, `continue_harness_guided_search`), forcing one
|
||
more confirmation (DP2, which also failed to beat baseline). After the budget, the
|
||
LLM's later, well-justified stop was honored (`stop_authorized_by:
|
||
llm_after_veto_budget`). The bounded veto behaved exactly as designed.
|
||
- **Auditable reason chain.** Every stop/veto carries a diagnosis grounded in the
|
||
measured evidence (e.g. *"increasing TP 1→2 lowers per-GPU efficiency even though
|
||
token latency improves … EP is explicitly blocked by launch-failure evidence"*).
|
||
- **Launch-failure robustness.** Two EP configs (trial-0003, 0006) failed to launch
|
||
under vLLM 0.11.1; the harness recorded them as hard-negative evidence and the LLM
|
||
explicitly stopped proposing EP.
|
||
|
||
## Notes / limitations
|
||
|
||
- For this workload the baseline (TP1) is already per-GPU optimal, so iterations-to-
|
||
*best* = 1; the remaining trials are the loop *confirming* no config beats baseline
|
||
before stopping. A workload with an under-tuned default would show an improving
|
||
trajectory; this run validates the stop/no-regression machinery, not a tuning win.
|
||
- The final stop came via `llm_after_veto_budget` (validator vetoed once, then
|
||
deferred), not a pure deterministic validator stop — because the deterministic
|
||
conditions (3-within-2%, saturation, validation-exhausted) did not cleanly fire
|
||
when every trial was a distinct config with a distinct per-GPU rate. The validator
|
||
acted as the *guard* (preventing premature stop), which is its designed role.
|
||
- 7 trials > the paper's 3–6 average, inflated by the wider search range, 2 EP
|
||
launch-failures, and the veto. Acceptable for a validation run.
|
||
- LLM token: the non-interactive shell lacks `OPENAI_API_KEY`; export it from the
|
||
codex `auth.json` (`~/.codex/auth.json`) before the run.
|