Document high search rerun

2026-05-06 03:19:51 +08:00
parent 0622e23817
commit cf2e741550
2 changed files with 69 additions and 1 deletions
--- a/docs/aituner-harness-summary.md
+++ b/docs/aituner-harness-summary.md
@@ -63,6 +63,10 @@ The speedup comes from reducing wasted proposal families, not from changing the
   - If the incumbent's highest measured probe is feasible, has no SLO failures, and is within the configured binary-search resolution of `search.high`, the harness stops before asking the LLM for another proposal.
   - This is not a model-specific threshold. It means the workload search range, not the engine config, is currently the limiting measurement bound.

+6. Deterministic first probes
+   - After a baseline latency bottleneck, the harness can propose the adjacent legal TP increase before asking the LLM.
+   - After a TP incumbent improves per-GPU throughput, the harness keeps that topology and applies a same-topology runtime seed before trying DP/EP or broad runtime changes.
+
 ## qwen27b 0-8k Evidence

 Source: `docs/qwen27b-chat-0-8k-harness-fig18.md`.
@@ -93,6 +97,8 @@ Source: `docs/qwen30b-community-vllm020/harness-early-stop-ablation-20260502.md`

 Metric: best-so-far feasible `request_rate_per_gpu` on the bounded 0-8k chat replay with 128 output tokens and `replay_time_scale=0.1`.

+Initial `search.high=0.125` result:
+
 | Variant | Iter 1 | Iter 2 | Iter 3-12 |
 | --- | ---: | ---: | ---: |
 | no-harness | 1.0333 | 1.0333 | 1.0333 |
@@ -103,3 +109,17 @@ Result:
 - Both variants found the same best measured config: the default community vLLM launch.
 - Harness stopped at iter 2 because the incumbent saturated `search.high`; no LLM proposal or GPU trial was needed after baseline.
 - No-harness spent the full 12-iteration budget: iter 2 was worse per GPU, and iter 3-12 were launch failures.
+- This was a measurement-ceiling result, not proof of global optimality.
+
+High `search.high=1.0` rerun:
+
+| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8-12 |
+| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
+| no-harness | 2.2000 | 3.2583 | 3.2583 | 3.2583 | 3.2583 | 3.3000 | 3.3500 | 3.3500 |
+| harness-guided-v2 | 2.3833 | 3.2583 | 3.2833 | 3.3000 | 3.3000 stop | 3.3000 | 3.3000 | 3.3000 |
+
+Result:
+
+- Raising `search.high` showed that default vLLM was not actually optimal; the prior run was capped by workload range.
+- Harness reached the same TP2/runtime config family in 4 iterations instead of 7 by making deterministic first TP and same-topology runtime proposals.
+- The single-run best value differs by about 1.5% (`3.3000` vs `3.3500`) for the same config family, so this should be interpreted as faster convergence to the same region, not an exact single-run throughput win.
--- a/docs/qwen30b-community-vllm020/harness-early-stop-ablation-20260502.md
+++ b/docs/qwen30b-community-vllm020/harness-early-stop-ablation-20260502.md
@@ -76,7 +76,7 @@ Local test command:
 PYTHONPATH=src python3 -m unittest tests.test_core_flow -q
 ```

-Result: passed, 75 tests.
+Result: passed, 77 tests.

 The added coverage checks:

@@ -87,6 +87,8 @@ The added coverage checks:
 | `test_cli_tune_uses_harness_stop_before_llm` | `study tune` can stop without calling the LLM or launching another GPU trial |
 | `test_prompt_can_disable_harness_for_ablation` | no-harness prompt removes structured harness context |
 | `test_harness_stop_when_incumbent_saturates_search_high` | deterministic stop when the incumbent saturates the configured workload search high |
+| `test_harness_guided_first_tp_probe_for_latency_bottleneck` | deterministic first TP probe after baseline latency bottleneck evidence |
+| `test_harness_guided_runtime_seed_preserves_tp_incumbent` | deterministic same-topology runtime refinement after a TP incumbent |

 ## Experiment Tracking

@@ -125,3 +127,49 @@ Interpretation:
 Operational note:

 - The no-harness run left driver-side orphan GPU memory on GPU0/1 after repeated launch failures. An earlier pre-high-stop harness attempt left the same kind of residue on GPU2/3. The final harness run was executed on dash0 GPU4-7 via a runtime-derived spec to avoid this contamination. Its executed GPU trial used a single H20, matching the no-harness best trial's single-GPU default configuration.
+
+## High=1.0 Rerun
+
+The `search.high=0.125` run answered only "can this config handle up to about 1.08 req/s in the compressed replay?" It could not answer "which config is best?" because the default config already reached the measurement ceiling.
+
+Trace request counts after raising `search.high` show the difference:
+
+| search.high | Near-top selected requests | Near-top request rate |
+| ---: | ---: | ---: |
+| 0.125 | 65 | 1.0833 req/s |
+| 0.25 | 141 | 2.3500 req/s |
+| 0.5 | 269 | 4.4833 req/s |
+| 1.0 | 502 | 8.3667 req/s |
+
+The high=1.0 run used the same bounded replay (`completion_tokens_override=128`, `replay_time_scale=0.1`, `max_requests_per_probe=512`) but set `search.high=1.0` and `max_probes=6`.
+
+Completed dash0 high=1.0 runs:
+
+| Variant | tmux session | Study root |
+| --- | --- | --- |
+| no-harness | `qwen30b_vllm020_noharness_high1_20260506` | `.aituner-community-vllm020/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-out128-scale01-high1-noharness` |
+| harness-guided-v2 | `qwen30b_vllm020_harness_high1_guided_v2_20260506` | `.aituner-community-vllm020/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-out128-scale01-high1-harness-guided-v2` |
+
+Metric: best-so-far `request_rate_per_gpu`.
+
+| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
+| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
+| no-harness | 2.2000 | 3.2583 | 3.2583 | 3.2583 | 3.2583 | 3.3000 | 3.3500 | 3.3500 | 3.3500 | 3.3500 | 3.3500 | 3.3500 |
+| harness-guided-v2 | 2.3833 | 3.2583 | 3.2833 | 3.3000 | 3.3000 stop | 3.3000 | 3.3000 | 3.3000 | 3.3000 | 3.3000 | 3.3000 | 3.3000 |
+
+Actual per-iteration outcomes:
+
+| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8-12 |
+| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | --- |
+| no-harness | 2.2000 | 3.2583 | launch fail | infeasible | 1.1042 | 3.3000 | 3.3500 | infeasible |
+| harness-guided-v2 | 2.3833 | 3.2583 | 3.2833 | 3.3000 | stop | stop | stop | stop |
+
+Interpretation:
+
+- Raising `search.high` was necessary. The default config was not optimal under the expanded workload range; `TP=2` immediately improved per-GPU throughput from about `2.2` to `3.2583`.
+- The updated harness now provides deterministic proposals, not just early stop:
+  - iter2: adjacent TP probe (`tensor-parallel-size=2`),
+  - iter3: same-topology runtime seed (`gpu-memory-utilization=0.95`, chunked prefill, `max-num-batched-tokens=16384`),
+  - iter4: controlled MBT growth to `24576`.
+- No-harness reached the same config family at iter7, after an EP launch failure, an infeasible DP probe, a poor TP/DP probe, and then runtime refinement.
+- Harness reached the same config family at iter4 and stopped at iter5. Its measured best was `3.3000`, while no-harness measured `3.3500` for the same `TP=2 + MBT=24576` family; the 1.5% gap is within the observed boundary/noise of repeated high-load replay. The convergence claim is therefore "same config family in fewer iterations", not an exact higher single-run number.