Evaluate baseline before LLM tuning

2026-04-25 17:14:05 +08:00
parent 2d7ebe50ee
commit 6c04b9dbbc
3 changed files with 97 additions and 2 deletions
--- a/docs/harness-tuning-progress.md
+++ b/docs/harness-tuning-progress.md
@@ -76,6 +76,12 @@ Improve AITuner convergence for the `dash0` internal vLLM + Qwen3.5-27B 0-8k cha
 - Important implementation issue found: after an early-stopped probe, the worker returned while in-flight HTTP requests could continue occupying the engine, stalling/polluting the next binary-search probe.
 - Action: stopped the run and freed GPUs. Updating `worker._replay_requests` to drain in-flight requests after early stop before the next probe starts.

+### 2026-04-25 17:00-17:12 CST
+
+- r2 confirmed that draining avoids immediate cross-probe pollution, but the first LLM trial still started from a speculative TP=2 edit without a measured incumbent.
+- This is not aligned with the paper's agentic loop, which evaluates the initial configuration first and then searches from measured feedback.
+- Action: update `study tune` so LLM-driven studies automatically materialize a baseline empty-patch trial first, unless `--skip-baseline` is passed. This should reduce early bad proposals because the first LLM edit will see real baseline bottleneck diagnostics and an incumbent request_rate_per_gpu.
+
 Remaining next steps:

 1. Start a real harness-guided Qwen3.5-27B 0-8k chat tuning run from `configs/examples/dash0_qwen27b_tight_slo_run4_0_8k.json`.