Drain inflight requests after early stop

2026-04-25 16:57:01 +08:00
parent 2dc2815620
commit 2d7ebe50ee
3 changed files with 59 additions and 12 deletions
--- a/docs/harness-tuning-progress.md
+++ b/docs/harness-tuning-progress.md
@@ -61,6 +61,21 @@ Improve AITuner convergence for the `dash0` internal vLLM + Qwen3.5-27B 0-8k cha
 - Remote `compileall` passed.
 - Remote `unittest discover` initially exposed two pre-existing path-sensitive tests that hardcoded `/home/gahow/phd/aituner`; fixed them to derive `REPO_ROOT` from the test file path.

+### 2026-04-25 16:38-16:58 CST
+
+- Started real run in tmux session `aituner_harness_qwen27b_0_8k_20260425`.
+- Store root: `.aituner/harness-studies-20260425`.
+- First proposal followed the harness:
+  - proposal: `tensor-parallel-size: 2`;
+  - rationale: L profile is prefill-sensitive, prefix reuse is low, arrivals are smooth, so probe adjacent TP before runtime batching knobs.
+- First high-load probe at `sampling_u=0.03125` was infeasible:
+  - request rate 0.895 req/s;
+  - pass rate 0.145;
+  - p95 TTFT 4063 ms and p95 TPOT 113 ms;
+  - failed reasons included `tpot_ms>50.0` and `slo_pass_rate_unrecoverable`.
+- Important implementation issue found: after an early-stopped probe, the worker returned while in-flight HTTP requests could continue occupying the engine, stalling/polluting the next binary-search probe.
+- Action: stopped the run and freed GPUs. Updating `worker._replay_requests` to drain in-flight requests after early stop before the next probe starts.
+
 Remaining next steps:

 1. Start a real harness-guided Qwen3.5-27B 0-8k chat tuning run from `configs/examples/dash0_qwen27b_tight_slo_run4_0_8k.json`.