diff --git a/docs/aituner-harness-summary.md b/docs/aituner-harness-summary.md
new file mode 100644
index 0000000..b454c80
--- /dev/null
+++ b/docs/aituner-harness-summary.md
@@ -0,0 +1,83 @@
+# AITuner Harness Summary
+
+## What The Harness Adds
+
+The harness turns each LLM proposal from open-ended config search into a bottleneck-directed decision.
+
+1. Workload profile
+   - Extracts L-C-A features from the trace window:
+     - L: prompt length percentiles and tail ratio.
+     - C: prefix/cache reuse estimates from `hash_ids` when available.
+     - A: request rate, burst ratio, and interarrival variation.
+   - These features are injected into the prompt as a structured `Harnesses` section.
+
+2. Trial diagnostics
+   - Reads recent trial result JSON.
+   - Summarizes feasible probes, all-infeasible probes, pass rates, request rates, latency percentiles, and failed SLO reason counts.
+   - Classifies the active bottleneck as `ttft_prefill`, `decode_tpot`, `admission_or_queueing`, `launch_or_memory`, or unknown.
+
+3. Knob-family harnesses
+   - Maps bottlenecks to a small number of plausible knob families.
+   - Current harness families:
+     - `tensor-parallel-size`: long-prompt TTFT/prefill bottlenecks.
+     - `max-num-batched-tokens`: prefill batching or fragmentation, with trust-region guards.
+     - `max-num-seqs`: cache-heavy or admission-limited workloads.
+     - `enable-chunked-prefill`: long-tail prompt blocking.
+     - `gpu-memory-utilization`: memory headroom after topology and batching are stable.
+   - Each family has `use_when`, `procedure`, `guards`, and `active_now` fields.
+
+4. Proposal discipline
+   - The prompt requires the LLM to choose at most one primary knob family unless history proves a coupled change is needed.
+   - It must use adjacent legal topology choices and stay inside topology constraints.
+   - It receives tested config signatures, so it should not repeat already-tried configs.
+   - It can return `should_stop=true` when no adjacent harness-guided probe is justified.
+
+5. Baseline-first loop
+   - LLM-driven `study tune` now evaluates the initial engine config first unless `--skip-baseline` is passed.
+   - This aligns the loop with evaluate-then-search: the first LLM proposal sees measured bottleneck evidence rather than guessing from static config.
+
+## What Accelerates Convergence
+
+The speedup comes from reducing wasted proposal families, not from changing the benchmark metric.
+
+1. Topology-before-runtime on prefill bottlenecks
+   - For long-prompt, low-cache-reuse windows, the harness activates the TP harness before speculative runtime knobs.
+   - Example: qwen27b 0-8k chat reached `TP=2, DP=1` at iter 2 under harness replay, while the original run spent iter 2 on `DP=2` and iter 3 on `DP=4`.
+
+2. Guarded stop after a strong incumbent
+   - If the newest trial is the incumbent and improves per-GPU throughput by at least `3x` over baseline, the harness requires direct evidence before trying runtime-only tweaks.
+   - Without that guard, the LLM still proposed weak MBT trials after finding the qwen27b best config.
+   - With the guard, it emits `should_stop=true`.
+
+3. All-infeasible plateau detection
+   - When recent all-infeasible trials at the same sampling threshold stop improving pass rate and p95 TTFT, the harness blocks repeating the same primary knob family.
+   - This prevents continuing a direction such as DP-only scale-out after DP4 and DP8 plateau.
+
+4. Cleaner early-stop handling
+   - Early-stopped probes no longer leave in-flight requests polluting the next probe.
+   - Default behavior drains in-flight requests for comparable production runs.
+   - Engine relaunch after early stop is available as opt-in for faster smoke studies, but it is not the default because it can change warm-state comparability.
+
+## qwen27b 0-8k Evidence
+
+Source: `docs/qwen27b-chat-0-8k-harness-fig18.md`.
+
+Metric: best-so-far feasible `request_rate_per_gpu`.
+
+| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5-12 |
+| --- | ---: | ---: | ---: | ---: | ---: |
+| Before harness | 0.0350 | 0.0617 | 0.0617 | 0.2025 | 0.2025 |
+| After harness strict replay | 0.0350 | 0.2025 | 0.2025 stop | 0.2025 | 0.2025 |
+
+Result:
+
+- Before harness reached the best value at iter 4.
+- After harness reached the same value at iter 2 and stopped at iter 3.
+- Iterations-to-best improved from `4` to `2`, a `2x` convergence speedup on this case.
+- The harness also avoided eight post-best infeasible runtime-only probes.
+
+## Current Risks
+
+- The harness is prompt-guided, not a hard verifier for every rule. If future LLM outputs ignore a fired guard, proposal validation should reject the blocked family explicitly.
+- Strong-incumbent stopping is deliberately conservative for the qwen27b pattern. Workloads with narrow runtime sweet spots, such as qwen235b thinking prefill-only, may need a weaker stop threshold or a "continue local refinement" exception.
+- Full fresh reruns on large models are expensive. Strict replay is useful for measuring proposal-path improvements when the proposed configs already exist in prior measured runs, but publication-quality claims still need fresh no-relaunch runs when time allows.