diff --git a/docs/harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md b/docs/harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md
new file mode 100644
index 0000000..dab994c
--- /dev/null
+++ b/docs/harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md
@@ -0,0 +1,251 @@
+# Qwen27B Tight-SLO 2x2 Harness Ablation - 2026-06-23
+
+This note organizes the aggregate report generated at:
+
+```text
+.aituner-reports/qwen27b-tight-2x2-aggregate-20260623T005838Z/report.md
+```
+
+The experiment is a 2x2 ablation: model strength crossed with `use_harness`.
+It asks whether the harness supplies reusable search structure beyond a stronger
+LLM's free-form tuning proposals.
+
+## Experiment Design
+
+Case: `qwen27b-tight-slo-2x2-aggregate`.
+
+Substrate:
+
+- Model served: `qwen3.5-27b-256k-0223-internal`.
+- Hardware: H20, up to 8 GPUs.
+- Trace: `chat_w20260311_1000`, input length filtered to 0-8192 tokens,
+  `replay_time_scale=1.0`, `max_concurrency=32`.
+- SLO: pass rate >= 0.95, TTFT step rule of 2s for <=4096 input tokens,
+  4s for <=32768 input tokens, 6s above that, and TPOT <= 50 ms.
+- Search: `sampling_u` in `[0, 0.0625]`, tolerance 0.001, max 6 probes.
+- Tunable envs: `VLLM_ENABLE_TORCH_COMPILE`.
+- Tunable flags: `tensor-parallel-size`, `data-parallel-size`,
+  `expert-parallel-size`, `gpu-memory-utilization`, `block-size`,
+  `max-num-batched-tokens`, `max-num-seqs`, `enable-prefix-caching`,
+  `enable-chunked-prefill`.
+- Topology constraints: TP and DP in `{1,2,4,8}`, allowed TP*DP products in
+  `{1,2,4,8}`, EP fixed to 1 for this case.
+
+Arms:
+
+| Arm | Tuner model | Harness | Trial budget used |
+| --- | --- | --- | ---: |
+| `gpt55_harness` | `gpt-5.5` | on | 2 |
+| `gpt55_naive` | `gpt-5.5` | off | 10 |
+| `gpt54mini_harness` | `gpt-5.4-mini` | on | 2 |
+| `gpt54mini_naive` | `gpt-5.4-mini` | off | 10 |
+
+The only intended axis inside each model pair is `use_harness`. The aggregate
+then compares whether the weaker model plus harness can match or exceed the
+stronger model without harness.
+
+## Aggregate Result
+
+Reference best: `0.4429 req/s/GPU`.
+Target threshold for convergence comparisons: 95% of reference, or
+`0.4208 req/s/GPU`.
+
+| Arm | Kind | Trials | Final req/s/GPU | Final/ref | Trials to target | Normalized AUC | Failed | No feasible |
+| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
+| `gpt55_harness` | harness | 2 | 0.4429 | 1.0000 | 2 | 0.9484 | 0 | 0 |
+| `gpt55_naive` | naive | 10 | 0.0273 | 0.0616 | - | 0.0588 | 2 | 2 |
+| `gpt54mini_harness` | harness | 2 | 0.4429 | 1.0000 | 2 | 0.9450 | 0 | 0 |
+| `gpt54mini_naive` | naive | 10 | 0.0231 | 0.0522 | - | 0.0498 | 1 | 1 |
+
+Harness wins both harness-vs-naive checks:
+
+| Harness arm | Final vs best naive | AUC vs best naive | Pass |
+| --- | ---: | ---: | --- |
+| `gpt55_harness` | 16.2290x | 16.1296x | true |
+| `gpt54mini_harness` | 16.2290x | 16.0720x | true |
+
+The strongest ablation observation is that `gpt-5.4-mini + harness` matches
+`gpt-5.5 + harness` at the same final throughput and the same trials-to-target,
+while both naive arms remain more than 16x below the harness arms by final
+per-GPU throughput and AUC.
+
+## What The Harness Actually Did
+
+The harness did not perform generic "better prompting". It inserted a measured,
+structured decision protocol between trial results and the next proposal.
+
+Formally, after each trial `t`, AITuner observes:
+
+```text
+o_t = (config_t, probe history_t, pass-rate_t, latency/SLO failures_t,
+       request_rate_t, parallel_size_t, launch status_t)
+```
+
+and optimizes:
+
+```text
+J(config_t) = request_rate_t / parallel_size_t
+subject to pass_rate_t >= 0.95.
+```
+
+The harness maps the observation into:
+
+```text
+b_t = ranked_bottleneck(o_t)
+A_t = candidate_knob_families(b_t, topology_constraints, prior_failures)
+score(a) = expected_bottleneck_relief(a)
+         + information_gain(a)
+         + launch_safety(a)
+         - regression_risk(a)
+         - measurement_cost(a)
+```
+
+For this workload, the ranked bottleneck was `ttft_prefill`: long, heavy-tailed
+prompts and a tight TTFT SLO made single-request prefill service time the
+active limiter. Under that bottleneck, the high-value candidate family is a
+legal TP frontier probe, because increasing TP can reduce prefill compute
+latency for one request. DP-only scaling adds replicas but does not shorten the
+single-request prefill path, so it can improve aggregate admission while still
+failing the per-request TTFT bottleneck and the per-GPU objective.
+
+The actual harness trajectory was:
+
+| Arm | Trial | Patch | req/s/GPU | Pass rate | Diagnosis |
+| --- | ---: | --- | ---: | ---: | --- |
+| `gpt55_harness` | 1 | `TP=2, DP=1` | 0.2142 | 0.9572 | TTFT/prefill; adjacent TP increase should reduce long-prefill latency. |
+| `gpt55_harness` | 2 | `TP=4, DP=1` | 0.4429 | 0.9718 | Ranked bottleneck is `ttft_prefill`; compare TP4 vs TP2 to distinguish compute-latency relief from replica/admission effects. |
+| `gpt54mini_harness` | 1 | `TP=2, DP=1` | 0.1992 | 0.9707 | TTFT/prefill; adjacent TP increase is the safest throughput-improving probe. |
+| `gpt54mini_harness` | 2 | `TP=4, DP=1` | 0.4429 | 0.9727 | Same `ttft_prefill` topology test as the stronger model. |
+
+The stop was also harness-mediated. Both harness arms stopped after trial 2
+because the validator authorized `harness_stop` with:
+
+```text
+search_high_saturated_by_incumbent
+```
+
+The recorded stop diagnosis was:
+
+```text
+The incumbent's highest measured probe is feasible and is within the configured
+binary-search resolution of search.high.
+```
+
+So the loop did not stop because an LLM guessed that tuning was done. It stopped
+because the incumbent saturated the configured search interval under the SLO
+within binary-search tolerance.
+
+## Which Knobs Were Tuned
+
+The winning harness configuration only changed topology:
+
+```text
+base config + tensor-parallel-size=4, data-parallel-size=1
+```
+
+The harness did not tune local scheduler/cache/memory knobs in the winning path.
+It deliberately tested topology before local runtime knobs because the active
+bottleneck was single-request TTFT/prefill service time.
+
+The naive arms tuned a different knob family:
+
+| Arm | Topology used in all trials | Runtime knobs varied | Best req/s/GPU |
+| --- | --- | --- | ---: |
+| `gpt55_naive` | `TP=1, DP=8` | `max-num-batched-tokens`, `max-num-seqs`, `block-size`, `gpu-memory-utilization`, prefix caching, chunked prefill | 0.0273 |
+| `gpt54mini_naive` | `TP=1, DP=8` | `max-num-batched-tokens`, `max-num-seqs`, `block-size`, `gpu-memory-utilization` | 0.0231 |
+
+The first `gpt55_naive` proposal explicitly chose `TP=1, DP=8`, reasoning that
+horizontal data parallelism should maximize request rate because the model fits
+per GPU and TP would add communication overhead. Subsequent naive proposals kept
+that DP-heavy topology and searched scheduler/cache/memory details around it.
+Across 20 naive trial slots total, neither model entered the TP2/TP4 topology
+frontier that solved the bottleneck.
+
+## Why This Beats Baseline
+
+The baseline failed because it optimized the wrong causal path.
+
+For a TTFT/prefill-bound workload, the relevant service-time term is the latency
+of one request's prefill path. A DP-heavy topology can run more independent
+replicas, but each replica still handles a long prompt with TP1 compute latency.
+Under a tight per-request TTFT SLO, those replicas do not unlock a much higher
+feasible `sampling_u`, and the objective divides by GPU usage. This is why
+`TP=1, DP=8` stayed near `0.02-0.027 req/s/GPU` despite using all GPUs.
+
+The harness changed the optimization direction:
+
+```text
+observed SLO pressure -> classify as TTFT/prefill -> prefer legal TP frontier
+-> measure per-GPU feasible rate under the same SLO -> stop when search.high is saturated
+```
+
+That sequence is measurable and falsifiable. If TP4 had improved raw latency but
+materially regressed `request_rate_per_gpu`, the harness proposal said it should
+reject the hypothesis. If the bottleneck had been admission/queueing with healthy
+TTFT/TPOT service times, the same knob-effect model would have favored DP or
+`max-num-seqs` instead. The decision was not "Qwen27B needs TP4"; it was
+"`ttft_prefill` evidence makes TP frontier the next highest-information probe
+under current constraints."
+
+This is also why the weak-model arm matters. The weaker `gpt-5.4-mini` with the
+harness converged to exactly the same TP frontier and final throughput as
+`gpt-5.5 + harness`, while the stronger `gpt-5.5` without harness stayed in the
+wrong DP-heavy family for its whole budget. The ablation therefore attributes the
+gain to the structured harness state and validators, not merely to a stronger
+language model or a more verbose prompt.
+
+## Evidence Boundary
+
+This report strongly supports the harness mechanism on the Qwen27B tight-SLO
+case and the model-strength ablation. It should not be overclaimed as universal
+proof by itself. The correct generalization claim is narrower:
+
+- In this case, the harness improved final quality, convergence speed, AUC, and
+  stop discipline.
+- The harness made a weaker model match the stronger harnessed model and beat
+  the stronger naive model by more than 16x.
+- The successful decision was expressed in generic terms: SLO-derived
+  bottleneck classification, topology constraints, knob-effect scoring,
+  per-GPU objective, and validator-authorized stop.
+- Additional cases are still needed to show the same mechanism across different
+  bottlenecks, for example prefill scheduler pressure, decode TPOT pressure,
+  memory/KV pressure, and admission/queueing pressure.
+
+## Original Aggregate Report
+
+```text
+# qwen27b-tight-2x2-aggregate-20260623T005838Z
+
+## Aggregate
+
+- Cases: `1`
+- Harness-vs-naive pass/checks: `2`/`2`
+- Winner counts: `{"final_best": {"gpt55_harness": 1}, "fastest_to_target": {"gpt55_harness": 1}, "normalized_auc": {"gpt55_harness": 1}}`
+
+## By Kind
+
+| Kind | Arms | Mean final/ref | Mean AUC | Target reached |
+| --- | ---: | ---: | ---: | ---: |
+| `harness` | 2 | 1.0000 | 0.9467 | 2 |
+| `naive` | 2 | 0.0569 | 0.0543 | 0 |
+
+## Cases
+
+### qwen27b-tight-slo-2x2-aggregate
+
+- Reference best req/s/GPU: `0.4429`
+- Target fraction: `0.95`
+- Winners: `{"final_best": "gpt55_harness", "fastest_to_target": "gpt55_harness", "normalized_auc": "gpt55_harness"}`
+
+| Arm | Kind | Trials | Final/GPU | Final/ref | TTT | AUC | Failed | No feasible |
+| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
+| `gpt55_harness` | `harness` | 2 | 0.4429 | 1.0000 | 2 | 0.9484 | 0 | 0 |
+| `gpt55_naive` | `naive` | 10 | 0.0273 | 0.0616 | - | 0.0588 | 2 | 2 |
+| `gpt54mini_harness` | `harness` | 2 | 0.4429 | 1.0000 | 2 | 0.9450 | 0 | 0 |
+| `gpt54mini_naive` | `naive` | 10 | 0.0231 | 0.0522 | - | 0.0498 | 1 | 1 |
+
+| Harness | Final vs best naive | Target speedup | AUC vs best naive | Pass |
+| --- | ---: | ---: | ---: | --- |
+| `gpt55_harness` | 16.2290 | - | 16.1296 | `True` |
+| `gpt54mini_harness` | 16.2290 | - | 16.0720 | `True` |
+```