From b0325ecfd9b9d098ec545a0981a5b7b33a583406 Mon Sep 17 00:00:00 2001
From: Gahow Wang <gahow.wang@gmail.com>
Date: Sun, 10 May 2026 14:21:49 +0800
Subject: [PATCH] Clarify qwen235b raw per-iteration performance

---
 .../qwen235b-thinking-prefill-ttft-20260510.md | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/docs/harness-ablation/qwen235b-thinking-prefill-ttft-20260510.md b/docs/harness-ablation/qwen235b-thinking-prefill-ttft-20260510.md
index 7eb5fed..c7fd2f5 100644
--- a/docs/harness-ablation/qwen235b-thinking-prefill-ttft-20260510.md
+++ b/docs/harness-ablation/qwen235b-thinking-prefill-ttft-20260510.md
@@ -24,20 +24,33 @@ Both runs were launched through `python3 -m aituner.cli study tune`; no proposal
 
 ## Result
 
-Throughput is `best_request_rate_per_gpu` for each trial. `-` means the trial did not produce a feasible point.
+The table below is the raw per-iteration performance for a Fig18-style plot. Use this table as `perf[i]`; do not replace missing points with `max(perf[:i+1])`.
+
+Metric: `best_request_rate_per_gpu` from that trial's own `result.json`. `NA` means the proposed config did not produce a feasible point under the SLO, either because the engine/probe failed or because every sampled probe was infeasible.
+
+| Variant | iter1 | iter2 | iter3 | iter4 | iter5 | iter6 | iter7 | iter8 | iter9 | iter10 | iter11 | iter12 |
+| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
+| no-harness raw `perf[i]` | 0.2029 | NA | NA | 0.3863 | NA | NA | NA | 0.3879 | 0.3892 | 0.3896 | 0.3900 | 0.3900 |
+| harness raw `perf[i]` | 0.2029 | NA | 0.3863 | stop | stop | stop | stop | stop | stop | stop | stop | stop |
+
+The raw no-harness curve is therefore not monotonic. The apparent monotonic 12-iter sequence comes only from plotting best-so-far rather than the measured performance of each proposal.
+
+Per-trial details:
 
 | Variant | iter1 | iter2 | iter3 | iter4 | iter5 | iter6 | iter7 | iter8 | iter9 | iter10 | iter11 | iter12 |
 | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
 | no-harness, per-trial | 0.2029 | - | - | 0.3863 | - | - | - | 0.3879 | 0.3892 | 0.3896 | 0.3900 | 0.3900 |
 | harness, per-trial | 0.2029 | - | 0.3863 | stop | stop | stop | stop | stop | stop | stop | stop | stop |
 
-Best-so-far curve:
+Best-so-far curve, shown only to explain final incumbent selection:
 
 | Variant | iter1 | iter2 | iter3 | iter4 | iter5 | iter6 | iter7 | iter8 | iter9 | iter10 | iter11 | iter12 |
 | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
 | no-harness | 0.2029 | 0.2029 | 0.2029 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3879 | 0.3892 | 0.3896 | 0.3900 | 0.3900 |
 | harness | 0.2029 | 0.2029 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3863 |
 
+For plotting raw `perf[i]`, the failed/infeasible points should stay missing or be rendered as invalid trials. If a plotting script requires numeric values, use `0` only with an explicit label that this means "no feasible configuration under the configured SLO"; do not forward-fill from the incumbent.
+
 Final best:
 
 | Variant | GPU trials spent | Best trial | Best config summary | Best req/s | Best req/s/GPU | Final vs no-harness |
@@ -70,4 +83,3 @@ The harness context also made the LLM response more directed after failure:
 Harness accelerated convergence mainly through early stopping, not by finding a much better final config on this setup. It reduced GPU trials from 12 to 3 while preserving 99.0% of the no-harness final throughput. It also reached the first strong TP8 point one trial earlier than no-harness.
 
 The limitation is that the generic search-high stop guard stopped before local runtime tuning of `max-num-batched-tokens`, which no-harness used to recover a small additional `0.97%`. For this setup, that tradeoff is acceptable if the goal is fast convergence under a fixed measurement ceiling; if the goal is exact final throughput, the next study should raise `search.high` or disable search-high early stop for a local-polish phase.
-