Document 8-GPU harness ablation results for qwen27b and qwen235b prefill

Add completed experiment results from dash0 runs after 2026-05-13: - qwen27b chat 0-8k: harness +118.6% over no-harness (0.2696 vs 0.1233 req/s/GPU) - qwen235b prefill TTFT 3s/6s/9s: harness +76.8% (0.3921 vs 0.2217 req/s/GPU) Mark old 7-GPU and pre-5/13 docs as superseded. Update implementation log with completed run status.
2026-05-16 21:23:16 +08:00
parent d0c89dac48
commit 984eb1f325
5 changed files with 223 additions and 3 deletions
--- a/docs/harness-ablation/profile-driven-harness-implementation-20260512.md
+++ b/docs/harness-ablation/profile-driven-harness-implementation-20260512.md
@@ -141,8 +141,8 @@ New comparable studies:
 | Variant | Study ID | Status |
 | --- | --- | --- |
-| no-harness baseline | `dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu8-12iter-noharness-minprompt-gpt54-20260513` | running first |
+| no-harness baseline | `dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu8-12iter-noharness-minprompt-gpt54-20260513` | completed |
-| harness | `dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu8-12iter-harness-profileplanner-20260513` | queued to run after baseline |
+| harness | `dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu8-12iter-harness-profileplanner-20260513` | completed |
 Both specs set:
@@ -152,4 +152,4 @@ Both specs set:
 - SLO: TTFT p95 <= 4000ms, TPOT p95 <= 25ms, target pass rate 0.95
 - search: full range, `inherit_incumbent_floor=false`
-The no-harness baseline is running in tmux session `qwen27b-gpu8-noharness-20260513`. The harness run should only be started after the no-harness baseline finishes or reaches a sufficient early comparison point, because both need the full GPU host and should not run concurrently.
+Results: harness best `0.2696 req/s/GPU` (TP=4, MBT=7680) vs no-harness best `0.1233 req/s/GPU` (prefix-caching=false), a **+118.6%** improvement. Full analysis in `qwen27b-chat-0-8k-ttft4s-tpot25-gpu8-20260513.md`.
--- a/docs/harness-ablation/qwen235b-thinking-prefill-ttft-20260510.md
+++ b/docs/harness-ablation/qwen235b-thinking-prefill-ttft-20260510.md
@@ -1,5 +1,7 @@
 # qwen235b Thinking Prefill Harness Ablation, 2026-05-10
 **Superseded** by `qwen235b-thinking-prefill-ttft-3s6s9s-20260514.md` (updated SLO thresholds, 8-GPU setup). This document is retained for reference only.
 ## Setup
 - Host: `dash0`
--- a/docs/harness-ablation/qwen235b-thinking-prefill-ttft-3s6s9s-20260514.md
+++ b/docs/harness-ablation/qwen235b-thinking-prefill-ttft-3s6s9s-20260514.md
@@ -0,0 +1,117 @@
 # qwen235b Thinking Prefill Harness Ablation (TTFT 3s/6s/9s)
 Date: 2026-05-14 / 2026-05-15
 Supersedes: `qwen235b-thinking-prefill-ttft-20260510.md` (different SLO thresholds).
 ## Setup
 - Host: `dash0`
 - Engine: internal vLLM at `/usr/local/bin/vllm`
 - Model: `/home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717`
 - Trace window: `thinking_w20260327_1000`
 - Request mode: chat, with `completion_tokens_override=1` for prefill-only measurement
 - SLO: TTFT-only stepped p95 pass target, target pass rate `0.95`
  - input tokens `<=4096`: `3000 ms`
  - input tokens `<=32768`: `6000 ms`
  - otherwise: `9000 ms`
 - GPU env: `CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7` (8x H20)
 - Baseline topology: `TP=4`
 - LLM: `gpt-5.4`
 - Code: profile-driven harness planner, post GPU-visibility fix (`5c2958e`+)
 ## Studies
 | Variant | Study ID | search.high |
 | --- | --- | ---: |
 | no-harness | `dash0-qwen235b-prefill-thinking-ttft-3s6s9s-12iter-noharness-minprompt-gpt54-20260514` | 0.125 |
 | harness | `dash0-qwen235b-prefill-thinking-ttft-3s6s9s-12iter-harness-profileplanner-gpt54-20260514` | 0.125 |
 | harness (high=0.25) | `dash0-qwen235b-prefill-thinking-ttft-3s6s9s-high025-12iter-harness-profileplanner-gpt54-20260515` | 0.25 |
 The `harness (high=0.25)` run was added to test whether raising `search.high` lets the harness find a better runtime config after reaching the search ceiling at `0.125`.
 ## Result
 Raw per-iteration performance for Fig18-style plot. Metric: `best_request_rate_per_gpu`. `NA` means the proposed config did not produce a feasible point. `fail` means engine launch failure. `stop` means harness stopped before launching another trial.
 | Variant | iter1 | iter2 | iter3 | iter4 | iter5 | iter6 | iter7 | iter8 | iter9 | iter10 | iter11 | iter12 |
 | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
 | no-harness raw `perf[i]` | 0.1804 | fail | 0.1892 | fail | 0.1892 | 0.1804 | 0.2217 | 0.2029 | 0.2029 | 0.2029 | 0.1892 | 0.1804 |
 | harness raw `perf[i]` | 0.2029 | 0.3863 | stop | stop | stop | stop | stop | stop | stop | stop | stop | stop |
 | harness (high=0.25) raw `perf[i]` | 0.2029 | 0.3921 | 0.3442 | 0.3921 | 0.3821 | 0.3821 | 0.3821 | 0.3688 | 0.3821 | 0.3821 | 0.3821 | 0.3821 |
 | Variant | GPU trials | Best iter | Best req/s | Best req/s/GPU | Best config summary |
 | --- | ---: | ---: | ---: | ---: | --- |
 | no-harness | 12 | 7 | 0.8867 | 0.2217 | TP=4, MNS=112, MBT=7168 |
 | harness | 2 (stop) | 2 | 3.0900 | 0.3863 | TP=8 |
 | harness (high=0.25) | 12 | 2 | 3.1367 | **0.3921** | TP=8 |
 Harness reached **+74.2%** over no-harness at iter 2. With `search.high=0.25`, the harness found `0.3921 req/s/GPU` (+76.8%).
 ## Incumbent Curve
 Best-so-far request rate per GPU after each iteration.
 | Variant | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
 | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
 | no-harness | 0.1804 | 0.1804 | 0.1892 | 0.1892 | 0.1892 | 0.1892 | 0.2217 | 0.2217 | 0.2217 | 0.2217 | 0.2217 | 0.2217 |
 | harness | 0.2029 | 0.3863 | stop | stop | stop | stop | stop | stop | stop | stop | stop | stop |
 | harness (high=0.25) | 0.2029 | 0.3921 | 0.3921 | 0.3921 | 0.3921 | 0.3921 | 0.3921 | 0.3921 | 0.3921 | 0.3921 | 0.3921 | 0.3921 |
 ## Trial Details
 No-harness:
 | Iter | Result / GPU | Incumbent / GPU | Status | Config summary |
 | ---: | ---: | ---: | --- | --- |
 | 1 | 0.1804 | 0.1804 | completed | baseline (TP=4) |
 | 2 | - | 0.1804 | launch fail | TP=4, EP=4, MNS=128 |
 | 3 | 0.1892 | 0.1892 | completed | MNS=96 |
 | 4 | - | 0.1892 | launch fail | TP=4, DP=2, EP off, MNS=96 |
 | 5 | 0.1892 | 0.1892 | completed | MNS=112 |
 | 6 | 0.1804 | 0.1892 | completed | MNS=112, MBT=9216 |
 | 7 | 0.2217 | 0.2217 | completed | MNS=112, MBT=7168 |
 | 8 | 0.2029 | 0.2217 | completed | MNS=112, MBT=6144 |
 | 9 | 0.2029 | 0.2217 | completed | MNS=120, MBT=7168 |
 | 10 | 0.2029 | 0.2217 | completed | TP=4, DP=1, EP off, MNS=108, MBT=7168 |
 | 11 | 0.1892 | 0.2217 | completed | MNS=112, MBT=7680 |
 | 12 | 0.1804 | 0.2217 | completed | MNS=112, MBT=6912 |
 Harness (`search.high=0.125`):
 | Iter | Result / GPU | Incumbent / GPU | Status | Config summary |
 | ---: | ---: | ---: | --- | --- |
 | 1 | 0.2029 | 0.2029 | completed | baseline (TP=4) |
 | 2 | 0.3863 | 0.3863 | completed | TP=8 |
 | 3 | - | - | harness stop | search-high saturation (`sampling_u=0.123` vs `search.high=0.125`) |
 Harness (`search.high=0.25`):
 | Iter | Result / GPU | Incumbent / GPU | Status | Config summary |
 | ---: | ---: | ---: | --- | --- |
 | 1 | 0.2029 | 0.2029 | completed | baseline (TP=4) |
 | 2 | 0.3921 | 0.3921 | completed | TP=8 |
 | 3 | 0.3442 | 0.3921 | completed | TP=8, chunked-prefill, MBT=32768 |
 | 4 | 0.3921 | 0.3921 | completed | TP=8, MBT=12288 |
 | 5 | 0.3821 | 0.3921 | completed | TP=8, EP off, MBT=16384 |
 | 6 | 0.3821 | 0.3921 | completed | TP=8, EP off, MBT=14336 |
 | 7 | 0.3821 | 0.3921 | completed | TP=8, EP off, MBT=10240 |
 | 8 | 0.3688 | 0.3921 | completed | TP=8, EP off, MBT=11776 |
 | 9 | 0.3821 | 0.3921 | completed | TP=8, EP off, MBT=13312 |
 | 10 | 0.3821 | 0.3921 | completed | TP=8, EP off, MBT=7168 |
 | 11 | 0.3821 | 0.3921 | completed | TP=8, EP off, MBT=12032 |
 | 12 | 0.3821 | 0.3921 | completed | TP=8, EP off, MBT=12800 |
 ## Interpretation
 No-harness never attempted TP=8. It stayed on the TP=4 baseline, encountered two launch failures (EP=4 and DP=2), and spent all remaining trials on runtime knob tuning within the TP=4 family. Its best finding was `MNS=112, MBT=7168` at iter 7 (`0.2217 req/s/GPU`).
 Harness identified `ttft_prefill` as the dominant bottleneck from the baseline trial and immediately proposed TP=8 as the first topology move. This is the correct direction for a prefill-only workload with heavy-tail prompts (p95 ~19.7k tokens, p99 ~30k tokens).
 With `search.high=0.125`, the harness stopped at iter 2 because the incumbent's best feasible `sampling_u=0.123` was within one search resolution of `search.high`. With `search.high=0.25`, the harness continued for 12 trials but the best remained iter 2 (`TP=8, default MBT`). The additional 10 trials explored MBT variations on TP=8 but none improved per-GPU throughput. This confirms the 2-trial harness result was already at or near the local optimum.
 The gap between harness and no-harness (`+76.8%`) comes entirely from topology: TP=8 doubles the per-GPU prefill compute bandwidth compared to TP=4, which directly reduces TTFT and allows higher admitted request rates under the stepped TTFT SLO.
 ## Comparison with Previous Run (2026-05-10)
 The 2026-05-10 run used different SLO thresholds and is documented in `qwen235b-thinking-prefill-ttft-20260510.md`. The core finding is consistent: harness finds TP=8 at iter 2-3 while no-harness gets stuck on TP=4 runtime tuning.
--- a/docs/harness-ablation/qwen27b-chat-0-8k-ttft4s-tpot25-20260510.md
+++ b/docs/harness-ablation/qwen27b-chat-0-8k-ttft4s-tpot25-20260510.md
@@ -2,6 +2,8 @@
 Date: 2026-05-10
 **Superseded** by `qwen27b-chat-0-8k-ttft4s-tpot25-gpu8-20260513.md` (corrected 8-GPU setup). This document used `CUDA_VISIBLE_DEVICES=0,1,2,4,5,6,7` (7 GPUs) and is retained for reference only.
 ## Setup
 - Host: `dash0` (`172.27.114.84`)
--- a/docs/harness-ablation/qwen27b-chat-0-8k-ttft4s-tpot25-gpu8-20260513.md
+++ b/docs/harness-ablation/qwen27b-chat-0-8k-ttft4s-tpot25-gpu8-20260513.md
@@ -0,0 +1,99 @@
 # Qwen27B Chat 0-8k Harness Ablation (8-GPU)
 Date: 2026-05-13
 Supersedes: `qwen27b-chat-0-8k-ttft4s-tpot25-20260510.md` (7-GPU / gpu3skip setup).
 ## Setup
 - Host: `dash0`
 - Model: `/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal`
 - Workload: `chat_w20260311_1000`, chat, 0-8k input window
 - SLO: TTFT <= 4000ms and TPOT <= 25ms, target pass rate = 0.95
 - Trial budget: 12 total tuning iterations per study
 - Search: `sampling_u` in `[0, 0.0625]`, tolerance `0.001`, max probes `6`
 - Execution: `python3 -m aituner.cli study tune ... --max-trials 12`
 - GPU env: `CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7` (8x H20)
 - Baseline topology: `TP=1`
 - LLM: `gpt-5.4`
 - Code: profile-driven harness planner, post GPU-visibility fix (`5c2958e`+)
 ## Studies
 | Variant | Study ID |
 | --- | --- |
 | no-harness | `dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu8-12iter-noharness-minprompt-gpt54-20260513` |
 | harness | `dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu8-12iter-harness-profileplanner-20260513` |
 ## Result
 Raw per-iteration performance for Fig18-style plot. Metric: `best_request_rate_per_gpu` from that trial's own `result.json`. `NA` means the proposed config did not produce a feasible point. `fail` means engine launch failure.
 | Variant | iter1 | iter2 | iter3 | iter4 | iter5 | iter6 | iter7 | iter8 | iter9 | iter10 | iter11 | iter12 |
 | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
 | no-harness raw `perf[i]` | 0.0650 | fail | fail | 0.0617 | 0.0650 | 0.1233 | 0.1050 | 0.1233 | 0.0650 | 0.0650 | 0.0617 | 0.1233 |
 | harness raw `perf[i]` | 0.0650 | 0.1992 | 0.2621 | 0.2056 | 0.1544 | 0.2696 | 0.2621 | 0.2621 | 0.2696 | 0.2621 | 0.2621 | 0.2621 |
 | Variant | Best iter | Best request rate | Best request rate / GPU | Best config summary |
 | --- | ---: | ---: | ---: | --- |
 | no-harness | 6 | 0.1233 | 0.1233 | `enable-prefix-caching=false` |
 | harness | 6 | 1.0783 | **0.2696** | `tensor-parallel-size=4`, `max-num-batched-tokens=7680` |
 Harness final best is **+118.6%** over no-harness.
 ## Incumbent Curve
 Best-so-far request rate per GPU after each iteration.
 | Variant | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
 | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
 | no-harness | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.1233 | 0.1233 | 0.1233 | 0.1233 | 0.1233 | 0.1233 | 0.1233 |
 | harness | 0.0650 | 0.1992 | 0.2621 | 0.2621 | 0.2621 | 0.2696 | 0.2696 | 0.2696 | 0.2696 | 0.2696 | 0.2696 | 0.2696 |
 ## Trial Details
 No-harness:
 | Iter | Result / GPU | Incumbent / GPU | Status | Config summary |
 | ---: | ---: | ---: | --- | --- |
 | 1 | 0.0650 | 0.0650 | completed | baseline |
 | 2 | - | 0.0650 | launch fail | `gpu-memory-utilization=0.94`, `max-num-batched-tokens=16384` |
 | 3 | - | 0.0650 | launch fail | `enable-chunked-prefill=false` |
 | 4 | 0.0617 | 0.0650 | completed | `data-parallel-size=2` |
 | 5 | 0.0650 | 0.0650 | completed | `block-size=32` |
 | 6 | 0.1233 | 0.1233 | completed | `enable-prefix-caching=false` |
 | 7 | 0.1050 | 0.1233 | completed | `enable-prefix-caching=false`, `block-size=32` |
 | 8 | 0.1233 | 0.1233 | completed | `enable-prefix-caching=false`, `max-num-seqs=32` |
 | 9 | 0.0650 | 0.1233 | completed | `enable-prefix-caching=false`, `max-num-batched-tokens=4096` |
 | 10 | 0.0650 | 0.1233 | completed | `enable-prefix-caching=false`, `max-num-seqs=16` |
 | 11 | 0.0617 | 0.1233 | completed | `data-parallel-size=2`, `enable-prefix-caching=false` |
 | 12 | 0.1233 | 0.1233 | completed | `enable-prefix-caching=false` (+ torch compile off) |
 Harness:
 | Iter | Result / GPU | Incumbent / GPU | Status | Config summary |
 | ---: | ---: | ---: | --- | --- |
 | 1 | 0.0650 | 0.0650 | completed | baseline |
 | 2 | 0.1992 | 0.1992 | completed | `tensor-parallel-size=2` |
 | 3 | 0.2621 | 0.2621 | completed | `tensor-parallel-size=4` |
 | 4 | 0.2056 | 0.2621 | completed | `tensor-parallel-size=8` |
 | 5 | 0.1544 | 0.2621 | completed | `tensor-parallel-size=4`, `data-parallel-size=2` |
 | 6 | 0.2696 | 0.2696 | completed | `tensor-parallel-size=4`, `max-num-batched-tokens=7680` |
 | 7 | 0.2621 | 0.2696 | completed | `tensor-parallel-size=4`, `enable-chunked-prefill=true`, `max-num-batched-tokens=12288` |
 | 8 | 0.2621 | 0.2696 | completed | `tensor-parallel-size=4`, `max-num-batched-tokens=7424` |
 | 9 | 0.2696 | 0.2696 | completed | `tensor-parallel-size=4`, `max-num-batched-tokens=7680`, `max-num-seqs=64` |
 | 10 | 0.2621 | 0.2696 | completed | `tensor-parallel-size=4`, `max-num-batched-tokens=7680`, `max-num-seqs=56` |
 | 11 | 0.2621 | 0.2696 | completed | `tensor-parallel-size=4`, `max-num-batched-tokens=7680`, `max-num-seqs=60` |
 | 12 | 0.2621 | 0.2696 | completed | `tensor-parallel-size=4`, `max-num-batched-tokens=7680`, `max-num-seqs=63` |
 ## Interpretation
 No-harness never tested any TP change in 12 trials. It started from TP=1, encountered two early launch failures, then spent all remaining budget on runtime knobs (`enable-prefix-caching`, `block-size`, `max-num-seqs`, `max-num-batched-tokens`). Its best discovery was disabling prefix caching at iter 6, reaching only `0.1233 req/s/GPU`.
 Harness systematically explored the TP frontier: iter 2 tested TP=2, iter 3 tested TP=4, iter 4 tested TP=8. The profile-driven planner identified `ttft_prefill` as the ranked bottleneck and proposed increasing TP as the primary relief action. After TP=4 proved best per-GPU, the harness tested TP=4/DP=2 (worse) then shifted to runtime refinement within the TP=4 family, settling on `max-num-batched-tokens=7680` as the marginal improvement.
 The result demonstrates that topology exploration is critical for this workload: the no-harness LLM failed to discover TP>1 configurations entirely, while the harness reached the optimal TP=4 topology by iter 3 and refined it by iter 6.
 ## Comparison with Previous 7-GPU Run
 The 7-GPU (`gpu3skip`) run from 2026-05-10 used `CUDA_VISIBLE_DEVICES=0,1,2,4,5,6,7` and is not directly comparable. The harness result on 7-GPU was `0.2742 req/s/GPU` (TP=4, chunked-prefill, MBT=16384). On 8-GPU, the harness found a similar TP=4 optimum at `0.2696 req/s/GPU` with slightly different runtime tuning. The core finding is consistent: harness accelerates topology discovery and significantly outperforms no-harness.