Document 27B TP sweep: per-GPU rises sharply with TP (dense), opposite of MoE

Under the length-aware TTFT SLO (4s + L_in/8k), dense Qwen3.5-27B per-GPU throughput: TP1=0.065, TP2=0.2925 (4.5x), TP4>=0.908 (>=14x, ceiling-saturated). TP1 is TPOT-bound (one H20 can't decode a 27B under 50ms/token once batched); loosening TTFT didn't move TP1, confirming TPOT is the binding constraint. Opposite of MoE 30B-A3B where TP1 was best per-GPU. Validates the harness + length-aware SLO produce meaningful, non-saturated measurements (TP1/TP2). TP4 saturated -> lower bound. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 01:54:40 +08:00
parent b1b74318f6
commit 93ce339d61
1 changed files with 51 additions and 0 deletions
--- a/docs/harness-ablation/qwen27b-tp-sweep-20260616.md
+++ b/docs/harness-ablation/qwen27b-tp-sweep-20260616.md
@@ -0,0 +1,51 @@
+# Qwen3.5-27B TP sweep under length-aware TTFT SLO — 2026-06-16
+
+Branch `feat/two-stop`. Deterministic ground-truth A/B (proposal files, no LLM):
+TP1 vs TP2 vs TP4 on the dense Qwen3.5-27B (internal 256k, fp8, spec-decode) at
+0–8k chat, vLLM 0.11.1, H20, `replay_time_scale=1.0` (no smoke), Stop-A enabled,
+pinned to GPUs 2–7.
+
+**SLO**: TTFT ≤ `4000 + 0.125·L_in` ms (= 4s + L_in/8k), TPOT ≤ 50 ms, pass ≥ 95%.
+
+## Result
+
+| config | best_u | raw req/s | req/s/GPU | pass | saturated |
+| --- | --- | --- | --- | --- | --- |
+| TP1 | 0.00195 | 0.065 | **0.065** | 1.00 | no |
+| TP2 | 0.0195 | 0.585 | **0.2925** | 0.96 | no |
+| TP4 | 0.123 | 3.63 | **≥0.908** | 0.98 | **yes (best_u≈high=0.125)** |
+
+- **Per-GPU throughput rises sharply with TP for the dense 27B**: TP2 = 4.5× TP1,
+  TP4 ≥ 14× TP1. Opposite of the MoE Qwen3-30B-A3B (TP1 best per-GPU) — confirms the
+  dense-vs-MoE distinction.
+- **Mechanism**: TP1 is TPOT-bound — one H20 cannot decode a 27B under 50 ms/token
+  once the batch grows, so it saturates at ~0.065 req/s/GPU. Loosening TTFT (2s→4-5s)
+  did *not* change TP1 (still 0.065), confirming TPOT — not TTFT — is TP1's binding
+  constraint. Each TP doubling speeds decode+prefill enough to more than recover the
+  added GPUs.
+- **TP4 saturated** the offered-load ceiling (`best_u=0.123 ≈ 0.125`): still feasible
+  after ~the whole trace, so 0.908 is a lower bound. True peak (and TP8) need a
+  raised `search.high` to measure.
+
+## Process findings (fed back into the harness)
+
+- **Bug fixed**: a request exceeding `request_timeout_s` raised a raw `TimeoutError`
+  mid-stream that escaped `_run_one_request` and crashed the whole trial; now wrapped
+  as `HttpClientError` (failed request, not failed trial). Commit `2fcaf80`.
+- **Open gap**: killing a `study tune` run orphans the `VLLM::EngineCore` workers
+  (SIGTERM/SIGKILL of the loop doesn't tear down the engine), which twice left leaked
+  GPU memory on GPUs 0/1 (dead PIDs still pinning KV, only clearable via root
+  `nvidia-smi --gpu-reset`). Fix: SIGTERM handler in the CLI loop + make
+  `_terminate_process_tree` match `EngineCore` workers, not just `vllm serve`.
+- Experiment hygiene: scale=1.0 makes each probe take real arrival time; `search.high`
+  must bracket the config's boundary (too wide wastes probes on a low-capacity config;
+  too low saturates a high-capacity one), and `request_timeout_s` must be modest so
+  overloaded probes drain fast.
+
+## Next
+
+- Re-measure TP4 (and TP8) with `search.high` raised (e.g. 0.5) to find the true peak
+  per-GPU and the TP knee.
+- Run the Stop-B agentic loop on this 27B stack: unlike the 30B (baseline already
+  optimal), here the loop should climb TP1→TP2→TP4 and stop — a real improving
+  trajectory (the original Phase-5 "A" goal).