Document 27B TP sweep: per-GPU rises sharply with TP (dense), opposite of MoE

Under the length-aware TTFT SLO (4s + L_in/8k), dense Qwen3.5-27B per-GPU throughput:
TP1=0.065, TP2=0.2925 (4.5x), TP4>=0.908 (>=14x, ceiling-saturated). TP1 is TPOT-bound
(one H20 can't decode a 27B under 50ms/token once batched); loosening TTFT didn't move
TP1, confirming TPOT is the binding constraint. Opposite of MoE 30B-A3B where TP1 was
best per-GPU. Validates the harness + length-aware SLO produce meaningful, non-saturated
measurements (TP1/TP2). TP4 saturated -> lower bound.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-16 01:54:40 +08:00
parent b1b74318f6
commit 93ce339d61

View File

@@ -0,0 +1,51 @@
# Qwen3.5-27B TP sweep under length-aware TTFT SLO — 2026-06-16
Branch `feat/two-stop`. Deterministic ground-truth A/B (proposal files, no LLM):
TP1 vs TP2 vs TP4 on the dense Qwen3.5-27B (internal 256k, fp8, spec-decode) at
08k chat, vLLM 0.11.1, H20, `replay_time_scale=1.0` (no smoke), Stop-A enabled,
pinned to GPUs 27.
**SLO**: TTFT ≤ `4000 + 0.125·L_in` ms (= 4s + L_in/8k), TPOT ≤ 50 ms, pass ≥ 95%.
## Result
| config | best_u | raw req/s | req/s/GPU | pass | saturated |
| --- | --- | --- | --- | --- | --- |
| TP1 | 0.00195 | 0.065 | **0.065** | 1.00 | no |
| TP2 | 0.0195 | 0.585 | **0.2925** | 0.96 | no |
| TP4 | 0.123 | 3.63 | **≥0.908** | 0.98 | **yes (best_u≈high=0.125)** |
- **Per-GPU throughput rises sharply with TP for the dense 27B**: TP2 = 4.5× TP1,
TP4 ≥ 14× TP1. Opposite of the MoE Qwen3-30B-A3B (TP1 best per-GPU) — confirms the
dense-vs-MoE distinction.
- **Mechanism**: TP1 is TPOT-bound — one H20 cannot decode a 27B under 50 ms/token
once the batch grows, so it saturates at ~0.065 req/s/GPU. Loosening TTFT (2s→4-5s)
did *not* change TP1 (still 0.065), confirming TPOT — not TTFT — is TP1's binding
constraint. Each TP doubling speeds decode+prefill enough to more than recover the
added GPUs.
- **TP4 saturated** the offered-load ceiling (`best_u=0.123 ≈ 0.125`): still feasible
after ~the whole trace, so 0.908 is a lower bound. True peak (and TP8) need a
raised `search.high` to measure.
## Process findings (fed back into the harness)
- **Bug fixed**: a request exceeding `request_timeout_s` raised a raw `TimeoutError`
mid-stream that escaped `_run_one_request` and crashed the whole trial; now wrapped
as `HttpClientError` (failed request, not failed trial). Commit `2fcaf80`.
- **Open gap**: killing a `study tune` run orphans the `VLLM::EngineCore` workers
(SIGTERM/SIGKILL of the loop doesn't tear down the engine), which twice left leaked
GPU memory on GPUs 0/1 (dead PIDs still pinning KV, only clearable via root
`nvidia-smi --gpu-reset`). Fix: SIGTERM handler in the CLI loop + make
`_terminate_process_tree` match `EngineCore` workers, not just `vllm serve`.
- Experiment hygiene: scale=1.0 makes each probe take real arrival time; `search.high`
must bracket the config's boundary (too wide wastes probes on a low-capacity config;
too low saturates a high-capacity one), and `request_timeout_s` must be modest so
overloaded probes drain fast.
## Next
- Re-measure TP4 (and TP8) with `search.high` raised (e.g. 0.5) to find the true peak
per-GPU and the TP knee.
- Run the Stop-B agentic loop on this 27B stack: unlike the 30B (baseline already
optimal), here the loop should climb TP1→TP2→TP4 and stop — a real improving
trajectory (the original Phase-5 "A" goal).