Document 27B TP sweep: per-GPU rises sharply with TP (dense), opposite of MoE
Under the length-aware TTFT SLO (4s + L_in/8k), dense Qwen3.5-27B per-GPU throughput: TP1=0.065, TP2=0.2925 (4.5x), TP4>=0.908 (>=14x, ceiling-saturated). TP1 is TPOT-bound (one H20 can't decode a 27B under 50ms/token once batched); loosening TTFT didn't move TP1, confirming TPOT is the binding constraint. Opposite of MoE 30B-A3B where TP1 was best per-GPU. Validates the harness + length-aware SLO produce meaningful, non-saturated measurements (TP1/TP2). TP4 saturated -> lower bound. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
51
docs/harness-ablation/qwen27b-tp-sweep-20260616.md
Normal file
51
docs/harness-ablation/qwen27b-tp-sweep-20260616.md
Normal file
@@ -0,0 +1,51 @@
|
||||
# Qwen3.5-27B TP sweep under length-aware TTFT SLO — 2026-06-16
|
||||
|
||||
Branch `feat/two-stop`. Deterministic ground-truth A/B (proposal files, no LLM):
|
||||
TP1 vs TP2 vs TP4 on the dense Qwen3.5-27B (internal 256k, fp8, spec-decode) at
|
||||
0–8k chat, vLLM 0.11.1, H20, `replay_time_scale=1.0` (no smoke), Stop-A enabled,
|
||||
pinned to GPUs 2–7.
|
||||
|
||||
**SLO**: TTFT ≤ `4000 + 0.125·L_in` ms (= 4s + L_in/8k), TPOT ≤ 50 ms, pass ≥ 95%.
|
||||
|
||||
## Result
|
||||
|
||||
| config | best_u | raw req/s | req/s/GPU | pass | saturated |
|
||||
| --- | --- | --- | --- | --- | --- |
|
||||
| TP1 | 0.00195 | 0.065 | **0.065** | 1.00 | no |
|
||||
| TP2 | 0.0195 | 0.585 | **0.2925** | 0.96 | no |
|
||||
| TP4 | 0.123 | 3.63 | **≥0.908** | 0.98 | **yes (best_u≈high=0.125)** |
|
||||
|
||||
- **Per-GPU throughput rises sharply with TP for the dense 27B**: TP2 = 4.5× TP1,
|
||||
TP4 ≥ 14× TP1. Opposite of the MoE Qwen3-30B-A3B (TP1 best per-GPU) — confirms the
|
||||
dense-vs-MoE distinction.
|
||||
- **Mechanism**: TP1 is TPOT-bound — one H20 cannot decode a 27B under 50 ms/token
|
||||
once the batch grows, so it saturates at ~0.065 req/s/GPU. Loosening TTFT (2s→4-5s)
|
||||
did *not* change TP1 (still 0.065), confirming TPOT — not TTFT — is TP1's binding
|
||||
constraint. Each TP doubling speeds decode+prefill enough to more than recover the
|
||||
added GPUs.
|
||||
- **TP4 saturated** the offered-load ceiling (`best_u=0.123 ≈ 0.125`): still feasible
|
||||
after ~the whole trace, so 0.908 is a lower bound. True peak (and TP8) need a
|
||||
raised `search.high` to measure.
|
||||
|
||||
## Process findings (fed back into the harness)
|
||||
|
||||
- **Bug fixed**: a request exceeding `request_timeout_s` raised a raw `TimeoutError`
|
||||
mid-stream that escaped `_run_one_request` and crashed the whole trial; now wrapped
|
||||
as `HttpClientError` (failed request, not failed trial). Commit `2fcaf80`.
|
||||
- **Open gap**: killing a `study tune` run orphans the `VLLM::EngineCore` workers
|
||||
(SIGTERM/SIGKILL of the loop doesn't tear down the engine), which twice left leaked
|
||||
GPU memory on GPUs 0/1 (dead PIDs still pinning KV, only clearable via root
|
||||
`nvidia-smi --gpu-reset`). Fix: SIGTERM handler in the CLI loop + make
|
||||
`_terminate_process_tree` match `EngineCore` workers, not just `vllm serve`.
|
||||
- Experiment hygiene: scale=1.0 makes each probe take real arrival time; `search.high`
|
||||
must bracket the config's boundary (too wide wastes probes on a low-capacity config;
|
||||
too low saturates a high-capacity one), and `request_timeout_s` must be modest so
|
||||
overloaded probes drain fast.
|
||||
|
||||
## Next
|
||||
|
||||
- Re-measure TP4 (and TP8) with `search.high` raised (e.g. 0.5) to find the true peak
|
||||
per-GPU and the TP knee.
|
||||
- Run the Stop-B agentic loop on this 27B stack: unlike the 30B (baseline already
|
||||
optimal), here the loop should climb TP1→TP2→TP4 and stop — a real improving
|
||||
trajectory (the original Phase-5 "A" goal).
|
||||
Reference in New Issue
Block a user