Under the length-aware TTFT SLO (4s + L_in/8k), dense Qwen3.5-27B per-GPU throughput: TP1=0.065, TP2=0.2925 (4.5x), TP4>=0.908 (>=14x, ceiling-saturated). TP1 is TPOT-bound (one H20 can't decode a 27B under 50ms/token once batched); loosening TTFT didn't move TP1, confirming TPOT is the binding constraint. Opposite of MoE 30B-A3B where TP1 was best per-GPU. Validates the harness + length-aware SLO produce meaningful, non-saturated measurements (TP1/TP2). TP4 saturated -> lower bound. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2.7 KiB
2.7 KiB
Qwen3.5-27B TP sweep under length-aware TTFT SLO — 2026-06-16
Branch feat/two-stop. Deterministic ground-truth A/B (proposal files, no LLM):
TP1 vs TP2 vs TP4 on the dense Qwen3.5-27B (internal 256k, fp8, spec-decode) at
0–8k chat, vLLM 0.11.1, H20, replay_time_scale=1.0 (no smoke), Stop-A enabled,
pinned to GPUs 2–7.
SLO: TTFT ≤ 4000 + 0.125·L_in ms (= 4s + L_in/8k), TPOT ≤ 50 ms, pass ≥ 95%.
Result
| config | best_u | raw req/s | req/s/GPU | pass | saturated |
|---|---|---|---|---|---|
| TP1 | 0.00195 | 0.065 | 0.065 | 1.00 | no |
| TP2 | 0.0195 | 0.585 | 0.2925 | 0.96 | no |
| TP4 | 0.123 | 3.63 | ≥0.908 | 0.98 | yes (best_u≈high=0.125) |
- Per-GPU throughput rises sharply with TP for the dense 27B: TP2 = 4.5× TP1, TP4 ≥ 14× TP1. Opposite of the MoE Qwen3-30B-A3B (TP1 best per-GPU) — confirms the dense-vs-MoE distinction.
- Mechanism: TP1 is TPOT-bound — one H20 cannot decode a 27B under 50 ms/token once the batch grows, so it saturates at ~0.065 req/s/GPU. Loosening TTFT (2s→4-5s) did not change TP1 (still 0.065), confirming TPOT — not TTFT — is TP1's binding constraint. Each TP doubling speeds decode+prefill enough to more than recover the added GPUs.
- TP4 saturated the offered-load ceiling (
best_u=0.123 ≈ 0.125): still feasible after ~the whole trace, so 0.908 is a lower bound. True peak (and TP8) need a raisedsearch.highto measure.
Process findings (fed back into the harness)
- Bug fixed: a request exceeding
request_timeout_sraised a rawTimeoutErrormid-stream that escaped_run_one_requestand crashed the whole trial; now wrapped asHttpClientError(failed request, not failed trial). Commit2fcaf80. - Open gap: killing a
study tunerun orphans theVLLM::EngineCoreworkers (SIGTERM/SIGKILL of the loop doesn't tear down the engine), which twice left leaked GPU memory on GPUs 0/1 (dead PIDs still pinning KV, only clearable via rootnvidia-smi --gpu-reset). Fix: SIGTERM handler in the CLI loop + make_terminate_process_treematchEngineCoreworkers, not justvllm serve. - Experiment hygiene: scale=1.0 makes each probe take real arrival time;
search.highmust bracket the config's boundary (too wide wastes probes on a low-capacity config; too low saturates a high-capacity one), andrequest_timeout_smust be modest so overloaded probes drain fast.
Next
- Re-measure TP4 (and TP8) with
search.highraised (e.g. 0.5) to find the true peak per-GPU and the TP knee. - Run the Stop-B agentic loop on this 27B stack: unlike the 30B (baseline already optimal), here the loop should climb TP1→TP2→TP4 and stop — a real improving trajectory (the original Phase-5 "A" goal).