Files
aituner/docs/harness-ablation/qwen27b-tp-sweep-20260616.md
Gahow Wang 93ce339d61 Document 27B TP sweep: per-GPU rises sharply with TP (dense), opposite of MoE
Under the length-aware TTFT SLO (4s + L_in/8k), dense Qwen3.5-27B per-GPU throughput:
TP1=0.065, TP2=0.2925 (4.5x), TP4>=0.908 (>=14x, ceiling-saturated). TP1 is TPOT-bound
(one H20 can't decode a 27B under 50ms/token once batched); loosening TTFT didn't move
TP1, confirming TPOT is the binding constraint. Opposite of MoE 30B-A3B where TP1 was
best per-GPU. Validates the harness + length-aware SLO produce meaningful, non-saturated
measurements (TP1/TP2). TP4 saturated -> lower bound.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 01:54:40 +08:00

2.7 KiB
Raw Permalink Blame History

Qwen3.5-27B TP sweep under length-aware TTFT SLO — 2026-06-16

Branch feat/two-stop. Deterministic ground-truth A/B (proposal files, no LLM): TP1 vs TP2 vs TP4 on the dense Qwen3.5-27B (internal 256k, fp8, spec-decode) at 08k chat, vLLM 0.11.1, H20, replay_time_scale=1.0 (no smoke), Stop-A enabled, pinned to GPUs 27.

SLO: TTFT ≤ 4000 + 0.125·L_in ms (= 4s + L_in/8k), TPOT ≤ 50 ms, pass ≥ 95%.

Result

config best_u raw req/s req/s/GPU pass saturated
TP1 0.00195 0.065 0.065 1.00 no
TP2 0.0195 0.585 0.2925 0.96 no
TP4 0.123 3.63 ≥0.908 0.98 yes (best_u≈high=0.125)
  • Per-GPU throughput rises sharply with TP for the dense 27B: TP2 = 4.5× TP1, TP4 ≥ 14× TP1. Opposite of the MoE Qwen3-30B-A3B (TP1 best per-GPU) — confirms the dense-vs-MoE distinction.
  • Mechanism: TP1 is TPOT-bound — one H20 cannot decode a 27B under 50 ms/token once the batch grows, so it saturates at ~0.065 req/s/GPU. Loosening TTFT (2s→4-5s) did not change TP1 (still 0.065), confirming TPOT — not TTFT — is TP1's binding constraint. Each TP doubling speeds decode+prefill enough to more than recover the added GPUs.
  • TP4 saturated the offered-load ceiling (best_u=0.123 ≈ 0.125): still feasible after ~the whole trace, so 0.908 is a lower bound. True peak (and TP8) need a raised search.high to measure.

Process findings (fed back into the harness)

  • Bug fixed: a request exceeding request_timeout_s raised a raw TimeoutError mid-stream that escaped _run_one_request and crashed the whole trial; now wrapped as HttpClientError (failed request, not failed trial). Commit 2fcaf80.
  • Open gap: killing a study tune run orphans the VLLM::EngineCore workers (SIGTERM/SIGKILL of the loop doesn't tear down the engine), which twice left leaked GPU memory on GPUs 0/1 (dead PIDs still pinning KV, only clearable via root nvidia-smi --gpu-reset). Fix: SIGTERM handler in the CLI loop + make _terminate_process_tree match EngineCore workers, not just vllm serve.
  • Experiment hygiene: scale=1.0 makes each probe take real arrival time; search.high must bracket the config's boundary (too wide wastes probes on a low-capacity config; too low saturates a high-capacity one), and request_timeout_s must be modest so overloaded probes drain fast.

Next

  • Re-measure TP4 (and TP8) with search.high raised (e.g. 0.5) to find the true peak per-GPU and the TP knee.
  • Run the Stop-B agentic loop on this 27B stack: unlike the 30B (baseline already optimal), here the loop should climb TP1→TP2→TP4 and stop — a real improving trajectory (the original Phase-5 "A" goal).