Files
obsidian/projects/auto-tuner/draft.md

8.6 KiB

4.X Load-Dependent Optimal Tensor Parallelism under a Fixed GPU Budget

A central question in our setting is why the preferred tensor parallelism (TP) changes with workload intensity, even when the model and hardware remain unchanged. Our key observation is that TP controls two fundamentally different quantities at the same time: the runtime of a single prefill batch, and the number of replicas that can be deployed under a fixed GPU budget. This creates a regime-dependent tradeoff between per-request latency and cluster-level service capacity.

Setup

We consider a cluster with a fixed total GPU budget G. Choosing tensor parallelism degree t implies that each model replica consumes t GPUs, so the number of deployable replicas is


m_t = \frac{G}{t}.

We focus on the prefill stage, since it dominates TTFT in our traces and is the primary source of the performance shift observed in Figure~X. For a prefill batch b executed on one replica, let \mathcal{B}_b denote the set of requests in the batch, and let x_i be the prompt length of request i \in \mathcal{B}_b. The total number of prefill tokens in the batch is


Z_b = \sum_{i \in \mathcal{B}_b} x_i.

The TTFT of a request is determined not by an isolated request-level service time, but by the runtime of the prefill batch it belongs to and the waiting time before that batch starts. Accordingly, we model the system at the batch level.

Batch Runtime Model

Let T_b(t) denote the runtime of prefill batch b under TP degree t. We decompose it as


T_b(t) = T_b^{\mathrm{comp}}(t) + T_b^{\mathrm{comm}}(t) + T_b^{\mathrm{rt}}(t),

where T_b^{\mathrm{comp}}(t) is the operator compute time, T_b^{\mathrm{comm}}(t) is the TP communication cost, and T_b^{\mathrm{rt}}(t) captures remaining runtime overheads such as launch gaps and executor overhead.

For decoder-only Transformers, the prefill FLOPs of a request with prompt length x can be approximated as


F(x) = a x d^2 + b x^2 d,

where d is the hidden dimension, the a x d^2 term captures dense projections and MLPs, and the b x^2 d term captures self-attention. For a batch b, the total FLOPs are therefore


F_b = \sum_{i \in \mathcal{B}_b} F(x_i)
= a d^2 \sum_{i \in \mathcal{B}_b} x_i + b d \sum_{i \in \mathcal{B}_b} x_i^2.

This form is important because the batch cost depends not only on the total token count Z_b, but also on the second moment of request lengths. Using


\sum_{i \in \mathcal{B}_b} x_i^2
= n_b \left( \bar{x}_b^2 + \mathrm{Var}_b(x) \right),

where n_b = |\mathcal{B}_b| and \bar{x}_b is the batch mean prompt length, we see that higher within-batch length variance directly increases the physical prefill cost, even when the mean length is fixed.

We then express compute time using a roofline-style model:


T_b^{\mathrm{comp}}(t)
=
\max
\left\{
\frac{F_b}{t \Pi_1 \eta_t^{\mathrm{comp}}},
\frac{Q_b}{t B_1 \eta_t^{\mathrm{mem}}}
\right\},

where \Pi_1 and B_1 are the effective single-GPU compute and memory bandwidth, Q_b is the batch memory traffic, and \eta_t^{\mathrm{comp}}, \eta_t^{\mathrm{mem}} \le 1 are TP efficiency terms. In the ideal case, compute time would decrease as 1/t; in practice, the decrease is only sublinear due to degraded kernel efficiency and shape effects.

TP also introduces collective communication overhead. Approximating each collective with a ring all-reduce over payload size n, its cost is


T_{\mathrm{AR}}(n,t)
=
2 (t-1)\alpha
+
2 \frac{t-1}{t} \frac{n}{\beta},

where \alpha is the per-hop latency and \beta is the effective link bandwidth. If each layer performs, on average, k such collectives and the activation payload is proportional to batch token volume, n_b \approx q d Z_b, then the communication term becomes


T_b^{\mathrm{comm}}(t)
\approx
L k
\left(
2 (t-1)\alpha
+
2 \frac{t-1}{t} \frac{q d Z_b}{\beta}
\right),

where L is the number of layers and q is the bytes per element. Unlike compute, this term does not scale down with 1/t; instead, it grows with the TP degree and the batch size.

Combining the two, the prefill batch runtime is


T_b(t)
=
\max
\left\{
\frac{
a d^2 \sum_i x_i + b d \sum_i x_i^2
}{
t \Pi_1 \eta_t^{\mathrm{comp}}
},
\frac{Q_b}{t B_1 \eta_t^{\mathrm{mem}}}
\right\}
+
L k
\left(
2 (t-1)\alpha
+
2 \frac{t-1}{t} \frac{q d Z_b}{\beta}
\right)
+
T_b^{\mathrm{rt}}(t).

This equation makes the TP tradeoff explicit: larger TP reduces the compute-dominated component of a single batch, but it also introduces collective cost and provides only sublinear speedup.

From Batch Runtime to TTFT

For a request i assigned to replica r, let b(i) denote the prefill batch that eventually serves it. Its TTFT can be written as


\mathrm{TTFT}_i(t) = W_{q,i}(t) + T_{b(i)}(t),

where W_{q,i}(t) is the waiting time before the batch starts. More precisely,


W_{q,i}(t) = R_i(t) + \sum_{u \in \mathcal{H}_i} T_u(t),

where R_i(t) is the residual runtime of the currently executing batch at arrival, and \mathcal{H}_i is the set of batches queued ahead of request i on the same replica.

Under a renewal approximation, the expected residual time is


\mathbb{E}[R_t] = \frac{\mathbb{E}[T_b(t)^2]}{2 \mathbb{E}[T_b(t)]}.

Thus, tail TTFT is affected not only by the mean batch runtime, but also by its second moment. This is crucial for heterogeneous workloads: higher length variance increases both T_b(t) and \mathbb{E}[T_b(t)^2], which amplifies queueing tails.

Cluster Capacity under a Fixed GPU Budget

The key systems-level quantity is not the latency of one replica, but the total service capacity of the cluster under fixed G. Let


\mu_{t,w}
=
\mathbb{E}
\left[
\frac{Z_b}{T_b(t)}
\mid w
\right]

denote the average prefill token throughput of a single replica under workload window w. Since the cluster can host only m_t = G/t replicas, the aggregate prefill capacity is


\Lambda_{t,w} = \frac{G}{t} \mu_{t,w}.

This expression exposes the central tradeoff. Increasing TP may improve single-replica throughput \mu_{t,w} by reducing batch runtime, but it simultaneously reduces the replica count by a factor of t. Therefore, larger TP improves cluster-wide capacity only if


\mu_{t,w} > t \mu_{1,w},

i.e., if the per-replica throughput gain is superlinear in t. In practice, this is rarely achievable: compute speedup is at best linear, while communication and runtime overheads make it strictly sublinear. As a result, under saturation, larger TP usually cannot improve total cluster capacity and often reduces it.

Regime Shift: Why Large TP Helps at Low Load but Hurts at High Load

This model immediately explains the phase transition in our measurements. Let \lambda_w denote the offered prefill-token arrival rate of workload window w. The observed goodput per GPU is approximately


g_{t,w}^{\mathrm{obs}}
\approx
\frac{1}{G}
\min \{ \lambda_w, \Lambda_{t,w} \}.

In the light-load regime, where


\lambda_w \ll \Lambda_{t,w},

all TP choices have sufficient cluster capacity, so the observed goodput per GPU is nearly identical across TP settings. In this regime, TTFT is dominated by the runtime of a single prefill batch, and larger TP is beneficial because it reduces T_b(t).

In contrast, in the high-load regime, where


\lambda_w \approx \Lambda_{t,w},

the system becomes capacity- and queueing-limited. Since larger TP typically lowers \Lambda_{t,w}, it pushes the system closer to saturation, increasing queue depth and amplifying tail TTFT. Consequently, smaller or intermediate TP becomes preferable because it provides more replicas, higher aggregate concurrency, and lower queueing pressure.

The model also explains why the effect is stronger for heterogeneous workloads such as coder traces. Because batch cost depends on \sum_i x_i^2 rather than only \sum_i x_i, workloads with larger prompt-length variance induce larger batch runtime variance, which increases both execution time and residual waiting time. Such workloads therefore enter the queueing-dominated regime earlier, causing the optimal TP to shift toward smaller values at lower offered load.

Implication

The above analysis suggests that TP should not be tuned as a static model-specific constant. Instead, the preferred TP is workload-regime dependent: larger TP is favored in service-time-dominated regimes, while smaller or intermediate TP is favored in queueing-dominated regimes. This observation is the basis for our tuner design: rather than searching TP blindly, we first infer which regime the workload belongs to, and then restrict the TP search to the region consistent with the predicted latency-capacity tradeoff.