Initial commit: obsidian to gitea

This commit is contained in:
2026-05-07 15:04:41 +08:00
commit a57afa86b4
323 changed files with 42569 additions and 0 deletions

View File

@@ -0,0 +1,210 @@
### 4.X Load-Dependent Optimal Tensor Parallelism under a Fixed GPU Budget
A central question in our setting is why the preferred tensor parallelism (TP) changes with workload intensity, even when the model and hardware remain unchanged. Our key observation is that TP controls two fundamentally different quantities at the same time: the runtime of a single prefill batch, and the number of replicas that can be deployed under a fixed GPU budget. This creates a regime-dependent tradeoff between per-request latency and cluster-level service capacity.
#### Setup
We consider a cluster with a fixed total GPU budget $G$. Choosing tensor parallelism degree $t$ implies that each model replica consumes $t$ GPUs, so the number of deployable replicas is
$$
m_t = \frac{G}{t}.
$$
We focus on the prefill stage, since it dominates TTFT in our traces and is the primary source of the performance shift observed in Figure~X. For a prefill batch $b$ executed on one replica, let $\mathcal{B}_b$ denote the set of requests in the batch, and let $x_i$ be the prompt length of request $i \in \mathcal{B}_b$. The total number of prefill tokens in the batch is
$$
Z_b = \sum_{i \in \mathcal{B}_b} x_i.
$$
The TTFT of a request is determined not by an isolated request-level service time, but by the runtime of the prefill batch it belongs to and the waiting time before that batch starts. Accordingly, we model the system at the batch level.
#### Batch Runtime Model
Let $T_b(t)$ denote the runtime of prefill batch $b$ under TP degree $t$. We decompose it as
$$
T_b(t) = T_b^{\mathrm{comp}}(t) + T_b^{\mathrm{comm}}(t) + T_b^{\mathrm{rt}}(t),
$$
where $T_b^{\mathrm{comp}}(t)$ is the operator compute time, $T_b^{\mathrm{comm}}(t)$ is the TP communication cost, and $T_b^{\mathrm{rt}}(t)$ captures remaining runtime overheads such as launch gaps and executor overhead.
For decoder-only Transformers, the prefill FLOPs of a request with prompt length $x$ can be approximated as
$$
F(x) = a x d^2 + b x^2 d,
$$
where $d$ is the hidden dimension, the $a x d^2$ term captures dense projections and MLPs, and the $b x^2 d$ term captures self-attention. For a batch $b$, the total FLOPs are therefore
$$
F_b = \sum_{i \in \mathcal{B}_b} F(x_i)
= a d^2 \sum_{i \in \mathcal{B}_b} x_i + b d \sum_{i \in \mathcal{B}_b} x_i^2.
$$
This form is important because the batch cost depends not only on the total token count $Z_b$, but also on the second moment of request lengths. Using
$$
\sum_{i \in \mathcal{B}_b} x_i^2
= n_b \left( \bar{x}_b^2 + \mathrm{Var}_b(x) \right),
$$
where $n_b = |\mathcal{B}_b|$ and $\bar{x}_b$ is the batch mean prompt length, we see that higher within-batch length variance directly increases the physical prefill cost, even when the mean length is fixed.
We then express compute time using a roofline-style model:
$$
T_b^{\mathrm{comp}}(t)
=
\max
\left\{
\frac{F_b}{t \Pi_1 \eta_t^{\mathrm{comp}}},
\frac{Q_b}{t B_1 \eta_t^{\mathrm{mem}}}
\right\},
$$
where $\Pi_1$ and $B_1$ are the effective single-GPU compute and memory bandwidth, $Q_b$ is the batch memory traffic, and $\eta_t^{\mathrm{comp}}, \eta_t^{\mathrm{mem}} \le 1$ are TP efficiency terms. In the ideal case, compute time would decrease as $1/t$; in practice, the decrease is only sublinear due to degraded kernel efficiency and shape effects.
TP also introduces collective communication overhead. Approximating each collective with a ring all-reduce over payload size $n$, its cost is
$$
T_{\mathrm{AR}}(n,t)
=
2 (t-1)\alpha
+
2 \frac{t-1}{t} \frac{n}{\beta},
$$
where $\alpha$ is the per-hop latency and $\beta$ is the effective link bandwidth. If each layer performs, on average, $k$ such collectives and the activation payload is proportional to batch token volume, $n_b \approx q d Z_b$, then the communication term becomes
$$
T_b^{\mathrm{comm}}(t)
\approx
L k
\left(
2 (t-1)\alpha
+
2 \frac{t-1}{t} \frac{q d Z_b}{\beta}
\right),
$$
where $L$ is the number of layers and $q$ is the bytes per element. Unlike compute, this term does not scale down with $1/t$; instead, it grows with the TP degree and the batch size.
Combining the two, the prefill batch runtime is
$$
T_b(t)
=
\max
\left\{
\frac{
a d^2 \sum_i x_i + b d \sum_i x_i^2
}{
t \Pi_1 \eta_t^{\mathrm{comp}}
},
\frac{Q_b}{t B_1 \eta_t^{\mathrm{mem}}}
\right\}
+
L k
\left(
2 (t-1)\alpha
+
2 \frac{t-1}{t} \frac{q d Z_b}{\beta}
\right)
+
T_b^{\mathrm{rt}}(t).
$$
This equation makes the TP tradeoff explicit: larger TP reduces the compute-dominated component of a single batch, but it also introduces collective cost and provides only sublinear speedup.
#### From Batch Runtime to TTFT
For a request $i$ assigned to replica $r$, let $b(i)$ denote the prefill batch that eventually serves it. Its TTFT can be written as
$$
\mathrm{TTFT}_i(t) = W_{q,i}(t) + T_{b(i)}(t),
$$
where $W_{q,i}(t)$ is the waiting time before the batch starts. More precisely,
$$
W_{q,i}(t) = R_i(t) + \sum_{u \in \mathcal{H}_i} T_u(t),
$$
where $R_i(t)$ is the residual runtime of the currently executing batch at arrival, and $\mathcal{H}_i$ is the set of batches queued ahead of request $i$ on the same replica.
Under a renewal approximation, the expected residual time is
$$
\mathbb{E}[R_t] = \frac{\mathbb{E}[T_b(t)^2]}{2 \mathbb{E}[T_b(t)]}.
$$
Thus, tail TTFT is affected not only by the mean batch runtime, but also by its second moment. This is crucial for heterogeneous workloads: higher length variance increases both $T_b(t)$ and $\mathbb{E}[T_b(t)^2]$, which amplifies queueing tails.
#### Cluster Capacity under a Fixed GPU Budget
The key systems-level quantity is not the latency of one replica, but the total service capacity of the cluster under fixed $G$. Let
$$
\mu_{t,w}
=
\mathbb{E}
\left[
\frac{Z_b}{T_b(t)}
\mid w
\right]
$$
denote the average prefill token throughput of a single replica under workload window $w$. Since the cluster can host only $m_t = G/t$ replicas, the aggregate prefill capacity is
$$
\Lambda_{t,w} = \frac{G}{t} \mu_{t,w}.
$$
This expression exposes the central tradeoff. Increasing TP may improve single-replica throughput $\mu_{t,w}$ by reducing batch runtime, but it simultaneously reduces the replica count by a factor of $t$. Therefore, larger TP improves cluster-wide capacity only if
$$
\mu_{t,w} > t \mu_{1,w},
$$
i.e., if the per-replica throughput gain is superlinear in $t$. In practice, this is rarely achievable: compute speedup is at best linear, while communication and runtime overheads make it strictly sublinear. As a result, under saturation, larger TP usually cannot improve total cluster capacity and often reduces it.
#### Regime Shift: Why Large TP Helps at Low Load but Hurts at High Load
This model immediately explains the phase transition in our measurements. Let $\lambda_w$ denote the offered prefill-token arrival rate of workload window $w$. The observed goodput per GPU is approximately
$$
g_{t,w}^{\mathrm{obs}}
\approx
\frac{1}{G}
\min \{ \lambda_w, \Lambda_{t,w} \}.
$$
In the light-load regime, where
$$
\lambda_w \ll \Lambda_{t,w},
$$
all TP choices have sufficient cluster capacity, so the observed goodput per GPU is nearly identical across TP settings. In this regime, TTFT is dominated by the runtime of a single prefill batch, and larger TP is beneficial because it reduces $T_b(t)$.
In contrast, in the high-load regime, where
$$
\lambda_w \approx \Lambda_{t,w},
$$
the system becomes capacity- and queueing-limited. Since larger TP typically lowers $\Lambda_{t,w}$, it pushes the system closer to saturation, increasing queue depth and amplifying tail TTFT. Consequently, smaller or intermediate TP becomes preferable because it provides more replicas, higher aggregate concurrency, and lower queueing pressure.
The model also explains why the effect is stronger for heterogeneous workloads such as coder traces. Because batch cost depends on $\sum_i x_i^2$ rather than only $\sum_i x_i$, workloads with larger prompt-length variance induce larger batch runtime variance, which increases both execution time and residual waiting time. Such workloads therefore enter the queueing-dominated regime earlier, causing the optimal TP to shift toward smaller values at lower offered load.
#### Implication
The above analysis suggests that TP should not be tuned as a static model-specific constant. Instead, the preferred TP is workload-regime dependent: larger TP is favored in service-time-dominated regimes, while smaller or intermediate TP is favored in queueing-dominated regimes. This observation is the basis for our tuner design: rather than searching TP blindly, we first infer which regime the workload belongs to, and then restrict the TP search to the region consistent with the predicted latency-capacity tradeoff.