602 lines
18 KiB
Markdown
602 lines
18 KiB
Markdown
## 先把整篇 paper 的核心脉络修正成一个严谨版本
|
||
|
||
你现在的直觉是对的,但原始表述里有一个关键逻辑漏洞:
|
||
|
||
> 不能简单说“$hardware \times engine \times model$ 交给 AITuner,$workload \times SLO$ 交给 principles”。
|
||
|
||
这样说的问题在于,**principles 也不可能脱离 $hardware \times engine \times model$ 独立成立**。
|
||
更准确的说法应该是:
|
||
|
||
> **$hardware \times engine \times model$ 决定性能曲面的形状与可行配置集合;$workload \times SLO$ 决定系统当前落在哪个 operating regime,以及哪些 tradeoff 正在主导最优配置。**
|
||
|
||
也就是说,这两组变量的角色不同,不是“一个重要,一个不重要”,而是:
|
||
|
||
* **$hardware, engine, model$ 是 platform-defining axes**
|
||
* **$workload, SLO$ 是 regime-defining axes**
|
||
|
||
这是你整篇 paper 最应该抓住的结构。
|
||
|
||
---
|
||
|
||
## 一、把问题形式化:整篇 paper 真正在求什么
|
||
|
||
设:
|
||
|
||
* $h$ 表示 hardware
|
||
* $e$ 表示 engine
|
||
* $m$ 表示 model
|
||
* $w$ 表示 workload signature
|
||
* $s$ 表示 SLO profile
|
||
* $\theta$ 表示 serving configuration
|
||
|
||
那么最优配置其实是:
|
||
|
||
$$
|
||
\theta^*(h,e,m,w,s) =
|
||
\arg\max_{\theta \in \Theta(h,e,m)} G(\theta; h,e,m,w)
|
||
\quad
|
||
\text{s.t.} \quad L(\theta; h,e,m,w) \le s
|
||
$$
|
||
|
||
其中:
|
||
|
||
* $\Theta(h,e,m)$ 是 **在特定平台上合法且可部署的配置空间**
|
||
* $G(\cdot)$ 是 throughput / goodput 类目标
|
||
* $L(\cdot)$ 是 latency tail,例如 $p95$ TTFT
|
||
|
||
这个式子直接说明了两件事:
|
||
|
||
### **1. $h,e,m$ 不能被 principles 忽略**
|
||
|
||
因为它们决定:
|
||
|
||
* 哪些 knobs 存在
|
||
* 哪些组合合法
|
||
* 每个 knob 改动的收益和代价
|
||
* crossover point 在哪里
|
||
|
||
### **2. $w,s$ 不能被 simulator/emulator 忽略**
|
||
|
||
因为它们决定:
|
||
|
||
* 哪个 bottleneck 被激活
|
||
* latency headroom 是否紧张
|
||
* 当前 regime 是 service-time-dominated 还是 queueing-dominated
|
||
* 哪种 tradeoff 才是“当前最重要的”
|
||
|
||
---
|
||
|
||
## 二、你真正想表达的 thesis,应该改写成下面这个版本
|
||
|
||
## **Refined thesis**
|
||
|
||
> **A full predictive emulator over $hardware \times engine \times model \times workload \times SLO$ is brittle in the face of rapid engine/model evolution, changing knob semantics, and shifting legality constraints. However, the entire space is not equally hard: $hardware \times engine \times model$ primarily shapes the response surface, while $workload \times SLO$ determines which tradeoff regime is active. We therefore extract directional principles over workload--SLO regimes, and use an online AITuner to instantiate and calibrate those principles for each concrete hardware--engine--model setting.**
|
||
|
||
这段话比“把前者交给 AITuner,后者交给 principles”严谨很多,因为它明确了:
|
||
|
||
* principles 不是在替代 HEM
|
||
* AITuner 也不是在替代 WS principles
|
||
* 两者是 **分层协作**
|
||
|
||
---
|
||
|
||
## 三、为什么“全空间 emulator”这条路不稳
|
||
|
||
这里不能只说“因为 engine 和 model 演化快”,这还不够。
|
||
更严谨的论证应该有四层。
|
||
|
||
### **(1) 配置空间本身在变**
|
||
|
||
随着 engine / model 演化:
|
||
|
||
* 新 knobs 被引入
|
||
* 老 knobs 语义改变
|
||
* knob 间约束变化
|
||
* 某些组合从合法变非法,或反过来
|
||
|
||
因此你不是在一个固定的 $\Theta$ 上做预测,而是在一个不断变化的 $\Theta(h,e,m)$ 上做预测。
|
||
|
||
---
|
||
|
||
### **(2) 机制系数在变**
|
||
|
||
即使 knobs 名字相同,性能响应也会变。比如:
|
||
|
||
* engine 改了 scheduler
|
||
* collective backend 改了
|
||
* CUDA graph 路径改了
|
||
* model 改了 MoE routing / attention kernel / KV layout
|
||
|
||
这会直接改变:
|
||
|
||
* TP 的收益曲线
|
||
* EP 的通信代价
|
||
* batching knobs 的排队行为
|
||
* runtime knobs 的边际收益
|
||
|
||
也就是性能曲面的“几何形状”在变。
|
||
|
||
---
|
||
|
||
### **(3) 你需要的不只是 ranking,还需要 feasibility**
|
||
|
||
对于 serving,最关键的不只是“谁更快”,而是:
|
||
|
||
> **在给定 SLO 下谁 still feasible**
|
||
|
||
而 feasibility 边界往往是最脆弱、最难模拟的部分,因为它对:
|
||
|
||
* tail latency
|
||
* burstiness
|
||
* queueing
|
||
* runtime jitter
|
||
|
||
高度敏感。
|
||
|
||
---
|
||
|
||
### **(4) simulator/emulator 的维护成本会越来越高**
|
||
|
||
即使某一代 HEM 上 emulator 有效,后续也要持续追:
|
||
|
||
* 新模型
|
||
* 新 kernel
|
||
* 新 engine release
|
||
* 新 hardware interconnect
|
||
* 新 serving path
|
||
|
||
所以问题不是“能不能建 emulator”,而是:
|
||
|
||
> **能否持续维护一个在 rapidly evolving stack 上仍然可信的 emulator**
|
||
|
||
这就是你 paper 的关键动机之一。
|
||
|
||
---
|
||
|
||
## 四、你 paper 的真正贡献,不是“不要模型”,而是“换一个抽象层”
|
||
|
||
你不是在说:
|
||
|
||
* 不做建模
|
||
* 不做 white-box
|
||
* 全交给在线搜索
|
||
|
||
你真正应该说的是:
|
||
|
||
> **我们放弃对整个五维空间做脆弱的精确响应预测,转而提炼更稳定的、面向 operating regime 的 directional principles。**
|
||
|
||
这是一个很重要的层次转换:
|
||
|
||
### **不是**
|
||
|
||
预测整个函数:
|
||
|
||
$$
|
||
(h,e,m,w,s,\theta) \mapsto \text{performance}
|
||
$$
|
||
|
||
### **而是**
|
||
|
||
提炼一组更稳定的规则:
|
||
|
||
$$
|
||
(w,s) \mapsto \text{which tradeoff regime is active}
|
||
$$
|
||
|
||
再让 AITuner 在给定 $(h,e,m)$ 下去确定:
|
||
|
||
* crossover point
|
||
* feasible boundary
|
||
* exact winner
|
||
|
||
所以 AITuner 的角色不是“暴力兜底”,而是:
|
||
|
||
> **在具体平台上校准 principles 的边界,并完成最后一段精确搜索。**
|
||
|
||
---
|
||
|
||
## 五、这条 story 最关键的一个区分
|
||
|
||
这是我建议你在 paper 里明确讲出来的一句话:
|
||
|
||
## **Platform shapes the surface; regime selects the active tradeoff.**
|
||
|
||
更展开一点:
|
||
|
||
* **$hardware \times engine \times model$**
|
||
决定配置空间是否合法、各 knob 的局部响应系数、以及不同 tradeoff 的边界位置。
|
||
|
||
* **$workload \times SLO$**
|
||
决定当前系统更像是:
|
||
|
||
* latency-headroom-limited
|
||
* capacity-limited
|
||
* queueing-tail-limited
|
||
* routing-communication-limited
|
||
等哪一类 regime。
|
||
|
||
这句话非常适合作为 principles section 的开头句,也适合作为 intro 里的一句概括。
|
||
|
||
---
|
||
|
||
## 六、把整篇 paper 的主线整理成 5 个逻辑命题
|
||
|
||
你可以把全文逻辑压成下面五个命题。
|
||
|
||
### **Claim 1: The full tuning problem is five-dimensional**
|
||
|
||
最优配置依赖于:
|
||
|
||
$$
|
||
(h,e,m,w,s)
|
||
$$
|
||
|
||
任何把其中某些维度当作常量的做法,都只能在局部成立。
|
||
|
||
---
|
||
|
||
### **Claim 2: The five dimensions play different roles**
|
||
|
||
不是每个维度都同样适合被“提前模拟”。
|
||
|
||
* $h,e,m$:定义平台、决定机制细节、变化快
|
||
* $w,s$:定义 regime、决定 tradeoff 是否激活、在 operational semantics 上更稳定
|
||
|
||
---
|
||
|
||
### **Claim 3: Full-stack emulation is brittle**
|
||
|
||
因为它必须同时追踪:
|
||
|
||
* evolving legality
|
||
* evolving semantics
|
||
* evolving coefficients
|
||
* evolving feasibility boundaries
|
||
|
||
---
|
||
|
||
### **Claim 4: Regime-level principles are more stable and more actionable**
|
||
|
||
我们不试图预测每个点的精确 performance,
|
||
而是预测:
|
||
|
||
* 当前 regime 下哪些 knob 更值得调
|
||
* 哪个方向更可能好
|
||
* 哪些区域根本 infeasible
|
||
|
||
---
|
||
|
||
### **Claim 5: AITuner uses principles as structured search priors**
|
||
|
||
也就是说,principles 不是结论陈列,而是 tuner 的先验。
|
||
它们决定:
|
||
|
||
* 搜哪些 knobs
|
||
* 先搜哪些方向
|
||
* 哪些组合可以剪枝
|
||
* 何时停止继续搜并报告 infeasible
|
||
|
||
这五个命题连起来,你的故事就完整了。
|
||
|
||
---
|
||
|
||
# 七、在这个主线下,principles section 应该如何展开
|
||
|
||
现在这一节已经不该叫 **workload-to-configuration principles**,
|
||
而应该叫:
|
||
|
||
## **Regime-to-Configuration Principles**
|
||
|
||
或者更完整一点:
|
||
|
||
## **Configuration Principles under Joint Workload--SLO Regimes**
|
||
|
||
我更推荐前者,短而有力。
|
||
|
||
---
|
||
|
||
## 这一节的职责
|
||
|
||
这一节不是要给出全空间 predictive model,
|
||
而是要回答:
|
||
|
||
1. **为什么最优配置必须 jointly depend on workload and SLO**
|
||
2. **这些 joint regimes 会激活哪些核心 tradeoff**
|
||
3. **这些 tradeoff 如何映射到不同 knob families**
|
||
4. **AITuner 如何把这些 principles 变成 structured search prior**
|
||
|
||
---
|
||
|
||
# 八、principles section 的推荐结构
|
||
|
||
## **Section X: Regime-to-Configuration Principles**
|
||
|
||
### **X.1 From full-space tuning to regime-guided search**
|
||
|
||
这是全节的 framing 小节。
|
||
|
||
它做三件事:
|
||
|
||
* 给出五维问题定义
|
||
* 说明为什么不做全空间 emulator
|
||
* 说明为什么 principles 聚焦在 $w \times s$
|
||
|
||
这一小节里最重要的一句话是:
|
||
|
||
> We do not model away hardware, engine, or model diversity; instead, we let AITuner resolve them online, while using workload--SLO principles to identify which tradeoff regime is active and which parts of the configuration space are worth exploring.
|
||
|
||
这句话能把你整篇 paper 和 simulator/emulator work 区分开。
|
||
|
||
---
|
||
|
||
### **X.2 Why workload-only principles are insufficient**
|
||
|
||
这一节用你刚刚那两张图来引入。
|
||
|
||
核心观察是:
|
||
|
||
* 同一个 trace window
|
||
* 只改 SLO profile
|
||
* winner TP 会变化
|
||
* 甚至会从 feasible 变成 none
|
||
|
||
所以不能只写:
|
||
|
||
> high load $\rightarrow$ small TP
|
||
|
||
而必须写成:
|
||
|
||
> under a given load and heterogeneity profile, the preferred TP still depends on SLO tightness.
|
||
|
||
这一节主要负责把 **SLO 引入 principles story**。
|
||
|
||
---
|
||
|
||
### **X.3 Principle I: TP trades latency headroom for aggregate concurrency**
|
||
|
||
这是 TP 小节。
|
||
|
||
它不再只是 workload principle,而是 regime principle:
|
||
|
||
* **tight SLO** 偏向大 TP,因为需要更低单-replica latency
|
||
* **relaxed SLO + high load** 偏向小/中 TP,因为需要更高 aggregate concurrency
|
||
* **heterogeneity** 会让转折更早发生
|
||
|
||
这一节最好只保留一句极简机制:
|
||
|
||
$$
|
||
\text{replica count} = \frac{G}{t}
|
||
$$
|
||
|
||
再结合图讲:
|
||
|
||
* latency headroom
|
||
* queue buildup
|
||
* infeasible regions
|
||
|
||
---
|
||
|
||
### **X.4 Principle II: EP helps only when expert traffic amortizes routing cost under the target SLO**
|
||
|
||
这是 EP 小节。
|
||
|
||
建议结构是:
|
||
|
||
* workload signal:
|
||
|
||
* MoE token volume
|
||
* expert skew
|
||
* prefill/decode mix
|
||
* SLO signal:
|
||
|
||
* strict SLO 时 routing jitter 更危险
|
||
* relaxed SLO 时更能容忍通信换吞吐
|
||
* principle:
|
||
|
||
* 只有当 expert-side compute 足够大,且 routing/communication 能被摊薄时,EP 才值得
|
||
* 否则不开 EP 更稳
|
||
|
||
这会和 TP 小节形成平行结构。
|
||
|
||
---
|
||
|
||
### **X.5 Principle III: Batching knobs reshape queueing tails under heterogeneity and SLO pressure**
|
||
|
||
这是 batching 小节。
|
||
|
||
建议聚焦:
|
||
|
||
* workload signal:
|
||
|
||
* length CV
|
||
* long-request fraction
|
||
* burstiness
|
||
* SLO signal:
|
||
|
||
* strict tail SLO 不容忍长短请求互相拖累
|
||
* principle:
|
||
|
||
* strict SLO 下,batching 通常更保守
|
||
* relaxed SLO 下,可以更 aggressive packing 追吞吐
|
||
|
||
这一节很适合连接你后面的 queueing story。
|
||
|
||
---
|
||
|
||
### **X.6 Principle IV: Runtime-overhead knobs matter only when latency headroom is scarce**
|
||
|
||
这一节讲 CUDA graph、launch amortization、capture sizes 一类 knobs。
|
||
|
||
核心是:
|
||
|
||
* 这些 knobs 不是一阶 knobs in all regimes
|
||
* 它们主要在:
|
||
|
||
* short requests
|
||
* small batches
|
||
* strict SLO
|
||
* overhead-sensitive regime
|
||
中决定 feasibility
|
||
* 在 heavy prefill 或 communication-dominated regime 下,它们往往不是首要问题
|
||
|
||
---
|
||
|
||
### **X.7 Summary: Principles as structured search priors**
|
||
|
||
这一节非常关键。
|
||
|
||
它要把前面的 principle 收束成:
|
||
|
||
* 哪些 regime signal 决定先调哪个 knob family
|
||
* 哪些 region 可以直接剪枝
|
||
* 哪些 regime 应该直接报告 infeasible,而不是继续搜
|
||
|
||
这节最好配一个 summary table。
|
||
|
||
---
|
||
|
||
# 九、这一节的统一模板
|
||
|
||
为了让 TP、EP、batching、runtime 四个小节看起来像同一类东西,建议每节严格遵循同一模板。
|
||
|
||
| 小节组成 | 内容 |
|
||
| -------------------------- | -------------------------------------- |
|
||
| **Observation** | 图里看到什么 regime shift |
|
||
| **Mechanism** | 一个最核心的 tradeoff,不展开长推导 |
|
||
| **Regime dependence** | workload feature 和 SLO feature 各自怎么起作用 |
|
||
| **Implication for tuning** | 如何缩小 search space / 识别 infeasible |
|
||
|
||
这样 TP 不会写成独立论文,EP/batching/runtime 也容易保持风格一致。
|
||
|
||
---
|
||
|
||
# 十、这一节里最值得保留的全局公式
|
||
|
||
正文里我建议只保留一个全局目标公式,用来统一整节。
|
||
|
||
$$
|
||
\theta^*(h,e,m,w,s)
|
||
===================
|
||
|
||
\arg\max_{\theta \in \Theta(h,e,m)}
|
||
; G(\theta; h,e,m,w)
|
||
\quad
|
||
\text{s.t.}
|
||
\quad
|
||
L(\theta; h,e,m,w) \le s
|
||
$$
|
||
|
||
然后紧接着给一句解释:
|
||
|
||
* $h,e,m$ 定义 feasible space 和 local response surface
|
||
* $w,s$ 选择当前 active regime
|
||
* principles 作用于后者,AITuner 校准前者
|
||
|
||
这就够了。
|
||
其他公式都尽量移到 appendix。
|
||
|
||
---
|
||
|
||
# 十一、推荐的整节 LaTeX 骨架
|
||
|
||
```latex
|
||
\section{Regime-to-Configuration Principles}
|
||
\label{sec:principles}
|
||
|
||
Serving performance depends on the joint space of hardware, engine, model,
|
||
workload, and SLO. The optimal configuration is therefore
|
||
\[
|
||
\theta^*(h,e,m,w,s)
|
||
=
|
||
\arg\max_{\theta \in \Theta(h,e,m)}
|
||
\; G(\theta; h,e,m,w)
|
||
\quad
|
||
\text{s.t.}
|
||
\quad
|
||
L(\theta; h,e,m,w) \le s .
|
||
\]
|
||
Rather than building a brittle full-stack emulator over this entire space, we
|
||
separate the problem into two roles. Hardware, engine, and model determine the
|
||
feasible configuration set and shape the local performance surface. Workload
|
||
and SLO determine which operating regime is active, and thus which tradeoff is
|
||
most likely to govern the optimum. We therefore extract regime-to-configuration
|
||
principles over workload--SLO regimes, and let AITuner instantiate them online
|
||
for each concrete hardware--engine--model setting.
|
||
|
||
\subsection{Why workload-only principles are insufficient}
|
||
\label{sec:principles-why-not-workload-only}
|
||
% \TODO{Use the multi-SLO TP figure.}
|
||
% \TODO{Quantify how often the winner changes as the SLO changes.}
|
||
|
||
\subsection{Principle I: TP trades latency headroom for aggregate concurrency}
|
||
\label{sec:principles-tp}
|
||
% \TODO{Use TP winner heatmap and one supporting line chart.}
|
||
|
||
\subsection{Principle II: EP helps only when expert traffic amortizes routing cost under the target SLO}
|
||
\label{sec:principles-ep}
|
||
% \TODO{Insert EP figure.}
|
||
|
||
\subsection{Principle III: Batching knobs reshape queueing tails under heterogeneity and SLO pressure}
|
||
\label{sec:principles-batching}
|
||
% \TODO{Insert batching figure.}
|
||
|
||
\subsection{Principle IV: Runtime-overhead knobs matter when latency headroom is scarce}
|
||
\label{sec:principles-runtime}
|
||
% \TODO{Insert runtime-overhead figure.}
|
||
|
||
\subsection{Summary: principles as structured search priors}
|
||
\label{sec:principles-summary}
|
||
% \TODO{Insert summary table mapping regime signals to knob priorities,
|
||
% candidate directions, and infeasibility actions.}
|
||
```
|
||
|
||
---
|
||
|
||
# 十二、建议你在 summary 小节里放的表
|
||
|
||
这张表会非常有力量,因为它把你的 principles 直接连接到 tuner design。
|
||
|
||
| Regime signal | Tight SLO effect | Dominant bottleneck | Preferred knob direction | Tuner action |
|
||
| ----------------------------- | ------------------------------------- | ----------------------- | ------------------------------- | --------------------------------- |
|
||
| Low load, low queueing | Headroom scarce | Single-replica latency | Larger TP | Search larger TP first |
|
||
| High load, near saturation | Headroom less important | Aggregate concurrency | Smaller / intermediate TP | Search smaller TP first |
|
||
| High expert traffic, low skew | Tight SLO may penalize routing jitter | Expert compute | Consider EP only if amortizable | Probe EP boundary |
|
||
| High length heterogeneity | Tight SLO amplifies tails | HOL blocking / queueing | More conservative batching | Reduce batch aggressiveness |
|
||
| Small batches, short requests | Tight SLO exposes launch overhead | Runtime overhead | Tune graph/capture knobs | Prioritize overhead knobs |
|
||
| No feasible config | Any | Budget-limited regime | None | Scale out / relax SLO / shed load |
|
||
|
||
最后这一行非常重要,它把 `none` 合法化为系统输出,而不是“实验失败”。
|
||
|
||
---
|
||
|
||
# 十三、最关键的一句 paper-level message
|
||
|
||
我建议你把整篇 paper 的主线凝练成下面这句:
|
||
|
||
> **We do not attempt to emulate the full five-dimensional tuning space. Instead, we exploit the asymmetry between platform axes and operating-regime axes: hardware, engine, and model determine the local response surface, while workload and SLO determine which tradeoff regime is active. This lets us replace brittle full-space prediction with regime-guided search, instantiated online by AITuner.**
|
||
|
||
这句话足够当:
|
||
|
||
* abstract 里的核心句
|
||
* intro 末尾的 problem statement
|
||
* principles section 的 framing sentence
|
||
|
||
---
|
||
|
||
# 十四、最后一个关键提醒
|
||
|
||
你现在这条 story 已经很接近正确了,但一定不要过度声称:
|
||
|
||
> workload--SLO principles are universal across all hardware, engines, and models
|
||
|
||
更稳的说法应该是:
|
||
|
||
> **the directional principles are stable, while the exact crossover points and feasibility boundaries are platform-dependent and are resolved online by AITuner.**
|
||
|
||
这句话非常重要。
|
||
它既保住了你的 principle story,又不会被 reviewer 用“换个 engine 就未必成立”轻易击穿。
|
||
|
||
---
|
||
|
||
下一步最自然的动作,是把这一整节直接写成**论文可用的完整 prose**:先写 **X.1 From full-space tuning to regime-guided search**,再写 **X.2 Why workload-only principles are insufficient**,这样整节开头就真正立住了。
|