Files
obsidian/projects/auto-tuner/eval setup.md

309 lines
10 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## TODO
- [x] 什么是 peak/valley补一个 1 一天的 workload 波动
- [ ] 需要跨小时(+1h的 trace 相似性
- [ ] 「需要 tune 多久prefix trace window 的有效性」:这一部分需要跑一个 30min 的实验
- [ ] evaluator-reliability-compare 加一个 valley setup 的比较 [5/10]
```bash
# probe qwen235b 的 trace sample threshold
# [x] cd /home/admin/cpfs/wjh/aituner/tuner-workload-principle && ./start_qwen235b_tp4dp1_threshold_dash0123_tmux.sh
# [x] bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/start_qwen235b_threshold_refresh_step1_dash0123_tmux.sh
# 等 step1 全部完成后
# [x] bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/start_qwen235b_threshold_experiment_step2_dash0123_tmux.sh
# 上面 qwen235b 10 parallel configs 的实验已经跑完,汇总观察到了 paper/workload_pattern_to_config_principles.md
# [x] bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/start_qwen30b_0311_peak_valley_19parallel_threshold_chat_dash0123_tmux.sh
# 跑完 merge
bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/tmp/qwen30b_0311_peak_valley_19parallel_threshold_chat_dash0123_ts100/merge_results_v2_trace_tables.sh
# TODO: 让 codex 跑 threshold 版本的 synthetic/semi-real
# done
# Ongoing
# [x] bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/start_qwen235b_0311_peak_threshold_batching_tp4dp1_epoff19_dash0123_tmux.sh
# decode-only
# TBD
MODEL=qwen30b bash ./start_decode_peak_parallel_sweep_dash0123_tmux.sh
# going
MODEL=qwen235b bash ./start_decode_peak_parallel_sweep_dash0123_tmux.sh
# Ongoing
# [x] bash ./start_decode_peak_thresholds_dash0123_tmux.sh
# MODEL=qwen30b bash ./search_decode_peak_thresholds.sh
# MODEL=qwen235b bash ./search_decode_peak_thresholds.sh
#### TBD
# qwen3-coder 的 19 parallel configs smoke test
cd /home/admin/cpfs/wjh/aituner/tuner-workload-principle
tmux new -s codernext_parallel_smoke 'bash ./start_qwen_coder_next_internal_parallel_only_smoke.sh'
```
qwen30b:
chat: 19 parallel configs x 5 days
coder: 19 parallel configs x 5 days
qwen235b:
chat: 10 parallel configs x 5 days
coder: 10 parallel configs x 5 days
TP4DP1EPOFF, TP8DP1EPOFF
TP4DP2EPOFF
- L 现在是 3 维
- log_mean_raw_len
- log_p95_over_mean_raw_len
- cv_raw_len
- C 现在是 4 维
- log_mean_hit_len
- log_p95_over_mean_hit_len
- cv_hit_len
- cache_saving_ratio
- A 现在是 3 维
- log_qps
- cv_interarrival
- log_fano_1s_request_counts
## 一句话 principles
prefill 时总是应该开 DP=1
prefill 时几乎 EP off > EP on需要找到 EP on > EP off 的 case 分析
## prefill node
- 证明 config 对性能影响大/不同 workload 性能最优点不同:`data/qwen30b-config-performance-spread-v1.csv`
qwen30b
361 configs (19 parallel x 19 batching (LPT=20480/32768))
chat/coder 0311 10:00~10:03
timescale/GPU=0.5
linear SLO 0.001L + 1.0
- workload 的跨天相似性:`data/weekday-workload-similarity.csv`
0311~0317 5 weekdays
chat/coder, peak/valley (10:00~10:30/22:00~22:30)
- tuned top5 configs 与未来的相似度,`data/qwen30b-high-perf-configs-jaccard.csv`
qwen30b
19 configs (parallel)
0311~0317 5 weekdays, chat/coder, peak/valley (10:00~10:10/22:00~22:10)
timescale/GPU=0.5
SLO: 0;8k;32k;=2s;4s;6s
- tuned best configs 在未来能达到相对 oracle 的性能,`data/qwen30b-tuned-best-config-perf-across-5days.csv`
qwen30b
19 configs (parallel)
0311~0317 5 weekdays, chat/coder, peak/valley (10:00~10:10/22:00~22:10)
timescale/GPU=0.5
SLO: 0;8k;32k;=2s;4s;6s
- synthetic/semi-real/real 对比,`data/qwen30b-evaluator-reliability-compare.csv`
qwen30b
19 configs (parallel)
0311~0317 5 weekdays, chat/coder, peak (10:00~10:10)
timescale/GPU=0.5
SLO: 0;8k;32k;=2s;4s;6s
- 需要 tune 多久prefix trace window 的有效性 `data/qwen30b-prefix-trace-window-stability.csv`
qwen30b
19 configs (parallel)
0311~0317 5 weekdays, chat/coder, peak/valley (10:00~10:10/22:00~22:10)
timescale/GPU=0.5
SLO: 0;8k;32k;=1s;2s;4s
## decode node
TBD
## Similarity 计算
这张图的算法,直接看这两个 Python 文件就行:
- 主脚本: [plot_similarity_heatmap_custom_windows.py](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py)
- 归一化定义: [compute_signatures.py](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/compute_signatures.py)
- trace 排序与 catalog: [trace_catalog.py](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/trace_catalog.py)
如果你问的是图里这张 `Similarity: 10:00-10:30 / 22:00-22:30`,核心算法就在:
- 指标提取: [plot_similarity_heatmap_custom_windows.py:132](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py#L132)
- 全局 robust normalization: [compute_signatures.py:64](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/compute_signatures.py#L64)
- 相似度矩阵: [plot_similarity_heatmap_custom_windows.py:247](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py#L247)
**它实际怎么算**
1. 枚举所有 trace
来自 [trace_catalog.py:77](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/trace_catalog.py#L77),会收集:
- `chat peak`
- `chat valley`
- `coder peak`
- `coder valley`
并按 `trace_family, date, day_part` 排序。
2. 对每个 trace截取指定窗口
在 [plot_similarity_heatmap_custom_windows.py:179](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py#L179)。
对 peak trace 用你传的 `10:00-10:30`,对 valley trace 用 `22:00-22:30`
3. 对每个窗口算 5 维原始特征
在 [plot_similarity_heatmap_custom_windows.py:168](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py#L168)
- `load_tokens_per_s = total_input_tokens / total_duration_seconds`
- `mean_input_length = mean(input_lengths)`
- `p95_input_length = quantile(input_lengths, 0.95)`
- `input_length_cv = std(input_lengths) / mean_input_lengths`
- `burstiness = std(inter_arrivals) / mean(inter_arrivals)`
4. 把所有窗口放在一起,按每一维做全局 robust normalization
在 [compute_signatures.py:64](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/compute_signatures.py#L64)
- 对每个维度 `x`
- 计算所有窗口上的 `median, q1, q3`
- `iqr = q3 - q1`
- `global_z_x = (x - median) / iqr`
- 如果 `iqr <= 0`,就强制设为 `1.0`
5. 每个窗口得到一个 5 维 normalized 向量
在 [plot_similarity_heatmap_custom_windows.py:248](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py#L248)
```python
[
global_z_load_tokens_per_s,
global_z_mean_input_length,
global_z_p95_input_length,
global_z_input_length_cv,
global_z_burstiness,
]
```
6. 两两计算欧氏距离,再映射成相似度
在 [plot_similarity_heatmap_custom_windows.py:247](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py#L247)
- `distance(i, j) = ||v_i - v_j||_2`
- `similarity(i, j) = exp(-distance(i, j))`
所以这张图的数学定义就是:
```text
G(w) = [
z(load_tokens_per_s),
z(mean_input_length),
z(p95_input_length),
z(input_length_cv),
z(burstiness),
]
z(x) = (x - median(x_all_windows)) / IQR(x_all_windows)
d(a,b) = ||G(a) - G(b)||_2
sim(a,b) = exp(-d(a,b))
```
这里 `x_all_windows` 指的是这次参与画图的全部窗口,不是单个 window 内部。
**伪代码**
```python
specs = build_trace_catalog() # chat/coder x peak/valley x all days
rows = []
for spec in specs:
if spec.day_part == "peak":
window = peak_window # 10:00-10:30
else:
window = valley_window # 22:00-22:30
reqs = load_requests_in_window(spec.trace_path, window)
input_lengths = [r.input_length for r in reqs]
inter_arrivals = diff([r.timestamp for r in reqs])
row = {
"trace_family": spec.trace_family,
"day_part": spec.day_part,
"date": spec.date,
"load_tokens_per_s": sum(input_lengths) / window_duration_sec,
"mean_input_length": mean(input_lengths),
"p95_input_length": p95(input_lengths),
"input_length_cv": std(input_lengths) / mean(input_lengths),
"burstiness": std(inter_arrivals) / mean(inter_arrivals),
}
rows.append(row)
frame = DataFrame(rows)
for col in [
"load_tokens_per_s",
"mean_input_length",
"p95_input_length",
"input_length_cv",
"burstiness",
]:
med = median(frame[col])
iqr = p75(frame[col]) - p25(frame[col])
if iqr <= 0:
iqr = 1.0
frame[f"global_z_{col}"] = (frame[col] - med) / iqr
vectors = frame[[global_z_5_dims]].to_numpy()
for i in range(len(vectors)):
for j in range(len(vectors)):
dist[i,j] = l2_norm(vectors[i] - vectors[j])
sim[i,j] = exp(-dist[i,j])
plot_heatmap(sim)
```
**补充两点**
- 这张图不是用 `SIGNATURE_WEIGHTS` 画的。`0.35/0.2/...` 那组权重只用于 `signature_score`,不用于 heatmap 两两相似度。
- 这张图用的是 `global_robust_scale`,不是按 `chat/coder` 分开缩放,也不是每个窗口自己内部缩放。
如果你要,我下一步可以直接把这张图对应的命令行也给你写出来,并把它产出的 `metrics csv / normalization csv / similarity matrix csv` 一一对应到文件。
## Semi-real 定义
- 到达过程:按 source trace 的平均 req_rate 生成 Poisson arrivals
代码在 prepare_figure08_evaluator_assets.py (line 115)  prepare_figure08_evaluator_assets.py (line 537)
- 请求数:不固定,取决于这次 Poisson 过程实际生成了多少个 arrival
- 长度分布:对每个生成的 arrival从 source real trace 的请求列表里“有放回随机采样”一个 source request
代码在 prepare_figure08_evaluator_assets.py (line 272)
- 被继承的字段input_length、output_length、turn/type
- 不保留原请求身份hash_ids 会重新生成不是原 real trace 的 hash_ids
代码在 prepare_figure08_evaluator_assets.py (line 169)