Initial commit: obsidian to gitea

2026-05-07 15:04:41 +08:00
commit a57afa86b4
323 changed files with 42569 additions and 0 deletions
--- a/projects/auto-tuner/eval
+++ b/projects/auto-tuner/eval
@@ -0,0 +1,308 @@
+## TODO
+
+- [x] 什么是 peak/valley，补一个 1 一天的 workload 波动
+- [ ] 需要跨小时（+1h）的 trace 相似性
+
+- [ ] 「需要 tune 多久？prefix trace window 的有效性」：这一部分需要跑一个 30min 的实验
+
+- [ ] evaluator-reliability-compare 加一个 valley setup 的比较 [5/10]
+
+
+```bash
+# probe qwen235b 的 trace sample threshold
+# [x] cd /home/admin/cpfs/wjh/aituner/tuner-workload-principle && ./start_qwen235b_tp4dp1_threshold_dash0123_tmux.sh
+
+
+# [x] bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/start_qwen235b_threshold_refresh_step1_dash0123_tmux.sh
+# 等 step1 全部完成后
+# [x] bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/start_qwen235b_threshold_experiment_step2_dash0123_tmux.sh
+# 上面 qwen235b 10 parallel configs 的实验已经跑完，汇总观察到了 paper/workload_pattern_to_config_principles.md
+
+
+
+# [x] bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/start_qwen30b_0311_peak_valley_19parallel_threshold_chat_dash0123_tmux.sh
+# 跑完 merge
+bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/tmp/qwen30b_0311_peak_valley_19parallel_threshold_chat_dash0123_ts100/merge_results_v2_trace_tables.sh
+
+# TODO: 让 codex 跑 threshold 版本的 synthetic/semi-real 
+# done
+
+# Ongoing
+# [x] bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/start_qwen235b_0311_peak_threshold_batching_tp4dp1_epoff19_dash0123_tmux.sh
+
+
+# decode-only
+# TBD
+MODEL=qwen30b bash ./start_decode_peak_parallel_sweep_dash0123_tmux.sh
+
+# going
+MODEL=qwen235b bash ./start_decode_peak_parallel_sweep_dash0123_tmux.sh
+
+
+# Ongoing
+# [x] bash ./start_decode_peak_thresholds_dash0123_tmux.sh
+# MODEL=qwen30b bash ./search_decode_peak_thresholds.sh
+# MODEL=qwen235b bash ./search_decode_peak_thresholds.sh
+
+
+
+#### TBD
+# qwen3-coder 的 19 parallel configs smoke test
+cd /home/admin/cpfs/wjh/aituner/tuner-workload-principle
+tmux new -s codernext_parallel_smoke 'bash ./start_qwen_coder_next_internal_parallel_only_smoke.sh'
+
+```
+
+
+qwen30b:
+chat: 19 parallel configs x 5 days
+coder: 19 parallel configs x 5 days
+
+qwen235b:
+chat: 10 parallel configs x 5 days
+coder: 10 parallel configs x 5 days
+
+
+✅：TP4DP1EPOFF, TP8DP1EPOFF
+
+TP4DP2EPOFF
+
+
+- L 现在是 3 维
+    - log_mean_raw_len
+    - log_p95_over_mean_raw_len
+    - cv_raw_len
+- C 现在是 4 维
+    - log_mean_hit_len
+    - log_p95_over_mean_hit_len
+    - cv_hit_len
+    - cache_saving_ratio
+- A 现在是 3 维
+    - log_qps
+    - cv_interarrival
+    - log_fano_1s_request_counts
+
+
+## 一句话 principles
+
+prefill 时总是应该开 DP=1
+prefill 时几乎 EP off > EP on，需要找到 EP on > EP off 的 case 分析
+
+
+## prefill node
+
+
+- 证明 config 对性能影响大/不同 workload 性能最优点不同：`data/qwen30b-config-performance-spread-v1.csv`
+
+qwen30b
+361 configs (19 parallel x 19 batching (LPT=20480/32768))
+chat/coder 0311 10:00~10:03
+timescale/GPU=0.5
+linear SLO 0.001L + 1.0
+
+
+- workload 的跨天相似性：`data/weekday-workload-similarity.csv`
+
+0311~0317 5 weekdays
+chat/coder, peak/valley (10:00~10:30/22:00~22:30)
+
+
+- tuned top5 configs 与未来的相似度，`data/qwen30b-high-perf-configs-jaccard.csv`
+
+qwen30b
+19 configs (parallel)
+0311~0317 5 weekdays, chat/coder, peak/valley (10:00~10:10/22:00~22:10)
+timescale/GPU=0.5
+SLO: 0;8k;32k;=2s;4s;6s
+
+
+- tuned best configs 在未来能达到相对 oracle 的性能，`data/qwen30b-tuned-best-config-perf-across-5days.csv`
+
+qwen30b
+19 configs (parallel)
+0311~0317 5 weekdays, chat/coder, peak/valley (10:00~10:10/22:00~22:10)
+timescale/GPU=0.5
+SLO: 0;8k;32k;=2s;4s;6s
+
+
+- synthetic/semi-real/real 对比，`data/qwen30b-evaluator-reliability-compare.csv`
+
+qwen30b
+19 configs (parallel)
+0311~0317 5 weekdays, chat/coder, peak (10:00~10:10)
+timescale/GPU=0.5
+SLO: 0;8k;32k;=2s;4s;6s
+
+
+- 需要 tune 多久？prefix trace window 的有效性 `data/qwen30b-prefix-trace-window-stability.csv`
+
+qwen30b
+19 configs (parallel)
+0311~0317 5 weekdays, chat/coder, peak/valley (10:00~10:10/22:00~22:10)
+timescale/GPU=0.5
+SLO: 0;8k;32k;=1s;2s;4s
+
+## decode node
+
+
+TBD
+
+
+
+
+
+
+
+## Similarity 计算
+
+
+这张图的算法，直接看这两个 Python 文件就行：
+
+- 主脚本: [plot_similarity_heatmap_custom_windows.py](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py)
+- 归一化定义: [compute_signatures.py](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/compute_signatures.py)
+- trace 排序与 catalog: [trace_catalog.py](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/trace_catalog.py)
+
+如果你问的是图里这张 `Similarity: 10:00-10:30 / 22:00-22:30`，核心算法就在：
+- 指标提取: [plot_similarity_heatmap_custom_windows.py:132](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py#L132)
+- 全局 robust normalization: [compute_signatures.py:64](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/compute_signatures.py#L64)
+- 相似度矩阵: [plot_similarity_heatmap_custom_windows.py:247](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py#L247)
+
+**它实际怎么算**
+1. 枚举所有 trace
+   来自 [trace_catalog.py:77](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/trace_catalog.py#L77)，会收集：
+   - `chat peak`
+   - `chat valley`
+   - `coder peak`
+   - `coder valley`
+   并按 `trace_family, date, day_part` 排序。
+
+2. 对每个 trace，截取指定窗口
+   在 [plot_similarity_heatmap_custom_windows.py:179](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py#L179)。
+   对 peak trace 用你传的 `10:00-10:30`，对 valley trace 用 `22:00-22:30`。
+
+3. 对每个窗口算 5 维原始特征
+   在 [plot_similarity_heatmap_custom_windows.py:168](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py#L168)：
+   - `load_tokens_per_s = total_input_tokens / total_duration_seconds`
+   - `mean_input_length = mean(input_lengths)`
+   - `p95_input_length = quantile(input_lengths, 0.95)`
+   - `input_length_cv = std(input_lengths) / mean_input_lengths`
+   - `burstiness = std(inter_arrivals) / mean(inter_arrivals)`
+
+4. 把所有窗口放在一起，按每一维做全局 robust normalization
+   在 [compute_signatures.py:64](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/compute_signatures.py#L64)：
+   - 对每个维度 `x`
+   - 计算所有窗口上的 `median, q1, q3`
+   - `iqr = q3 - q1`
+   - `global_z_x = (x - median) / iqr`
+   - 如果 `iqr <= 0`，就强制设为 `1.0`
+
+5. 每个窗口得到一个 5 维 normalized 向量
+   在 [plot_similarity_heatmap_custom_windows.py:248](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py#L248)：
+   ```python
+   [
+     global_z_load_tokens_per_s,
+     global_z_mean_input_length,
+     global_z_p95_input_length,
+     global_z_input_length_cv,
+     global_z_burstiness,
+   ]
+   ```
+
+6. 两两计算欧氏距离，再映射成相似度
+   在 [plot_similarity_heatmap_custom_windows.py:247](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py#L247)：
+   - `distance(i, j) = ||v_i - v_j||_2`
+   - `similarity(i, j) = exp(-distance(i, j))`
+
+所以这张图的数学定义就是：
+
+```text
+G(w) = [
+  z(load_tokens_per_s),
+  z(mean_input_length),
+  z(p95_input_length),
+  z(input_length_cv),
+  z(burstiness),
+]
+
+z(x) = (x - median(x_all_windows)) / IQR(x_all_windows)
+
+d(a,b) = ||G(a) - G(b)||_2
+
+sim(a,b) = exp(-d(a,b))
+```
+
+这里 `x_all_windows` 指的是这次参与画图的全部窗口，不是单个 window 内部。
+
+**伪代码**
+```python
+specs = build_trace_catalog()  # chat/coder x peak/valley x all days
+
+rows = []
+for spec in specs:
+    if spec.day_part == "peak":
+        window = peak_window   # 10:00-10:30
+    else:
+        window = valley_window # 22:00-22:30
+
+    reqs = load_requests_in_window(spec.trace_path, window)
+
+    input_lengths = [r.input_length for r in reqs]
+    inter_arrivals = diff([r.timestamp for r in reqs])
+
+    row = {
+        "trace_family": spec.trace_family,
+        "day_part": spec.day_part,
+        "date": spec.date,
+        "load_tokens_per_s": sum(input_lengths) / window_duration_sec,
+        "mean_input_length": mean(input_lengths),
+        "p95_input_length": p95(input_lengths),
+        "input_length_cv": std(input_lengths) / mean(input_lengths),
+        "burstiness": std(inter_arrivals) / mean(inter_arrivals),
+    }
+    rows.append(row)
+
+frame = DataFrame(rows)
+
+for col in [
+    "load_tokens_per_s",
+    "mean_input_length",
+    "p95_input_length",
+    "input_length_cv",
+    "burstiness",
+]:
+    med = median(frame[col])
+    iqr = p75(frame[col]) - p25(frame[col])
+    if iqr <= 0:
+        iqr = 1.0
+    frame[f"global_z_{col}"] = (frame[col] - med) / iqr
+
+vectors = frame[[global_z_5_dims]].to_numpy()
+
+for i in range(len(vectors)):
+    for j in range(len(vectors)):
+        dist[i,j] = l2_norm(vectors[i] - vectors[j])
+        sim[i,j] = exp(-dist[i,j])
+
+plot_heatmap(sim)
+```
+
+**补充两点**
+- 这张图不是用 `SIGNATURE_WEIGHTS` 画的。`0.35/0.2/...` 那组权重只用于 `signature_score`，不用于 heatmap 两两相似度。
+- 这张图用的是 `global_robust_scale`，不是按 `chat/coder` 分开缩放，也不是每个窗口自己内部缩放。
+
+如果你要，我下一步可以直接把这张图对应的命令行也给你写出来，并把它产出的 `metrics csv / normalization csv / similarity matrix csv` 一一对应到文件。
+
+
+## Semi-real 定义
+
+- 到达过程：按 source trace 的平均 req_rate 生成 Poisson arrivals  
+    代码在 prepare_figure08_evaluator_assets.py (line 115) 和 prepare_figure08_evaluator_assets.py (line 537)
+- 请求数：不固定，取决于这次 Poisson 过程实际生成了多少个 arrival
+- 长度分布：对每个生成的 arrival，从 source real trace 的请求列表里“有放回随机采样”一个 source request  
+    代码在 prepare_figure08_evaluator_assets.py (line 272)
+- 被继承的字段：input_length、output_length、turn/type
+- 不保留原请求身份：hash_ids 会重新生成，不是原 real trace 的 hash_ids  
+	    代码在 prepare_figure08_evaluator_assets.py (line 169)
+
+
+