## TODO - [x] 什么是 peak/valley,补一个 1 一天的 workload 波动 - [ ] 需要跨小时(+1h)的 trace 相似性 - [ ] 「需要 tune 多久?prefix trace window 的有效性」:这一部分需要跑一个 30min 的实验 - [ ] evaluator-reliability-compare 加一个 valley setup 的比较 [5/10] ```bash # probe qwen235b 的 trace sample threshold # [x] cd /home/admin/cpfs/wjh/aituner/tuner-workload-principle && ./start_qwen235b_tp4dp1_threshold_dash0123_tmux.sh # [x] bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/start_qwen235b_threshold_refresh_step1_dash0123_tmux.sh # 等 step1 全部完成后 # [x] bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/start_qwen235b_threshold_experiment_step2_dash0123_tmux.sh # 上面 qwen235b 10 parallel configs 的实验已经跑完,汇总观察到了 paper/workload_pattern_to_config_principles.md # [x] bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/start_qwen30b_0311_peak_valley_19parallel_threshold_chat_dash0123_tmux.sh # 跑完 merge bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/tmp/qwen30b_0311_peak_valley_19parallel_threshold_chat_dash0123_ts100/merge_results_v2_trace_tables.sh # TODO: 让 codex 跑 threshold 版本的 synthetic/semi-real # done # Ongoing # [x] bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/start_qwen235b_0311_peak_threshold_batching_tp4dp1_epoff19_dash0123_tmux.sh # decode-only # TBD MODEL=qwen30b bash ./start_decode_peak_parallel_sweep_dash0123_tmux.sh # going MODEL=qwen235b bash ./start_decode_peak_parallel_sweep_dash0123_tmux.sh # Ongoing # [x] bash ./start_decode_peak_thresholds_dash0123_tmux.sh # MODEL=qwen30b bash ./search_decode_peak_thresholds.sh # MODEL=qwen235b bash ./search_decode_peak_thresholds.sh #### TBD # qwen3-coder 的 19 parallel configs smoke test cd /home/admin/cpfs/wjh/aituner/tuner-workload-principle tmux new -s codernext_parallel_smoke 'bash ./start_qwen_coder_next_internal_parallel_only_smoke.sh' ``` qwen30b: chat: 19 parallel configs x 5 days coder: 19 parallel configs x 5 days qwen235b: chat: 10 parallel configs x 5 days coder: 10 parallel configs x 5 days ✅:TP4DP1EPOFF, TP8DP1EPOFF TP4DP2EPOFF - L 现在是 3 维 - log_mean_raw_len - log_p95_over_mean_raw_len - cv_raw_len - C 现在是 4 维 - log_mean_hit_len - log_p95_over_mean_hit_len - cv_hit_len - cache_saving_ratio - A 现在是 3 维 - log_qps - cv_interarrival - log_fano_1s_request_counts ## 一句话 principles prefill 时总是应该开 DP=1 prefill 时几乎 EP off > EP on,需要找到 EP on > EP off 的 case 分析 ## prefill node - 证明 config 对性能影响大/不同 workload 性能最优点不同:`data/qwen30b-config-performance-spread-v1.csv` qwen30b 361 configs (19 parallel x 19 batching (LPT=20480/32768)) chat/coder 0311 10:00~10:03 timescale/GPU=0.5 linear SLO 0.001L + 1.0 - workload 的跨天相似性:`data/weekday-workload-similarity.csv` 0311~0317 5 weekdays chat/coder, peak/valley (10:00~10:30/22:00~22:30) - tuned top5 configs 与未来的相似度,`data/qwen30b-high-perf-configs-jaccard.csv` qwen30b 19 configs (parallel) 0311~0317 5 weekdays, chat/coder, peak/valley (10:00~10:10/22:00~22:10) timescale/GPU=0.5 SLO: 0;8k;32k;=2s;4s;6s - tuned best configs 在未来能达到相对 oracle 的性能,`data/qwen30b-tuned-best-config-perf-across-5days.csv` qwen30b 19 configs (parallel) 0311~0317 5 weekdays, chat/coder, peak/valley (10:00~10:10/22:00~22:10) timescale/GPU=0.5 SLO: 0;8k;32k;=2s;4s;6s - synthetic/semi-real/real 对比,`data/qwen30b-evaluator-reliability-compare.csv` qwen30b 19 configs (parallel) 0311~0317 5 weekdays, chat/coder, peak (10:00~10:10) timescale/GPU=0.5 SLO: 0;8k;32k;=2s;4s;6s - 需要 tune 多久?prefix trace window 的有效性 `data/qwen30b-prefix-trace-window-stability.csv` qwen30b 19 configs (parallel) 0311~0317 5 weekdays, chat/coder, peak/valley (10:00~10:10/22:00~22:10) timescale/GPU=0.5 SLO: 0;8k;32k;=1s;2s;4s ## decode node TBD ## Similarity 计算 这张图的算法,直接看这两个 Python 文件就行: - 主脚本: [plot_similarity_heatmap_custom_windows.py](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py) - 归一化定义: [compute_signatures.py](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/compute_signatures.py) - trace 排序与 catalog: [trace_catalog.py](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/trace_catalog.py) 如果你问的是图里这张 `Similarity: 10:00-10:30 / 22:00-22:30`,核心算法就在: - 指标提取: [plot_similarity_heatmap_custom_windows.py:132](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py#L132) - 全局 robust normalization: [compute_signatures.py:64](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/compute_signatures.py#L64) - 相似度矩阵: [plot_similarity_heatmap_custom_windows.py:247](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py#L247) **它实际怎么算** 1. 枚举所有 trace 来自 [trace_catalog.py:77](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/trace_catalog.py#L77),会收集: - `chat peak` - `chat valley` - `coder peak` - `coder valley` 并按 `trace_family, date, day_part` 排序。 2. 对每个 trace,截取指定窗口 在 [plot_similarity_heatmap_custom_windows.py:179](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py#L179)。 对 peak trace 用你传的 `10:00-10:30`,对 valley trace 用 `22:00-22:30`。 3. 对每个窗口算 5 维原始特征 在 [plot_similarity_heatmap_custom_windows.py:168](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py#L168): - `load_tokens_per_s = total_input_tokens / total_duration_seconds` - `mean_input_length = mean(input_lengths)` - `p95_input_length = quantile(input_lengths, 0.95)` - `input_length_cv = std(input_lengths) / mean_input_lengths` - `burstiness = std(inter_arrivals) / mean(inter_arrivals)` 4. 把所有窗口放在一起,按每一维做全局 robust normalization 在 [compute_signatures.py:64](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/compute_signatures.py#L64): - 对每个维度 `x` - 计算所有窗口上的 `median, q1, q3` - `iqr = q3 - q1` - `global_z_x = (x - median) / iqr` - 如果 `iqr <= 0`,就强制设为 `1.0` 5. 每个窗口得到一个 5 维 normalized 向量 在 [plot_similarity_heatmap_custom_windows.py:248](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py#L248): ```python [ global_z_load_tokens_per_s, global_z_mean_input_length, global_z_p95_input_length, global_z_input_length_cv, global_z_burstiness, ] ``` 6. 两两计算欧氏距离,再映射成相似度 在 [plot_similarity_heatmap_custom_windows.py:247](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py#L247): - `distance(i, j) = ||v_i - v_j||_2` - `similarity(i, j) = exp(-distance(i, j))` 所以这张图的数学定义就是: ```text G(w) = [ z(load_tokens_per_s), z(mean_input_length), z(p95_input_length), z(input_length_cv), z(burstiness), ] z(x) = (x - median(x_all_windows)) / IQR(x_all_windows) d(a,b) = ||G(a) - G(b)||_2 sim(a,b) = exp(-d(a,b)) ``` 这里 `x_all_windows` 指的是这次参与画图的全部窗口,不是单个 window 内部。 **伪代码** ```python specs = build_trace_catalog() # chat/coder x peak/valley x all days rows = [] for spec in specs: if spec.day_part == "peak": window = peak_window # 10:00-10:30 else: window = valley_window # 22:00-22:30 reqs = load_requests_in_window(spec.trace_path, window) input_lengths = [r.input_length for r in reqs] inter_arrivals = diff([r.timestamp for r in reqs]) row = { "trace_family": spec.trace_family, "day_part": spec.day_part, "date": spec.date, "load_tokens_per_s": sum(input_lengths) / window_duration_sec, "mean_input_length": mean(input_lengths), "p95_input_length": p95(input_lengths), "input_length_cv": std(input_lengths) / mean(input_lengths), "burstiness": std(inter_arrivals) / mean(inter_arrivals), } rows.append(row) frame = DataFrame(rows) for col in [ "load_tokens_per_s", "mean_input_length", "p95_input_length", "input_length_cv", "burstiness", ]: med = median(frame[col]) iqr = p75(frame[col]) - p25(frame[col]) if iqr <= 0: iqr = 1.0 frame[f"global_z_{col}"] = (frame[col] - med) / iqr vectors = frame[[global_z_5_dims]].to_numpy() for i in range(len(vectors)): for j in range(len(vectors)): dist[i,j] = l2_norm(vectors[i] - vectors[j]) sim[i,j] = exp(-dist[i,j]) plot_heatmap(sim) ``` **补充两点** - 这张图不是用 `SIGNATURE_WEIGHTS` 画的。`0.35/0.2/...` 那组权重只用于 `signature_score`,不用于 heatmap 两两相似度。 - 这张图用的是 `global_robust_scale`,不是按 `chat/coder` 分开缩放,也不是每个窗口自己内部缩放。 如果你要,我下一步可以直接把这张图对应的命令行也给你写出来,并把它产出的 `metrics csv / normalization csv / similarity matrix csv` 一一对应到文件。 ## Semi-real 定义 - 到达过程:按 source trace 的平均 req_rate 生成 Poisson arrivals 代码在 prepare_figure08_evaluator_assets.py (line 115) 和 prepare_figure08_evaluator_assets.py (line 537) - 请求数:不固定,取决于这次 Poisson 过程实际生成了多少个 arrival - 长度分布:对每个生成的 arrival,从 source real trace 的请求列表里“有放回随机采样”一个 source request 代码在 prepare_figure08_evaluator_assets.py (line 272) - 被继承的字段:input_length、output_length、turn/type - 不保留原请求身份:hash_ids 会重新生成,不是原 real trace 的 hash_ids 代码在 prepare_figure08_evaluator_assets.py (line 169)