obsidian/eval setup.md at a57afa86b47c58aeca557e7cbcb0d38b81159d78

gahow/obsidian

Fork 0

Files

Gahow Wang a57afa86b4 Initial commit: obsidian to gitea

2026-05-07 15:04:41 +08:00

10 KiB

Raw Blame History

TODO

什么是 peak/valley，补一个 1 一天的 workload 波动
需要跨小时（+1h）的 trace 相似性
「需要 tune 多久？prefix trace window 的有效性」：这一部分需要跑一个 30min 的实验
evaluator-reliability-compare 加一个 valley setup 的比较 [5/10]

# probe qwen235b 的 trace sample threshold
# [x] cd /home/admin/cpfs/wjh/aituner/tuner-workload-principle && ./start_qwen235b_tp4dp1_threshold_dash0123_tmux.sh


# [x] bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/start_qwen235b_threshold_refresh_step1_dash0123_tmux.sh
# 等 step1 全部完成后
# [x] bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/start_qwen235b_threshold_experiment_step2_dash0123_tmux.sh
# 上面 qwen235b 10 parallel configs 的实验已经跑完，汇总观察到了 paper/workload_pattern_to_config_principles.md



# [x] bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/start_qwen30b_0311_peak_valley_19parallel_threshold_chat_dash0123_tmux.sh
# 跑完 merge
bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/tmp/qwen30b_0311_peak_valley_19parallel_threshold_chat_dash0123_ts100/merge_results_v2_trace_tables.sh

# TODO: 让 codex 跑 threshold 版本的 synthetic/semi-real 
# done

# Ongoing
# [x] bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/start_qwen235b_0311_peak_threshold_batching_tp4dp1_epoff19_dash0123_tmux.sh


# decode-only
# TBD
MODEL=qwen30b bash ./start_decode_peak_parallel_sweep_dash0123_tmux.sh

# going
MODEL=qwen235b bash ./start_decode_peak_parallel_sweep_dash0123_tmux.sh


# Ongoing
# [x] bash ./start_decode_peak_thresholds_dash0123_tmux.sh
# MODEL=qwen30b bash ./search_decode_peak_thresholds.sh
# MODEL=qwen235b bash ./search_decode_peak_thresholds.sh



#### TBD
# qwen3-coder 的 19 parallel configs smoke test
cd /home/admin/cpfs/wjh/aituner/tuner-workload-principle
tmux new -s codernext_parallel_smoke 'bash ./start_qwen_coder_next_internal_parallel_only_smoke.sh'

qwen30b: chat: 19 parallel configs x 5 days coder: 19 parallel configs x 5 days

qwen235b: chat: 10 parallel configs x 5 days coder: 10 parallel configs x 5 days

✅：TP4DP1EPOFF, TP8DP1EPOFF

TP4DP2EPOFF

L 现在是 3 维
- log_mean_raw_len
- log_p95_over_mean_raw_len
- cv_raw_len
C 现在是 4 维
- log_mean_hit_len
- log_p95_over_mean_hit_len
- cv_hit_len
- cache_saving_ratio
A 现在是 3 维
- log_qps
- cv_interarrival
- log_fano_1s_request_counts

一句话 principles

prefill 时总是应该开 DP=1 prefill 时几乎 EP off > EP on，需要找到 EP on > EP off 的 case 分析

prefill node

证明 config 对性能影响大/不同 workload 性能最优点不同：data/qwen30b-config-performance-spread-v1.csv

qwen30b 361 configs (19 parallel x 19 batching (LPT=20480/32768)) chat/coder 0311 10:00~10:03 timescale/GPU=0.5 linear SLO 0.001L + 1.0

workload 的跨天相似性：data/weekday-workload-similarity.csv

0311~~0317 5 weekdays chat/coder, peak/valley (10:00~~10:30/22:00~22:30)

tuned top5 configs 与未来的相似度，data/qwen30b-high-perf-configs-jaccard.csv

qwen30b 19 configs (parallel) 0311~~0317 5 weekdays, chat/coder, peak/valley (10:00~~10:10/22:00~22:10) timescale/GPU=0.5 SLO: 0;8k;32k;=2s;4s;6s

tuned best configs 在未来能达到相对 oracle 的性能，data/qwen30b-tuned-best-config-perf-across-5days.csv

qwen30b 19 configs (parallel) 0311~~0317 5 weekdays, chat/coder, peak/valley (10:00~~10:10/22:00~22:10) timescale/GPU=0.5 SLO: 0;8k;32k;=2s;4s;6s

synthetic/semi-real/real 对比，data/qwen30b-evaluator-reliability-compare.csv

qwen30b 19 configs (parallel) 0311~~0317 5 weekdays, chat/coder, peak (10:00~~10:10) timescale/GPU=0.5 SLO: 0;8k;32k;=2s;4s;6s

需要 tune 多久？prefix trace window 的有效性 data/qwen30b-prefix-trace-window-stability.csv

qwen30b 19 configs (parallel) 0311~~0317 5 weekdays, chat/coder, peak/valley (10:00~~10:10/22:00~22:10) timescale/GPU=0.5 SLO: 0;8k;32k;=1s;2s;4s

decode node

TBD

Similarity 计算

这张图的算法，直接看这两个 Python 文件就行：

主脚本: plot_similarity_heatmap_custom_windows.py
归一化定义: compute_signatures.py
trace 排序与 catalog: trace_catalog.py

如果你问的是图里这张 Similarity: 10:00-10:30 / 22:00-22:30，核心算法就在：

指标提取: plot_similarity_heatmap_custom_windows.py:132
全局 robust normalization: compute_signatures.py:64
相似度矩阵: plot_similarity_heatmap_custom_windows.py:247

它实际怎么算

枚举所有 trace 来自 trace_catalog.py:77，会收集：
- chat peak
- chat valley
- coder peak
- coder valley 并按 trace_family, date, day_part 排序。
对每个 trace，截取指定窗口在 plot_similarity_heatmap_custom_windows.py:179。对 peak trace 用你传的 10:00-10:30，对 valley trace 用 22:00-22:30。
对每个窗口算 5 维原始特征在 plot_similarity_heatmap_custom_windows.py:168：
- load_tokens_per_s = total_input_tokens / total_duration_seconds
- mean_input_length = mean(input_lengths)
- p95_input_length = quantile(input_lengths, 0.95)
- input_length_cv = std(input_lengths) / mean_input_lengths
- burstiness = std(inter_arrivals) / mean(inter_arrivals)
把所有窗口放在一起，按每一维做全局 robust normalization 在 compute_signatures.py:64：
- 对每个维度 x
- 计算所有窗口上的 median, q1, q3
- iqr = q3 - q1
- global_z_x = (x - median) / iqr
- 如果 iqr <= 0，就强制设为 1.0

每个窗口得到一个 5 维 normalized 向量在 plot_similarity_heatmap_custom_windows.py:248：

[
  global_z_load_tokens_per_s,
  global_z_mean_input_length,
  global_z_p95_input_length,
  global_z_input_length_cv,
  global_z_burstiness,
]

两两计算欧氏距离，再映射成相似度在 plot_similarity_heatmap_custom_windows.py:247：
- distance(i, j) = ||v_i - v_j||_2
- similarity(i, j) = exp(-distance(i, j))

所以这张图的数学定义就是：

G(w) = [
  z(load_tokens_per_s),
  z(mean_input_length),
  z(p95_input_length),
  z(input_length_cv),
  z(burstiness),
]

z(x) = (x - median(x_all_windows)) / IQR(x_all_windows)

d(a,b) = ||G(a) - G(b)||_2

sim(a,b) = exp(-d(a,b))

这里 x_all_windows 指的是这次参与画图的全部窗口，不是单个 window 内部。

伪代码

specs = build_trace_catalog()  # chat/coder x peak/valley x all days

rows = []
for spec in specs:
    if spec.day_part == "peak":
        window = peak_window   # 10:00-10:30
    else:
        window = valley_window # 22:00-22:30

    reqs = load_requests_in_window(spec.trace_path, window)

    input_lengths = [r.input_length for r in reqs]
    inter_arrivals = diff([r.timestamp for r in reqs])

    row = {
        "trace_family": spec.trace_family,
        "day_part": spec.day_part,
        "date": spec.date,
        "load_tokens_per_s": sum(input_lengths) / window_duration_sec,
        "mean_input_length": mean(input_lengths),
        "p95_input_length": p95(input_lengths),
        "input_length_cv": std(input_lengths) / mean(input_lengths),
        "burstiness": std(inter_arrivals) / mean(inter_arrivals),
    }
    rows.append(row)

frame = DataFrame(rows)

for col in [
    "load_tokens_per_s",
    "mean_input_length",
    "p95_input_length",
    "input_length_cv",
    "burstiness",
]:
    med = median(frame[col])
    iqr = p75(frame[col]) - p25(frame[col])
    if iqr <= 0:
        iqr = 1.0
    frame[f"global_z_{col}"] = (frame[col] - med) / iqr

vectors = frame[[global_z_5_dims]].to_numpy()

for i in range(len(vectors)):
    for j in range(len(vectors)):
        dist[i,j] = l2_norm(vectors[i] - vectors[j])
        sim[i,j] = exp(-dist[i,j])

plot_heatmap(sim)

补充两点

这张图不是用 SIGNATURE_WEIGHTS 画的。0.35/0.2/... 那组权重只用于 signature_score，不用于 heatmap 两两相似度。
这张图用的是 global_robust_scale，不是按 chat/coder 分开缩放，也不是每个窗口自己内部缩放。

如果你要，我下一步可以直接把这张图对应的命令行也给你写出来，并把它产出的 metrics csv / normalization csv / similarity matrix csv 一一对应到文件。

Semi-real 定义

到达过程：按 source trace 的平均 req_rate 生成 Poisson arrivals
代码在 prepare_figure08_evaluator_assets.py (line 115) 和 prepare_figure08_evaluator_assets.py (line 537)
请求数：不固定，取决于这次 Poisson 过程实际生成了多少个 arrival
长度分布：对每个生成的 arrival，从 source real trace 的请求列表里“有放回随机采样”一个 source request
代码在 prepare_figure08_evaluator_assets.py (line 272)
被继承的字段：input_length、output_length、turn/type
不保留原请求身份：hash_ids 会重新生成，不是原 real trace 的 hash_ids
代码在 prepare_figure08_evaluator_assets.py (line 169)

10 KiB Raw Blame History Unescape Escape