docs: add qwen235b prefill 7-day compare
This commit is contained in:
79
docs/qwen235b-thinking-prefill/7day-compare.md
Normal file
79
docs/qwen235b-thinking-prefill/7day-compare.md
Normal file
@@ -0,0 +1,79 @@
|
|||||||
|
# qwen235b-thinking-prefill-7day-compare
|
||||||
|
|
||||||
|
qwen3-235b-a22b `thinking` trace, prefill-only replay with `output_length=1`, comparing 3 configs across 7 daily `10:00-10:10` windows by `request_rate_per_gpu`.
|
||||||
|
|
||||||
|
## Setup
|
||||||
|
|
||||||
|
- Hardware: `dash1`, `8x H20`
|
||||||
|
- Model: `/home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717`
|
||||||
|
- Engine: internal vLLM, baseline aligned to `~/run_qwen235b.sh`
|
||||||
|
- Trace set: `thinking_w20260321_1000` to `thinking_w20260327_1000`
|
||||||
|
- Window duration: `600s` each
|
||||||
|
- Request mode: `chat`
|
||||||
|
- Replay override: `min_tokens=max_tokens=1`
|
||||||
|
- SLO:
|
||||||
|
- pass target: `95%`
|
||||||
|
- `TTFT <= 2000ms` for `<=8191` input tokens
|
||||||
|
- `TTFT <= 4000ms` for `<=32767` input tokens
|
||||||
|
- `TTFT <= 6000ms` for `>32767` input tokens
|
||||||
|
- Search:
|
||||||
|
- each candidate independently binary-searches its own `sampling_u`
|
||||||
|
- `sampling_u in [0, 0.125]`
|
||||||
|
- `max_probes = 6`
|
||||||
|
|
||||||
|
## Candidates
|
||||||
|
|
||||||
|
- `baseline`
|
||||||
|
- `TP=4, DP=1, EP=off`
|
||||||
|
- baseline `run_qwen235b.sh` shape
|
||||||
|
- `tuned_0323`
|
||||||
|
- tuned on `thinking_w20260323_1000`
|
||||||
|
- `TP=4, DP=1, EP=off`
|
||||||
|
- `max-num-batched-tokens=3072`
|
||||||
|
- `max-num-seqs=32`
|
||||||
|
- `tuned_0327`
|
||||||
|
- tuned on `thinking_w20260327_1000`
|
||||||
|
- `TP=8, DP=1, EP=off`
|
||||||
|
- `max-num-batched-tokens=6144`
|
||||||
|
- `max-num-seqs=48`
|
||||||
|
- `block-size=32`
|
||||||
|
|
||||||
|
## Run assets
|
||||||
|
|
||||||
|
- Compare root: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen235b-prefill-thinking-7day-baseline-vs-0323-vs-0327`
|
||||||
|
- Summary: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen235b-prefill-thinking-7day-baseline-vs-0323-vs-0327/summary.json`
|
||||||
|
- Report: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen235b-prefill-thinking-7day-baseline-vs-0323-vs-0327/report.md`
|
||||||
|
- Compare spec: `/home/admin/cpfs/wjh/aituner/aituner/configs/examples/dash1_qwen235b_prefill_thinking_7day_compare.json`
|
||||||
|
|
||||||
|
## Aggregate result
|
||||||
|
|
||||||
|
- Wins by `request_rate_per_gpu`:
|
||||||
|
- `tuned_0327`: `5 / 7`
|
||||||
|
- `baseline`: `2 / 7`
|
||||||
|
- `tuned_0323`: `0 / 7`
|
||||||
|
- Mean `request_rate_per_gpu`:
|
||||||
|
- `baseline`: `0.13845`
|
||||||
|
- `tuned_0323`: `0.12756`
|
||||||
|
- `tuned_0327`: `0.17232`
|
||||||
|
- Relative to baseline:
|
||||||
|
- `tuned_0323`: `0.92x` mean per-GPU throughput
|
||||||
|
- `tuned_0327`: `1.24x` mean per-GPU throughput
|
||||||
|
|
||||||
|
## Per-day result
|
||||||
|
|
||||||
|
| Date | baseline req/s/gpu | tuned_0323 req/s/gpu | tuned_0327 req/s/gpu | Winner |
|
||||||
|
| --- | ---: | ---: | ---: | --- |
|
||||||
|
| `2026-03-21` | `0.08500` | `0.03917` | `0.14375` | `tuned_0327` |
|
||||||
|
| `2026-03-22` | `0.10125` | `0.12083` | `0.15313` | `tuned_0327` |
|
||||||
|
| `2026-03-23` | `0.12792` | `0.12792` | `0.19167` | `tuned_0327` |
|
||||||
|
| `2026-03-24` | `0.09000` | `0.09583` | `0.11250` | `tuned_0327` |
|
||||||
|
| `2026-03-25` | `0.13792` | `0.13208` | `0.13146` | `baseline` |
|
||||||
|
| `2026-03-26` | `0.32000` | `0.25917` | `0.23375` | `baseline` |
|
||||||
|
| `2026-03-27` | `0.10708` | `0.11792` | `0.24000` | `tuned_0327` |
|
||||||
|
|
||||||
|
## Key takeaways
|
||||||
|
|
||||||
|
- `tuned_0327` is the only candidate with clear cross-day value. It wins `5/7` windows and improves mean per-GPU throughput by about `24%`.
|
||||||
|
- `tuned_0323` does not generalize. It is slightly more conservative and keeps high pass rate, but mean per-GPU throughput is below baseline.
|
||||||
|
- The `0327` winner is not universal. On `2026-03-25` and especially `2026-03-26`, the 4-GPU baseline is more efficient per GPU than the `TP8` tuned shape.
|
||||||
|
- The practical reading is that prefill-only tuning has workload-regime sensitivity. `TP8 + 6144 + 48 + block-size=32` is a strong default candidate, but not a global static optimum across all days.
|
||||||
Reference in New Issue
Block a user