Add qwen235b prefill docs and tight TTFT spec
This commit is contained in:
91
docs/qwen235b-thinking-prefill/README.md
Normal file
91
docs/qwen235b-thinking-prefill/README.md
Normal file
@@ -0,0 +1,91 @@
|
||||
# qwen235b-thinking-prefill
|
||||
|
||||
qwen3-235b-a22b `thinking` trace, prefill-only replay with `output_length=1`, internal vLLM (`/usr/local/bin/vllm`), compared by `request_rate_per_gpu`.
|
||||
|
||||
## Setup
|
||||
|
||||
- Hardware: `dash0`, `8x H20`
|
||||
- Model: `/home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717`
|
||||
- Engine: internal vLLM, baseline aligned to `~/run_qwen235b.sh`
|
||||
- Baseline topology: `TP=4, DP=1, EP=1`
|
||||
- Trace: `thinking_w20260327_1000`
|
||||
- Trace source: `trace_windows/traces/thinking_w20260327_1000.jsonl`
|
||||
- Window duration: `600s` (`10:00-10:10`, `2026-03-27`)
|
||||
- Request mode: `chat`
|
||||
- Replay override: `min_tokens=max_tokens=1`
|
||||
- SLO:
|
||||
- pass target: `95%`
|
||||
- `TTFT <= 3000ms` for `<=4096` input tokens
|
||||
- `TTFT <= 6000ms` for `<=32768` input tokens
|
||||
- `TTFT <= 9000ms` for `>32768` input tokens
|
||||
- Search:
|
||||
- `sampling_u in [0, 0.125]`
|
||||
- `max_probes = 6`
|
||||
- `12` trials total
|
||||
- Proposal model: `codex / gpt-5.4`
|
||||
|
||||
## Run assets
|
||||
|
||||
- Study root: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-prefill/dash0-qwen235b-prefill-thinking-run1-ttft-topology`
|
||||
- State: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-prefill/dash0-qwen235b-prefill-thinking-run1-ttft-topology/state.json`
|
||||
- Log: `/home/admin/cpfs/wjh/aituner/aituner/logs/dash0_qwen235b_prefill_thinking_run1_ttft_topology.log`
|
||||
- Spec: `/home/admin/cpfs/wjh/aituner/aituner/configs/examples/dash0_qwen235b_prefill_thinking_run1_ttft.json`
|
||||
|
||||
## Best result
|
||||
|
||||
- Best trial: `trial-0010`
|
||||
- Best config:
|
||||
- `tensor-parallel-size=8`
|
||||
- `data-parallel-size=1`
|
||||
- `enable-expert-parallel=false`
|
||||
- `max-num-batched-tokens=3712`
|
||||
- Best `sampling_u`: `0.120422363281`
|
||||
- Best request rate: `3.035 req/s`
|
||||
- Best request rate per GPU: `0.379375 req/s/gpu`
|
||||
- Best pass rate: `0.9533223503569467`
|
||||
|
||||
Compared with baseline:
|
||||
|
||||
- `trial-0001`: `0.8116666666666666 req/s`, `0.20291666666666666 req/s/gpu`
|
||||
- `trial-0010`: `3.035 req/s`, `0.379375 req/s/gpu`
|
||||
- Raw throughput gain: `3.74x`
|
||||
- Per-GPU throughput gain: `1.87x`
|
||||
|
||||
Best-point latency:
|
||||
|
||||
- `TTFT mean/p50/p90/p95/p99 = 863.84 / 253.58 / 2392.48 / 3154.26 / 5377.00 ms`
|
||||
|
||||
## 12-trial summary
|
||||
|
||||
| Trial | Proposed config delta | Result |
|
||||
| --- | --- | --- |
|
||||
| `trial-0001` | baseline `TP4/DP1/EP-off`, `max-num-batched-tokens=8192` | `0.8117 req/s`, feasible |
|
||||
| `trial-0002` | `DP=2`, `max-num-batched-tokens=4096` | probe-time runtime failure |
|
||||
| `trial-0003` | `DP=2`, `max-num-batched-tokens=8192` | probe-time runtime failure |
|
||||
| `trial-0004` | `EP=4`, `enable-expert-parallel=true` | launch fail |
|
||||
| `trial-0005` | `max-num-batched-tokens=4096` | infeasible |
|
||||
| `trial-0006` | `TP=8, DP=1`, `max-num-batched-tokens=4096` | `2.8600 req/s`, feasible |
|
||||
| `trial-0007` | `trial-0006 + max-num-batched-tokens=3072` | infeasible |
|
||||
| `trial-0008` | `trial-0006 + max-num-batched-tokens=3584` | `2.9667 req/s`, feasible |
|
||||
| `trial-0009` | `trial-0006 + max-num-batched-tokens=3328` | infeasible |
|
||||
| `trial-0010` | `trial-0006 + max-num-batched-tokens=3712` | `3.0350 req/s`, feasible, best |
|
||||
| `trial-0011` | `trial-0010 + max-num-batched-tokens=3840` | infeasible |
|
||||
| `trial-0012` | `trial-0010 + max-num-batched-tokens=3776` | infeasible |
|
||||
|
||||
## Key insights
|
||||
|
||||
- The main win came from topology first, then local batch-shape refinement. `TP4 -> TP8` was the key change.
|
||||
- `TP4/DP2` was not just suboptimal; it was unstable at runtime under probing and should be treated as negative evidence for this stack.
|
||||
- `EP=4` on the baseline `TP4/DP1` path failed at launch with `group_gemm's contiguous kernel requires deepgemm`, so EP is currently not a viable direction here.
|
||||
- After switching to `TP8/DP1/EP-off`, the remaining gain came from tightening `max-num-batched-tokens` from `4096` into a narrow sweet spot around `3584~3712`.
|
||||
- Too-small prefill batches (`3072`, `3328`) and too-large ones (`3776`, `3840`) both hurt the TTFT tail enough to lose the `95%` target.
|
||||
|
||||
## Current recommendation
|
||||
|
||||
Use `trial-0010` as the default serving shape for this workload:
|
||||
|
||||
- `tensor-parallel-size=8`
|
||||
- `data-parallel-size=1`
|
||||
- `enable-expert-parallel=false`
|
||||
- `max-num-batched-tokens=3712`
|
||||
- keep the rest of the `run_qwen235b.sh` baseline unchanged
|
||||
Reference in New Issue
Block a user