Add qwen27b and qwen235b tuning notes
This commit is contained in:
@@ -2,6 +2,26 @@
|
||||
|
||||
qwen3-235b-a22b `thinking` trace, `decode_only` mode, internal vLLM (`/usr/local/bin/vllm`), SLO: `p95-equivalent pass target 95%`, `TPOT <= 40ms`, `TTFT` not enforced.
|
||||
|
||||
## Setup
|
||||
|
||||
- Hardware: `dash0`, `8x H20`
|
||||
- Model: `/home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717`
|
||||
- Engine: internal vLLM, decode-only mode with `--kv-transfer-config {"kv_connector":"DecodeBenchConnector","kv_role":"kv_both"}`
|
||||
- Baseline topology: `TP=4, DP=2, EP=8`
|
||||
- Trace: `thinking_w20260327_1000`
|
||||
- Trace source: `trace_windows/traces/thinking_w20260327_1000.jsonl`
|
||||
- Window duration: `600s` (`10:00-10:10`, `2026-03-27`)
|
||||
- Request mode: `decode_only`
|
||||
- SLO:
|
||||
- pass target: `95%`
|
||||
- `TPOT <= 40ms`
|
||||
- `TTFT` not enforced
|
||||
- Search:
|
||||
- `sampling_u in [0, 0.125]`
|
||||
- `max_probes = 6`
|
||||
- `12` trials total
|
||||
- Proposal model: `codex / gpt-5.4`
|
||||
|
||||
## Run assets
|
||||
|
||||
- Study root: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash0-qwen235b-decode-thinking-run5-tpot40-topology`
|
||||
|
||||
83
docs/qwen27b-chat-pd-colocation/README.md
Normal file
83
docs/qwen27b-chat-pd-colocation/README.md
Normal file
@@ -0,0 +1,83 @@
|
||||
# qwen27b-chat-pd-colocation
|
||||
|
||||
qwen3.5-27b `chat` trace, `0~8k` input bucket, internal vLLM (`/usr/local/bin/vllm`), baseline aligned to `~/run_qwen27b.sh`, compared by `request_rate_per_gpu`.
|
||||
|
||||
## Setup
|
||||
|
||||
- Hardware: `dash0`, `8x H20`
|
||||
- Model: `/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal`
|
||||
- Engine: internal vLLM, PD-colocation baseline from `~/run_qwen27b.sh`
|
||||
- Baseline topology: `TP=1, DP=1, EP=1`
|
||||
- Trace: `chat_w20260311_1000`
|
||||
- Trace source: `trace_windows/traces/chat_w20260311_1000.jsonl`
|
||||
- Window duration: `600s` (`10:00-10:10`, `2026-03-11`)
|
||||
- Request mode: `chat`
|
||||
- Input bucket: `0 <= input_length <= 8192`
|
||||
- SLO:
|
||||
- pass target: `95%`
|
||||
- `TTFT <= 2000ms` for `<=4096` input tokens
|
||||
- `TTFT <= 4000ms` for `<=32768` input tokens
|
||||
- `TTFT <= 6000ms` for `>32768` input tokens
|
||||
- `TPOT <= 50ms`
|
||||
- Search:
|
||||
- `sampling_u in [0, 0.0625]`
|
||||
- `max_probes = 6`
|
||||
- `12` trials total
|
||||
- Proposal model: `codex / gpt-5.4`
|
||||
|
||||
## Run assets
|
||||
|
||||
- Study root: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-tight/dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology`
|
||||
- State: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-tight/dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology/state.json`
|
||||
- Log: `/home/admin/cpfs/wjh/aituner/aituner/logs/dash0_qwen27b_tight_slo_run9_0_8k_codex_topology.log`
|
||||
- Spec: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-tight/specs/dash0_qwen27b_tight_slo_run9_0_8k_codex_topology.json`
|
||||
|
||||
## Best result
|
||||
|
||||
- Best trial: `trial-0004`
|
||||
- Best config:
|
||||
- `tensor-parallel-size=2`
|
||||
- `data-parallel-size=1`
|
||||
- Best `sampling_u`: `0.013061523438`
|
||||
- Best request rate: `0.405 req/s`
|
||||
- Best request rate per GPU: `0.2025 req/s/gpu`
|
||||
- Best pass rate: `0.9629629629629629`
|
||||
|
||||
Compared with baseline:
|
||||
|
||||
- `trial-0001`: `0.035 req/s`, `0.035 req/s/gpu`
|
||||
- `trial-0004`: `0.405 req/s`, `0.2025 req/s/gpu`
|
||||
- Raw throughput gain: `11.57x`
|
||||
- Per-GPU throughput gain: `5.79x`
|
||||
|
||||
## 12-trial summary
|
||||
|
||||
| Trial | Proposed config delta | Result |
|
||||
| --- | --- | --- |
|
||||
| `trial-0001` | baseline `TP1/DP1` | `0.0350 req/s`, `0.0350 req/s/gpu`, feasible |
|
||||
| `trial-0002` | `DP=2` | `0.1233 req/s`, `0.0617 req/s/gpu`, feasible |
|
||||
| `trial-0003` | `DP=4` | `0.1567 req/s`, `0.0392 req/s/gpu`, feasible |
|
||||
| `trial-0004` | `TP=2, DP=1` | `0.4050 req/s`, `0.2025 req/s/gpu`, feasible, best |
|
||||
| `trial-0005` | `trial-0004 + max-num-batched-tokens=16384` | infeasible |
|
||||
| `trial-0006` | `trial-0004 + max-num-seqs=24` | infeasible |
|
||||
| `trial-0007` | `trial-0004 + max-num-batched-tokens=12288` | infeasible |
|
||||
| `trial-0008` | `trial-0004 + block-size=32` | infeasible |
|
||||
| `trial-0009` | `trial-0004 + gpu-memory-utilization=0.93` | infeasible |
|
||||
| `trial-0010` | `trial-0004 + max-num-seqs=16, max-num-batched-tokens=6144` | infeasible |
|
||||
| `trial-0011` | `trial-0004 + enable-prefix-caching=false` | infeasible |
|
||||
| `trial-0012` | `trial-0004 + block-size=128` | infeasible |
|
||||
|
||||
## Key insights
|
||||
|
||||
- The baseline must be the real `~/run_qwen27b.sh` `TP=1` shape. Under that correct baseline, `TP=2, DP=1` is clearly better on both raw throughput and `request_rate_per_gpu`.
|
||||
- Pure DP scaling helped from `DP1 -> DP2`, but `DP4` already lost per-GPU efficiency. The main win came from `TP2`, not from adding more replicas.
|
||||
- After topology settled at `TP2/DP1`, the remaining bottleneck was `TTFT` tail, not `TPOT`. Later runtime-only trials generally failed around `0.435 req/s` with `pass_rate ~= 0.89`, while `TPOT p95` stayed acceptable and `TTFT p95` stayed near `2.5s~3.0s`.
|
||||
- For this `0~8k` chat bucket, the useful topology search space was small but important. Without per-topology `sampling_u` search isolation, this result would have been easy to miss.
|
||||
|
||||
## Current recommendation
|
||||
|
||||
Use `trial-0004` as the default serving shape for this workload:
|
||||
|
||||
- `tensor-parallel-size=2`
|
||||
- `data-parallel-size=1`
|
||||
- keep the rest of the `run_qwen27b.sh` baseline unchanged
|
||||
Reference in New Issue
Block a user