docs: expand qwen27b 0-8k compare summary
This commit is contained in:
@@ -34,6 +34,7 @@ qwen3.5-27b `chat` trace, `0~8k` input bucket, tuned-best vs baseline cross-day
|
|||||||
- Search:
|
- Search:
|
||||||
- binary search on `sampling_u`
|
- binary search on `sampling_u`
|
||||||
- `max_probes = 6`
|
- `max_probes = 6`
|
||||||
|
- Proposal model for tuned source: `codex / gpt-5.4`
|
||||||
|
|
||||||
## Run assets
|
## Run assets
|
||||||
|
|
||||||
@@ -43,6 +44,41 @@ qwen3.5-27b `chat` trace, `0~8k` input bucket, tuned-best vs baseline cross-day
|
|||||||
- Compare spec: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/specs/qwen27b_chat_0_8k_compare_dash1.json`
|
- Compare spec: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/specs/qwen27b_chat_0_8k_compare_dash1.json`
|
||||||
- Tuned study root: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-tight/dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology`
|
- Tuned study root: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-tight/dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology`
|
||||||
|
|
||||||
|
## Tuned-source result
|
||||||
|
|
||||||
|
- Best trial: `trial-0004`
|
||||||
|
- Best config:
|
||||||
|
- `tensor-parallel-size=2`
|
||||||
|
- `data-parallel-size=1`
|
||||||
|
- Best `sampling_u`: `0.013061523438`
|
||||||
|
- Best request rate: `0.405 req/s`
|
||||||
|
- Best request rate per GPU: `0.2025 req/s/gpu`
|
||||||
|
- Best pass rate: `0.9629629629629629`
|
||||||
|
|
||||||
|
Compared with the single-day baseline on `chat_w20260311_1000`:
|
||||||
|
|
||||||
|
- `trial-0001`: `0.035 req/s`, `0.035 req/s/gpu`
|
||||||
|
- `trial-0004`: `0.405 req/s`, `0.2025 req/s/gpu`
|
||||||
|
- Raw throughput gain: `11.57x`
|
||||||
|
- Per-GPU throughput gain: `5.79x`
|
||||||
|
|
||||||
|
## 12-trial summary
|
||||||
|
|
||||||
|
| Trial | Proposed config delta | Result |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| `trial-0001` | baseline `TP1/DP1` | `0.0350 req/s`, `0.0350 req/s/gpu`, feasible |
|
||||||
|
| `trial-0002` | `DP=2` | `0.1233 req/s`, `0.0617 req/s/gpu`, feasible |
|
||||||
|
| `trial-0003` | `DP=4` | `0.1567 req/s`, `0.0392 req/s/gpu`, feasible |
|
||||||
|
| `trial-0004` | `TP=2, DP=1` | `0.4050 req/s`, `0.2025 req/s/gpu`, feasible, best |
|
||||||
|
| `trial-0005` | `trial-0004 + max-num-batched-tokens=16384` | infeasible |
|
||||||
|
| `trial-0006` | `trial-0004 + max-num-seqs=24` | infeasible |
|
||||||
|
| `trial-0007` | `trial-0004 + max-num-batched-tokens=12288` | infeasible |
|
||||||
|
| `trial-0008` | `trial-0004 + block-size=32` | infeasible |
|
||||||
|
| `trial-0009` | `trial-0004 + gpu-memory-utilization=0.93` | infeasible |
|
||||||
|
| `trial-0010` | `trial-0004 + max-num-seqs=16, max-num-batched-tokens=6144` | infeasible |
|
||||||
|
| `trial-0011` | `trial-0004 + enable-prefix-caching=false` | infeasible |
|
||||||
|
| `trial-0012` | `trial-0004 + block-size=128` | infeasible |
|
||||||
|
|
||||||
## Aggregate result
|
## Aggregate result
|
||||||
|
|
||||||
- Comparable wins: tuned `5`, baseline `0`
|
- Comparable wins: tuned `5`, baseline `0`
|
||||||
@@ -66,6 +102,7 @@ qwen3.5-27b `chat` trace, `0~8k` input bucket, tuned-best vs baseline cross-day
|
|||||||
|
|
||||||
## Key insights
|
## Key insights
|
||||||
|
|
||||||
|
- The tuned-source tuning itself was simple and topology-driven. The winning patch is only `TP1 -> TP2`; later runtime-only tweaks all failed to beat it.
|
||||||
- This compare does not support the conclusion that the tuned config lacks generalization. Across the full 7-day slice, tuned wins every directly comparable window.
|
- This compare does not support the conclusion that the tuned config lacks generalization. Across the full 7-day slice, tuned wins every directly comparable window.
|
||||||
- The two `incomparable` days are not execution failures. Baseline completed probing but never found a single feasible `sampling_u` under the target SLO, while tuned still found feasible operating points.
|
- The two `incomparable` days are not execution failures. Baseline completed probing but never found a single feasible `sampling_u` under the target SLO, while tuned still found feasible operating points.
|
||||||
- The tuned `TP=2, DP=1` shape is materially more robust than the `TP=1, DP=1` baseline for this `0~8k` chat bucket.
|
- The tuned `TP=2, DP=1` shape is materially more robust than the `TP=1, DP=1` baseline for this `0~8k` chat bucket.
|
||||||
|
|||||||
Reference in New Issue
Block a user