From ee9ec3c60bc9505893ad55bb4b7d49a92c32c10b Mon Sep 17 00:00:00 2001
From: Gahow Wang <gahow.wang@gmail.com>
Date: Mon, 13 Apr 2026 09:33:02 +0800
Subject: [PATCH] docs: add qwen235b decode 0323 summary

---
 docs/qwen235b-thinking-decode/0323.md | 85 +++++++++++++++++++++++++++
 1 file changed, 85 insertions(+)
 create mode 100644 docs/qwen235b-thinking-decode/0323.md

diff --git a/docs/qwen235b-thinking-decode/0323.md b/docs/qwen235b-thinking-decode/0323.md
new file mode 100644
index 0000000..f4d52a3
--- /dev/null
+++ b/docs/qwen235b-thinking-decode/0323.md
@@ -0,0 +1,85 @@
+# qwen235b-thinking-decode-0323
+
+qwen3-235b-a22b `thinking` trace, `decode_only` mode, internal vLLM (`/usr/local/bin/vllm`), tuned on `thinking_w20260323_1000` with `TPOT <= 40ms`.
+
+## Setup
+
+- Hardware: `dash2`, `8x H20`
+- Model: `/home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717`
+- Engine: internal vLLM, decode-only mode with `--kv-transfer-config {"kv_connector":"DecodeBenchConnector","kv_role":"kv_both"}`
+- Baseline topology: `TP=4, DP=2, EP=8`
+- Trace: `thinking_w20260323_1000`
+- Trace source: `trace_windows/traces/thinking_w20260323_1000.jsonl`
+- Window duration: `600s` (`10:00-10:10`, `2026-03-23`)
+- Request mode: `decode_only`
+- SLO:
+  - pass target: `95%`
+  - `TPOT <= 40ms`
+  - `TTFT` not enforced
+- Search:
+  - `sampling_u in [0, 0.125]`
+  - `max_probes = 6`
+  - `12` trials total
+- Proposal model: `codex / gpt-5.4`
+
+## Run assets
+
+- Study root: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash2-qwen235b-decode-thinking-run1-0323-tpot40-topology`
+- State: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash2-qwen235b-decode-thinking-run1-0323-tpot40-topology/state.json`
+- Spec: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/specs/dash2_qwen235b_decode_thinking_run1_0323_tpot40_topology.json`
+
+## Best result
+
+- Best trial: `trial-0007`
+- Best config delta:
+  - `gpu-memory-utilization=0.86`
+- Best topology stayed unchanged:
+  - `tensor-parallel-size=4`
+  - `data-parallel-size=2`
+  - `expert-parallel-size=8`
+- Best `sampling_u`: `0.033736228943`
+- Best request rate: `0.48333333333333334 req/s`
+- Best request rate per GPU: `0.06041666666666667 req/s/gpu`
+- Best pass rate: `0.9551724137931035`
+
+Compared with baseline:
+
+- `trial-0001`: `0.43333333333333335 req/s`, `0.05416666666666667 req/s/gpu`
+- `trial-0007`: `0.48333333333333334 req/s`, `0.06041666666666667 req/s/gpu`
+- Throughput gain: `1.12x`
+
+Best-point latency:
+
+- `TPOT mean/p50/p90/p95/p99 = 26.18 / 24.04 / 38.46 / 39.55 / 40.76 ms`
+
+## 12-trial summary
+
+| Trial | Proposed config delta | Result |
+| --- | --- | --- |
+| `trial-0001` | baseline `TP4/DP2/EP8` | `0.4333 req/s`, feasible |
+| `trial-0002` | `EP=4` | launch fail |
+| `trial-0003` | `gpu-memory-utilization=0.8`, `max-num-seqs=224` | infeasible |
+| `trial-0004` | `gpu-memory-utilization=0.8` | infeasible |
+| `trial-0005` | `gpu-memory-utilization=0.82` | `0.4650 req/s`, feasible |
+| `trial-0006` | `gpu-memory-utilization=0.84` | infeasible |
+| `trial-0007` | `gpu-memory-utilization=0.86` | `0.4833 req/s`, feasible, best |
+| `trial-0008` | `gpu-memory-utilization=0.87` | infeasible |
+| `trial-0009` | `gpu-memory-utilization=0.86`, `block-size=32` | launch fail |
+| `trial-0010` | `gpu-memory-utilization=0.86`, `max-num-seqs=208` | infeasible |
+| `trial-0011` | `gpu-memory-utilization=0.86`, `max-num-seqs=176` | infeasible |
+| `trial-0012` | `gpu-memory-utilization=0.86`, `max-num-batched-tokens=896` | infeasible |
+
+## Key insights
+
+- This `0323` window did not produce a better topology than the baseline. The winning move was a small memory-headroom increase, not a TP/DP/EP change.
+- `EP=4` was not viable under the current deployment shape and failed at launch, so the run quickly converged away from topology changes.
+- The best point is very close to the SLO edge: `TPOT p95 ~= 39.55ms`, so the remaining headroom is small.
+- Compared with the heavier `0327` decode-only tuning, `0323` is a milder window: baseline is already strong, and tuning only adds about `11.5%`.
+
+## Recommendation
+
+For a `0323`-like decode-only `thinking` window, keep the baseline `TP4/DP2/EP8` topology and use:
+
+- `gpu-memory-utilization=0.86`
+
+Do not treat this run as evidence that `0323` prefers the same topology changes as `0327`; this study mainly supports a small residency-headroom refinement.