Document eight-GPU harness rerun
This commit is contained in:
@@ -130,3 +130,26 @@ Fix:
|
|||||||
- parse `engine.base_envs.CUDA_VISIBLE_DEVICES`;
|
- parse `engine.base_envs.CUDA_VISIBLE_DEVICES`;
|
||||||
- compute effective GPU count as `min(hardware.gpu_count, visible_device_count)`;
|
- compute effective GPU count as `min(hardware.gpu_count, visible_device_count)`;
|
||||||
- filter topology candidates and adjacent TP frontier candidates by the effective GPU count.
|
- filter topology candidates and adjacent TP frontier candidates by the effective GPU count.
|
||||||
|
|
||||||
|
## GPU Visibility Correction
|
||||||
|
|
||||||
|
On 2026-05-13 we corrected the intended experiment setup: `CUDA_VISIBLE_DEVICES` should be `0,1,2,3,4,5,6,7`, not the previous `0,1,2,4,5,6,7`.
|
||||||
|
|
||||||
|
This invalidates direct comparison between the old `gpu3skip` runs and new 8-GPU runs. The old v2 failure was real under the old visible-device profile, but it was not the intended 8-card H20 setup.
|
||||||
|
|
||||||
|
New comparable studies:
|
||||||
|
|
||||||
|
| Variant | Study ID | Status |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| no-harness baseline | `dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu8-12iter-noharness-minprompt-gpt54-20260513` | running first |
|
||||||
|
| harness | `dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu8-12iter-harness-profileplanner-20260513` | queued to run after baseline |
|
||||||
|
|
||||||
|
Both specs set:
|
||||||
|
|
||||||
|
- `CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7`
|
||||||
|
- model endpoint: `gpt-5.4`
|
||||||
|
- workload: qwen3.5-27b chat 0-8k
|
||||||
|
- SLO: TTFT p95 <= 4000ms, TPOT p95 <= 25ms, target pass rate 0.95
|
||||||
|
- search: full range, `inherit_incumbent_floor=false`
|
||||||
|
|
||||||
|
The no-harness baseline is running in tmux session `qwen27b-gpu8-noharness-20260513`. The harness run should only be started after the no-harness baseline finishes or reaches a sufficient early comparison point, because both need the full GPU host and should not run concurrently.
|
||||||
|
|||||||
Reference in New Issue
Block a user