Constrain harness topology by visible GPUs
This commit is contained in:
@@ -104,3 +104,29 @@ Started on `dash0` (`11.73.2.172`) at commit `e3ed775`.
|
||||
- monitor: read-only subagent `Wegener`
|
||||
|
||||
Acceptance for this run is based on end-to-end trial results, not unit tests. If the first four trials lag the min-prompt no-harness baseline (`0.0650`, `0.1992`, `0.2696`, then failed/NA), the run should be treated as a failed harness iteration and the harness should be optimized again.
|
||||
|
||||
## V2 Result And Failure
|
||||
|
||||
V2 was stopped early after four trials because it did not improve the no-harness baseline and made a preventable launch-risk proposal.
|
||||
|
||||
Raw `request_rate/GPU`:
|
||||
|
||||
| Variant | iter1 | iter2 | iter3 | iter4 |
|
||||
| --- | ---: | ---: | ---: | --- |
|
||||
| no-harness min-prompt | 0.0650 | 0.1992 | 0.2696 | 0.2696 |
|
||||
| harness v2 | 0.0650 | 0.1992 | 0.2696 | failed |
|
||||
|
||||
Harness v2 did correctly diagnose the first bottleneck and proposed:
|
||||
|
||||
- iter2: `tensor-parallel-size=2`, raw `0.1992 req/s/GPU`;
|
||||
- iter3: `tensor-parallel-size=4`, raw `0.2696 req/s/GPU`.
|
||||
|
||||
However, iter4 proposed `tensor-parallel-size=8` and failed at engine launch. The study's `hardware.gpu_count` is 8, but the launch environment sets `CUDA_VISIBLE_DEVICES=0,1,2,4,5,6,7`, which exposes only 7 GPUs. Therefore TP=8 should not have been considered launch-safe.
|
||||
|
||||
This is a general harness bug: topology planning must use the effective visible GPU count from the execution profile, not just the nominal hardware count.
|
||||
|
||||
Fix:
|
||||
|
||||
- parse `engine.base_envs.CUDA_VISIBLE_DEVICES`;
|
||||
- compute effective GPU count as `min(hardware.gpu_count, visible_device_count)`;
|
||||
- filter topology candidates and adjacent TP frontier candidates by the effective GPU count.
|
||||
|
||||
Reference in New Issue
Block a user