Constrain harness topology by visible GPUs

This commit is contained in:
2026-05-13 01:25:31 +08:00
parent fb6d74a18c
commit 5c2958e6c1
3 changed files with 143 additions and 7 deletions

View File

@@ -104,3 +104,29 @@ Started on `dash0` (`11.73.2.172`) at commit `e3ed775`.
- monitor: read-only subagent `Wegener`
Acceptance for this run is based on end-to-end trial results, not unit tests. If the first four trials lag the min-prompt no-harness baseline (`0.0650`, `0.1992`, `0.2696`, then failed/NA), the run should be treated as a failed harness iteration and the harness should be optimized again.
## V2 Result And Failure
V2 was stopped early after four trials because it did not improve the no-harness baseline and made a preventable launch-risk proposal.
Raw `request_rate/GPU`:
| Variant | iter1 | iter2 | iter3 | iter4 |
| --- | ---: | ---: | ---: | --- |
| no-harness min-prompt | 0.0650 | 0.1992 | 0.2696 | 0.2696 |
| harness v2 | 0.0650 | 0.1992 | 0.2696 | failed |
Harness v2 did correctly diagnose the first bottleneck and proposed:
- iter2: `tensor-parallel-size=2`, raw `0.1992 req/s/GPU`;
- iter3: `tensor-parallel-size=4`, raw `0.2696 req/s/GPU`.
However, iter4 proposed `tensor-parallel-size=8` and failed at engine launch. The study's `hardware.gpu_count` is 8, but the launch environment sets `CUDA_VISIBLE_DEVICES=0,1,2,4,5,6,7`, which exposes only 7 GPUs. Therefore TP=8 should not have been considered launch-safe.
This is a general harness bug: topology planning must use the effective visible GPU count from the execution profile, not just the nominal hardware count.
Fix:
- parse `engine.base_envs.CUDA_VISIBLE_DEVICES`;
- compute effective GPU count as `min(hardware.gpu_count, visible_device_count)`;
- filter topology candidates and adjacent TP frontier candidates by the effective GPU count.