Constrain harness topology by visible GPUs

2026-05-13 01:25:31 +08:00
parent fb6d74a18c
commit 5c2958e6c1
3 changed files with 143 additions and 7 deletions
--- a/docs/harness-ablation/profile-driven-harness-implementation-20260512.md
+++ b/docs/harness-ablation/profile-driven-harness-implementation-20260512.md
@@ -104,3 +104,29 @@ Started on `dash0` (`11.73.2.172`) at commit `e3ed775`.
 - monitor: read-only subagent `Wegener`

 Acceptance for this run is based on end-to-end trial results, not unit tests. If the first four trials lag the min-prompt no-harness baseline (`0.0650`, `0.1992`, `0.2696`, then failed/NA), the run should be treated as a failed harness iteration and the harness should be optimized again.
+
+## V2 Result And Failure
+
+V2 was stopped early after four trials because it did not improve the no-harness baseline and made a preventable launch-risk proposal.
+
+Raw `request_rate/GPU`:
+
+| Variant | iter1 | iter2 | iter3 | iter4 |
+| --- | ---: | ---: | ---: | --- |
+| no-harness min-prompt | 0.0650 | 0.1992 | 0.2696 | 0.2696 |
+| harness v2 | 0.0650 | 0.1992 | 0.2696 | failed |
+
+Harness v2 did correctly diagnose the first bottleneck and proposed:
+
+- iter2: `tensor-parallel-size=2`, raw `0.1992 req/s/GPU`;
+- iter3: `tensor-parallel-size=4`, raw `0.2696 req/s/GPU`.
+
+However, iter4 proposed `tensor-parallel-size=8` and failed at engine launch. The study's `hardware.gpu_count` is 8, but the launch environment sets `CUDA_VISIBLE_DEVICES=0,1,2,4,5,6,7`, which exposes only 7 GPUs. Therefore TP=8 should not have been considered launch-safe.
+
+This is a general harness bug: topology planning must use the effective visible GPU count from the execution profile, not just the nominal hardware count.
+
+Fix:
+
+- parse `engine.base_envs.CUDA_VISIBLE_DEVICES`;
+- compute effective GPU count as `min(hardware.gpu_count, visible_device_count)`;
+- filter topology candidates and adjacent TP frontier candidates by the effective GPU count.