server: serve gpt-oss on a single GPU via the TP engine (world=1)
gpt-oss has no single-GPU engine path, so --tp 1 fell through to the
Qwen3-only engine and every request 503'd. Route gpt_oss to run_tp
even at tp=1: NCCL world-1 init works and all_reduce already no-ops
(bench-gpt-oss --tp 1 exercised this path). Quantized gpt-oss (22 GB
FP8 / 13 GB MXFP4) now serves on one 32 GB 5090.
Also fix eval_gsm8k_fast.py --gpu to accept a device list ("2,3"):
it was type=int, so any --tp 2 run pinned CUDA_VISIBLE_DEVICES to one
GPU and rank 1's set_device panicked while rank 0 spun in NCCL init.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
@@ -81,7 +81,8 @@ def main():
|
||||
parser.add_argument("--max-tokens", type=int, default=512, help="Max generation tokens")
|
||||
parser.add_argument("--tp", type=int, default=1, help="Tensor parallelism")
|
||||
parser.add_argument("--offset", type=int, default=0, help="Start from problem N")
|
||||
parser.add_argument("--gpu", type=int, default=0, help="GPU device index")
|
||||
parser.add_argument("--gpu", type=str, default="0",
|
||||
help="CUDA_VISIBLE_DEVICES value, e.g. '0' or '2,3' (must cover --tp ranks)")
|
||||
args = parser.parse_args()
|
||||
|
||||
if not DATA_PATH.exists():
|
||||
|
||||
Reference in New Issue
Block a user