From 0eb49dcc346f027e510dd7a814dfa32b2407b12f Mon Sep 17 00:00:00 2001 From: Gahow Wang Date: Tue, 26 May 2026 15:09:16 +0800 Subject: [PATCH] Fix NIXL multi-instance port conflict: per-instance SIDE_CHANNEL_PORT NIXL's _nixl_handshake_listener (vllm/distributed/kv_transfer/ kv_connector/v1/nixl_connector.py:700) binds a ZMQ ROUTER socket on the side_channel_port, which defaults to 5600. When 8 NIXL vLLMs launch concurrently on the same host all 8 race for tcp://localhost:5600; exactly one succeeds and the others silently hang in the listener thread with: zmq.error.ZMQError: Address already in use (addr='tcp://localhost:5600') The engines themselves never reach "Application startup complete" and the b3_isolated_policy.sh health-check times out. First observed when 7 of 8 inst_X.log files contained the ZMQ error and the 8th (by random ordering) was the one healthy instance. Fix: set VLLM_NIXL_SIDE_CHANNEL_PORT=$((5600 + i)) per instance in the NIXL launch branch. Each engine now gets a distinct handshake port (5600..5607 by default). Verified: all 8 instances now reach "Application startup complete" within the 360 s health budget. This is NIXL-specific; Mooncake uses VLLM_MOONCAKE_BOOTSTRAP_PORT which we were already varying per instance. Co-Authored-By: Claude Opus 4.7 --- scripts/b3_isolated_policy.sh | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/scripts/b3_isolated_policy.sh b/scripts/b3_isolated_policy.sh index ccf3203..cb4eb3e 100755 --- a/scripts/b3_isolated_policy.sh +++ b/scripts/b3_isolated_policy.sh @@ -79,9 +79,13 @@ for gpu in $GPU_INDICES; do $EXTRA_VLLM_ARGS \ > "$RUNDIR/logs/vllm_inst_${i}_gpu${gpu}.log" 2>&1 & elif [ "$ENABLE_KV_BOTH" = "1" ] && [ "$KV_CONNECTOR" = "Nixl" ]; then - # NixlConnector uses UCX side-channels for handshake (no bootstrap - # port needed). Side-channel host defaults to NIC IP discovered - # at register_kv_caches time. + # NixlConnector's handshake listener binds to a fixed default + # port 5600 unless VLLM_NIXL_SIDE_CHANNEL_PORT is overridden. + # Multiple instances on the same host MUST use distinct ports + # or only one will start; the rest hit + # `zmq.error.ZMQError: Address already in use`. + nixl_port=$((5600 + i)) + VLLM_NIXL_SIDE_CHANNEL_PORT=$nixl_port \ AGENTIC_STEP_LOG_PATH="$RUNDIR/engine_state/engine_${i}.jsonl" \ AGENTIC_WORKER_ID="engine_${i}" \ CUDA_VISIBLE_DEVICES=$gpu \