Fix NIXL multi-instance port conflict: per-instance SIDE_CHANNEL_PORT

NIXL's _nixl_handshake_listener (vllm/distributed/kv_transfer/
kv_connector/v1/nixl_connector.py:700) binds a ZMQ ROUTER socket on
the side_channel_port, which defaults to 5600. When 8 NIXL vLLMs
launch concurrently on the same host all 8 race for tcp://localhost:5600;
exactly one succeeds and the others silently hang in the listener
thread with:

    zmq.error.ZMQError: Address already in use (addr='tcp://localhost:5600')

The engines themselves never reach "Application startup complete"
and the b3_isolated_policy.sh health-check times out. First observed
when 7 of 8 inst_X.log files contained the ZMQ error and the 8th
(by random ordering) was the one healthy instance.

Fix: set VLLM_NIXL_SIDE_CHANNEL_PORT=$((5600 + i)) per instance in
the NIXL launch branch. Each engine now gets a distinct handshake
port (5600..5607 by default). Verified: all 8 instances now reach
"Application startup complete" within the 360 s health budget.

This is NIXL-specific; Mooncake uses VLLM_MOONCAKE_BOOTSTRAP_PORT
which we were already varying per instance.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-26 15:09:16 +08:00
parent 151bf33541
commit 0eb49dcc34

View File

@@ -79,9 +79,13 @@ for gpu in $GPU_INDICES; do
$EXTRA_VLLM_ARGS \ $EXTRA_VLLM_ARGS \
> "$RUNDIR/logs/vllm_inst_${i}_gpu${gpu}.log" 2>&1 & > "$RUNDIR/logs/vllm_inst_${i}_gpu${gpu}.log" 2>&1 &
elif [ "$ENABLE_KV_BOTH" = "1" ] && [ "$KV_CONNECTOR" = "Nixl" ]; then elif [ "$ENABLE_KV_BOTH" = "1" ] && [ "$KV_CONNECTOR" = "Nixl" ]; then
# NixlConnector uses UCX side-channels for handshake (no bootstrap # NixlConnector's handshake listener binds to a fixed default
# port needed). Side-channel host defaults to NIC IP discovered # port 5600 unless VLLM_NIXL_SIDE_CHANNEL_PORT is overridden.
# at register_kv_caches time. # Multiple instances on the same host MUST use distinct ports
# or only one will start; the rest hit
# `zmq.error.ZMQError: Address already in use`.
nixl_port=$((5600 + i))
VLLM_NIXL_SIDE_CHANNEL_PORT=$nixl_port \
AGENTIC_STEP_LOG_PATH="$RUNDIR/engine_state/engine_${i}.jsonl" \ AGENTIC_STEP_LOG_PATH="$RUNDIR/engine_state/engine_${i}.jsonl" \
AGENTIC_WORKER_ID="engine_${i}" \ AGENTIC_WORKER_ID="engine_${i}" \
CUDA_VISIBLE_DEVICES=$gpu \ CUDA_VISIBLE_DEVICES=$gpu \