0eb49dcc346f027e510dd7a814dfa32b2407b12f
NIXL's _nixl_handshake_listener (vllm/distributed/kv_transfer/
kv_connector/v1/nixl_connector.py:700) binds a ZMQ ROUTER socket on
the side_channel_port, which defaults to 5600. When 8 NIXL vLLMs
launch concurrently on the same host all 8 race for tcp://localhost:5600;
exactly one succeeds and the others silently hang in the listener
thread with:
zmq.error.ZMQError: Address already in use (addr='tcp://localhost:5600')
The engines themselves never reach "Application startup complete"
and the b3_isolated_policy.sh health-check times out. First observed
when 7 of 8 inst_X.log files contained the ZMQ error and the 8th
(by random ordering) was the one healthy instance.
Fix: set VLLM_NIXL_SIDE_CHANNEL_PORT=$((5600 + i)) per instance in
the NIXL launch branch. Each engine now gets a distinct handshake
port (5600..5607 by default). Verified: all 8 instances now reach
"Application startup complete" within the 360 s health budget.
This is NIXL-specific; Mooncake uses VLLM_MOONCAKE_BOOTSTRAP_PORT
which we were already varying per instance.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Description
No description provided
Languages
Python
82.9%
Shell
17.1%