The upstream mooncake_connector_proxy round-robins both P and D
selection. For agentic multi-turn sessions this destroys prefix-cache
reuse on the producer side — every turn of a session lands on a
different P, so the prefix-cache hit ratio collapses to 0 (observed in
the 6P+2D round-robin baseline) and every turn re-prefills from
scratch, piling extra load on the P pool.
Add an env-gated routing mode so the same proxy serves both arms of a
clean A/B:
MB5_P_ROUTING=rr round-robin (default, = upstream behavior)
MB5_P_ROUTING=session consistent md5 hash on X-Session-Id -> same
producer for all turns of a session
Decode side stays round-robin (load balance) in both modes — decode
KV is freshly transferred per turn, so D gains nothing from affinity
but everything from even load spreading.
mb5_launch.sh threads MB5_P_ROUTING through to the proxy and logs the
active mode. Default path is byte-for-byte the old behavior, so an
in-flight round-robin sweep is unaffected if this is redeployed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three independent bugs were blocking PD-disagg smoke; each fix is
isolated so the next PD experiment doesn't re-hit them.
1. mb5_launch.sh
- stop_all() also kills mb5_pd_proxy.py (our vendored copy),
not just the upstream filename, and asserts ports 8000-8007 +
PROXY_PORT are free before launching — stale proxies were
silently passing the readiness check.
- Proxy readiness uses a generic "any HTTP response" probe;
mooncake_connector_proxy only exposes /v1/completions so
/v1/models 404 is expected.
2. mb5_pd_proxy.py (vendored from third_party so deploy.sh ships it)
- Force min_tokens=1 on the prefill leg. Clients that set
min_tokens == max_tokens (our replayer does) collide with
vLLM's min_tokens<=max_tokens check after the proxy caps
max_tokens=1.
3. instrument_kv_snapshot.py
- Adds a second patch target: initialize
MooncakeConnectorWorker.bootstrap_server = None in __init__.
vLLM 0.18.1 only sets it under the is_kv_producer branch, so
kv_consumer hits AttributeError as soon as the first remote
prefill request lands.
- apply/revert refactored to iterate over (path, patches) pairs.
plot_kv_pool_timeline.py also handles snapshot files that never
captured a running request (would otherwise IndexError on an empty
stackplot input).
Smoke: 4P+4D × 20 reqs → 20/20 success, mean 3.9s, p99 17s, 8 PIDs
all writing snapshots (601 total), well above the 8C baseline.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>