3 Commits

Author SHA1 Message Date
ee5db0b321 MB5 driver updates: PD-proxy + snapshot instrument + launcher tweaks
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 11:53:27 +08:00
e8980ce957 MB5 proxy: session-affinity P routing (MB5_P_ROUTING=session)
The upstream mooncake_connector_proxy round-robins both P and D
selection. For agentic multi-turn sessions this destroys prefix-cache
reuse on the producer side — every turn of a session lands on a
different P, so the prefix-cache hit ratio collapses to 0 (observed in
the 6P+2D round-robin baseline) and every turn re-prefills from
scratch, piling extra load on the P pool.

Add an env-gated routing mode so the same proxy serves both arms of a
clean A/B:
  MB5_P_ROUTING=rr       round-robin (default, = upstream behavior)
  MB5_P_ROUTING=session  consistent md5 hash on X-Session-Id -> same
                         producer for all turns of a session

Decode side stays round-robin (load balance) in both modes — decode
KV is freshly transferred per turn, so D gains nothing from affinity
but everything from even load spreading.

mb5_launch.sh threads MB5_P_ROUTING through to the proxy and logs the
active mode. Default path is byte-for-byte the old behavior, so an
in-flight round-robin sweep is unaffected if this is redeployed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 11:05:25 +08:00
a9c7310f4a MB5 PD-disagg pipeline: working end-to-end
Three independent bugs were blocking PD-disagg smoke; each fix is
isolated so the next PD experiment doesn't re-hit them.

1. mb5_launch.sh
   - stop_all() also kills mb5_pd_proxy.py (our vendored copy),
     not just the upstream filename, and asserts ports 8000-8007 +
     PROXY_PORT are free before launching — stale proxies were
     silently passing the readiness check.
   - Proxy readiness uses a generic "any HTTP response" probe;
     mooncake_connector_proxy only exposes /v1/completions so
     /v1/models 404 is expected.

2. mb5_pd_proxy.py (vendored from third_party so deploy.sh ships it)
   - Force min_tokens=1 on the prefill leg. Clients that set
     min_tokens == max_tokens (our replayer does) collide with
     vLLM's min_tokens<=max_tokens check after the proxy caps
     max_tokens=1.

3. instrument_kv_snapshot.py
   - Adds a second patch target: initialize
     MooncakeConnectorWorker.bootstrap_server = None in __init__.
     vLLM 0.18.1 only sets it under the is_kv_producer branch, so
     kv_consumer hits AttributeError as soon as the first remote
     prefill request lands.
   - apply/revert refactored to iterate over (path, patches) pairs.

plot_kv_pool_timeline.py also handles snapshot files that never
captured a running request (would otherwise IndexError on an empty
stackplot input).

Smoke: 4P+4D × 20 reqs → 20/20 success, mean 3.9s, p99 17s, 8 PIDs
all writing snapshots (601 total), well above the 8C baseline.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 00:14:22 +08:00