e8980ce95753eafa768a9b0df9026caf747e1d2a
The upstream mooncake_connector_proxy round-robins both P and D
selection. For agentic multi-turn sessions this destroys prefix-cache
reuse on the producer side — every turn of a session lands on a
different P, so the prefix-cache hit ratio collapses to 0 (observed in
the 6P+2D round-robin baseline) and every turn re-prefills from
scratch, piling extra load on the P pool.
Add an env-gated routing mode so the same proxy serves both arms of a
clean A/B:
MB5_P_ROUTING=rr round-robin (default, = upstream behavior)
MB5_P_ROUTING=session consistent md5 hash on X-Session-Id -> same
producer for all turns of a session
Decode side stays round-robin (load balance) in both modes — decode
KV is freshly transferred per turn, so D gains nothing from affinity
but everything from even load spreading.
mb5_launch.sh threads MB5_P_ROUTING through to the proxy and logs the
active mode. Default path is byte-for-byte the old behavior, so an
in-flight round-robin sweep is unaffected if this is redeployed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Description
No description provided
Languages
Python
82.9%
Shell
17.1%