agentic-kvc

gahow/agentic-kvc

Fork 0

Commit Graph

Author	SHA1	Message	Date
Gahow Wang	ee5db0b321	MB5 driver updates: PD-proxy + snapshot instrument + launcher tweaks Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 11:53:27 +08:00
Gahow Wang	e8980ce957	MB5 proxy: session-affinity P routing (MB5_P_ROUTING=session) The upstream mooncake_connector_proxy round-robins both P and D selection. For agentic multi-turn sessions this destroys prefix-cache reuse on the producer side — every turn of a session lands on a different P, so the prefix-cache hit ratio collapses to 0 (observed in the 6P+2D round-robin baseline) and every turn re-prefills from scratch, piling extra load on the P pool. Add an env-gated routing mode so the same proxy serves both arms of a clean A/B: MB5_P_ROUTING=rr round-robin (default, = upstream behavior) MB5_P_ROUTING=session consistent md5 hash on X-Session-Id -> same producer for all turns of a session Decode side stays round-robin (load balance) in both modes — decode KV is freshly transferred per turn, so D gains nothing from affinity but everything from even load spreading. mb5_launch.sh threads MB5_P_ROUTING through to the proxy and logs the active mode. Default path is byte-for-byte the old behavior, so an in-flight round-robin sweep is unaffected if this is redeployed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 11:05:25 +08:00
Gahow Wang	a9c7310f4a	MB5 PD-disagg pipeline: working end-to-end Three independent bugs were blocking PD-disagg smoke; each fix is isolated so the next PD experiment doesn't re-hit them. 1. mb5_launch.sh - stop_all() also kills mb5_pd_proxy.py (our vendored copy), not just the upstream filename, and asserts ports 8000-8007 + PROXY_PORT are free before launching — stale proxies were silently passing the readiness check. - Proxy readiness uses a generic "any HTTP response" probe; mooncake_connector_proxy only exposes /v1/completions so /v1/models 404 is expected. 2. mb5_pd_proxy.py (vendored from third_party so deploy.sh ships it) - Force min_tokens=1 on the prefill leg. Clients that set min_tokens == max_tokens (our replayer does) collide with vLLM's min_tokens<=max_tokens check after the proxy caps max_tokens=1. 3. instrument_kv_snapshot.py - Adds a second patch target: initialize MooncakeConnectorWorker.bootstrap_server = None in __init__. vLLM 0.18.1 only sets it under the is_kv_producer branch, so kv_consumer hits AttributeError as soon as the first remote prefill request lands. - apply/revert refactored to iterate over (path, patches) pairs. plot_kv_pool_timeline.py also handles snapshot files that never captured a running request (would otherwise IndexError on an empty stackplot input). Smoke: 4P+4D × 20 reqs → 20/20 success, mean 3.9s, p99 17s, 8 PIDs all writing snapshots (601 total), well above the 8C baseline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 00:14:22 +08:00

Author

SHA1

Message

Date

Gahow Wang

ee5db0b321

MB5 driver updates: PD-proxy + snapshot instrument + launcher tweaks

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-29 11:53:27 +08:00

Gahow Wang

e8980ce957

MB5 proxy: session-affinity P routing (MB5_P_ROUTING=session)

The upstream mooncake_connector_proxy round-robins both P and D
selection. For agentic multi-turn sessions this destroys prefix-cache
reuse on the producer side — every turn of a session lands on a
different P, so the prefix-cache hit ratio collapses to 0 (observed in
the 6P+2D round-robin baseline) and every turn re-prefills from
scratch, piling extra load on the P pool.

Add an env-gated routing mode so the same proxy serves both arms of a
clean A/B:
  MB5_P_ROUTING=rr       round-robin (default, = upstream behavior)
  MB5_P_ROUTING=session  consistent md5 hash on X-Session-Id -> same
                         producer for all turns of a session

Decode side stays round-robin (load balance) in both modes — decode
KV is freshly transferred per turn, so D gains nothing from affinity
but everything from even load spreading.

mb5_launch.sh threads MB5_P_ROUTING through to the proxy and logs the
active mode. Default path is byte-for-byte the old behavior, so an
in-flight round-robin sweep is unaffected if this is redeployed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-28 11:05:25 +08:00

Gahow Wang

a9c7310f4a

MB5 PD-disagg pipeline: working end-to-end

Three independent bugs were blocking PD-disagg smoke; each fix is
isolated so the next PD experiment doesn't re-hit them.

1. mb5_launch.sh
   - stop_all() also kills mb5_pd_proxy.py (our vendored copy),
     not just the upstream filename, and asserts ports 8000-8007 +
     PROXY_PORT are free before launching — stale proxies were
     silently passing the readiness check.
   - Proxy readiness uses a generic "any HTTP response" probe;
     mooncake_connector_proxy only exposes /v1/completions so
     /v1/models 404 is expected.

2. mb5_pd_proxy.py (vendored from third_party so deploy.sh ships it)
   - Force min_tokens=1 on the prefill leg. Clients that set
     min_tokens == max_tokens (our replayer does) collide with
     vLLM's min_tokens<=max_tokens check after the proxy caps
     max_tokens=1.

3. instrument_kv_snapshot.py
   - Adds a second patch target: initialize
     MooncakeConnectorWorker.bootstrap_server = None in __init__.
     vLLM 0.18.1 only sets it under the is_kv_producer branch, so
     kv_consumer hits AttributeError as soon as the first remote
     prefill request lands.
   - apply/revert refactored to iterate over (path, patches) pairs.

plot_kv_pool_timeline.py also handles snapshot files that never
captured a running request (would otherwise IndexError on an empty
stackplot input).

Smoke: 4P+4D × 20 reqs → 20/20 success, mean 3.9s, p99 17s, 8 PIDs
all writing snapshots (601 total), well above the 8C baseline.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-28 00:14:22 +08:00

3 Commits