agentic-kvc

gahow/agentic-kvc

Fork 0

Commit Graph

Author	SHA1	Message	Date
Gahow Wang	c6b7c3471b	B3: load_only + sticky policies, capped-trace builder, sweep driver Three additions land together because B3's whole point is comparing LMetric against meaningful controls. - scripts/cache_aware_proxy.py: two new --policy values. - load_only: pure min(num_requests) routing, no cache or affinity. The B3 control that strips locality so the LMetric-vs-load gap is legible. - sticky: first turn goes to min-load, subsequent turns ALWAYS return to the same instance, even under saturation. The B3 control that maxes out locality so the hot-spot cost is legible. - scripts/build_capped_trace.py: per-session turn cap (default 8). Generates the session-mass-equalized variant the TODO calls for so that hot-spot index can be re-measured with the heavy-tail removed. - scripts/b3_sweep.sh: orchestrates the 5-cell sweep. - GPU_INDICES makes it easy to skip a dead GPU. - EXTRA_VLLM_ARGS defaults to --enable-prompt-tokens-details so usage.prompt_tokens_details.cached_tokens is populated. vLLM 0.18.1 omits the field by default and breaks the reuse-decomp pipeline; the smoke run surfaced this. - Trap kills EngineCore by name in addition to "vllm serve" — the parent dies first but the child holds GPU memory. Was the root cause of the 89 GB ghost on GPU 0 earlier today. - Proxy readiness is a polling loop, not a fixed sleep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 17:54:24 +08:00

Author

SHA1

Message

Date

Gahow Wang

c6b7c3471b

B3: load_only + sticky policies, capped-trace builder, sweep driver

Three additions land together because B3's whole point is comparing
LMetric against meaningful controls.

- scripts/cache_aware_proxy.py: two new --policy values.
  - load_only: pure min(num_requests) routing, no cache or affinity.
    The B3 control that strips locality so the LMetric-vs-load gap is
    legible.
  - sticky: first turn goes to min-load, subsequent turns ALWAYS
    return to the same instance, even under saturation. The B3
    control that maxes out locality so the hot-spot cost is legible.
- scripts/build_capped_trace.py: per-session turn cap (default 8).
  Generates the session-mass-equalized variant the TODO calls for so
  that hot-spot index can be re-measured with the heavy-tail removed.
- scripts/b3_sweep.sh: orchestrates the 5-cell sweep.
  - GPU_INDICES makes it easy to skip a dead GPU.
  - EXTRA_VLLM_ARGS defaults to --enable-prompt-tokens-details so
    usage.prompt_tokens_details.cached_tokens is populated. vLLM
    0.18.1 omits the field by default and breaks the reuse-decomp
    pipeline; the smoke run surfaced this.
  - Trap kills EngineCore by name in addition to "vllm serve" — the
    parent dies first but the child holds GPU memory. Was the root
    cause of the 89 GB ghost on GPU 0 earlier today.
  - Proxy readiness is a polling loop, not a fixed sleep.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-25 17:54:24 +08:00

1 Commits