10 Commits

Author SHA1 Message Date
68f21bef23 bench harness: env-tunable vLLM health timeout + both-modes 5-policy driver
- b3_isolated_policy.sh: HEALTH_MAX_TRIES now env-overridable (default 180 ->
  360s unchanged); slow-node launches can pass HEALTH_MAX_TRIES=300 (600s) to
  ride out a single-instance startup flake without aborting the whole arm.
- run_5policy_both_modes.sh: runs run_5policy_600s.sh twice on the SAME ttp
  trace with REPLAY_DISPATCH_MODE={tracets,thinktime}, so the only variable is
  dispatch mode. Outputs to outputs/policy5_600s_{mode}_<date>/.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 20:59:02 +08:00
160c29133d Unified bench report: mean+TPS+per-worker GPU util, auto-captured
scripts/bench_report.py is now the canonical analyzer: per run + per input-
class it emits TTFT/TPOT/E2E mean+p50+p90+p99, decode/prefill TPS (aggregate
and per-worker), APC, per-worker GPU util mean/max, and load-spread ratios.

b3_isolated_policy.sh auto-captures the inputs for every run: gpu_util.csv
(via gpu_monitor.sh, 5s, replay-window only) + bench_config.json (worker->GPU
map); teardown stops the sampler. Future runs populate per-worker GPU util
automatically.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 16:08:22 +08:00
f739f7d461 Proxy/runner support for Nixl connector + unified_v3 (offload-decode) policy
scripts/b3_isolated_policy.sh:
  Recognize unified_v3 as a kv_both-requiring policy; respect explicit
  KV_CONNECTOR=Nixl override (so unified_v2 / unified_v3 / unified_kv_both
  can run against either Mooncake or Nixl back-end). When Nixl is
  selected, skip the bootstrap-ports plumbing — Nixl uses its own UCX
  side-channel and the proxy forwards kv_transfer_params from the src
  response body instead of pre-baking engine_id/bootstrap_addr.

scripts/cache_aware_proxy.py:
  - New unified_v3 policy (~250 lines): prefill stays on session-affinity
    host (preserves intra-session prefix-cache reuse), decode is migrated
    to a lower-load target when the affinity host is busy with concurrent
    decodes. KV transfer flows prefill_host → decode_target, opposite of
    v2. Knobs: v3_min_new_tokens, v3_min_prefill_decode_busy,
    v3_target_load_ratio, v3_min_load_gap, v3_rotate_affinity,
    v3_prefer_cache_target. cache_miss_audit found rotation hurts cross-
    turn locality (9.5% hit with vs ~80% without) so default
    v3_rotate_affinity=False.
  - New connector_type setting ("mooncake" | "nixl") gating the PD-sep
    handshake form: mooncake uses pre-baked kv_transfer_params,
    nixl forwards them from the response body.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 22:05:19 +08:00
3fdcec9c0f Fix review P2s: lockfile, model path convention, trap robustness
- Regenerate uv.lock after adding fastapi/uvicorn deps so uv sync
  --locked no longer fails
- B3 scripts: default MODEL to $HOME/models/... matching documented
  convention and other launch scripts (repo has no models/ directory)
- launch_elastic_p2p: append || true to each trap command so set -e
  doesn't abort cleanup when jobs -p is empty and EngineCore orphans
  remain
2026-05-26 16:05:43 +08:00
645b067dd4 Fix review bugs: PD-sep counter leaks, hardcoded paths, missing deps
Critical:
- cache_aware_proxy: _handle_pd_sep leaked p_inst.num_requests (never
  decremented) and never managed d_inst.num_requests; fix media_type
  from application/json to text/event-stream for SSE stream

High:
- b3_sweep/b3_isolated_policy/b3_analyze: replace hardcoded
  /home/admin/cpfs/wjh/ ROOT with script-relative $(dirname "$0")/..
- b3_analyze: replace hardcoded 8-port WORKER_MAP with dynamic
  generation from BASE_PORT and N_INSTANCES

Medium:
- analyze_breakdown: warn on stderr when records are skipped (was silent)
- deploy_vllm_patches: fail-fast on SSH/SCP errors instead of
  continuing with empty VENV_SITE
- pyproject.toml: declare fastapi and uvicorn as runtime dependencies
- launch_elastic_p2p: kill EngineCore and proxy in trap handler to
  prevent GPU memory leaks on exit
2026-05-26 15:54:55 +08:00
0eb49dcc34 Fix NIXL multi-instance port conflict: per-instance SIDE_CHANNEL_PORT
NIXL's _nixl_handshake_listener (vllm/distributed/kv_transfer/
kv_connector/v1/nixl_connector.py:700) binds a ZMQ ROUTER socket on
the side_channel_port, which defaults to 5600. When 8 NIXL vLLMs
launch concurrently on the same host all 8 race for tcp://localhost:5600;
exactly one succeeds and the others silently hang in the listener
thread with:

    zmq.error.ZMQError: Address already in use (addr='tcp://localhost:5600')

The engines themselves never reach "Application startup complete"
and the b3_isolated_policy.sh health-check times out. First observed
when 7 of 8 inst_X.log files contained the ZMQ error and the 8th
(by random ordering) was the one healthy instance.

Fix: set VLLM_NIXL_SIDE_CHANNEL_PORT=$((5600 + i)) per instance in
the NIXL launch branch. Each engine now gets a distinct handshake
port (5600..5607 by default). Verified: all 8 instances now reach
"Application startup complete" within the 360 s health budget.

This is NIXL-specific; Mooncake uses VLLM_MOONCAKE_BOOTSTRAP_PORT
which we were already varying per instance.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 15:09:16 +08:00
151bf33541 Add unified_nixl_both policy: NIXL connector isolation control
Adds a NIXL-backed counterpart to unified_kv_both so we can attribute
the kv_both substrate overhead measured in the elastic_migration_v2
section to either Mooncake-specific code or a generic v1-connector
cost shared by all connectors.

- scripts/cache_aware_proxy.py: register --policy unified_nixl_both.
  Picker is identical to unified (and unified_kv_both); routing
  decisions never go through the PD-sep branch. Differs only at the
  vLLM launch layer.
- scripts/b3_isolated_policy.sh: new KV_CONNECTOR env var
  (Mooncake|Nixl), auto-set based on POLICY. NIXL launch path uses
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
  with no VLLM_MOONCAKE_BOOTSTRAP_PORT (NIXL uses UCX side-channels).
- Health-check timeout: 90 iterations * 2s -> 180 iterations * 2s
  (180s -> 360s). Empirically NIXL needs ~100-150s per instance to
  initialize the UCX agent and register KV cache memory; 8
  concurrent NIXL launches frequently overshoot the previous 180s
  budget. Mooncake is unaffected (still finishes well inside the new
  budget). The 8-vLLM unified_nixl_both first launch tripped the
  old timeout despite 7/8 instances reaching startup-complete.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 14:57:54 +08:00
4b833d33b7 unified_v2.1: relax gates + add unified_kv_both isolation control
v2.0 ran on B3 and triggered PD-sep only 2 / 1214 times (0.2%). The
gates were too conservative; the v2-vs-v1 latency gap (TTFT p90
7.35 -> 8.96 s) is therefore probably attributable to kv_both
always-on overhead, not to the PD-sep mechanism itself. v2.1 has two
fixes plus an isolation control.

Bug fix:
- The "chosen has live decodes worth protecting" gate combined
  num_requests and ongoing_decode_tokens with AND, falling through
  when EITHER was small. Under agentic workloads each worker rarely
  stacks more than 1-2 concurrent requests, so the gate killed 84%
  of v2.0 candidates that reached it. Replace with a pure
  ongoing_decode_tokens == 0 check ("chosen_no_active_decode") —
  same semantic, much higher recall.

Threshold relaxation (B2 microbench is the calibration source):
- pd_sep_min_new_tokens: 16000 -> 8000 (B2 TPOT idx 1.9x already
  at 8k, TTFT idx 12x — strictly worth migrating)
- pd_sep_min_decodes_protected: 2 -> 1
- pd_sep_min_src_cache_tokens: 8000 -> 4000
- pd_sep_min_extra_cache_tokens: 4000 -> 2000

Isolation control:
- New --policy unified_kv_both option. Uses the exact same picker as
  --policy unified but the vLLMs are launched in kv_role=kv_both
  (the same launch mode unified_v2 requires). PD-sep never fires.
  Compares against unified_v2 to attribute any v2 effect to the
  PD-sep branch alone, not the kv_both always-on overhead.
- Both unified_kv_both and unified_v2 auto-enable kv_both launch in
  b3_isolated_policy.sh.

Tests:
- Updated the existing "chosen has no decodes" test for the new
  gate name and semantic.
- All 24 proxy tests pass.

Refs: window_1_results/v2_breakdown analysis (88.7% of candidates
caught by old new_local_below_threshold; 84% of the remainder
caught by the old few_decodes gate).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 10:40:57 +08:00
19f69a9d2e unified_v2: selective per-request PD-sep via Mooncake (E3+E4)
Adds a sixth routing policy --policy unified_v2 that wraps the
existing unified hybrid picker with a selective PD-sep branch.
When all of the following hold, a request is split prefill-on-src,
decode-on-chosen via Mooncake kv_role=kv_both transfer:

  1. new_local = input_length - chosen.cache_hit > 16k
     (B2 microbench shows same-worker TTFT idx >= 3x from this size up)
  2. chosen has live decodes worth protecting (>= 2 in-flight)
  3. some other instance holds materially more cache for this prefix
     (>= 8k tokens, and >= 4k more than chosen)
  4. cost(src_interference + RDMA xfer) + 0.2s margin < cost(chosen_interference)

The cost model is the audit-blessed shape from E1's post-mortem:
- gate on new_tokens (post-cache), NOT input_length (the old PUSH gate)
- bind to a single transfer mechanism (kv_both peer-to-peer pull)
- realistic RDMA cost as a function of bytes: 0.3s base +
  bytes / 2.7 GB/s (calibrated against contention_16s_elastic p50)
- both source and target decode counts considered

E2 mechanism-level patches not yet applied (this commit is policy-only).
Patches 6.2 / 6.3 / 6.5 remain on the table. Patch 6.6 (per-request
xfer timeout, 60s default) is implemented on the proxy side as an
httpx per-chunk read timeout on the dst streaming call, so a stuck
KV transfer fails the request instead of hanging for 600s.

cache_aware_proxy.py:
- Settings: kv_bytes_per_token, prefill_throughput_kv_both,
  rdma_base_overhead_s, rdma_effective_gb_per_s, pd_sep_* gating knobs
- estimate_transfer_cost(bytes) replaces the constant rdma_overhead_s
- estimate_same_worker_interference_s(new_tokens, num_decodes) reads off
  the B2 penalty curve in 4 bins
- pick_instance_unified_v2: inherits unified, returns extra
  (src_inst, src_idx) tuple when PD-sep wins the cost compare
- _handle_combined_pd_sep_v2: prefill on src (do_remote_decode=True,
  max_tokens=1), Mooncake xfer, decode-stream on dst with httpx
  Timeout(read=pd_sep_xfer_timeout_s)
- --policy unified_v2 added to argparse choices
- lifespan auto-runs init_prefill_bootstrap when policy is unified_v2

b3_isolated_policy.sh:
- ENABLE_KV_BOTH env var, auto-set when POLICY=unified_v2, threads
  kv_role=kv_both + VLLM_MOONCAKE_BOOTSTRAP_PORT to vllm and
  --bootstrap-ports to the proxy

Tests: 8 new unit tests cover the gating predicates and the cost
estimators; all 32 proxy tests still pass.

Refs: E1 (PUSH post-mortem) + E2 (Mooncake audit) reports.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 09:25:45 +08:00
1d87082ca1 B3: cold-start isolated policy runner (clean APC per cell)
scripts/b3_isolated_policy.sh wraps one policy run in a fresh
8-instance vLLM lifecycle: hard reset -> launch -> health -> proxy
-> replayer -> snapshot artifacts -> cleanup. Used when cross-
policy APC contamination matters more than the ~25-min vLLM
warmup overhead per policy.

Counterpart to the existing b3_sweep.sh which keeps vLLM warm
across all policies (faster but warm-cache; we found via the
sticky pre-flight that contamination is < 1% on this trace, so
b3_sweep.sh stays the default).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 20:33:44 +08:00