Commit Graph

7 Commits

Author SHA1 Message Date
3fdcec9c0f Fix review P2s: lockfile, model path convention, trap robustness
- Regenerate uv.lock after adding fastapi/uvicorn deps so uv sync
  --locked no longer fails
- B3 scripts: default MODEL to $HOME/models/... matching documented
  convention and other launch scripts (repo has no models/ directory)
- launch_elastic_p2p: append || true to each trap command so set -e
  doesn't abort cleanup when jobs -p is empty and EngineCore orphans
  remain
2026-05-26 16:05:43 +08:00
645b067dd4 Fix review bugs: PD-sep counter leaks, hardcoded paths, missing deps
Critical:
- cache_aware_proxy: _handle_pd_sep leaked p_inst.num_requests (never
  decremented) and never managed d_inst.num_requests; fix media_type
  from application/json to text/event-stream for SSE stream

High:
- b3_sweep/b3_isolated_policy/b3_analyze: replace hardcoded
  /home/admin/cpfs/wjh/ ROOT with script-relative $(dirname "$0")/..
- b3_analyze: replace hardcoded 8-port WORKER_MAP with dynamic
  generation from BASE_PORT and N_INSTANCES

Medium:
- analyze_breakdown: warn on stderr when records are skipped (was silent)
- deploy_vllm_patches: fail-fast on SSH/SCP errors instead of
  continuing with empty VENV_SITE
- pyproject.toml: declare fastapi and uvicorn as runtime dependencies
- launch_elastic_p2p: kill EngineCore and proxy in trap handler to
  prevent GPU memory leaks on exit
2026-05-26 15:54:55 +08:00
0eb49dcc34 Fix NIXL multi-instance port conflict: per-instance SIDE_CHANNEL_PORT
NIXL's _nixl_handshake_listener (vllm/distributed/kv_transfer/
kv_connector/v1/nixl_connector.py:700) binds a ZMQ ROUTER socket on
the side_channel_port, which defaults to 5600. When 8 NIXL vLLMs
launch concurrently on the same host all 8 race for tcp://localhost:5600;
exactly one succeeds and the others silently hang in the listener
thread with:

    zmq.error.ZMQError: Address already in use (addr='tcp://localhost:5600')

The engines themselves never reach "Application startup complete"
and the b3_isolated_policy.sh health-check times out. First observed
when 7 of 8 inst_X.log files contained the ZMQ error and the 8th
(by random ordering) was the one healthy instance.

Fix: set VLLM_NIXL_SIDE_CHANNEL_PORT=$((5600 + i)) per instance in
the NIXL launch branch. Each engine now gets a distinct handshake
port (5600..5607 by default). Verified: all 8 instances now reach
"Application startup complete" within the 360 s health budget.

This is NIXL-specific; Mooncake uses VLLM_MOONCAKE_BOOTSTRAP_PORT
which we were already varying per instance.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 15:09:16 +08:00
151bf33541 Add unified_nixl_both policy: NIXL connector isolation control
Adds a NIXL-backed counterpart to unified_kv_both so we can attribute
the kv_both substrate overhead measured in the elastic_migration_v2
section to either Mooncake-specific code or a generic v1-connector
cost shared by all connectors.

- scripts/cache_aware_proxy.py: register --policy unified_nixl_both.
  Picker is identical to unified (and unified_kv_both); routing
  decisions never go through the PD-sep branch. Differs only at the
  vLLM launch layer.
- scripts/b3_isolated_policy.sh: new KV_CONNECTOR env var
  (Mooncake|Nixl), auto-set based on POLICY. NIXL launch path uses
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
  with no VLLM_MOONCAKE_BOOTSTRAP_PORT (NIXL uses UCX side-channels).
- Health-check timeout: 90 iterations * 2s -> 180 iterations * 2s
  (180s -> 360s). Empirically NIXL needs ~100-150s per instance to
  initialize the UCX agent and register KV cache memory; 8
  concurrent NIXL launches frequently overshoot the previous 180s
  budget. Mooncake is unaffected (still finishes well inside the new
  budget). The 8-vLLM unified_nixl_both first launch tripped the
  old timeout despite 7/8 instances reaching startup-complete.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 14:57:54 +08:00
4b833d33b7 unified_v2.1: relax gates + add unified_kv_both isolation control
v2.0 ran on B3 and triggered PD-sep only 2 / 1214 times (0.2%). The
gates were too conservative; the v2-vs-v1 latency gap (TTFT p90
7.35 -> 8.96 s) is therefore probably attributable to kv_both
always-on overhead, not to the PD-sep mechanism itself. v2.1 has two
fixes plus an isolation control.

Bug fix:
- The "chosen has live decodes worth protecting" gate combined
  num_requests and ongoing_decode_tokens with AND, falling through
  when EITHER was small. Under agentic workloads each worker rarely
  stacks more than 1-2 concurrent requests, so the gate killed 84%
  of v2.0 candidates that reached it. Replace with a pure
  ongoing_decode_tokens == 0 check ("chosen_no_active_decode") —
  same semantic, much higher recall.

Threshold relaxation (B2 microbench is the calibration source):
- pd_sep_min_new_tokens: 16000 -> 8000 (B2 TPOT idx 1.9x already
  at 8k, TTFT idx 12x — strictly worth migrating)
- pd_sep_min_decodes_protected: 2 -> 1
- pd_sep_min_src_cache_tokens: 8000 -> 4000
- pd_sep_min_extra_cache_tokens: 4000 -> 2000

Isolation control:
- New --policy unified_kv_both option. Uses the exact same picker as
  --policy unified but the vLLMs are launched in kv_role=kv_both
  (the same launch mode unified_v2 requires). PD-sep never fires.
  Compares against unified_v2 to attribute any v2 effect to the
  PD-sep branch alone, not the kv_both always-on overhead.
- Both unified_kv_both and unified_v2 auto-enable kv_both launch in
  b3_isolated_policy.sh.

Tests:
- Updated the existing "chosen has no decodes" test for the new
  gate name and semantic.
- All 24 proxy tests pass.

Refs: window_1_results/v2_breakdown analysis (88.7% of candidates
caught by old new_local_below_threshold; 84% of the remainder
caught by the old few_decodes gate).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 10:40:57 +08:00
19f69a9d2e unified_v2: selective per-request PD-sep via Mooncake (E3+E4)
Adds a sixth routing policy --policy unified_v2 that wraps the
existing unified hybrid picker with a selective PD-sep branch.
When all of the following hold, a request is split prefill-on-src,
decode-on-chosen via Mooncake kv_role=kv_both transfer:

  1. new_local = input_length - chosen.cache_hit > 16k
     (B2 microbench shows same-worker TTFT idx >= 3x from this size up)
  2. chosen has live decodes worth protecting (>= 2 in-flight)
  3. some other instance holds materially more cache for this prefix
     (>= 8k tokens, and >= 4k more than chosen)
  4. cost(src_interference + RDMA xfer) + 0.2s margin < cost(chosen_interference)

The cost model is the audit-blessed shape from E1's post-mortem:
- gate on new_tokens (post-cache), NOT input_length (the old PUSH gate)
- bind to a single transfer mechanism (kv_both peer-to-peer pull)
- realistic RDMA cost as a function of bytes: 0.3s base +
  bytes / 2.7 GB/s (calibrated against contention_16s_elastic p50)
- both source and target decode counts considered

E2 mechanism-level patches not yet applied (this commit is policy-only).
Patches 6.2 / 6.3 / 6.5 remain on the table. Patch 6.6 (per-request
xfer timeout, 60s default) is implemented on the proxy side as an
httpx per-chunk read timeout on the dst streaming call, so a stuck
KV transfer fails the request instead of hanging for 600s.

cache_aware_proxy.py:
- Settings: kv_bytes_per_token, prefill_throughput_kv_both,
  rdma_base_overhead_s, rdma_effective_gb_per_s, pd_sep_* gating knobs
- estimate_transfer_cost(bytes) replaces the constant rdma_overhead_s
- estimate_same_worker_interference_s(new_tokens, num_decodes) reads off
  the B2 penalty curve in 4 bins
- pick_instance_unified_v2: inherits unified, returns extra
  (src_inst, src_idx) tuple when PD-sep wins the cost compare
- _handle_combined_pd_sep_v2: prefill on src (do_remote_decode=True,
  max_tokens=1), Mooncake xfer, decode-stream on dst with httpx
  Timeout(read=pd_sep_xfer_timeout_s)
- --policy unified_v2 added to argparse choices
- lifespan auto-runs init_prefill_bootstrap when policy is unified_v2

b3_isolated_policy.sh:
- ENABLE_KV_BOTH env var, auto-set when POLICY=unified_v2, threads
  kv_role=kv_both + VLLM_MOONCAKE_BOOTSTRAP_PORT to vllm and
  --bootstrap-ports to the proxy

Tests: 8 new unit tests cover the gating predicates and the cost
estimators; all 32 proxy tests still pass.

Refs: E1 (PUSH post-mortem) + E2 (Mooncake audit) reports.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 09:25:45 +08:00
1d87082ca1 B3: cold-start isolated policy runner (clean APC per cell)
scripts/b3_isolated_policy.sh wraps one policy run in a fresh
8-instance vLLM lifecycle: hard reset -> launch -> health -> proxy
-> replayer -> snapshot artifacts -> cleanup. Used when cross-
policy APC contamination matters more than the ~25-min vLLM
warmup overhead per policy.

Counterpart to the existing b3_sweep.sh which keeps vLLM warm
across all policies (faster but warm-cache; we found via the
sticky pre-flight that contamination is < 1% on this trace, so
b3_sweep.sh stays the default).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 20:33:44 +08:00