agentic-kvc

Author	SHA1	Message	Date
Gahow Wang	68f21bef23	bench harness: env-tunable vLLM health timeout + both-modes 5-policy driver - b3_isolated_policy.sh: HEALTH_MAX_TRIES now env-overridable (default 180 -> 360s unchanged); slow-node launches can pass HEALTH_MAX_TRIES=300 (600s) to ride out a single-instance startup flake without aborting the whole arm. - run_5policy_both_modes.sh: runs run_5policy_600s.sh twice on the SAME ttp trace with REPLAY_DISPATCH_MODE={tracets,thinktime}, so the only variable is dispatch mode. Outputs to outputs/policy5_600s_{mode}_<date>/. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 20:59:02 +08:00
Gahow Wang	160c29133d	Unified bench report: mean+TPS+per-worker GPU util, auto-captured scripts/bench_report.py is now the canonical analyzer: per run + per input- class it emits TTFT/TPOT/E2E mean+p50+p90+p99, decode/prefill TPS (aggregate and per-worker), APC, per-worker GPU util mean/max, and load-spread ratios. b3_isolated_policy.sh auto-captures the inputs for every run: gpu_util.csv (via gpu_monitor.sh, 5s, replay-window only) + bench_config.json (worker->GPU map); teardown stops the sampler. Future runs populate per-worker GPU util automatically. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 16:08:22 +08:00
Gahow Wang	f739f7d461	Proxy/runner support for Nixl connector + unified_v3 (offload-decode) policy scripts/b3_isolated_policy.sh: Recognize unified_v3 as a kv_both-requiring policy; respect explicit KV_CONNECTOR=Nixl override (so unified_v2 / unified_v3 / unified_kv_both can run against either Mooncake or Nixl back-end). When Nixl is selected, skip the bootstrap-ports plumbing — Nixl uses its own UCX side-channel and the proxy forwards kv_transfer_params from the src response body instead of pre-baking engine_id/bootstrap_addr. scripts/cache_aware_proxy.py: - New unified_v3 policy (~250 lines): prefill stays on session-affinity host (preserves intra-session prefix-cache reuse), decode is migrated to a lower-load target when the affinity host is busy with concurrent decodes. KV transfer flows prefill_host → decode_target, opposite of v2. Knobs: v3_min_new_tokens, v3_min_prefill_decode_busy, v3_target_load_ratio, v3_min_load_gap, v3_rotate_affinity, v3_prefer_cache_target. cache_miss_audit found rotation hurts cross- turn locality (9.5% hit with vs ~80% without) so default v3_rotate_affinity=False. - New connector_type setting ("mooncake" \| "nixl") gating the PD-sep handshake form: mooncake uses pre-baked kv_transfer_params, nixl forwards them from the response body. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 22:05:19 +08:00
Gahow Wang	3fdcec9c0f	Fix review P2s: lockfile, model path convention, trap robustness - Regenerate uv.lock after adding fastapi/uvicorn deps so uv sync --locked no longer fails - B3 scripts: default MODEL to $HOME/models/... matching documented convention and other launch scripts (repo has no models/ directory) - launch_elastic_p2p: append \|\| true to each trap command so set -e doesn't abort cleanup when jobs -p is empty and EngineCore orphans remain	2026-05-26 16:05:43 +08:00
Gahow Wang	645b067dd4	Fix review bugs: PD-sep counter leaks, hardcoded paths, missing deps Critical: - cache_aware_proxy: _handle_pd_sep leaked p_inst.num_requests (never decremented) and never managed d_inst.num_requests; fix media_type from application/json to text/event-stream for SSE stream High: - b3_sweep/b3_isolated_policy/b3_analyze: replace hardcoded /home/admin/cpfs/wjh/ ROOT with script-relative $(dirname "$0")/.. - b3_analyze: replace hardcoded 8-port WORKER_MAP with dynamic generation from BASE_PORT and N_INSTANCES Medium: - analyze_breakdown: warn on stderr when records are skipped (was silent) - deploy_vllm_patches: fail-fast on SSH/SCP errors instead of continuing with empty VENV_SITE - pyproject.toml: declare fastapi and uvicorn as runtime dependencies - launch_elastic_p2p: kill EngineCore and proxy in trap handler to prevent GPU memory leaks on exit	2026-05-26 15:54:55 +08:00
Gahow Wang	0eb49dcc34	Fix NIXL multi-instance port conflict: per-instance SIDE_CHANNEL_PORT NIXL's _nixl_handshake_listener (vllm/distributed/kv_transfer/ kv_connector/v1/nixl_connector.py:700) binds a ZMQ ROUTER socket on the side_channel_port, which defaults to 5600. When 8 NIXL vLLMs launch concurrently on the same host all 8 race for tcp://localhost:5600; exactly one succeeds and the others silently hang in the listener thread with: zmq.error.ZMQError: Address already in use (addr='tcp://localhost:5600') The engines themselves never reach "Application startup complete" and the b3_isolated_policy.sh health-check times out. First observed when 7 of 8 inst_X.log files contained the ZMQ error and the 8th (by random ordering) was the one healthy instance. Fix: set VLLM_NIXL_SIDE_CHANNEL_PORT=$((5600 + i)) per instance in the NIXL launch branch. Each engine now gets a distinct handshake port (5600..5607 by default). Verified: all 8 instances now reach "Application startup complete" within the 360 s health budget. This is NIXL-specific; Mooncake uses VLLM_MOONCAKE_BOOTSTRAP_PORT which we were already varying per instance. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 15:09:16 +08:00
Gahow Wang	151bf33541	Add unified_nixl_both policy: NIXL connector isolation control Adds a NIXL-backed counterpart to unified_kv_both so we can attribute the kv_both substrate overhead measured in the elastic_migration_v2 section to either Mooncake-specific code or a generic v1-connector cost shared by all connectors. - scripts/cache_aware_proxy.py: register --policy unified_nixl_both. Picker is identical to unified (and unified_kv_both); routing decisions never go through the PD-sep branch. Differs only at the vLLM launch layer. - scripts/b3_isolated_policy.sh: new KV_CONNECTOR env var (Mooncake\|Nixl), auto-set based on POLICY. NIXL launch path uses --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' with no VLLM_MOONCAKE_BOOTSTRAP_PORT (NIXL uses UCX side-channels). - Health-check timeout: 90 iterations * 2s -> 180 iterations * 2s (180s -> 360s). Empirically NIXL needs ~100-150s per instance to initialize the UCX agent and register KV cache memory; 8 concurrent NIXL launches frequently overshoot the previous 180s budget. Mooncake is unaffected (still finishes well inside the new budget). The 8-vLLM unified_nixl_both first launch tripped the old timeout despite 7/8 instances reaching startup-complete. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 14:57:54 +08:00
Gahow Wang	4b833d33b7	unified_v2.1: relax gates + add unified_kv_both isolation control v2.0 ran on B3 and triggered PD-sep only 2 / 1214 times (0.2%). The gates were too conservative; the v2-vs-v1 latency gap (TTFT p90 7.35 -> 8.96 s) is therefore probably attributable to kv_both always-on overhead, not to the PD-sep mechanism itself. v2.1 has two fixes plus an isolation control. Bug fix: - The "chosen has live decodes worth protecting" gate combined num_requests and ongoing_decode_tokens with AND, falling through when EITHER was small. Under agentic workloads each worker rarely stacks more than 1-2 concurrent requests, so the gate killed 84% of v2.0 candidates that reached it. Replace with a pure ongoing_decode_tokens == 0 check ("chosen_no_active_decode") — same semantic, much higher recall. Threshold relaxation (B2 microbench is the calibration source): - pd_sep_min_new_tokens: 16000 -> 8000 (B2 TPOT idx 1.9x already at 8k, TTFT idx 12x — strictly worth migrating) - pd_sep_min_decodes_protected: 2 -> 1 - pd_sep_min_src_cache_tokens: 8000 -> 4000 - pd_sep_min_extra_cache_tokens: 4000 -> 2000 Isolation control: - New --policy unified_kv_both option. Uses the exact same picker as --policy unified but the vLLMs are launched in kv_role=kv_both (the same launch mode unified_v2 requires). PD-sep never fires. Compares against unified_v2 to attribute any v2 effect to the PD-sep branch alone, not the kv_both always-on overhead. - Both unified_kv_both and unified_v2 auto-enable kv_both launch in b3_isolated_policy.sh. Tests: - Updated the existing "chosen has no decodes" test for the new gate name and semantic. - All 24 proxy tests pass. Refs: window_1_results/v2_breakdown analysis (88.7% of candidates caught by old new_local_below_threshold; 84% of the remainder caught by the old few_decodes gate). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 10:40:57 +08:00
Gahow Wang	19f69a9d2e	unified_v2: selective per-request PD-sep via Mooncake (E3+E4) Adds a sixth routing policy --policy unified_v2 that wraps the existing unified hybrid picker with a selective PD-sep branch. When all of the following hold, a request is split prefill-on-src, decode-on-chosen via Mooncake kv_role=kv_both transfer: 1. new_local = input_length - chosen.cache_hit > 16k (B2 microbench shows same-worker TTFT idx >= 3x from this size up) 2. chosen has live decodes worth protecting (>= 2 in-flight) 3. some other instance holds materially more cache for this prefix (>= 8k tokens, and >= 4k more than chosen) 4. cost(src_interference + RDMA xfer) + 0.2s margin < cost(chosen_interference) The cost model is the audit-blessed shape from E1's post-mortem: - gate on new_tokens (post-cache), NOT input_length (the old PUSH gate) - bind to a single transfer mechanism (kv_both peer-to-peer pull) - realistic RDMA cost as a function of bytes: 0.3s base + bytes / 2.7 GB/s (calibrated against contention_16s_elastic p50) - both source and target decode counts considered E2 mechanism-level patches not yet applied (this commit is policy-only). Patches 6.2 / 6.3 / 6.5 remain on the table. Patch 6.6 (per-request xfer timeout, 60s default) is implemented on the proxy side as an httpx per-chunk read timeout on the dst streaming call, so a stuck KV transfer fails the request instead of hanging for 600s. cache_aware_proxy.py: - Settings: kv_bytes_per_token, prefill_throughput_kv_both, rdma_base_overhead_s, rdma_effective_gb_per_s, pd_sep_* gating knobs - estimate_transfer_cost(bytes) replaces the constant rdma_overhead_s - estimate_same_worker_interference_s(new_tokens, num_decodes) reads off the B2 penalty curve in 4 bins - pick_instance_unified_v2: inherits unified, returns extra (src_inst, src_idx) tuple when PD-sep wins the cost compare - _handle_combined_pd_sep_v2: prefill on src (do_remote_decode=True, max_tokens=1), Mooncake xfer, decode-stream on dst with httpx Timeout(read=pd_sep_xfer_timeout_s) - --policy unified_v2 added to argparse choices - lifespan auto-runs init_prefill_bootstrap when policy is unified_v2 b3_isolated_policy.sh: - ENABLE_KV_BOTH env var, auto-set when POLICY=unified_v2, threads kv_role=kv_both + VLLM_MOONCAKE_BOOTSTRAP_PORT to vllm and --bootstrap-ports to the proxy Tests: 8 new unit tests cover the gating predicates and the cost estimators; all 32 proxy tests still pass. Refs: E1 (PUSH post-mortem) + E2 (Mooncake audit) reports. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 09:25:45 +08:00
Gahow Wang	1d87082ca1	B3: cold-start isolated policy runner (clean APC per cell) scripts/b3_isolated_policy.sh wraps one policy run in a fresh 8-instance vLLM lifecycle: hard reset -> launch -> health -> proxy -> replayer -> snapshot artifacts -> cleanup. Used when cross- policy APC contamination matters more than the ~25-min vLLM warmup overhead per policy. Counterpart to the existing b3_sweep.sh which keeps vLLM warm across all policies (faster but warm-cache; we found via the sticky pre-flight that contamination is < 1% on this trace, so b3_sweep.sh stays the default). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 20:33:44 +08:00

10 Commits