Files
agentic-kvc/scripts/b3_isolated_policy.sh
Gahow Wang 4b833d33b7 unified_v2.1: relax gates + add unified_kv_both isolation control
v2.0 ran on B3 and triggered PD-sep only 2 / 1214 times (0.2%). The
gates were too conservative; the v2-vs-v1 latency gap (TTFT p90
7.35 -> 8.96 s) is therefore probably attributable to kv_both
always-on overhead, not to the PD-sep mechanism itself. v2.1 has two
fixes plus an isolation control.

Bug fix:
- The "chosen has live decodes worth protecting" gate combined
  num_requests and ongoing_decode_tokens with AND, falling through
  when EITHER was small. Under agentic workloads each worker rarely
  stacks more than 1-2 concurrent requests, so the gate killed 84%
  of v2.0 candidates that reached it. Replace with a pure
  ongoing_decode_tokens == 0 check ("chosen_no_active_decode") —
  same semantic, much higher recall.

Threshold relaxation (B2 microbench is the calibration source):
- pd_sep_min_new_tokens: 16000 -> 8000 (B2 TPOT idx 1.9x already
  at 8k, TTFT idx 12x — strictly worth migrating)
- pd_sep_min_decodes_protected: 2 -> 1
- pd_sep_min_src_cache_tokens: 8000 -> 4000
- pd_sep_min_extra_cache_tokens: 4000 -> 2000

Isolation control:
- New --policy unified_kv_both option. Uses the exact same picker as
  --policy unified but the vLLMs are launched in kv_role=kv_both
  (the same launch mode unified_v2 requires). PD-sep never fires.
  Compares against unified_v2 to attribute any v2 effect to the
  PD-sep branch alone, not the kv_both always-on overhead.
- Both unified_kv_both and unified_v2 auto-enable kv_both launch in
  b3_isolated_policy.sh.

Tests:
- Updated the existing "chosen has no decodes" test for the new
  gate name and semantic.
- All 24 proxy tests pass.

Refs: window_1_results/v2_breakdown analysis (88.7% of candidates
caught by old new_local_below_threshold; 84% of the remainder
caught by the old few_decodes gate).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 10:40:57 +08:00

173 lines
5.9 KiB
Bash
Executable File

#!/usr/bin/env bash
# Run a single B3 policy with a cold-start vLLM (clean APC).
#
# Usage:
# bash scripts/b3_isolated_policy.sh <policy> <trace> <rundir>
#
# Launches 8 fresh vLLM instances, captures their engine_state into
# <rundir>/engine_state/, runs the policy through the proxy on
# <trace>, then kills everything. Distinct from b3_sweep.sh which
# shares one vLLM-set across all five policies (faster but warm-cache).
set -euo pipefail
ROOT="${ROOT:-/home/admin/cpfs/wjh/agentic-kv}"
VENV="$ROOT/.venv/bin"
MODEL="${MODEL:-/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
PROXY_PORT="${PROXY_PORT:-9300}"
BASE_PORT="${BASE_PORT:-8000}"
GPU_INDICES="${GPU_INDICES:-0 1 2 3 4 5 6 7}"
EXTRA_VLLM_ARGS="${EXTRA_VLLM_ARGS:---enable-prompt-tokens-details}"
N_INSTANCES=$(echo $GPU_INDICES | wc -w)
# When ENABLE_KV_BOTH=1, vLLM launches with the Mooncake KV connector in
# kv_both role and the proxy is given bootstrap ports. This is required
# for --policy unified_v2 (per-request PD-sep) but disabled by default
# because it adds always-on KV-transfer overhead even when not triggered.
ENABLE_KV_BOTH="${ENABLE_KV_BOTH:-0}"
BOOTSTRAP_BASE_PORT="${BOOTSTRAP_BASE_PORT:-8998}"
POLICY="${1:?usage: $0 <policy> <trace> <rundir>}"
TRACE="${2:?usage: $0 <policy> <trace> <rundir>}"
RUNDIR="${3:?usage: $0 <policy> <trace> <rundir>}"
# Auto-enable kv_both when the policy requires it.
if [ "$POLICY" = "unified_v2" ] || [ "$POLICY" = "unified_kv_both" ]; then
ENABLE_KV_BOTH=1
fi
mkdir -p "$RUNDIR/engine_state" "$RUNDIR/logs"
echo "[isolated] policy=$POLICY trace=$(basename $TRACE) rundir=$RUNDIR"
cleanup() {
pkill -9 -f cache_aware_proxy 2>/dev/null || true
pkill -9 -f "vllm serve" 2>/dev/null || true
pkill -9 -f "EngineCore" 2>/dev/null || true
sleep 3
}
trap cleanup EXIT
# Hard reset first
cleanup
echo "[isolated] launching $N_INSTANCES vLLM on GPUs $GPU_INDICES ENABLE_KV_BOTH=$ENABLE_KV_BOTH ..."
i=0
kv_both_extra=""
if [ "$ENABLE_KV_BOTH" = "1" ]; then
kv_both_extra="--kv-transfer-config {\"kv_connector\":\"MooncakeConnector\",\"kv_role\":\"kv_both\"}"
fi
for gpu in $GPU_INDICES; do
port=$((BASE_PORT + i))
master=$((29500 + i))
bp=$((BOOTSTRAP_BASE_PORT + i))
if [ "$ENABLE_KV_BOTH" = "1" ]; then
PYTHONHASHSEED=42 \
VLLM_MOONCAKE_BOOTSTRAP_PORT=$bp \
AGENTIC_STEP_LOG_PATH="$RUNDIR/engine_state/engine_${i}.jsonl" \
AGENTIC_WORKER_ID="engine_${i}" \
CUDA_VISIBLE_DEVICES=$gpu \
MASTER_PORT=$master \
nohup "$VENV/vllm" serve "$MODEL" \
--host 0.0.0.0 --port "$port" \
--tensor-parallel-size 1 \
--trust-remote-code --enable-prefix-caching \
--dtype auto --gpu-memory-utilization 0.9 \
--max-model-len 200000 \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
$EXTRA_VLLM_ARGS \
> "$RUNDIR/logs/vllm_inst_${i}_gpu${gpu}.log" 2>&1 &
else
AGENTIC_STEP_LOG_PATH="$RUNDIR/engine_state/engine_${i}.jsonl" \
AGENTIC_WORKER_ID="engine_${i}" \
CUDA_VISIBLE_DEVICES=$gpu \
MASTER_PORT=$master \
nohup "$VENV/vllm" serve "$MODEL" \
--host 0.0.0.0 --port "$port" \
--tensor-parallel-size 1 \
--trust-remote-code --enable-prefix-caching \
--dtype auto --gpu-memory-utilization 0.9 \
--max-model-len 200000 \
$EXTRA_VLLM_ARGS \
> "$RUNDIR/logs/vllm_inst_${i}_gpu${gpu}.log" 2>&1 &
fi
disown
sleep 2
i=$((i + 1))
done
echo "[isolated] waiting for vLLM health ..."
for i in $(seq 0 $((N_INSTANCES - 1))); do
port=$((BASE_PORT + i))
tries=0
while ! curl -sf "http://127.0.0.1:$port/health" >/dev/null 2>&1; do
tries=$((tries + 1))
if [ $tries -gt 90 ]; then
echo "[isolated] FATAL: inst_$i not healthy after 180s"
exit 1
fi
sleep 2
done
echo " inst_$i ready"
done
echo "[isolated] launching proxy with --policy $POLICY ..."
combined_args=""
for i in $(seq 0 $((N_INSTANCES - 1))); do
combined_args="$combined_args http://127.0.0.1:$((BASE_PORT + i))"
done
proxy_extra=""
if [ "$ENABLE_KV_BOTH" = "1" ]; then
bp_list=""
for i in $(seq 0 $((N_INSTANCES - 1))); do
if [ -z "$bp_list" ]; then
bp_list="$((BOOTSTRAP_BASE_PORT + i))"
else
bp_list="$bp_list,$((BOOTSTRAP_BASE_PORT + i))"
fi
done
proxy_extra="--bootstrap-ports $bp_list"
fi
nohup "$VENV/python" "$ROOT/scripts/cache_aware_proxy.py" \
--port "$PROXY_PORT" \
--combined $combined_args \
--policy "$POLICY" \
$proxy_extra \
> "$RUNDIR/proxy.log" 2>&1 &
disown
tries=0
until curl -sf "http://127.0.0.1:$PROXY_PORT/stats" >/dev/null 2>&1; do
tries=$((tries + 1))
if [ $tries -gt 30 ]; then
echo "[isolated] FATAL: proxy did not come up in 60s"
tail -30 "$RUNDIR/proxy.log"
exit 1
fi
sleep 2
done
t_start=$(date +%s.%N)
echo "[isolated] running replayer ..."
PYTHONPATH="$ROOT" "$VENV/python" -m replayer \
--trace "$TRACE" \
--output "$RUNDIR/metrics.jsonl" \
--endpoint "http://127.0.0.1:$PROXY_PORT" \
--model "$MODEL" \
2>&1 | tee "$RUNDIR/replayer.log" | tail -3
t_end=$(date +%s.%N)
python3 - "$RUNDIR" "$POLICY" "$TRACE" "$t_start" "$t_end" <<'PY'
import json, sys
rundir, policy, trace, t_start, t_end = sys.argv[1:]
with open(f"{rundir}/run_window.json", "w") as f:
json.dump({
"policy": policy, "trace": trace,
"t_start_unix": float(t_start),
"t_end_unix": float(t_end),
"isolated": True,
}, f, indent=2)
PY
curl -s "http://127.0.0.1:$PROXY_PORT/breakdown" > "$RUNDIR/breakdown.json"
curl -s "http://127.0.0.1:$PROXY_PORT/worker_state" > "$RUNDIR/worker_state.json"
curl -s "http://127.0.0.1:$PROXY_PORT/stats" > "$RUNDIR/stats.json"
echo "[isolated] $POLICY done: $(wc -l < "$RUNDIR/metrics.jsonl") metric rows"