Adds an env-gated skip for the per-step `set(cache.keys())` walk in MooncakeConnectorScheduler.build_connector_meta() that was introduced in our own commita7df84b(Direct RDMA read). Re-runs the cache_sweep A/B with three configs: plain (control), mooncake_both (baseline), and mooncake_both_drfix (VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1). Files: apply_direct_read_fix.py one-line env-gate patch (markered revert) run_drfix.sh orchestrator for plain + mooncake_both + drfix analyze.py extended to compare mooncake_both_drfix vs plain and mooncake_both vs mooncake_both_drfix REPORT_DRFIX.md findings results/20260526_1543_drfix/ run artifacts Headline: config | slope (μs/1k blocks) | step_dur p50 @ 16.6k ----------------------|----------------------|--------------------- mooncake_both | +81.0 | 1 550 μs mooncake_both_drfix | -0.7 (≈ 0) | 95 μs plain (control) | -1.8 (≈ 0) | 72 μs build_meta p50 @ 16.6k blocks: mooncake_both = 1 459 μs mooncake_both_drfix = 6 μs (residual loop bookkeeping) worker get_finished p50: mooncake_both = 178 μs (unchanged; this fix doesn't touch it) mooncake_both_drfix = 183 μs The fix recovers 1 453 μs (99.6 %) of the scheduler-side cost at |cache|=16.6k blocks. drfix's per-bin step_dur tracks plain within ±50 μs across the full cache range — that's noise-level. The slope goes from +81 to essentially zero. Worker-side get_finished (180 μs constant) is unchanged because the DR-fix touches scheduler.build_connector_meta only. That's the next target if we want to bring kv_both fully back to plain-level. Extrapolation to trace-replay (|cache|≈13k, APC≈79%): before: build_meta 1 060 μs + get_finished 180 μs = 1.24 ms/step after DR-fix: build_meta 6 μs + get_finished 180 μs = ~0.19 ms/step → 85% reduction in per-step connector cost → TPOT inflation drops from ~+18% to ~+3% on a 7 ms decode step Confirms: the entire O(|cache|) slope was introduced by our own direct-RDMA-read implementation (commita7df84b), not upstream Mooncake. Production fix: gate the sync on the presence of any direct_read consumer, or replace per-step diff with an incremental delta listener fed by block_pool add/remove callbacks. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
134 lines
4.4 KiB
Bash
Executable File
134 lines
4.4 KiB
Bash
Executable File
#!/bin/bash
|
|
# A/B re-measurement after the direct-RDMA-read hash-sync env-gate.
|
|
#
|
|
# Applies all instrumentation (v1+v2) AND the CT_DR_FIX patch, then runs
|
|
# the same workload as run_all.sh on:
|
|
# plain (control)
|
|
# mooncake_both (baseline: env not set → original behaviour)
|
|
# mooncake_both_drfix (with VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1)
|
|
# Same vLLM lifetime per config; full apply→run→revert cycle at the end.
|
|
#
|
|
# Usage: bash run_drfix.sh
|
|
# Env overrides: DURATION (default 240), RATE (default 1.5), CONFIGS (override list)
|
|
|
|
set -uo pipefail
|
|
|
|
HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
|
CT_DIR="$(cd "$HERE/.." && pwd)"
|
|
PROJ_DIR="$(cd "$HERE/../../.." && pwd)"
|
|
PYTHON="${PYTHON:-$PROJ_DIR/.venv/bin/python}"
|
|
VLLM_ROOT="${VLLM_ROOT:-$PROJ_DIR/.venv/lib/python3.12/site-packages/vllm}"
|
|
|
|
DURATION="${DURATION:-240}"
|
|
RATE="${RATE:-1.5}"
|
|
PORT="${PORT:-8000}"
|
|
GPU_ID="${GPU_ID:-0}"
|
|
MODEL_PATH="${MODEL_PATH:-$HOME/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
|
|
CONFIGS="${CONFIGS:-plain mooncake_both mooncake_both_drfix}"
|
|
|
|
DATE="$(date +%Y%m%d_%H%M)"
|
|
RUN_ROOT="$HERE/results/${DATE}_drfix"
|
|
mkdir -p "$RUN_ROOT"
|
|
|
|
echo "=== Cache-size sweep + DR-fix A/B ==="
|
|
echo "Run dir : $RUN_ROOT"
|
|
echo "Configs : $CONFIGS"
|
|
echo "Rate : $RATE Duration: ${DURATION}s"
|
|
echo ""
|
|
|
|
kill_all_vllm() {
|
|
pkill -9 -f "VLLM::EngineCore" 2>/dev/null || true
|
|
pkill -9 -f "vllm.entrypoints" 2>/dev/null || true
|
|
pkill -9 -f "vllm serve" 2>/dev/null || true
|
|
sleep 4
|
|
for _ in $(seq 1 20); do
|
|
used=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i "$GPU_ID" 2>/dev/null | tr -d ' ')
|
|
[[ -n "$used" && "$used" -lt 1000 ]] && return 0
|
|
sleep 3
|
|
done
|
|
echo "WARN: GPU $GPU_ID not free after kill" >&2
|
|
}
|
|
|
|
cleanup_patches() {
|
|
"$PYTHON" "$HERE/apply_direct_read_fix.py" --revert --vllm-root "$VLLM_ROOT" 2>/dev/null || true
|
|
"$PYTHON" "$HERE/apply_step_timing_v2.py" --revert --vllm-root "$VLLM_ROOT" 2>/dev/null || true
|
|
[[ -f "$CT_DIR/patches/apply_step_timing.py" ]] && \
|
|
"$PYTHON" "$CT_DIR/patches/apply_step_timing.py" --revert --vllm-root "$VLLM_ROOT" 2>/dev/null || true
|
|
}
|
|
|
|
trap 'kill_all_vllm; cleanup_patches' EXIT
|
|
|
|
echo "[stage 0] applying v1 + v2 + DR_FIX patches"
|
|
"$PYTHON" "$CT_DIR/patches/apply_step_timing.py" --apply --vllm-root "$VLLM_ROOT"
|
|
"$PYTHON" "$HERE/apply_step_timing_v2.py" --apply --vllm-root "$VLLM_ROOT"
|
|
"$PYTHON" "$HERE/apply_direct_read_fix.py" --apply --vllm-root "$VLLM_ROOT"
|
|
|
|
kill_all_vllm
|
|
for cfg in $CONFIGS; do
|
|
cfg_dir="$RUN_ROOT/$cfg"
|
|
mkdir -p "$cfg_dir"
|
|
|
|
# Pick the launch script. mooncake_both_drfix reuses mooncake_both's launcher.
|
|
case "$cfg" in
|
|
mooncake_both_drfix) launch_cfg="mooncake_both" ;;
|
|
*) launch_cfg="$cfg" ;;
|
|
esac
|
|
|
|
launch_script="$CT_DIR/launch/launch_${launch_cfg}.sh"
|
|
if [[ ! -f "$launch_script" ]]; then
|
|
echo "SKIP $cfg (no launch script at $launch_script)"
|
|
continue
|
|
fi
|
|
|
|
echo ""
|
|
echo "====== Config: $cfg (launcher=$launch_cfg) ======"
|
|
|
|
export RUN_DIR="$cfg_dir"
|
|
export PORT GPU_ID MODEL_PATH
|
|
export AGENTIC_STEP_LOG_PATH="$cfg_dir/engine_step.jsonl"
|
|
export CT_WORKER_STEP_LOG_PATH="$cfg_dir/worker_step.jsonl"
|
|
export PYTHONPATH="$PROJ_DIR:${PYTHONPATH:-}"
|
|
|
|
# The env-gated skip
|
|
if [[ "$cfg" == "mooncake_both_drfix" ]]; then
|
|
export VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1
|
|
echo " VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1 (hash sync skipped)"
|
|
else
|
|
unset VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC
|
|
fi
|
|
|
|
: > "$cfg_dir/engine_step.jsonl"
|
|
rm -f "$cfg_dir/worker_step.jsonl".*
|
|
: > "$cfg_dir/requests.jsonl"
|
|
|
|
bash "$launch_script" 2>&1 | tail -5
|
|
rc=$?
|
|
if [[ $rc -ne 0 ]]; then
|
|
echo "FAIL $cfg (launch rc=$rc) — skipping bench"
|
|
kill_all_vllm
|
|
continue
|
|
fi
|
|
|
|
echo "[bench] running ${DURATION}s open-loop at rate=$RATE"
|
|
"$PYTHON" "$HERE/run_cache_sweep.py" \
|
|
--url "http://127.0.0.1:$PORT/v1/chat/completions" \
|
|
--model "$MODEL_PATH" \
|
|
--rate "$RATE" --duration "$DURATION" \
|
|
--output-dir "$cfg_dir" 2>&1 | tail -8
|
|
|
|
curl -s "http://127.0.0.1:$PORT/metrics" > "$cfg_dir/metrics_final.txt" 2>&1 || true
|
|
|
|
echo "[teardown] $cfg"
|
|
kill_all_vllm
|
|
done
|
|
|
|
echo ""
|
|
echo "[stage Z] reverting patches"
|
|
cleanup_patches
|
|
|
|
echo ""
|
|
echo "[analyze]"
|
|
"$PYTHON" "$HERE/analyze.py" --run-root "$RUN_ROOT"
|
|
echo ""
|
|
echo "Done. Artifacts: $RUN_ROOT"
|