Files
Gahow Wang ef9e0102ec Connector tax: trace-replay confirms +45% kv_both penalty is gone; DR-fix adds 22% more
Re-runs the elastic_migration_v2 trace (w600 r0.0015 st30, 1214 reqs,
274 sessions, 8×TP1 vLLM + cache_aware_proxy) with three configs:
- plain unified
- unified + Mooncake kv_both
- unified + Mooncake kv_both + DR-fix (env-gated O(|cache|) hash sync removal)

TTFT p90: 11.97 s → 9.74 s (−18.6%) → 7.58 s (−36.6% vs plain)
E2E p90:  23.48 s → 21.25 s (−9.5%) → 17.93 s (−23.6% vs plain)

Two findings:
1. The "+45% kv_both penalty" claim from elastic_migration_v2 is OBSOLETE
   on current codebase — kv_both is now *faster* than plain at p90.
   Likely fixed by e3a1d70 (RDMA-READ → bootstrap PUSH refactor) and
   the connector-mode delay_free_blocks extending cross-turn prefix
   cache hits on a 93%-intra-session-reuse trace.
2. DR-fix removes another 22% from TTFT p90 by skipping the
   O(|cache|) hash sync in build_connector_meta. Cache-sweep with
   DR-fix shows slope drops from +94.5 to +2.3 μs/1k blocks.

Adds:
- run_trace_replay_drfix.sh: A/B/C harness (env CT_DR_FIX gates patch)
- analyze_trace_replay.py: TTFT/TPOT/E2E delta analysis
- REPORT_TRACE_REPLAY.md: summary + reproduction
- results/20260526_1627_drfix/: cache-sweep with DR-fix
- results/trace_replay_20260526_1652/: full trace-replay A/B/C

Implication for EAR paper: the kv_both substrate is no longer the
bottleneck blocking session migration. The prior 4 migration reverts
were dominated by transfer overhead that has now been characterized
and (partially) removed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 09:13:50 +08:00

136 lines
4.5 KiB
Bash
Executable File

#!/bin/bash
# A/B re-measurement after the direct-RDMA-read hash-sync env-gate.
#
# Applies all instrumentation (v1+v2) AND the CT_DR_FIX patch, then runs
# the same workload as run_all.sh on:
# plain (control)
# mooncake_both (baseline: env not set → original behaviour)
# mooncake_both_drfix (with VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1)
# Same vLLM lifetime per config; full apply→run→revert cycle at the end.
#
# Usage: bash run_drfix.sh
# Env overrides: DURATION (default 240), RATE (default 1.5), CONFIGS (override list)
set -uo pipefail
HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
CT_DIR="$(cd "$HERE/.." && pwd)"
PROJ_DIR="$(cd "$HERE/../../.." && pwd)"
PYTHON="${PYTHON:-$PROJ_DIR/.venv/bin/python}"
VLLM_ROOT="${VLLM_ROOT:-$PROJ_DIR/.venv/lib/python3.12/site-packages/vllm}"
DURATION="${DURATION:-240}"
RATE="${RATE:-1.5}"
PORT="${PORT:-8000}"
GPU_ID="${GPU_ID:-0}"
MODEL_PATH="${MODEL_PATH:-$HOME/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
CONFIGS="${CONFIGS:-plain mooncake_both mooncake_both_drfix}"
SEED="${SEED:-12345}" # shared seed across configs → identical Poisson + content
DATE="$(date +%Y%m%d_%H%M)"
RUN_ROOT="$HERE/results/${DATE}_drfix"
mkdir -p "$RUN_ROOT"
echo "=== Cache-size sweep + DR-fix A/B ==="
echo "Run dir : $RUN_ROOT"
echo "Configs : $CONFIGS"
echo "Rate : $RATE Duration: ${DURATION}s Seed: $SEED"
echo ""
kill_all_vllm() {
pkill -9 -f "VLLM::EngineCore" 2>/dev/null || true
pkill -9 -f "vllm.entrypoints" 2>/dev/null || true
pkill -9 -f "vllm serve" 2>/dev/null || true
sleep 4
for _ in $(seq 1 20); do
used=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i "$GPU_ID" 2>/dev/null | tr -d ' ')
[[ -n "$used" && "$used" -lt 1000 ]] && return 0
sleep 3
done
echo "WARN: GPU $GPU_ID not free after kill" >&2
}
cleanup_patches() {
"$PYTHON" "$HERE/apply_direct_read_fix.py" --revert --vllm-root "$VLLM_ROOT" 2>/dev/null || true
"$PYTHON" "$HERE/apply_step_timing_v2.py" --revert --vllm-root "$VLLM_ROOT" 2>/dev/null || true
[[ -f "$CT_DIR/patches/apply_step_timing.py" ]] && \
"$PYTHON" "$CT_DIR/patches/apply_step_timing.py" --revert --vllm-root "$VLLM_ROOT" 2>/dev/null || true
}
trap 'kill_all_vllm; cleanup_patches' EXIT
echo "[stage 0] applying v1 + v2 + DR_FIX patches"
"$PYTHON" "$CT_DIR/patches/apply_step_timing.py" --apply --vllm-root "$VLLM_ROOT"
"$PYTHON" "$HERE/apply_step_timing_v2.py" --apply --vllm-root "$VLLM_ROOT"
"$PYTHON" "$HERE/apply_direct_read_fix.py" --apply --vllm-root "$VLLM_ROOT"
kill_all_vllm
for cfg in $CONFIGS; do
cfg_dir="$RUN_ROOT/$cfg"
mkdir -p "$cfg_dir"
# Pick the launch script. mooncake_both_drfix reuses mooncake_both's launcher.
case "$cfg" in
mooncake_both_drfix) launch_cfg="mooncake_both" ;;
*) launch_cfg="$cfg" ;;
esac
launch_script="$CT_DIR/launch/launch_${launch_cfg}.sh"
if [[ ! -f "$launch_script" ]]; then
echo "SKIP $cfg (no launch script at $launch_script)"
continue
fi
echo ""
echo "====== Config: $cfg (launcher=$launch_cfg) ======"
export RUN_DIR="$cfg_dir"
export PORT GPU_ID MODEL_PATH
export AGENTIC_STEP_LOG_PATH="$cfg_dir/engine_step.jsonl"
export CT_WORKER_STEP_LOG_PATH="$cfg_dir/worker_step.jsonl"
export PYTHONPATH="$PROJ_DIR:${PYTHONPATH:-}"
# The env-gated skip
if [[ "$cfg" == "mooncake_both_drfix" ]]; then
export VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1
echo " VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1 (hash sync skipped)"
else
unset VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC
fi
: > "$cfg_dir/engine_step.jsonl"
rm -f "$cfg_dir/worker_step.jsonl".*
: > "$cfg_dir/requests.jsonl"
bash "$launch_script" 2>&1 | tail -5
rc=$?
if [[ $rc -ne 0 ]]; then
echo "FAIL $cfg (launch rc=$rc) — skipping bench"
kill_all_vllm
continue
fi
echo "[bench] running ${DURATION}s open-loop at rate=$RATE"
"$PYTHON" "$HERE/run_cache_sweep.py" \
--url "http://127.0.0.1:$PORT/v1/chat/completions" \
--model "$MODEL_PATH" \
--rate "$RATE" --duration "$DURATION" \
--seed "$SEED" \
--output-dir "$cfg_dir" 2>&1 | tail -8
curl -s "http://127.0.0.1:$PORT/metrics" > "$cfg_dir/metrics_final.txt" 2>&1 || true
echo "[teardown] $cfg"
kill_all_vllm
done
echo ""
echo "[stage Z] reverting patches"
cleanup_patches
echo ""
echo "[analyze]"
"$PYTHON" "$HERE/analyze.py" --run-root "$RUN_ROOT"
echo ""
echo "Done. Artifacts: $RUN_ROOT"