Follow-up to Microbench 3 that finally tests H5 (cache-size
dependence) and instruments worker-side connector callbacks the
original patch missed.
Patch v2 (apply_step_timing_v2.py) adds:
scheduler: `cache_size` field in engine_step.jsonl
worker: `get_finished_us` + `start_load_kv_us` in worker_step.r0.jsonl
uses BLOCK_BEGIN/END sentinels for safe multi-line revert
(the original v1 patch survives this v2's apply/revert cycle)
Driver: continuous open-loop (1.5 req/s, 4096x256 random per req)
that lets APC fill from 0 → ceiling within one vLLM lifetime so a
single run produces the full cache_size sweep. Decode-only steps
are filtered post-hoc to remove prefill-mix variance.
Findings (H20 96GB, ceiling reached ~17.5k blocks; n=15-18k decode
steps per config):
config | slope (μs / 1k blocks) | step_dur p50 @ |cache|=16.6k
---------------|------------------------|-----------------------------
mooncake_both | +85.6 | 1528 μs (build_meta=1442, 94%)
noop_connector | -0.8 (≈0) | 79 μs
plain | +1.0 (≈0) | 84 μs
Worker-side get_finished p50/p90/p99 (μs/step):
mooncake_both: 180 / 257 / 333
noop_connector: 0 / 0 / 2
H5 PASSES. mooncake_both step_duration scales linearly with |cache|
because build_connector_meta walks set(cache.keys()) every step
(`mooncake_connector.py:434-450`). plain and noop are flat.
The previously-uninstrumented get_finished() adds a constant
180 μs/step on top — two `run_coroutine_threadsafe(...).result()`
blocking waits in kv_both mode (`mooncake_connector.py:1107-1137`)
fire every step even when no transfer is pending.
Trace-replay reconciliation (APC ≈ 79% → |cache| ≈ 13k blocks):
build_meta @ 13k ≈ 1060 μs + get_finished ≈ 180 μs = 1.24 ms/step
On ~7 ms decode forward → +15-20% TPOT per step.
This explains most of the trace-replay +25% TPOT p90 gap from
single-instance per-step cost alone, leaving a smaller residual
for multi-instance coupling than originally assumed.
Two clear fixes pointed out in REPORT.md:
1. replace O(|cache|) per-step walk with incremental delta
listener using block_pool's add/remove callbacks
2. short-circuit get_finished() when both producer/consumer
queues are empty in kv_both
Heavy raw artifacts (engine_step.jsonl, vllm_stdout/stderr,
.vllm.pid) are .gitignored — they re-derive from `bash run_all.sh`
and SUMMARY.md / per_config.json fully capture the conclusions.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
137 lines
5.6 KiB
Bash
Executable File
137 lines
5.6 KiB
Bash
Executable File
#!/bin/bash
|
|
# Cache-size sweep orchestrator.
|
|
#
|
|
# 1. Apply CT_CACHE_SWEEP_PATCH on top of the already-installed
|
|
# CONNECTOR_TAX_PATCH (adds cache_size + worker timings).
|
|
# 2. For each config in {plain, noop_connector, mooncake_both}:
|
|
# launch vLLM → wait ready → 8-min open-loop bench → kill,
|
|
# release GPU.
|
|
# 3. Revert CT_CACHE_SWEEP_PATCH so a follow-up run is clean.
|
|
# 4. Run analyze.py and emit figures + summary.
|
|
#
|
|
# Usage: bash run_all.sh
|
|
#
|
|
# Env overrides:
|
|
# DURATION per-config bench duration (default 480 s)
|
|
# RATE open-loop rate (default 1.5 req/s)
|
|
# PORT vLLM port (default 8000)
|
|
# GPU_ID GPU index (default 0)
|
|
# MODEL_PATH model dir (default $HOME/models/Qwen/Qwen3-Coder-30B-A3B-Instruct)
|
|
# CONFIGS space-separated subset (default "plain noop_connector mooncake_both")
|
|
# SKIP_PATCH set to 1 to skip apply/revert (e.g. patch already applied)
|
|
|
|
set -uo pipefail
|
|
|
|
HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
|
CT_DIR="$(cd "$HERE/.." && pwd)"
|
|
PROJ_DIR="$(cd "$HERE/../../.." && pwd)"
|
|
PYTHON="${PYTHON:-$PROJ_DIR/.venv/bin/python}"
|
|
VLLM_ROOT="${VLLM_ROOT:-$PROJ_DIR/.venv/lib/python3.12/site-packages/vllm}"
|
|
|
|
DURATION="${DURATION:-480}"
|
|
RATE="${RATE:-1.5}"
|
|
PORT="${PORT:-8000}"
|
|
GPU_ID="${GPU_ID:-0}"
|
|
MODEL_PATH="${MODEL_PATH:-$HOME/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
|
|
CONFIGS="${CONFIGS:-plain noop_connector mooncake_both}"
|
|
SKIP_PATCH="${SKIP_PATCH:-0}"
|
|
|
|
DATE="$(date +%Y%m%d_%H%M)"
|
|
RUN_ROOT="$HERE/results/$DATE"
|
|
mkdir -p "$RUN_ROOT"
|
|
|
|
echo "=== Cache-size sweep ==="
|
|
echo "Run dir : $RUN_ROOT"
|
|
echo "vLLM root : $VLLM_ROOT"
|
|
echo "Configs : $CONFIGS"
|
|
echo "Rate : $RATE Duration: ${DURATION}s"
|
|
echo ""
|
|
|
|
# ── kill any leftover vLLM ────────────────────────────────────────────────
|
|
kill_all_vllm() {
|
|
pkill -9 -f "VLLM::EngineCore" 2>/dev/null || true
|
|
pkill -9 -f "vllm.entrypoints" 2>/dev/null || true
|
|
pkill -9 -f "vllm serve" 2>/dev/null || true
|
|
sleep 4
|
|
for _ in $(seq 1 20); do
|
|
used=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i "$GPU_ID" 2>/dev/null | tr -d ' ')
|
|
[[ -n "$used" && "$used" -lt 1000 ]] && return 0
|
|
sleep 3
|
|
done
|
|
echo "WARN: GPU $GPU_ID not free after kill" >&2
|
|
}
|
|
trap 'kill_all_vllm; if [[ "$SKIP_PATCH" != "1" ]]; then "$PYTHON" "$HERE/apply_step_timing_v2.py" --revert --vllm-root "$VLLM_ROOT" || true; [[ -f "$CT_DIR/patches/apply_step_timing.py" ]] && "$PYTHON" "$CT_DIR/patches/apply_step_timing.py" --revert --vllm-root "$VLLM_ROOT" || true; fi' EXIT
|
|
|
|
# ── patch ─────────────────────────────────────────────────────────────────
|
|
if [[ "$SKIP_PATCH" != "1" ]]; then
|
|
echo "[stage 0] applying CONNECTOR_TAX_PATCH (v1) then CT_CACHE_SWEEP_PATCH (v2)"
|
|
# v1 patch adds step_duration_us + build_meta_us; v2 stacks on top.
|
|
if [[ -f "$CT_DIR/patches/apply_step_timing.py" ]]; then
|
|
"$PYTHON" "$CT_DIR/patches/apply_step_timing.py" --apply --vllm-root "$VLLM_ROOT" || true
|
|
fi
|
|
"$PYTHON" "$HERE/apply_step_timing_v2.py" --apply --vllm-root "$VLLM_ROOT"
|
|
fi
|
|
|
|
# ── per-config runs ───────────────────────────────────────────────────────
|
|
kill_all_vllm
|
|
for cfg in $CONFIGS; do
|
|
cfg_dir="$RUN_ROOT/$cfg"
|
|
mkdir -p "$cfg_dir"
|
|
|
|
launch_script="$CT_DIR/launch/launch_${cfg}.sh"
|
|
if [[ ! -f "$launch_script" ]]; then
|
|
echo "SKIP $cfg (no launch script at $launch_script)"
|
|
continue
|
|
fi
|
|
|
|
echo ""
|
|
echo "====== Config: $cfg ======"
|
|
export RUN_DIR="$cfg_dir"
|
|
export PORT GPU_ID MODEL_PATH
|
|
export AGENTIC_STEP_LOG_PATH="$cfg_dir/engine_step.jsonl"
|
|
export CT_WORKER_STEP_LOG_PATH="$cfg_dir/worker_step.jsonl"
|
|
export PYTHONPATH="$PROJ_DIR:${PYTHONPATH:-}"
|
|
|
|
: > "$cfg_dir/engine_step.jsonl" # truncate
|
|
rm -f "$cfg_dir/worker_step.jsonl".* # any leftover ranks
|
|
: > "$cfg_dir/requests.jsonl"
|
|
|
|
# Launch (the launch scripts already use common.sh).
|
|
bash "$launch_script" 2>&1 | tail -5
|
|
rc=$?
|
|
if [[ $rc -ne 0 ]]; then
|
|
echo "FAIL $cfg (launch rc=$rc) — skipping bench"
|
|
kill_all_vllm
|
|
continue
|
|
fi
|
|
|
|
echo "[bench] running ${DURATION}s open-loop at rate=$RATE"
|
|
"$PYTHON" "$HERE/run_cache_sweep.py" \
|
|
--url "http://127.0.0.1:$PORT/v1/chat/completions" \
|
|
--model "$MODEL_PATH" \
|
|
--rate "$RATE" --duration "$DURATION" \
|
|
--output-dir "$cfg_dir" 2>&1 | tail -8
|
|
|
|
# Final fetch of /metrics so the analyze step has the ceiling.
|
|
curl -s "http://127.0.0.1:$PORT/metrics" > "$cfg_dir/metrics_final.txt" 2>&1 || true
|
|
|
|
echo "[teardown] $cfg"
|
|
kill_all_vllm
|
|
done
|
|
|
|
# ── revert + analyze ──────────────────────────────────────────────────────
|
|
if [[ "$SKIP_PATCH" != "1" ]]; then
|
|
echo ""
|
|
echo "[stage Z] reverting CT_CACHE_SWEEP_PATCH then CONNECTOR_TAX_PATCH"
|
|
"$PYTHON" "$HERE/apply_step_timing_v2.py" --revert --vllm-root "$VLLM_ROOT"
|
|
if [[ -f "$CT_DIR/patches/apply_step_timing.py" ]]; then
|
|
"$PYTHON" "$CT_DIR/patches/apply_step_timing.py" --revert --vllm-root "$VLLM_ROOT" || true
|
|
fi
|
|
fi
|
|
|
|
echo ""
|
|
echo "[analyze]"
|
|
"$PYTHON" "$HERE/analyze.py" --run-root "$RUN_ROOT"
|
|
echo ""
|
|
echo "Done. Artifacts: $RUN_ROOT"
|