Files
agentic-kvc/microbench/interference/run_sweep.sh
Gahow Wang f784e49c07 Microbench: prefill-decode interference + PD transfer lifecycle
Two microbenchmarks quantifying the elastic offload decision:

1. Interference (corrected): cold prefill causes 14-214x TPOT p90
   degradation on same-worker decode (D∈{1,2,4,8} × P∈{2k,8k,16k,32k}).
   Earlier run had a prefix-cache bug (deterministic prompts hit cache
   after rep 0); fixed with uuid+time_ns unique prompts.

2. Transfer lifecycle: PD-sep TTFT breakdown via Mooncake proxy,
   measuring prefill→RDMA→decode startup overhead.

Key finding: offload wins at all P≥2048 operating points —
transfer cost is 25-50% of interference cost even with bulk Mooncake.
2026-05-26 00:57:06 +08:00

37 lines
863 B
Bash

#!/bin/bash
# Run the interference microbenchmark sweep.
# Assumes vLLM is already running on the specified port.
#
# Usage: bash run_sweep.sh [port] [chunk_size]
set -euo pipefail
PORT=${1:-8000}
CHUNK_SIZE=${2:-8192}
REPS=${REPS:-5}
OUTPUT_DIR="results/interference"
echo "=== Interference Microbench Sweep ==="
echo "Server: http://127.0.0.1:$PORT"
echo "Chunk size: $CHUNK_SIZE"
echo "Reps: $REPS"
echo "Output: $OUTPUT_DIR"
echo ""
# Quick sanity check
curl -sf "http://127.0.0.1:$PORT/v1/models" > /dev/null || {
echo "ERROR: vLLM not reachable on port $PORT"
exit 1
}
cd "$(dirname "$0")"
python driver.py \
--host 127.0.0.1 \
--port "$PORT" \
--chunk-size "$CHUNK_SIZE" \
--decode-batch-sizes "0,1,2,4,6,8,12" \
--prefill-tokens "512,1024,2048,4096,8192,16384,32768" \
--reps "$REPS" \
--output-dir "$OUTPUT_DIR"