Two microbenchmarks quantifying the elastic offload decision:
1. Interference (corrected): cold prefill causes 14-214x TPOT p90
degradation on same-worker decode (D∈{1,2,4,8} × P∈{2k,8k,16k,32k}).
Earlier run had a prefix-cache bug (deterministic prompts hit cache
after rep 0); fixed with uuid+time_ns unique prompts.
2. Transfer lifecycle: PD-sep TTFT breakdown via Mooncake proxy,
measuring prefill→RDMA→decode startup overhead.
Key finding: offload wins at all P≥2048 operating points —
transfer cost is 25-50% of interference cost even with bulk Mooncake.
37 lines
863 B
Bash
37 lines
863 B
Bash
#!/bin/bash
|
|
# Run the interference microbenchmark sweep.
|
|
# Assumes vLLM is already running on the specified port.
|
|
#
|
|
# Usage: bash run_sweep.sh [port] [chunk_size]
|
|
|
|
set -euo pipefail
|
|
|
|
PORT=${1:-8000}
|
|
CHUNK_SIZE=${2:-8192}
|
|
REPS=${REPS:-5}
|
|
OUTPUT_DIR="results/interference"
|
|
|
|
echo "=== Interference Microbench Sweep ==="
|
|
echo "Server: http://127.0.0.1:$PORT"
|
|
echo "Chunk size: $CHUNK_SIZE"
|
|
echo "Reps: $REPS"
|
|
echo "Output: $OUTPUT_DIR"
|
|
echo ""
|
|
|
|
# Quick sanity check
|
|
curl -sf "http://127.0.0.1:$PORT/v1/models" > /dev/null || {
|
|
echo "ERROR: vLLM not reachable on port $PORT"
|
|
exit 1
|
|
}
|
|
|
|
cd "$(dirname "$0")"
|
|
|
|
python driver.py \
|
|
--host 127.0.0.1 \
|
|
--port "$PORT" \
|
|
--chunk-size "$CHUNK_SIZE" \
|
|
--decode-batch-sizes "0,1,2,4,6,8,12" \
|
|
--prefill-tokens "512,1024,2048,4096,8192,16384,32768" \
|
|
--reps "$REPS" \
|
|
--output-dir "$OUTPUT_DIR"
|