Single-GPU bench on dash1 GPU 0 (vanilla vLLM 0.18.1, chunked-prefill on,
no kv_connector). 3 decode batch sizes × 5 prefill sizes × 3 reps.
Method recap (driver: microbench/interference/driver.py, repurposed):
- Pin D streaming decode requests at constant max_tokens
- Inject one prefill-only request (max_tokens=1) of varying input length
- Bin decode-stream token timestamps into "during prefill" vs baseline
- Headline metric: effective per-stream TPOT during the prefill burst,
= prefill_ttft / (num_tokens_during_prefill / D). This is the average
rate at which each decode stream produces tokens during the burst.
p50 of inter-token intervals is deceptive (chunked-prefill makes most
intervals look normal); the burst-average gives the true cost.
Results (D=8 row, the most agentic-realistic case):
P (tokens) | prefill_ttft | per-stream TPOT during | penalty
2048 | 143 ms | 32 ms | 4×
8192 | 583 ms | 114 ms | 15×
32768 | 4520 ms | 388 ms | 52×
65536 | 15615 ms | 757 ms | 99×
131072 | 56991 ms | 1419 ms | 183×
Baseline TPOT at D=8: ~7.7 ms. So during a 131k-token prefill burst
each ongoing decode is running ~183× slower (i.e. essentially halted)
for ~57 seconds.
§3.2 implication: PD-disagg's promised phase-isolation benefit per
agentic request is bounded by the decode duration, which is 50–200 ms
for tool-call output. MB2 says the KV-transfer cost of PD-disagg
is 300 ms – 10 s for agentic-size requests. Cost > benefit for every
KV size above ~80 MiB (well below trace mean 192 MiB).
The new figs/pd_cost_vs_benefit.png overlays MB1 benefit ceiling
(50–200 ms band, capped by decode) onto MB2 transfer cost curve and
marks the agentic-distribution waypoints (trace mean, p90, p95, p99)
on the x-axis. Across the entire agentic distribution, the cost curve
sits above the benefit band.
Adds:
- microbench/fresh_setup/mb1_launch.sh: single-GPU vLLM launcher (no
kv_connector, default chunked_prefill=on, max_num_batched_tokens=8192)
- microbench/fresh_setup/mb1_driver.py: copy of the existing
microbench/interference/driver.py for cpfs deployment
- microbench/fresh_setup/analyze_mb1.py: aggregator emitting
per-(D, P) effective-TPOT-during + max PD-disagg-benefit table
- microbench/fresh_setup/plot_mb1.py: mb1 standalone +
pd_cost_vs_benefit headline figure
- analysis/mb1/summary.csv: 45 raw rows from the sweep
- analysis/mb1/breakdown.json: per-(D, P) aggregate
- analysis/mb1/README.md: persistent doc
- figs/mb1_interference.png: effective TPOT during prefill, one line per D
- figs/pd_cost_vs_benefit.png: §3.2 headline (cost > benefit everywhere)
Caveats noted in README:
- chunk_tokens=8192 only; Sarathi-Serve's smaller chunks would
interleave decode more aggressively. Chunk-size sensitivity is
flagged as next run.
- D ≤ 8; higher D may saturate or shrink the penalty further.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
81 lines
2.5 KiB
Bash
81 lines
2.5 KiB
Bash
#!/usr/bin/env bash
|
|
# Launch a SINGLE vLLM instance on dash1 for MB1 (prefill-decode interference).
|
|
# No kv_connector — MB1 measures intra-GPU phase interference, not transfer.
|
|
# chunked_prefill is enabled by default in vLLM 0.18.1 (this is the regime
|
|
# we want to characterize: how much benefit can PD-disagg buy on top of
|
|
# the existing chunked-prefill colocated baseline?).
|
|
#
|
|
# Usage:
|
|
# GPU=0 PORT=8000 CHUNK_TOKENS=8192 bash mb1_launch.sh start
|
|
# bash mb1_launch.sh status
|
|
# bash mb1_launch.sh stop
|
|
|
|
set -eo pipefail
|
|
|
|
FRESH_ROOT="/home/admin/cpfs/wjh/agentic-kv-fresh"
|
|
VENV="${FRESH_ROOT}/.venv"
|
|
MODEL="${MODEL:-/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
|
|
LOGS_DIR="${LOGS_DIR:-${FRESH_ROOT}/mb1_logs}"
|
|
|
|
GPU="${GPU:-0}"
|
|
PORT="${PORT:-8000}"
|
|
MASTER="${MASTER:-29500}"
|
|
# max_num_batched_tokens — controls the chunked-prefill chunk granularity.
|
|
# vLLM 0.18.1 default is 8192; we keep that as the headline run and
|
|
# optionally repeat at 32768 to expose the chunk-size effect.
|
|
CHUNK_TOKENS="${CHUNK_TOKENS:-8192}"
|
|
|
|
mkdir -p "${LOGS_DIR}"
|
|
|
|
stop_local() {
|
|
pkill -9 -f "vllm serve.*--port ${PORT} " 2>/dev/null || true
|
|
pkill -9 -f "EngineCore" 2>/dev/null || true
|
|
sleep 2
|
|
}
|
|
|
|
case "${1:-start}" in
|
|
stop)
|
|
stop_local; exit 0;;
|
|
status)
|
|
if curl -sf "http://127.0.0.1:${PORT}/health" >/dev/null 2>&1; then
|
|
echo "port ${PORT}: UP"
|
|
else
|
|
echo "port ${PORT}: DOWN"
|
|
fi
|
|
exit 0;;
|
|
start) ;;
|
|
*) echo "Unknown command: $1"; exit 1;;
|
|
esac
|
|
|
|
stop_local
|
|
source "${VENV}/bin/activate"
|
|
|
|
echo "[mb1] launching: gpu=${GPU} port=${PORT} chunk_tokens=${CHUNK_TOKENS} (no kv_connector)"
|
|
|
|
PYTHONHASHSEED=42 \
|
|
CUDA_VISIBLE_DEVICES="${GPU}" \
|
|
MASTER_PORT="${MASTER}" \
|
|
nohup vllm serve "${MODEL}" \
|
|
--host 0.0.0.0 --port "${PORT}" \
|
|
--tensor-parallel-size 1 \
|
|
--trust-remote-code --enable-prefix-caching \
|
|
--dtype auto --gpu-memory-utilization 0.9 \
|
|
--max-model-len 200000 \
|
|
--max-num-batched-tokens "${CHUNK_TOKENS}" \
|
|
--enable-prompt-tokens-details \
|
|
> "${LOGS_DIR}/vllm_gpu${GPU}_chunk${CHUNK_TOKENS}.log" 2>&1 &
|
|
disown
|
|
|
|
echo "[mb1] waiting for /health on port ${PORT}..."
|
|
tries=0
|
|
while ! curl -sf "http://127.0.0.1:${PORT}/health" >/dev/null 2>&1; do
|
|
tries=$((tries+1))
|
|
if [ ${tries} -gt 180 ]; then
|
|
echo "[mb1] FATAL port ${PORT} did not come up in 6 min"
|
|
tail -40 "${LOGS_DIR}/vllm_gpu${GPU}_chunk${CHUNK_TOKENS}.log" || true
|
|
exit 1
|
|
fi
|
|
sleep 2
|
|
done
|
|
echo "[mb1] UP on $(hostname -s):${PORT} (GPU ${GPU}, chunk_tokens=${CHUNK_TOKENS})"
|