Files
agentic-kvc/microbench/fresh_setup/mb1_launch.sh
Gahow Wang 029821c1b6 MB1: prefill-decode interference under chunked-prefill default; §3.2 headline
Single-GPU bench on dash1 GPU 0 (vanilla vLLM 0.18.1, chunked-prefill on,
no kv_connector). 3 decode batch sizes × 5 prefill sizes × 3 reps.

Method recap (driver: microbench/interference/driver.py, repurposed):
- Pin D streaming decode requests at constant max_tokens
- Inject one prefill-only request (max_tokens=1) of varying input length
- Bin decode-stream token timestamps into "during prefill" vs baseline
- Headline metric: effective per-stream TPOT during the prefill burst,
  = prefill_ttft / (num_tokens_during_prefill / D). This is the average
  rate at which each decode stream produces tokens during the burst.
  p50 of inter-token intervals is deceptive (chunked-prefill makes most
  intervals look normal); the burst-average gives the true cost.

Results (D=8 row, the most agentic-realistic case):
  P (tokens) | prefill_ttft | per-stream TPOT during | penalty
       2048  |    143 ms    |      32 ms             |    4×
       8192  |    583 ms    |     114 ms             |   15×
      32768  |  4520 ms     |     388 ms             |   52×
      65536  | 15615 ms     |     757 ms             |   99×
     131072  | 56991 ms     |    1419 ms             |  183×

Baseline TPOT at D=8: ~7.7 ms. So during a 131k-token prefill burst
each ongoing decode is running ~183× slower (i.e. essentially halted)
for ~57 seconds.

§3.2 implication: PD-disagg's promised phase-isolation benefit per
agentic request is bounded by the decode duration, which is 50–200 ms
for tool-call output. MB2 says the KV-transfer cost of PD-disagg
is 300 ms – 10 s for agentic-size requests. Cost > benefit for every
KV size above ~80 MiB (well below trace mean 192 MiB).

The new figs/pd_cost_vs_benefit.png overlays MB1 benefit ceiling
(50–200 ms band, capped by decode) onto MB2 transfer cost curve and
marks the agentic-distribution waypoints (trace mean, p90, p95, p99)
on the x-axis. Across the entire agentic distribution, the cost curve
sits above the benefit band.

Adds:
- microbench/fresh_setup/mb1_launch.sh: single-GPU vLLM launcher (no
  kv_connector, default chunked_prefill=on, max_num_batched_tokens=8192)
- microbench/fresh_setup/mb1_driver.py: copy of the existing
  microbench/interference/driver.py for cpfs deployment
- microbench/fresh_setup/analyze_mb1.py: aggregator emitting
  per-(D, P) effective-TPOT-during + max PD-disagg-benefit table
- microbench/fresh_setup/plot_mb1.py: mb1 standalone +
  pd_cost_vs_benefit headline figure
- analysis/mb1/summary.csv: 45 raw rows from the sweep
- analysis/mb1/breakdown.json: per-(D, P) aggregate
- analysis/mb1/README.md: persistent doc
- figs/mb1_interference.png: effective TPOT during prefill, one line per D
- figs/pd_cost_vs_benefit.png: §3.2 headline (cost > benefit everywhere)

Caveats noted in README:
- chunk_tokens=8192 only; Sarathi-Serve's smaller chunks would
  interleave decode more aggressively. Chunk-size sensitivity is
  flagged as next run.
- D ≤ 8; higher D may saturate or shrink the penalty further.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 21:25:09 +08:00

81 lines
2.5 KiB
Bash

#!/usr/bin/env bash
# Launch a SINGLE vLLM instance on dash1 for MB1 (prefill-decode interference).
# No kv_connector — MB1 measures intra-GPU phase interference, not transfer.
# chunked_prefill is enabled by default in vLLM 0.18.1 (this is the regime
# we want to characterize: how much benefit can PD-disagg buy on top of
# the existing chunked-prefill colocated baseline?).
#
# Usage:
# GPU=0 PORT=8000 CHUNK_TOKENS=8192 bash mb1_launch.sh start
# bash mb1_launch.sh status
# bash mb1_launch.sh stop
set -eo pipefail
FRESH_ROOT="/home/admin/cpfs/wjh/agentic-kv-fresh"
VENV="${FRESH_ROOT}/.venv"
MODEL="${MODEL:-/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
LOGS_DIR="${LOGS_DIR:-${FRESH_ROOT}/mb1_logs}"
GPU="${GPU:-0}"
PORT="${PORT:-8000}"
MASTER="${MASTER:-29500}"
# max_num_batched_tokens — controls the chunked-prefill chunk granularity.
# vLLM 0.18.1 default is 8192; we keep that as the headline run and
# optionally repeat at 32768 to expose the chunk-size effect.
CHUNK_TOKENS="${CHUNK_TOKENS:-8192}"
mkdir -p "${LOGS_DIR}"
stop_local() {
pkill -9 -f "vllm serve.*--port ${PORT} " 2>/dev/null || true
pkill -9 -f "EngineCore" 2>/dev/null || true
sleep 2
}
case "${1:-start}" in
stop)
stop_local; exit 0;;
status)
if curl -sf "http://127.0.0.1:${PORT}/health" >/dev/null 2>&1; then
echo "port ${PORT}: UP"
else
echo "port ${PORT}: DOWN"
fi
exit 0;;
start) ;;
*) echo "Unknown command: $1"; exit 1;;
esac
stop_local
source "${VENV}/bin/activate"
echo "[mb1] launching: gpu=${GPU} port=${PORT} chunk_tokens=${CHUNK_TOKENS} (no kv_connector)"
PYTHONHASHSEED=42 \
CUDA_VISIBLE_DEVICES="${GPU}" \
MASTER_PORT="${MASTER}" \
nohup vllm serve "${MODEL}" \
--host 0.0.0.0 --port "${PORT}" \
--tensor-parallel-size 1 \
--trust-remote-code --enable-prefix-caching \
--dtype auto --gpu-memory-utilization 0.9 \
--max-model-len 200000 \
--max-num-batched-tokens "${CHUNK_TOKENS}" \
--enable-prompt-tokens-details \
> "${LOGS_DIR}/vllm_gpu${GPU}_chunk${CHUNK_TOKENS}.log" 2>&1 &
disown
echo "[mb1] waiting for /health on port ${PORT}..."
tries=0
while ! curl -sf "http://127.0.0.1:${PORT}/health" >/dev/null 2>&1; do
tries=$((tries+1))
if [ ${tries} -gt 180 ]; then
echo "[mb1] FATAL port ${PORT} did not come up in 6 min"
tail -40 "${LOGS_DIR}/vllm_gpu${GPU}_chunk${CHUNK_TOKENS}.log" || true
exit 1
fi
sleep 2
done
echo "[mb1] UP on $(hostname -s):${PORT} (GPU ${GPU}, chunk_tokens=${CHUNK_TOKENS})"