MB5 driver fixes: bash env-prefix + replayer flag names + python date math

Two bugs caught by 8C smoke:

mb5_launch.sh
  ${env_bp_arg} expanded as a literal command line prefix doesn't work
  when env_bp_arg is itself a variable — bash only treats VAR=val as
  an env assignment if it sees the literal in the parsed command, not
  after expansion. Fix: always export VLLM_MOONCAKE_BOOTSTRAP_PORT as
  a literal, defaulting to 9999 when caller passed no port (consumer
  mode ignores the var so the placeholder is harmless).

mb5_run.sh
  replayer's actual CLI flags are --trace / --output / --endpoint /
  --model, not the --*-path / --*-name variants I had. Plus dash1
  has no `bc`; compute wall_clock_s via python instead.

Both fixed; 8C smoke (CONFIG=8C REPS=1 REQUEST_LIMIT=20) now runs
end-to-end in ~30 s:
  - 8 vLLM kv_both instances on GPU 0-7 come up
  - replayer round-robins 20 reqs across them
  - MB5 instrumentation captures 8 snapshot files (one per EngineCore
    PID), ranging 7-139 snapshots each = ~10 Hz throttle works
  - plot_kv_pool_timeline.py renders the stacked-area + queue-depth
    chart cleanly (figs/mb5_smoke/*.png)

Pipeline validated. Ready for the real PD-ratio sweep.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-27 23:23:23 +08:00
parent e9abd70c8d
commit e0d3b5150a
2 changed files with 20 additions and 12 deletions

View File

@@ -71,19 +71,23 @@ run_one() {
local t0
t0=$(date +%s.%N)
if ! PYTHONPATH="${FRESH_ROOT}" python -m replayer \
--endpoint-url "${endpoints}" \
--trace-path "${TRACE}" \
--output-path "${replay_out}" \
--model-name "${MODEL_NAME}" \
--endpoint "${endpoints}" \
--trace "${TRACE}" \
--output "${replay_out}" \
--model "${MODEL_NAME}" \
${REQUEST_LIMIT_ARG} \
> "${OUT_ROOT}/${config}_rep${rep}_replay.log" 2>&1; then
local t1=$(date +%s.%N)
echo "[mb5-run] REPLAY FAILED after $(echo "$t1 - $t0" | bc) s; see ${OUT_ROOT}/${config}_rep${rep}_replay.log"
local t1
t1=$(date +%s.%N)
local wall=$(python -c "print(${t1} - ${t0})")
echo "[mb5-run] REPLAY FAILED after ${wall} s; see ${OUT_ROOT}/${config}_rep${rep}_replay.log"
bash "${LAUNCH}" stop > /dev/null 2>&1 || true
return 1
fi
local t1=$(date +%s.%N)
local wall_clock_s=$(echo "$t1 - $t0" | bc)
local t1
t1=$(date +%s.%N)
local wall_clock_s
wall_clock_s=$(python -c "print(${t1} - ${t0})")
echo "[mb5-run] replay done in ${wall_clock_s}s"
echo "${wall_clock_s}" > "${rundir}/wall_clock_s.txt"