MB5 driver fixes: bash env-prefix + replayer flag names + python date math
Two bugs caught by 8C smoke:
mb5_launch.sh
${env_bp_arg} expanded as a literal command line prefix doesn't work
when env_bp_arg is itself a variable — bash only treats VAR=val as
an env assignment if it sees the literal in the parsed command, not
after expansion. Fix: always export VLLM_MOONCAKE_BOOTSTRAP_PORT as
a literal, defaulting to 9999 when caller passed no port (consumer
mode ignores the var so the placeholder is harmless).
mb5_run.sh
replayer's actual CLI flags are --trace / --output / --endpoint /
--model, not the --*-path / --*-name variants I had. Plus dash1
has no `bc`; compute wall_clock_s via python instead.
Both fixed; 8C smoke (CONFIG=8C REPS=1 REQUEST_LIMIT=20) now runs
end-to-end in ~30 s:
- 8 vLLM kv_both instances on GPU 0-7 come up
- replayer round-robins 20 reqs across them
- MB5 instrumentation captures 8 snapshot files (one per EngineCore
PID), ranging 7-139 snapshots each = ~10 Hz throttle works
- plot_kv_pool_timeline.py renders the stacked-area + queue-depth
chart cleanly (figs/mb5_smoke/*.png)
Pipeline validated. Ready for the real PD-ratio sweep.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -71,19 +71,23 @@ run_one() {
|
||||
local t0
|
||||
t0=$(date +%s.%N)
|
||||
if ! PYTHONPATH="${FRESH_ROOT}" python -m replayer \
|
||||
--endpoint-url "${endpoints}" \
|
||||
--trace-path "${TRACE}" \
|
||||
--output-path "${replay_out}" \
|
||||
--model-name "${MODEL_NAME}" \
|
||||
--endpoint "${endpoints}" \
|
||||
--trace "${TRACE}" \
|
||||
--output "${replay_out}" \
|
||||
--model "${MODEL_NAME}" \
|
||||
${REQUEST_LIMIT_ARG} \
|
||||
> "${OUT_ROOT}/${config}_rep${rep}_replay.log" 2>&1; then
|
||||
local t1=$(date +%s.%N)
|
||||
echo "[mb5-run] REPLAY FAILED after $(echo "$t1 - $t0" | bc) s; see ${OUT_ROOT}/${config}_rep${rep}_replay.log"
|
||||
local t1
|
||||
t1=$(date +%s.%N)
|
||||
local wall=$(python -c "print(${t1} - ${t0})")
|
||||
echo "[mb5-run] REPLAY FAILED after ${wall} s; see ${OUT_ROOT}/${config}_rep${rep}_replay.log"
|
||||
bash "${LAUNCH}" stop > /dev/null 2>&1 || true
|
||||
return 1
|
||||
fi
|
||||
local t1=$(date +%s.%N)
|
||||
local wall_clock_s=$(echo "$t1 - $t0" | bc)
|
||||
local t1
|
||||
t1=$(date +%s.%N)
|
||||
local wall_clock_s
|
||||
wall_clock_s=$(python -c "print(${t1} - ${t0})")
|
||||
echo "[mb5-run] replay done in ${wall_clock_s}s"
|
||||
echo "${wall_clock_s}" > "${rundir}/wall_clock_s.txt"
|
||||
|
||||
|
||||
Reference in New Issue
Block a user