Files
agentic-kvc/scripts/gpu_monitor.sh
Gahow Wang 67149130be Add GPU utilization A/B test and fix cache-aware proxy bugs
- GPU monitor: 5s interval nvidia-smi sampling during benchmarks
- A/B test script: clean restart + monitor + benchmark for Combined vs PD-Sep
- Fixed proxy: await bootstrap init (race condition), normalized LB scoring
- Fixed port conflicts: proxy 9090 to avoid bootstrap 9000 clash

Key finding: PD-Sep GPU utilization is 40% of Combined (12.4% vs 30.5%)
- Decode GPUs: mean=7.8%, max=47% (memory-bound, compute wasted)
- Prefill GPUs: active only 17% of samples (bursty, idle between requests)
- Combined: 8 GPUs flexibly used, mean=30.5%, active=64%

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 22:13:38 +08:00

19 lines
618 B
Bash
Executable File

#!/bin/bash
# Sample GPU utilization every 5s, output CSV
# Usage: bash gpu_monitor.sh <output_file> [interval_s]
# Runs until killed (Ctrl+C or kill)
OUT="${1:-/tmp/gpu_util.csv}"
INTERVAL="${2:-5}"
echo "timestamp,gpu,util_pct,mem_used_mb,mem_total_mb,power_w" > "$OUT"
while true; do
TS=$(date +%s.%N)
nvidia-smi --query-gpu=index,utilization.gpu,memory.used,memory.total,power.draw \
--format=csv,noheader,nounits 2>/dev/null | while IFS=', ' read -r idx util mem_used mem_total power; do
echo "$TS,$idx,$util,$mem_used,$mem_total,$power"
done >> "$OUT"
sleep "$INTERVAL"
done