Files
agentic-pd-hybrid/scripts/sweep_tp1_v5_optD_profile.sh
kzlin 51f5386691 profile(kvc): add D KV pool timeseries poller + analyzer for v6 root-cause
v5 dropped errors but pushed session-cap fallback to 46-51%. Before adding
v6 mitigations we need to attribute that capacity loss to one of:
  (a) active sessions — real footprint
  (b) idle-evictable sessions — LRU not aggressive enough
  (c) prefill backup blocks / in-flight / fragmentation — release timing

Without this it's all guessing. Plumb a 1Hz poller into replay that hits
each P/D worker's /server_info, captures session_cache + memory_usage, and
writes a per-worker time-series JSONL to <run_dir>/d-pool-timeseries.jsonl.
Off by default (--pool-poll-interval-s 0); v5+profile sweep enables it at
1.0s. Per-tick HTTP cost is ~8 parallel /server_info calls — negligible
relative to the 50min run.

Analyzer (scripts/analysis/analyze_pool_timeseries.py) decomposes each D's
capacity into active_held / idle_evictable / other (= cap-held-avail, the
backup-blocks bucket) / free, and reports session residency churn across
workers as a starvation/thrashing signal.

Mock-tested poller end-to-end (cancellation clean, file flushed, sessions
captured); analyzer validated against synthetic timeseries.

Next: run scripts/sweep_tp1_v5_optD_profile.sh on hardware (~90min), then
analyze results to pick a v6 direction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:04:21 +08:00

126 lines
4.5 KiB
Bash
Executable File

#!/bin/bash
# TP1 v5 + profiling — re-run the v5 (Option D) config with the new
# d-pool-timeseries poller enabled, so we can attribute each session-cap
# fallback to actual D KV pool occupancy (held vs available vs idle-evictable
# vs prefill-backup) instead of guessing.
#
# Output:
# outputs/qwen3-30b-tp1-v5-optD-profile/
# ├── kvcache-centric-kv-aware-worker-admission-<ts>/
# │ ├── request-metrics.jsonl
# │ ├── request-metrics.jsonl.summary.json
# │ └── d-pool-timeseries.jsonl ← NEW (1Hz P/D /server_info snapshots)
# ├── exp1_1p7d_kvc_optD_profile_metrics.jsonl
# └── exp2_2p6d_kvc_optD_profile_metrics.jsonl
set -euo pipefail
cd "$(dirname "$0")/.."
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
TRACE=outputs/qwen35-swebench-50sess.jsonl
OUTPUT=outputs/qwen3-30b-tp1-v5-optD-profile
VENV_PYTHON=.venv/bin/python
RESULTS_FILE=$OUTPUT/sweep_results.txt
POLL_INTERVAL=1.0
mkdir -p $OUTPUT
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
}
save_result() {
local label=$1
local run_dir=$2
log "=== $label COMPLETED ==="
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
log "Summary:"
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
echo "" >> $RESULTS_FILE
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
if [ -f "$run_dir/d-pool-timeseries.jsonl" ]; then
cp "$run_dir/d-pool-timeseries.jsonl" "$OUTPUT/${label}_pool_timeseries.jsonl"
log "Pool timeseries: $(wc -l < $OUTPUT/${label}_pool_timeseries.jsonl) rows"
else
log "WARNING: no d-pool-timeseries.jsonl produced"
fi
log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl + ${label}_pool_timeseries.jsonl"
else
log "WARNING: No summary file found in $run_dir"
fi
}
log "Starting TP1 v5 + profile sweep (Option D + ${POLL_INTERVAL}s pool polling)"
log "Model: $MODEL"
log "Trace: $TRACE (4449 requests, 52 sessions)"
log "Profiling: --pool-poll-interval-s $POLL_INTERVAL (writes d-pool-timeseries.jsonl)"
########################################
# Experiment 1: 1P + 7D KVC kv-aware Option D + profile
########################################
log ""
log "=== [EXP1] 1P7D KVC kv-aware Option D + profile ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 1 --decode-workers 7 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction \
--pool-poll-interval-s $POLL_INTERVAL
EXP1_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp1_1p7d_kvc_optD_profile" "$EXP1_DIR"
########################################
# Experiment 2: 2P + 6D KVC kv-aware Option D + profile
########################################
log ""
log "=== [EXP2] 2P6D KVC kv-aware Option D + profile ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 2 --decode-workers 6 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction \
--pool-poll-interval-s $POLL_INTERVAL
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp2_2p6d_kvc_optD_profile" "$EXP2_DIR"
log ""
log "=== ALL TP1 V5+PROFILE EXPERIMENTS DONE ==="