KVC v2 beats 4DP at ts=1 same-scale on 7/8 metrics: TTFT mean -24%, p50 -54%, p90 -64%; lat mean -0.8%, p50 -12.6%, p90 -0.7%. Direct-to-D rate jumped 42.8% -> 91.7%. REFACTOR_PLAN_V1 scenario C achieved. Two-knob fix: - reset-on-success blacklist decay: clear (sess, D) reject counter on successful direct-to-D path. Eliminates v1 thrashing where session 6880 was stable on decode-1 for 70 turns then collapsed to 75 D-changes after cumulative transient pressure tripped the permanent blacklist. - bump --kvcache-direct-max-uncached-tokens default 2048 -> 8192 via CLI flag. 41% of v1 fallbacks were 'real-large-append' (>2048 token append); raising the threshold lets these go through the direct-to-D fast path. Code: - policies.py: RoutingState.session_d_rejects counter + KvAwarePolicy migration_reject_threshold; degenerate fallback picks least-rejected D. - replay.py: record_admission_reject + reset-on-success in _run_request; _fallthrough_reason classifies turn-2+ fall-throughs as session-not-resident / real-large-append / etc, replacing misleading 'large-append' suffix (TEAM_REPORT §2.7). - cli.py + benchmark.py: --kvcache-migration-reject-threshold flag wiring. Docs: - REFACTOR_PLAN_V1_ZH.md: forward-looking plan after ts=1 validation. - MIGRATION_V1_FINDINGS_ZH.md: v1 thrashing root-cause analysis. - V2_RESULTS_ZH.md: v2 results, scenario C achievement, attribution. - TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md: comprehensive team report. Scripts: - sweep_ts1_kvc_n3_plus_dp.sh: ts=1 baseline (KVC 1P3D N=3 + 4DP CA). - sweep_ts1_migration_v1.sh / v2.sh: validation runs. - analyze_ts1_validation.py: 4-way comparison analyzer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
147 lines
5.6 KiB
Bash
Executable File
147 lines
5.6 KiB
Bash
Executable File
#!/bin/bash
|
||
# Time-scale=1 validation sweep, downscaled to 4 GPUs:
|
||
# - KVC v5 1P3D × N=3 (new data, validates §1/§2 structural claims at real timing)
|
||
# - 4-way DP cache-aware × 1 (sanity baseline at same scale + ts=1)
|
||
#
|
||
# Goal: per docs/AGENTIC_FIT_ANALYSIS_ZH.md §7 / TEAM_REPORT §2.6 — all v3-v6 KVC
|
||
# data was at time-scale=10 (inter-turn gap p50 = 0.25s, vs real 2.5s). This run
|
||
# tests whether the gap structurally reverses any conclusion.
|
||
#
|
||
# CONFIG NOTE: Original experiments used 8 GPUs (2P6D / 8-way DP). This host has
|
||
# only 4 H100s available, so we downscale proportionally to 1P3D / 4-way DP.
|
||
# Cross-compare against existing 2P6D ts=10 data is confounded by *both*
|
||
# time-scale and capacity. Internal comparison (1P3D KVC vs 4DP) at ts=1 is the
|
||
# clean signal. §5 (P-side imbalance) is NOT testable here — only 1 P.
|
||
#
|
||
# Capacity ratio: 3D × ~92K tok = 276K KV pool vs 52 sessions × ~50K peak input
|
||
# working set ≈ 1.5M → ~5.4× overload (vs 2.7× in original 2P6D).
|
||
# Pressure is HIGHER than original; partly offset by ts=1 letting D drain between turns.
|
||
#
|
||
# Output:
|
||
# outputs/qwen3-30b-tp1-ts1-validation/
|
||
# ├── kvc_1p3d_run{1,2,3}_summary.json
|
||
# ├── kvc_1p3d_run{1,2,3}_metrics.jsonl
|
||
# ├── dp4_summary.json
|
||
# ├── dp4_metrics.jsonl
|
||
# └── kvcache-centric-... / pd-colo-kv-aware-... (raw run dirs)
|
||
#
|
||
# Estimated GPU time: KVC ts=1 ≈ 100-180 min/run × 3 = 5-9h
|
||
# DP ts=1 ≈ 100-120 min × 1 = ~2h
|
||
# Total = 7-11h
|
||
set -euo pipefail
|
||
cd "$(dirname "$0")/.."
|
||
|
||
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
||
TRACE=outputs/qwen35-swebench-50sess.jsonl
|
||
OUTPUT=outputs/qwen3-30b-tp1-ts1-validation
|
||
VENV_PYTHON=.venv/bin/python
|
||
RESULTS_FILE=$OUTPUT/sweep_results.txt
|
||
|
||
mkdir -p $OUTPUT
|
||
|
||
log() {
|
||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
|
||
}
|
||
|
||
run_kvc_1p3d() {
|
||
local run_idx=$1
|
||
local label="kvc_1p3d_run${run_idx}"
|
||
log ""
|
||
log "=== [KVC ${run_idx}/3] 1P3D KVC kv-aware Option D, time-scale=1 ==="
|
||
PYTHONPATH=src:third_party/sglang/python \
|
||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||
--trace $TRACE \
|
||
--output-root $OUTPUT \
|
||
--mechanism kvcache-centric \
|
||
--policy kv-aware \
|
||
--model-path $MODEL \
|
||
--prefill-workers 1 --decode-workers 3 \
|
||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3 \
|
||
--transfer-backend mooncake \
|
||
--gpu-budget 4 \
|
||
--time-scale 1 \
|
||
--session-sample-rate 1.0 \
|
||
--target-duration-s 100000 \
|
||
--concurrency-limit 32 \
|
||
--timeout-s 900 \
|
||
--request-timeout-s 300 \
|
||
--kvcache-admission-mode worker \
|
||
--kvcache-seed-min-turn-id 1 \
|
||
--kvcache-seed-max-inflight-decode -1 \
|
||
--kvcache-prefill-backup-policy release-after-transfer \
|
||
--kvcache-prefill-priority-eviction
|
||
|
||
local run_dir=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||
log "=== [KVC ${run_idx}/3] $label COMPLETED ==="
|
||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||
local errs=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('error_count',0))")
|
||
log " errors = $errs"
|
||
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
|
||
echo "" >> $RESULTS_FILE
|
||
else
|
||
log "WARNING: no summary file in $run_dir"
|
||
fi
|
||
}
|
||
|
||
run_dp4_sanity() {
|
||
local label="dp4"
|
||
log ""
|
||
log "=== [DP] 4-way DP cache-aware sanity, time-scale=1 ==="
|
||
PYTHONPATH=src:third_party/sglang/python \
|
||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||
--trace $TRACE \
|
||
--output-root $OUTPUT \
|
||
--mechanism pd-colo \
|
||
--policy kv-aware \
|
||
--model-path $MODEL \
|
||
--prefill-workers 0 --decode-workers 0 \
|
||
--direct-workers 4 --direct-tp-size 1 \
|
||
--direct-gpu-ids 0,1,2,3 \
|
||
--gpu-budget 4 \
|
||
--time-scale 1 \
|
||
--session-sample-rate 1.0 \
|
||
--target-duration-s 100000 \
|
||
--concurrency-limit 32 \
|
||
--timeout-s 900 \
|
||
--request-timeout-s 300
|
||
|
||
local run_dir=$(ls -td $OUTPUT/pd-colo-kv-aware-*/ 2>/dev/null | head -1)
|
||
log "=== [DP] $label COMPLETED ==="
|
||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||
local errs=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('error_count',0))")
|
||
log " errors = $errs"
|
||
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
|
||
echo "" >> $RESULTS_FILE
|
||
else
|
||
log "WARNING: no summary file in $run_dir"
|
||
fi
|
||
}
|
||
|
||
log "=== TS=1 VALIDATION (4-GPU): KVC 1P3D × N=3 + 4DP × 1 ==="
|
||
log "Model: $MODEL"
|
||
log "Trace: $TRACE (4449 requests, 52 sessions)"
|
||
log "Goal: validate whether ts=10 was the main distortion in v3-v6 KVC vs DP"
|
||
|
||
# KVC × 3 first (the new data we need); DP last (cheaper sanity at end)
|
||
for i in 1 2 3; do
|
||
run_kvc_1p3d $i
|
||
done
|
||
|
||
run_dp4_sanity
|
||
|
||
log ""
|
||
log "=== TS=1 SUMMARY ==="
|
||
for label in kvc_1p3d_run1 kvc_1p3d_run2 kvc_1p3d_run3 dp4; do
|
||
if [ -f "$OUTPUT/${label}_summary.json" ]; then
|
||
e=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('error_count',0))")
|
||
p50=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('latency_stats_s',{}).get('p50','n/a'))")
|
||
log " ${label}: errors=$e lat_p50=${p50}s"
|
||
fi
|
||
done
|
||
log "=== TS=1 ALL DONE ==="
|