KVC v2 beats 4DP at ts=1 same-scale on 7/8 metrics: TTFT mean -24%, p50 -54%, p90 -64%; lat mean -0.8%, p50 -12.6%, p90 -0.7%. Direct-to-D rate jumped 42.8% -> 91.7%. REFACTOR_PLAN_V1 scenario C achieved. Two-knob fix: - reset-on-success blacklist decay: clear (sess, D) reject counter on successful direct-to-D path. Eliminates v1 thrashing where session 6880 was stable on decode-1 for 70 turns then collapsed to 75 D-changes after cumulative transient pressure tripped the permanent blacklist. - bump --kvcache-direct-max-uncached-tokens default 2048 -> 8192 via CLI flag. 41% of v1 fallbacks were 'real-large-append' (>2048 token append); raising the threshold lets these go through the direct-to-D fast path. Code: - policies.py: RoutingState.session_d_rejects counter + KvAwarePolicy migration_reject_threshold; degenerate fallback picks least-rejected D. - replay.py: record_admission_reject + reset-on-success in _run_request; _fallthrough_reason classifies turn-2+ fall-throughs as session-not-resident / real-large-append / etc, replacing misleading 'large-append' suffix (TEAM_REPORT §2.7). - cli.py + benchmark.py: --kvcache-migration-reject-threshold flag wiring. Docs: - REFACTOR_PLAN_V1_ZH.md: forward-looking plan after ts=1 validation. - MIGRATION_V1_FINDINGS_ZH.md: v1 thrashing root-cause analysis. - V2_RESULTS_ZH.md: v2 results, scenario C achievement, attribution. - TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md: comprehensive team report. Scripts: - sweep_ts1_kvc_n3_plus_dp.sh: ts=1 baseline (KVC 1P3D N=3 + 4DP CA). - sweep_ts1_migration_v1.sh / v2.sh: validation runs. - analyze_ts1_validation.py: 4-way comparison analyzer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
77 lines
3.4 KiB
Bash
Executable File
77 lines
3.4 KiB
Bash
Executable File
#!/bin/bash
|
|
# Migration v2 validation: KVC 1P3D ts=1 with BOTH:
|
|
# (1) reset-on-success blacklist decay (replay.py code change)
|
|
# (2) --kvcache-direct-max-uncached-tokens 8192 (was 2048 default)
|
|
#
|
|
# v1 results (kvc_1p3d_migration_run1) showed:
|
|
# - lat mean WORSE +11.7%, TTFT mean WORSE +71.3% — thrashing tax
|
|
# - direct-to-D rate UP +10.5pp (42.8 → 53.3%)
|
|
# - Fallback breakdown surprise: 41.3% are 'real-large-append' (>2048 tok),
|
|
# NOT 'session-not-resident' as we hypothesized
|
|
#
|
|
# v2 design (REFACTOR_PLAN_V1 + MIGRATION_V1_FINDINGS):
|
|
# (1) reset-on-success: clear (sess,D) reject counter on successful direct-to-D
|
|
# — eliminates blacklist-permanence bug → kills thrashing
|
|
# (2) bump direct-append threshold 2048 → 8192: lets more large-append turns
|
|
# go direct-to-D instead of fall through to seed (which often rejects)
|
|
set -euo pipefail
|
|
cd "$(dirname "$0")/.."
|
|
|
|
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
|
TRACE=outputs/qwen35-swebench-50sess.jsonl
|
|
OUTPUT=outputs/qwen3-30b-tp1-ts1-migration-v2
|
|
VENV_PYTHON=.venv/bin/python
|
|
RESULTS_FILE=$OUTPUT/sweep_results.txt
|
|
|
|
mkdir -p $OUTPUT
|
|
|
|
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE; }
|
|
|
|
log "=== TS=1 MIGRATION v2: reset-on-success + threshold=8192 ==="
|
|
log "Baselines:"
|
|
log " baseline (no migration): kvc_1p3d_run1 errors=5 lat_p50=0.811s ttft_p50=0.124s direct=42.8%"
|
|
log " v1 (migration permanent): kvc_1p3d_migration_run1 errors=6 lat_p50=0.773s ttft_p50=0.057s direct=53.3% lat_mean=1.758s"
|
|
log " 4DP ts=1: errors=0 lat_p50=0.659s ttft_p50=0.090s lat_mean=1.443s"
|
|
log "Goal: kill thrashing tax (lat_mean ≤ 1.5s, p99 ≤ 9s) while preserving v1's direct-to-D gains."
|
|
|
|
label=kvc_1p3d_migration_v2_run1
|
|
log ""
|
|
log "=== [migration v2] starting ==="
|
|
PYTHONPATH=src:third_party/sglang/python \
|
|
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
|
--trace $TRACE \
|
|
--output-root $OUTPUT \
|
|
--mechanism kvcache-centric \
|
|
--policy kv-aware \
|
|
--model-path $MODEL \
|
|
--prefill-workers 1 --decode-workers 3 \
|
|
--prefill-tp-size 1 --decode-tp-size 1 \
|
|
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3 \
|
|
--transfer-backend mooncake \
|
|
--gpu-budget 4 \
|
|
--time-scale 1 \
|
|
--session-sample-rate 1.0 \
|
|
--target-duration-s 100000 \
|
|
--concurrency-limit 32 \
|
|
--timeout-s 900 \
|
|
--request-timeout-s 300 \
|
|
--kvcache-admission-mode worker \
|
|
--kvcache-seed-min-turn-id 1 \
|
|
--kvcache-seed-max-inflight-decode -1 \
|
|
--kvcache-prefill-backup-policy release-after-transfer \
|
|
--kvcache-prefill-priority-eviction \
|
|
--kvcache-migration-reject-threshold 3 \
|
|
--kvcache-direct-max-uncached-tokens 8192
|
|
|
|
run_dir=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
|
log "=== [migration v2] $label COMPLETED ==="
|
|
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
|
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
|
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
|
errs=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('error_count',0))")
|
|
p50=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('latency_stats_s',{}).get('p50',0))")
|
|
log " errors=$errs lat_p50=${p50}s"
|
|
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
|
|
fi
|
|
log "=== migration v2 DONE ==="
|