Hostile audit of the original report flagged three load-bearing errors:
1. held_tokens semantic was inverted. session_held_tokens() at
session_aware_cache.py:278-282 sums (kv_allocated_len - cache_protected_len)
per slot, i.e. slot-private (NOT in radix tree). So "other = cap - held -
avail" actually CONTAINS the radix-tree protected prefix cache (likely the
single biggest component for shared agentic prefixes), not just running
batch + in-flight as the original report claimed.
2. Admission-race causal hypothesis for the 415 EXP2+profile errors is
contradicted by the data: 414/415 errors have kv_transfer_blocks > 0 — they
passed admission and died downstream ("generate stream ended before
producing any token", raised by the client when a 200 response had an empty
stream).
3. Polling deconfound was too quickly dismissed. Mode counts shift ~1:1
(session-cap-fb -356 / kvcache-centric +406), and /server_info is not a
passive read — it dispatches into the scheduler main loop and iterates
every session slot.
Plus: per-D error% confounded by sticky session affinity (only 18 unique
sessions cause 415 errors, decode-3 had 0 errors only because no high-error
session landed there); decile 10 "recovery" was an equal-time binning
artifact (24.5% under equal-count); v5 vs v5+profile time gap was 21h not
6h; p50/p90 latency comparison is N=1.
Rewritten report (docs/V5_PROFILE_INVESTIGATION_ZH.md) marks each correction
with ⚠️ and demotes admission-race to one of four hypotheses (H1-H4).
Action items split into P0 (verify, must do first) and P1 (instrument):
P0 — scripts/sweep_tp1_v5_baseline_rerun_exp2.sh runs 3x v5 baseline EXP2
(no polling, identical config to the original v5 run) to test whether the
9-error baseline result is reproducible. If 3 runs give ~9 errors and
profile gives 415, polling is the leading suspect. Currently running
in background.
P1 — scheduler.py:_compute_pool_breakdown_for_diagnostics adds a read-only
"pool_breakdown" dict to /server_info covering: radix_evictable_tokens,
radix_protected_tokens, slot_private_held_tokens, session_slot_count,
running_batch_{reqs,kv_tokens}, transfer_queue_{reqs,tokens},
prealloc_queue_{reqs,tokens}, retracted_queue_{reqs,tokens}. With these,
"unaccounted = cap - sum(known)" exposes true leakage. replay.py captures
all fields into the per-tick row; analyzer prints the decomposition and
gracefully handles old timeseries (prints "P1 instrument absent").
Mock-tested end-to-end. SGLang patch is read-only and does not affect
admission/scheduling. Old v5+profile data still analyzes correctly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
90 lines
3.2 KiB
Bash
Executable File
90 lines
3.2 KiB
Bash
Executable File
#!/bin/bash
|
|
# P0: Re-run v5 baseline EXP2 (2P6D) three times to establish whether
|
|
# errors=9 is a stable property of the v5 config or single-run variance.
|
|
# Critic of V5_PROFILE_INVESTIGATION_ZH.md flagged that the 415 errors in
|
|
# v5+profile EXP2 may have been polling-induced. We need 3 baseline runs
|
|
# (no polling, identical config to original v5) to test reproducibility.
|
|
#
|
|
# Output:
|
|
# outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/
|
|
# ├── exp2_2p6d_run{1,2,3}_summary.json
|
|
# ├── exp2_2p6d_run{1,2,3}_metrics.jsonl
|
|
# └── kvcache-centric-...<ts>/ (one per run)
|
|
set -euo pipefail
|
|
cd "$(dirname "$0")/.."
|
|
|
|
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
|
TRACE=outputs/qwen35-swebench-50sess.jsonl
|
|
OUTPUT=outputs/qwen3-30b-tp1-v5-optD-baseline-rerun
|
|
VENV_PYTHON=.venv/bin/python
|
|
RESULTS_FILE=$OUTPUT/sweep_results.txt
|
|
|
|
mkdir -p $OUTPUT
|
|
|
|
log() {
|
|
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
|
|
}
|
|
|
|
run_exp2() {
|
|
local run_idx=$1
|
|
local label="exp2_2p6d_run${run_idx}"
|
|
log ""
|
|
log "=== [RUN ${run_idx}/3] EXP2 2P6D KVC kv-aware Option D (no polling) ==="
|
|
PYTHONPATH=src:third_party/sglang/python \
|
|
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
|
--trace $TRACE \
|
|
--output-root $OUTPUT \
|
|
--mechanism kvcache-centric \
|
|
--policy kv-aware \
|
|
--model-path $MODEL \
|
|
--prefill-workers 2 --decode-workers 6 \
|
|
--prefill-tp-size 1 --decode-tp-size 1 \
|
|
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
|
|
--transfer-backend mooncake \
|
|
--gpu-budget 8 \
|
|
--time-scale 10 \
|
|
--session-sample-rate 1.0 \
|
|
--target-duration-s 100000 \
|
|
--concurrency-limit 32 \
|
|
--timeout-s 900 \
|
|
--request-timeout-s 300 \
|
|
--kvcache-admission-mode worker \
|
|
--kvcache-seed-min-turn-id 1 \
|
|
--kvcache-seed-max-inflight-decode -1 \
|
|
--kvcache-prefill-backup-policy release-after-transfer \
|
|
--kvcache-prefill-priority-eviction
|
|
|
|
local run_dir=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
|
log "=== [RUN ${run_idx}/3] $label COMPLETED ==="
|
|
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
|
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
|
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
|
local errs=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('error_count',0))")
|
|
log " errors = $errs (baseline reference = 9)"
|
|
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
|
|
echo "" >> $RESULTS_FILE
|
|
else
|
|
log "WARNING: no summary file in $run_dir"
|
|
fi
|
|
}
|
|
|
|
log "=== P0: v5 baseline EXP2 reproducibility test (3 runs, no polling) ==="
|
|
log "Model: $MODEL"
|
|
log "Trace: $TRACE (4449 requests, 52 sessions)"
|
|
log "Goal: confirm whether errors=9 in v5 baseline EXP2 is reproducible"
|
|
log " (v5+profile saw 415 errors; we need to know if polling was causal)"
|
|
|
|
for i in 1 2 3; do
|
|
run_exp2 $i
|
|
done
|
|
|
|
log ""
|
|
log "=== P0 SUMMARY: errors per run ==="
|
|
for i in 1 2 3; do
|
|
if [ -f "$OUTPUT/exp2_2p6d_run${i}_summary.json" ]; then
|
|
e=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/exp2_2p6d_run${i}_summary.json')); print(d.get('error_count',0))")
|
|
log " run ${i}: errors = $e"
|
|
fi
|
|
done
|
|
log "=== P0 ALL DONE ==="
|