Files
agentic-pd-hybrid/scripts/run_all_experiments.sh
kzlin c9d350b372 docs: KVC v1-v4 debug journey + raise session soft_cap to 16
Document the iterative debugging from v1 (broken KVC) through v4
(routing fixed + session cap raised), with code-level analysis of
the two main bugs encountered:

1. v2 root cause (mis-diagnosed previously as `allow_local_prefill`):
   `--policy default` for KVC mechanism caused replay's round-robin
   policy and the PD router's round-robin to diverge, sending requests
   with `session_params` to a D worker that did not have the session
   open. Resulted in 56-61% truncation with finish_reason
   "session id X does not exist".
   Fix: use `--policy kv-aware` (sweep_tp1_v3_kvaware.sh) so replay
   emits `x-smg-target-worker` and PD router uses consistent_hashing.

2. v3 new bottleneck: `pd-router-fallback-large-append-session-cap`
   dominated 52-65% of requests. Root cause was hardcoded
   `min(4, ...)` in `_decode_session_soft_cap`. With 7 D workers x 4
   sessions = 28 slots for 52 trace sessions, ~24 sessions starved
   permanently (bimodal direct-to-D rate of 0% or 99%).
   Fix: raise the cap to 16 (replay.py).

Also includes the v3 finding that direct-to-d-session path P50=0.495s
and TTFT P50=0.043s already beats the 8-way DP baseline (0.65s/0.093s)
- the KVC core mechanism works when fallback paths are avoided.

Files:
- docs/KVC_DEBUG_JOURNEY_V1_TO_V4.md: full journey + code location index
- docs/SWEBENCH_EXPERIMENT_{PROGRESS,RESULTS}.md: prior session notes
- scripts/sweep_tp1_v{2,3,4}*.sh: experiment driver scripts
- src/agentic_pd_hybrid/replay.py: cap 4 -> 16, audit fields
- src/agentic_pd_hybrid/pd_router.py: strip session_params from prefill
- src/agentic_pd_hybrid/metrics.py: truncated_request_count

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 21:10:41 +08:00

74 lines
2.2 KiB
Bash
Executable File

#!/bin/bash
# Run all 3 PD hybrid experiments sequentially
# Uses 52 sessions / 4,449 requests (10% sample of 497 sessions)
# Each experiment takes ~30-40 min
set -euo pipefail
cd "$(dirname "$0")/.."
TRACE="outputs/qwen35-swebench-50sess.jsonl"
MODEL="/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B"
OUTPUT="outputs/swebench-exps"
echo "=== Experiment A: pd-disaggregation ==="
uv run agentic-pd-hybrid benchmark-live \
--trace "$TRACE" \
--output-root "$OUTPUT" \
--mechanism pd-disaggregation \
--policy default \
--model-path "$MODEL" \
--prefill-workers 1 --decode-workers 1 \
--prefill-tp-size 4 --decode-tp-size 4 \
--prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300
echo "=== Experiment B: pd-colo ==="
uv run agentic-pd-hybrid benchmark-live \
--trace "$TRACE" \
--output-root "$OUTPUT" \
--mechanism pd-colo \
--policy default \
--model-path "$MODEL" \
--prefill-workers 0 --decode-workers 0 \
--direct-workers 2 --direct-tp-size 4 \
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300
echo "=== Experiment C: kvcache-centric ==="
uv run agentic-pd-hybrid benchmark-live \
--trace "$TRACE" \
--output-root "$OUTPUT" \
--mechanism kvcache-centric \
--policy default \
--model-path "$MODEL" \
--prefill-workers 1 --decode-workers 1 \
--prefill-tp-size 4 --decode-tp-size 4 \
--prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 2 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
echo "=== All experiments complete ==="