Commit Graph

8 Commits

Author SHA1 Message Date
b2ede1da77 bench.sh: add trap for graceful cleanup on kill/interrupt
Added EXIT/INT/TERM traps to ensure vLLM, proxy, and gpu_monitor
processes are cleaned up even when bench.sh is killed externally.
Also includes gpu_monitor in cleanup_gpu pattern matching.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 20:24:13 +08:00
bf037594c4 Production-realistic baseline: APC 67.5%, TPOT +139% from interference
Updated methodology:
- Window+thin sampling preserves cross-session sharing (48% vs 16%)
- --max-single-turn-ratio 0.3 boosts multi-turn to 70%
- --window-seconds 600 for 10-min contiguous window
- Trace-driven replay (no session limit, no time compression)
- Daily config: --requests 850 (~13 min, APC~76%)

Key result: TPOT p90=0.175s (vs 0.073s in legacy 1-req/GPU setup),
confirming prefill-decode interference is real at production concurrency.
APC 67.5% (vs 44%) from better KV reuse preservation.

Also fixed KV reuse breakdown: 62% intra-session / 38% cross-session
(was incorrectly reported as 91% / 9%).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 15:44:34 +08:00
4089ffd63f Fix replay methodology: trace-driven dispatch, no artificial limits
The replayer was artificially limiting concurrency with --max-inflight-sessions
(semaphore) and --time-scale (time compression), producing unrealistically low
1 req/GPU load that masked prefill-decode interference.

Replayer changes:
- Remove session_sem and time_scale entirely
- Each request dispatched at its trace timestamp exactly
- Sessions still sequential (turn N+1 waits for turn N completion)
- If turn completes late, next turn fires immediately

Sampler changes:
- Add --sample-ratio for GPU-proportional session sampling
- Keep --target-requests for backwards compat
- No time compression (preserve original arrival pattern)

bench.sh: remove --time-scale and --max-inflight-sessions args

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 12:43:41 +08:00
3594f7dce0 Fix LMetric routing: remove session affinity, align with OSDI'26 spec
LMetric was incorrectly sharing session-sticky logic with Linear policy.
Fixed to pure per-request routing: score = P_tokens × BS where
P = pending_prefill + (input - cache_hit), BS = num_requests.

Experiment result (200 req, fresh restart): Linear vs corrected LMetric
show <2% difference on all metrics — LMetric's cache-hit estimation
provides implicit soft affinity that preserves locality without explicit
session stickiness.

Also fix bench.sh missing cd (replayer module not found from non-project
cwd) and rewrite run_lmetric_ab.sh as thin wrapper around bench.sh to
eliminate duplicated launch/cleanup logic that broke under set -euo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 11:56:58 +08:00
080a8fa138 Chunk-size ablation + comprehensive synthesis
max_num_batched_tokens sweep at 16 sessions (2048/4096/8192/16384):
- Default 8192 has best overall TPOT p90 (0.106) and E2E p50 (5.83)
- 16384: HEAVY TTFT -16%, HEAVY TPOT -17%, but overall worse (+18%)
- Smaller chunks (2048/4096) always worse (scheduler overhead)

bench.sh now supports --max-batched-tokens flag.

Updated elastic_hypotheses.md with H8 (high concurrency validated),
H9 (elastic RDMA at 16s rejected), and final synthesis.

Key conclusion: for agentic workloads, the dominant optimization is
cache-aware session-sticky routing (-60% TTFT, +24pp APC vs RR).
Neither PD-Sep, LMetric, elastic RDMA, nor chunk-size tuning provides
additional benefit beyond well-tuned routing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 07:15:02 +08:00
85b230455e H7 OVERLOAD_FACTOR sweep: negative result + H4 GPU profiling
H7: Sweeping OVERLOAD_FACTOR (2.0/1.5/1.3/1.0) has no effect on GPU
imbalance (~3.5-4x across all settings). Root cause: imbalance is from
workload skew at session placement (turn 1), not from routing at turn 2+.

H4 GPU profiling confirms: GPU balance improvement IS real (4.0x→2.0x),
and it directly improves HEAVY_COLO TTFT by 10.5%. But RDMA-offloaded
requests have bimodal transfer times (0.6s or 18-31s) that negate the
routing benefit.

Updated elastic_hypotheses.md with H7 results and next directions:
higher load experiments where contention amplifies routing differences.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 03:04:02 +08:00
3bc37cc6d5 PS experiments + H4 cache-gate + GPU profiling + Mooncake elif→if fix
Experiments run:
- Phase 0: kv_both has zero idle overhead (TPOT +1.3%, noise)
- PS V1 (cold prefill): REJECTED — PS always slower than cached C
- PS V1+flexD: 92.5% OK, HEAVY TTFT 7.8s (baseline 5.0s) — PS bottleneck
- V2 (C_s prefill + flexible D): E2E -9% but 6 errors, RDMA bimodal
- H4 (cache-gate): 198/200 OK, GPU imbalance 4.0x→2.0x, but HEAVY_OFFLOAD
  TTFT=11.5s due to RDMA. HEAVY_COLO improved 10.5% from better balance.
- H5: Mooncake RDMA transfer R²=0.095, bimodal (0.6s or 18-30s)

Key findings:
- Mooncake lacks layerwise KV transfer → RDMA is pure sequential overhead
- 92% of HEAVY are turn-1 cold → offloading cold requests always loses
- GPU balance improvement from routing IS real (-10.5% HEAVY_COLO TTFT)
- RDMA transfer negates the routing benefit for offloaded requests

Code changes:
- bench.sh: add GPU timeline monitoring (gpu_monitor.sh during benchmark)
- cache_aware_proxy.py: H4 cache-gate, flexible D, PS routing
- mooncake_connector.py: elif→if fix (allow dual prefill+decode flags)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 02:14:37 +08:00
fc92410ec9 Invalidate prior A/B results + add proper experiment harness
Prior cross-machine comparison (commit 1e86285) was invalid: dash0
baseline used warm instances with residual KV cache, inflating TTFT
by 2x. Evidence: inst_7 APC=68.3% impossible from 25 cold-start
requests; WARM TTFT p90=3.3s vs fresh=0.26s.

Fair same-machine comparison (both fresh restart on dash0):
  Baseline:    TTFT50=1.075  TPOT90=0.076  E2E50=5.075  OK=198/200
  Elastic P2P: TTFT50=1.018  TPOT90=0.085  E2E50=6.977  OK=195/200
Elastic is WORSE due to Mooncake kv_both memory overhead.

Changes:
- REPORT.md: rewrite §3-4 with corrected results, add §3.5 errata
- pd_separation_analysis.md: update elastic TL;DR with correct numbers
- cache_aware_proxy.py: fix double-decrement bugs in offload path,
  add 120s prefill timeout with co-located fallback (HEAVY_COLO_FALLBACK)
- bench.sh: standardized experiment harness with guaranteed GPU cleanup
  and fresh-state verification (nvidia-smi check before start)
- run_elastic_stability_test.sh: two-phase elastic vs baseline test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 17:54:21 +08:00