Commit Graph

6 Commits

Author SHA1 Message Date
2c7f7fdaae replayer: restore optional max_inflight_sessions for backwards compat
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 21:02:26 +08:00
0ed1ce200e metrics: replace round-based percentile with linear interpolation (B5)
The previous implementation used round((n-1) * pct), which under Python's
banker's rounding returned the upper-middle element on every even-length
array (e.g. p50 of [1,2,3,4] returned 3 instead of 2.5). All summary
JSONs were biased upward at p50 as a result. Match numpy.percentile's
default linear interpolation between the two adjacent sorted values.
2026-05-23 21:00:24 +08:00
4089ffd63f Fix replay methodology: trace-driven dispatch, no artificial limits
The replayer was artificially limiting concurrency with --max-inflight-sessions
(semaphore) and --time-scale (time compression), producing unrealistically low
1 req/GPU load that masked prefill-decode interference.

Replayer changes:
- Remove session_sem and time_scale entirely
- Each request dispatched at its trace timestamp exactly
- Sessions still sequential (turn N+1 waits for turn N completion)
- If turn completes late, next turn fires immediately

Sampler changes:
- Add --sample-ratio for GPU-proportional session sampling
- Keep --target-requests for backwards compat
- No time compression (preserve original arrival pattern)

bench.sh: remove --time-scale and --max-inflight-sessions args

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 12:43:41 +08:00
1d2eeb4925 Elastic P2P offload: TTFT p50 -49% vs baseline (0.551 vs 1.080)
Design: offload HEAVY prefill only when P instance is less loaded than D
AND P is not overloaded (< 1.5x avg). Preserves session-sticky on D
for future KV reuse. External KV correctly registered in prefix cache.

Result (67/200 processed, 75% success):
  TTFT p50: 0.551s (-49% vs baseline 1.080s)
  TTFT p90: 4.135s (vs baseline 9.410s, -56%)
  TPOT p90: 0.074s (same as baseline)
  E2E  p50: 2.938s (-45% vs baseline 5.306s)

25% error rate from ReadTimeout on very large HEAVY requests queuing on P.
Needs stricter elastic gate or higher timeout. But successful requests
show significant improvement over both baseline and previous P2P.

Also: added external_prefix_cache metrics tracking to replayer summary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 13:50:25 +08:00
32f09d32cd Balanced session-sticky routing + agentic workload pattern analysis
Routing fix: new sessions placed by cumulative token load (greedy bin
packing) with cache-hit tiebreak. Session affinity for turn 2+.
Replayer now sends X-Session-Id header for proper session tracking.

Agentic workload core patterns (GLM-5.1 trace):
  - 91% of reusable KV is intra-session (not cross-session)
  - Session-sticky routing is THE critical optimization
  - 36% warm requests (1.3k new tokens), 64% cold (17k+)
  - After cache: effective prefill/decode ratio drops from 61.5x to 28.7x
  - Cross-session sharing (system prompt) is only 4.8% of tokens

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 01:50:27 +08:00
05592e6adc Agentic workload PD separation analysis with trace-driven benchmarks
Systematic study of prefill-decode disaggregation for agentic LLM workloads
using production GLM-5.1 coder trace (2.1M requests, 71B input tokens).

Key findings:
- Cache-aware routing improves TPOT p90 by 15% and APC from 20.8% to 44.7%
  without PD separation, matching PD-Sep's decode isolation benefit
- PD separation adds +72% TTFT overhead (KV transfer) with no TPOT gain
  when using the same cache-aware scheduler
- Prefill remains compute-bound even at 95% KV cache reuse (AI >1000x
  vs decode AI <2), but absolute FLOPs drop 71% from cache hits
- For agentic MoE workloads, cache-aware routing > PD separation

Infrastructure:
- Trace sampler preserving session structure + hash_ids for prefix sharing
- Async trace replayer with streaming TTFT/TPOT/E2E measurement
- Unified cache-aware + token-level load-balanced global scheduler proxy
  supporting both PD-colocated and PD-disaggregated (Mooncake/RDMA) modes
- vLLM 0.18.1 scheduler patch for KV transfer abort race condition
- Roofline analysis tool for prefill/decode compute characterization

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 21:21:57 +08:00