5 Commits

Author SHA1 Message Date
bf037594c4 Production-realistic baseline: APC 67.5%, TPOT +139% from interference
Updated methodology:
- Window+thin sampling preserves cross-session sharing (48% vs 16%)
- --max-single-turn-ratio 0.3 boosts multi-turn to 70%
- --window-seconds 600 for 10-min contiguous window
- Trace-driven replay (no session limit, no time compression)
- Daily config: --requests 850 (~13 min, APC~76%)

Key result: TPOT p90=0.175s (vs 0.073s in legacy 1-req/GPU setup),
confirming prefill-decode interference is real at production concurrency.
APC 67.5% (vs 44%) from better KV reuse preservation.

Also fixed KV reuse breakdown: 62% intra-session / 38% cross-session
(was incorrectly reported as 91% / 9%).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 15:44:34 +08:00
d8dc9dc0ce Add --max-single-turn-ratio to control single-turn session fraction
Single-turn sessions with unique prefixes get 0% cache hit, diluting
APC in benchmarks.  --max-single-turn-ratio caps their fraction,
boosting multi-turn density and theoretical APC.

Example: --sample-ratio 0.008 --max-single-turn-ratio 0.3
  Before: 9.2% multi-turn, APC=70.5%
  After:  70.0% multi-turn, APC=85.0%, sharing=53.3%

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 14:17:25 +08:00
1e1e2e774d Fix sampler: window+thin preserves cross-session KV cache sharing
Random session sampling destroys cross-session hash block sharing
(52% -> 16%) because sessions sharing system prompts get scattered.

New approach: take a contiguous time window from the trace (preserving
temporal locality of shared-prefix sessions), then thin within the
window to hit target QPS. This preserves both intra-session reuse
(62% of reusable tokens) and cross-session sharing (38%).

Results (block sharing rate):
  Old random r=0.002:  16.0%  ->  Window+thin: 29.7%
  Old random r=0.016:  19.5%  ->  Window+thin: 42.7%
  Full trace baseline: 52%

Also corrected the "91% intra-session" claim: actual split is
62% intra / 38% cross (token-level), making cross-session sharing
preservation critical for valid APC benchmarks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 14:03:12 +08:00
4089ffd63f Fix replay methodology: trace-driven dispatch, no artificial limits
The replayer was artificially limiting concurrency with --max-inflight-sessions
(semaphore) and --time-scale (time compression), producing unrealistically low
1 req/GPU load that masked prefill-decode interference.

Replayer changes:
- Remove session_sem and time_scale entirely
- Each request dispatched at its trace timestamp exactly
- Sessions still sequential (turn N+1 waits for turn N completion)
- If turn completes late, next turn fires immediately

Sampler changes:
- Add --sample-ratio for GPU-proportional session sampling
- Keep --target-requests for backwards compat
- No time compression (preserve original arrival pattern)

bench.sh: remove --time-scale and --max-inflight-sessions args

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 12:43:41 +08:00
05592e6adc Agentic workload PD separation analysis with trace-driven benchmarks
Systematic study of prefill-decode disaggregation for agentic LLM workloads
using production GLM-5.1 coder trace (2.1M requests, 71B input tokens).

Key findings:
- Cache-aware routing improves TPOT p90 by 15% and APC from 20.8% to 44.7%
  without PD separation, matching PD-Sep's decode isolation benefit
- PD separation adds +72% TTFT overhead (KV transfer) with no TPOT gain
  when using the same cache-aware scheduler
- Prefill remains compute-bound even at 95% KV cache reuse (AI >1000x
  vs decode AI <2), but absolute FLOPs drop 71% from cache hits
- For agentic MoE workloads, cache-aware routing > PD separation

Infrastructure:
- Trace sampler preserving session structure + hash_ids for prefix sharing
- Async trace replayer with streaming TTFT/TPOT/E2E measurement
- Unified cache-aware + token-level load-balanced global scheduler proxy
  supporting both PD-colocated and PD-disaggregated (Mooncake/RDMA) modes
- vLLM 0.18.1 scheduler patch for KV transfer abort race condition
- Roofline analysis tool for prefill/decode compute characterization

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 21:21:57 +08:00