The replayer was artificially limiting concurrency with --max-inflight-sessions
(semaphore) and --time-scale (time compression), producing unrealistically low
1 req/GPU load that masked prefill-decode interference.
Replayer changes:
- Remove session_sem and time_scale entirely
- Each request dispatched at its trace timestamp exactly
- Sessions still sequential (turn N+1 waits for turn N completion)
- If turn completes late, next turn fires immediately
Sampler changes:
- Add --sample-ratio for GPU-proportional session sampling
- Keep --target-requests for backwards compat
- No time compression (preserve original arrival pattern)
bench.sh: remove --time-scale and --max-inflight-sessions args
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>