Gahow Wang c8ba666517 Benchmark concurrency gap: 1 req/GPU is 10-15x below production
Our --max-inflight-sessions 8 yields 1 req/GPU, masking prefill-decode
interference that appears at 2/GPU (+38% TPOT) and would dominate at
production load (~15/GPU). Updated §8 to re-evaluate elastic PS at
production concurrency. Next step: --max-inflight-sessions 64 benchmark.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 12:16:20 +08:00
Description
No description provided
48 MiB
Languages
Python 82.9%
Shell 17.1%