Benchmark concurrency gap: 1 req/GPU is 10-15x below production

Our --max-inflight-sessions 8 yields 1 req/GPU, masking prefill-decode
interference that appears at 2/GPU (+38% TPOT) and would dominate at
production load (~15/GPU). Updated §8 to re-evaluate elastic PS at
production concurrency. Next step: --max-inflight-sessions 64 benchmark.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-23 12:16:20 +08:00
parent fefbd71ca9
commit c8ba666517

View File

@@ -384,59 +384,78 @@ Agentic workloads produce long-lived sessions with growing context:
Top 5 sessions = 29% of all tokens. With session-sticky, these lock their instances, creating persistent load hotspots.
### 8.3 Can Elastic Prefill Service Help?
### 8.3 Benchmark Concurrency vs Production Reality
> **Critical caveat**: All prior experiments used `--max-inflight-sessions 8` (1 session/GPU). This is **1015× below production concurrency** and masks the interference that elastic PS is designed to solve.
| | Our Benchmark | Production Estimate |
|--|---------------|---------------------|
| Concurrent requests/GPU | 12 | **815** |
| KV cache usage/GPU | 2628% (single req) | **80100%** |
| Prefill-decode interference | Minimal | **Significant** |
**KV cache capacity**: 281,888 tokens/GPU (25.8 GiB). A single 75k-token request consumes 27% of KV cache. At production concurrency (~15 req/GPU), KV cache is near-full, triggering eviction, cache misses, and prefill queuing none of which appear in our 1-req/GPU benchmark.
**Measured interference scaling**:
| Concurrency | TPOT p90 | vs 8-session |
|-------------|----------|-------------|
| 8 sessions (1/GPU) | 0.075s | baseline |
| 16 sessions (2/GPU) | 0.103s | **+38%** |
| Production (~15/GPU) | *not tested* | *expected >>+45%* |
### 8.4 Elastic PS: Two Capabilities Re-Evaluated
**Capability 1: Reduce prefill-decode interference (lower TPOT)**
| Concurrency | TPOT p90 | Interference |
|-------------|----------|--------------|
| 8 sessions (1/GPU) | 0.073s | Minimal, TPOT uniform across instances |
| 16 sessions (2/GPU) | 0.106s (+45%) | Significant, MEDIUM TPOT +163% |
Verdict: **No benefit at ≤1 session/GPU** (current setup). Potential benefit at higher concurrency where chunked prefill interrupts decode steps.
At 1 req/GPU (our benchmark): no interference, no benefit. But this is an artifact of unrealistically low concurrency. At 2 req/GPU, chunked prefill interrupts decode steps, causing TPOT +3845%. At production concurrency (~15/GPU), multiple HEAVY prefills sharing a GPU with decode requests would cause severe interference. Elastic PS's ability to isolate heavy prefills on separate GPUs directly addresses this.
**Capability 2: Session migration for load balancing**
Elastic PS enables mid-session migration: prefill on original instance (cache hit), KV transfer to new instance for decode + future turns.
Elastic PS enables mid-session migration: prefill on original instance (cache hit), KV transfer to a different instance for decode + future turns. This provides two benefits:
Simulation of migration strategies (1000 req):
1. **Break session lock-in**: Agentic sessions grow +2k tokens/turn over 30+ turns. With session-sticky, a 36-turn session (2.3M tokens total) locks its GPU, creating a hotspot. Elastic PS lets the session migrate to a less-loaded GPU while preserving cache on the original (PS does fast cached prefill, new GPU decodes).
2. **Rebalance without cache loss**: Unlike breaking affinity (which destroys cache), elastic PS migration preserves the prefix cache on the original instance the PS re-uses it for fast prefill, then transfers only the new KV to the destination.
Simulation of migration strategies (1000 req, at current low concurrency):
| Strategy | Imbalance | Migrations | KV Transfer Overhead |
|----------|-----------|------------|---------------------|
| No migration | 1.24× | 0 | 0s |
| Every 10 turns | 1.21× | 10 | 15s |
| Every 5 turns | 1.20× | 20 | 30s |
| Every 1 turn | 1.18× | 38 | 57s |
Even aggressive every-turn migration only reduces imbalance from 1.24× to 1.18× diminishing returns at ~1.5s overhead per migration event.
At 1 req/GPU, migration benefit is marginal (imbalance is only 1.24×). At production concurrency where imbalance combines with KV cache pressure and interference, the benefit would be substantially larger.
**Capability 3 (unexpected): Soft affinity replaces hard stickiness**
**Capability 3: Soft affinity from cache-hit scoring**
The corrected LMetric experiment 3.4) reveals a key insight: **explicit session affinity is unnecessary**. LMetric's cache-hit estimation (`new_tokens = input cached`) creates implicit soft affinity instances with cached prefixes score lower, naturally attracting subsequent turns. This achieves the same APC and latency as explicit session-sticky routing, while providing better load balancing automatically.
The corrected LMetric experiment 3.4) reveals that **explicit session affinity is unnecessary**. Cache-hit scoring (`new_tokens = input cached`) creates implicit soft affinity instances with cached prefixes score lower, naturally attracting subsequent turns. This matches hard session-sticky on all metrics (< 2% difference) while providing more flexible load balancing.
### 8.4 Elastic PS Verdict
### 8.5 Elastic PS Verdict
| Aspect | Benefit | Cost | Net |
|--------|---------|------|-----|
| TPOT reduction | 0% at 1/GPU | Mooncake overhead | **Negative** |
| Session migration | 1.24× 1.18× imbalance | 1.5s/migration KV transfer | **Marginal** |
| Load balance | Already achieved by soft affinity (LMetric) | N/A | **Not needed** |
| Aspect | At 1 req/GPU (tested) | At production load (expected) |
|--------|----------------------|-------------------------------|
| TPOT reduction | 0% (no interference) | **Significant** (interference scales with concurrency) |
| Session migration | Marginal (1.24× 1.20×) | **Larger benefit** (KV pressure + interference amplify imbalance) |
| Cache preservation | N/A | **Key advantage** vs breaking affinity |
Elastic PS is not justified for single-machine agentic workloads at moderate concurrency. The dominant optimization (cache-aware routing, -60% TTFT) works via soft affinity without any KV transfer. The remaining load imbalance (1.24×) is too small for migration to overcome KV transfer costs.
**At our benchmark concurrency (1 req/GPU), elastic PS is not justified** Mooncake overhead exceeds the non-existent interference benefit. **But our benchmark is 1015× below production load.** The real question is whether elastic PS helps at production-realistic concurrency (64128 concurrent sessions, 815 req/GPU), where:
- Prefill-decode interference is measurable (already +38% TPOT at just 2/GPU)
- KV cache pressure causes eviction and queue delays
- Session accumulation creates compounding hotspots
- Heavy prefills (50100k tokens) block decode for seconds
**When elastic PS could become justified:**
- Multi-machine deployment (no shared GPU memory competition)
- Higher concurrency (>1 session/GPU sustained) where prefill-decode interference is measurable
- Cheaper KV transfer (layerwise pipelining, not available in Mooncake 0.3.10)
**Next step: run `--max-inflight-sessions 64` benchmark** to test elastic PS at production-realistic concurrency.
## 9. Conclusions & Next Steps
### Established findings:
1. Full PD separation is **net negative** for single-machine agentic workloads (KV cache memory wall)
2. Cache-aware routing is the **dominant optimization** (+24pp APC, -60% TTFT vs round-robin)
3. **Explicit session affinity is unnecessary** cache-hit scoring in LMetric creates implicit soft affinity that matches hard session-sticky on all metrics (< 2% difference)
4. **Elastic P2P offload does NOT improve single-machine performance** Mooncake overhead outweighs both interference reduction and load-balancing benefits
5. **GPU load imbalance** from session accumulation is moderate at scale (1.24× at 1000 req) and does not affect TPOT; only TTFT on heavy instances (1.7× gap)
3. **Explicit session affinity is unnecessary** cache-hit scoring creates implicit soft affinity that matches hard session-sticky (< 2% difference)
4. At low concurrency (1 req/GPU), elastic P2P offload adds overhead without benefit
5. **Our benchmark concurrency is 1015× below production**: `--max-inflight-sessions 8` yields 1 req/GPU, masking prefill-decode interference that appears at 2 req/GPU (+38% TPOT) and would dominate at production load (~15 req/GPU)
6. **Experimental methodology matters**: warm vs fresh instances cause 2× TTFT difference
### Lessons learned:
@@ -444,13 +463,14 @@ Elastic PS is not justified for single-machine agentic workloads at moderate con
- Prior LMetric implementation was invalid incorrectly shared session-sticky logic with Linear
- `kv_role=kv_both` has non-trivial always-on overhead even when P2P transfer is not used
- Experiment isolation (kill all verify GPU free fresh start) is critical for reproducibility
- **Benchmark concurrency must match production** sub-production concurrency hides interference effects that drive system design decisions
### Open problems:
1. **Higher concurrency regime**: At 2+ sessions/GPU, prefill-decode interference becomes significant (+45% TPOT). Does elastic PS help there?
### Open problems (priority ordered):
1. **Production-concurrency benchmark** (`--max-inflight-sessions 64+`): Validate whether prefill-decode interference at 815 req/GPU makes elastic PS net-positive
2. **Multi-machine elastic**: P on a different node eliminates GPU memory competition the main cost that makes single-machine elastic net negative
3. **Router state accuracy**: proxy shadow state vs vLLM-internal exact state (TODO: vLLM Redis router)
4. **Layerwise KV transfer**: Mooncake's block-level transfer after full prefill is the bottleneck. Layerwise pipelining could reduce transfer latency by overlapping with computation
3. **Layerwise KV transfer**: Mooncake's block-level transfer after full prefill is the bottleneck. Layerwise pipelining could reduce transfer latency by overlapping with computation
4. **Router state accuracy**: proxy shadow state vs vLLM-internal exact state (TODO: vLLM Redis router)
---
*Updated 2026-05-23. LMetric corrected (§3.4 errata). GPU imbalance + elastic PS analysis added (§8).*
*Updated 2026-05-23. LMetric corrected (§3.4 errata). GPU imbalance analysis added (§8). Benchmark concurrency gap identified — production-load experiments needed.*