Elastic PS eval: near-neutral, offload gate triggers only 14% of HEAVY

Root cause: 75% of HEAVY requests are cold (cache_ratio=0%), failing the
cache_ratio>=0.3 gate. Only 17/118 HEAVY offloaded, insufficient to reduce
prefill-decode interference. Offloaded requests are 50% SLOWER due to
P-side queuing (14.7s) + RDMA overhead (5.7s).

Interference IS real: 89% of WARM/MEDIUM have 1+ concurrent HEAVY prefill.
But elastic PS in current form can't address it because cold HEAVY prefills
(the majority) can't benefit from offload.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-23 16:49:25 +08:00
parent 03e88b30bd
commit 9835d6af5d

View File

@@ -271,6 +271,37 @@ The elastic numbers on dash1 were genuinely fresh. The "improvement" was actuall
**Output**: `outputs/baseline_r0015_st30/` on dash0.
### 3.7 Elastic PS vs Baseline (production-realistic trace)
850 requests, `w600_r0.0015_st30.jsonl`, peak QPS1.6. Baseline on dash0, elastic on dash1.
| Metric | Baseline | Elastic PS | Delta |
|--------|----------|-----------|-------|
| TTFT mean | 4.35s | 4.01s | -7.8% |
| TTFT p50 | 0.94s | 0.93s | -1% |
| TPOT p50 | 0.070 | 0.071 | +2% |
| TPOT p90 | 0.162 | 0.157 | -3.1% |
| E2E p50 | 6.38s | 6.44s | +0.9% |
| APC mean | 60.7% | 59.9% | -0.8pp |
| Errors | 0/850 | 4/832 | 4 ReadTimeout |
**Elastic PS is near-neutral.** Root cause analysis:
**Problem 1: Offload gate too restrictive** only 17/118 HEAVY requests (14%) were offloaded. 75% of HEAVY requests had `cache_ratio=0%` (cold Turn 1), failing the `cache_ratio >= 0.3` gate. The gate was designed to avoid offloading cold requests (full prefill on P is slower than co-located), but this means 86% of HEAVY prefills still interfere with decode.
**Problem 2: Offloaded requests are slower (+50.6%)** HEAVY_OFFLOAD TTFT=19.94s vs HEAVY_COLO=13.25s. Breakdown:
- Prefill on P: 14.72s (P also queued, no faster than co-located)
- KV transfer + decode start on D: 5.71s (pure overhead)
**Interference is real but unaddressed**: 89% of WARM/MEDIUM requests ran concurrently with 1+ HEAVY prefills (up to 60 concurrent). Elastic PS only offloaded 17/118 HEAVY requests insufficient to reduce interference.
**Conclusion**: The offload gate (`cache_ratio >= 0.3`) is correct in principle (cold offload IS slower), but leaves the core problem unsolved. Reducing prefill-decode interference requires either:
1. Offloading ALL heavy prefills (accepting higher TTFT for offloaded requests in exchange for lower TPOT for all)
2. Chunked prefill scheduling that yields to decode (vLLM-side optimization)
3. Dedicated prefill GPUs (full PD separation) if KV memory wall can be solved
**Output**: `outputs/eval_baseline_linear/` on dash0, `outputs/eval_elastic_linear/` on dash1.
## 4. System-Level Analysis
### 4.1 Elastic P2P Does Not Improve Single-Machine Performance