Elastic PS eval: near-neutral, offload gate triggers only 14% of HEAVY
Root cause: 75% of HEAVY requests are cold (cache_ratio=0%), failing the cache_ratio>=0.3 gate. Only 17/118 HEAVY offloaded, insufficient to reduce prefill-decode interference. Offloaded requests are 50% SLOWER due to P-side queuing (14.7s) + RDMA overhead (5.7s). Interference IS real: 89% of WARM/MEDIUM have 1+ concurrent HEAVY prefill. But elastic PS in current form can't address it because cold HEAVY prefills (the majority) can't benefit from offload. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
31
REPORT.md
31
REPORT.md
@@ -271,6 +271,37 @@ The elastic numbers on dash1 were genuinely fresh. The "improvement" was actuall
|
||||
|
||||
**Output**: `outputs/baseline_r0015_st30/` on dash0.
|
||||
|
||||
### 3.7 Elastic PS vs Baseline (production-realistic trace)
|
||||
|
||||
850 requests, `w600_r0.0015_st30.jsonl`, peak QPS≈1.6. Baseline on dash0, elastic on dash1.
|
||||
|
||||
| Metric | Baseline | Elastic PS | Delta |
|
||||
|--------|----------|-----------|-------|
|
||||
| TTFT mean | 4.35s | 4.01s | -7.8% |
|
||||
| TTFT p50 | 0.94s | 0.93s | -1% |
|
||||
| TPOT p50 | 0.070 | 0.071 | +2% |
|
||||
| TPOT p90 | 0.162 | 0.157 | -3.1% |
|
||||
| E2E p50 | 6.38s | 6.44s | +0.9% |
|
||||
| APC mean | 60.7% | 59.9% | -0.8pp |
|
||||
| Errors | 0/850 | 4/832 | 4 ReadTimeout |
|
||||
|
||||
**Elastic PS is near-neutral.** Root cause analysis:
|
||||
|
||||
**Problem 1: Offload gate too restrictive** — only 17/118 HEAVY requests (14%) were offloaded. 75% of HEAVY requests had `cache_ratio=0%` (cold Turn 1), failing the `cache_ratio >= 0.3` gate. The gate was designed to avoid offloading cold requests (full prefill on P is slower than co-located), but this means 86% of HEAVY prefills still interfere with decode.
|
||||
|
||||
**Problem 2: Offloaded requests are slower (+50.6%)** — HEAVY_OFFLOAD TTFT=19.94s vs HEAVY_COLO=13.25s. Breakdown:
|
||||
- Prefill on P: 14.72s (P also queued, no faster than co-located)
|
||||
- KV transfer + decode start on D: 5.71s (pure overhead)
|
||||
|
||||
**Interference is real but unaddressed**: 89% of WARM/MEDIUM requests ran concurrently with 1+ HEAVY prefills (up to 60 concurrent). Elastic PS only offloaded 17/118 HEAVY requests — insufficient to reduce interference.
|
||||
|
||||
**Conclusion**: The offload gate (`cache_ratio >= 0.3`) is correct in principle (cold offload IS slower), but leaves the core problem unsolved. Reducing prefill-decode interference requires either:
|
||||
1. Offloading ALL heavy prefills (accepting higher TTFT for offloaded requests in exchange for lower TPOT for all)
|
||||
2. Chunked prefill scheduling that yields to decode (vLLM-side optimization)
|
||||
3. Dedicated prefill GPUs (full PD separation) if KV memory wall can be solved
|
||||
|
||||
**Output**: `outputs/eval_baseline_linear/` on dash0, `outputs/eval_elastic_linear/` on dash1.
|
||||
|
||||
## 4. System-Level Analysis
|
||||
|
||||
### 4.1 Elastic P2P Does Not Improve Single-Machine Performance
|
||||
|
||||
Reference in New Issue
Block a user