Elastic PS eval: near-neutral, offload gate triggers only 14% of HEAVY

Root cause: 75% of HEAVY requests are cold (cache_ratio=0%), failing the cache_ratio>=0.3 gate. Only 17/118 HEAVY offloaded, insufficient to reduce prefill-decode interference. Offloaded requests are 50% SLOWER due to P-side queuing (14.7s) + RDMA overhead (5.7s). Interference IS real: 89% of WARM/MEDIUM have 1+ concurrent HEAVY prefill. But elastic PS in current form can't address it because cold HEAVY prefills (the majority) can't benefit from offload. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 16:49:25 +08:00
parent 03e88b30bd
commit 9835d6af5d
1 changed files with 31 additions and 0 deletions
--- a/REPORT.md
+++ b/REPORT.md
@@ -271,6 +271,37 @@ The elastic numbers on dash1 were genuinely fresh. The "improvement" was actuall

 **Output**: `outputs/baseline_r0015_st30/` on dash0.

+### 3.7 Elastic PS vs Baseline (production-realistic trace)
+
+850 requests, `w600_r0.0015_st30.jsonl`, peak QPS≈1.6. Baseline on dash0, elastic on dash1.
+
+| Metric | Baseline | Elastic PS | Delta |
+|--------|----------|-----------|-------|
+| TTFT mean | 4.35s | 4.01s | -7.8% |
+| TTFT p50 | 0.94s | 0.93s | -1% |
+| TPOT p50 | 0.070 | 0.071 | +2% |
+| TPOT p90 | 0.162 | 0.157 | -3.1% |
+| E2E p50 | 6.38s | 6.44s | +0.9% |
+| APC mean | 60.7% | 59.9% | -0.8pp |
+| Errors | 0/850 | 4/832 | 4 ReadTimeout |
+
+**Elastic PS is near-neutral.** Root cause analysis:
+
+**Problem 1: Offload gate too restrictive** — only 17/118 HEAVY requests (14%) were offloaded. 75% of HEAVY requests had `cache_ratio=0%` (cold Turn 1), failing the `cache_ratio >= 0.3` gate. The gate was designed to avoid offloading cold requests (full prefill on P is slower than co-located), but this means 86% of HEAVY prefills still interfere with decode.
+
+**Problem 2: Offloaded requests are slower (+50.6%)** — HEAVY_OFFLOAD TTFT=19.94s vs HEAVY_COLO=13.25s. Breakdown:
+- Prefill on P: 14.72s (P also queued, no faster than co-located)
+- KV transfer + decode start on D: 5.71s (pure overhead)
+
+**Interference is real but unaddressed**: 89% of WARM/MEDIUM requests ran concurrently with 1+ HEAVY prefills (up to 60 concurrent). Elastic PS only offloaded 17/118 HEAVY requests — insufficient to reduce interference.
+
+**Conclusion**: The offload gate (`cache_ratio >= 0.3`) is correct in principle (cold offload IS slower), but leaves the core problem unsolved. Reducing prefill-decode interference requires either:
+1. Offloading ALL heavy prefills (accepting higher TTFT for offloaded requests in exchange for lower TPOT for all)
+2. Chunked prefill scheduling that yields to decode (vLLM-side optimization)
+3. Dedicated prefill GPUs (full PD separation) if KV memory wall can be solved
+
+**Output**: `outputs/eval_baseline_linear/` on dash0, `outputs/eval_elastic_linear/` on dash1.
+
 ## 4. System-Level Analysis

 ### 4.1 Elastic P2P Does Not Improve Single-Machine Performance