Per analysis/unified_routing_fix_review.md #2, several docs still presented the retired single-argmin + PUSH-migration design as the final algorithm. Mark them superseded and document the current hybrid direction (commit255c8e6). - REPORT.md §1.1 / §3.9: add errata callout and section header noting the "Final Design" framing was retired aftercc6e562/ 4c583f2; point readers to docs/migration-policy-design.md. - docs/migration-policy-design.md: rewrite. Opens with the current hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate + tie-breaker), then a "What Was Retired" commit table, then the old Approach A numbers preserved as "Historical Baseline-Mode Comparison". - analysis/research_findings.md §2.2 / §5: correct the LMetric framing. LMetric isn't "neutralized by affinity constraints" (pure --policy lmetric has no affinity at all); it converges to similar placements because P_tokens includes new_uncached_tokens, giving it implicit soft affinity. - analysis/elastic_hypotheses.md: same LMetric correction in the "DOESN'T work" summary, plus a footer cross-referencing the current routing direction. - analysis/unified_routing_fix_review.md: track this file (was untracked); it is the review handoff cited from the updated docs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
665 lines
37 KiB
Markdown
665 lines
37 KiB
Markdown
# Milestone Report: Elastic P2P vs PD-Combined Baseline
|
||
|
||
**Date**: 2026-05-22
|
||
**Author**: Gahow Wang
|
||
**Status**: Phase 1 complete — baseline + elastic validated, system-level analysis done
|
||
|
||
---
|
||
|
||
## 1. Research Question
|
||
|
||
For agentic LLM workloads (long input, short output, high KV cache reuse), is prefill-decode disaggregation beneficial? If full PD separation hurts (proven in §3), can **selective** disaggregation of only heavy requests improve serving latency while preserving KV cache locality?
|
||
|
||
## 1.1 Errata / Superseded sections
|
||
|
||
> This report has been revised several times as the methodology matured.
|
||
> The sections below are kept for historical context but their numerical
|
||
> conclusions have been **superseded** — do not cite them in isolation.
|
||
>
|
||
> - **§3.1 (initial PD-sep vs PD-combined)**: ran with the old random
|
||
> sampler + `--time-scale` compression + `--max-inflight-sessions 8`.
|
||
> Cross-session KV reuse dropped from 52% → 16%, and per-GPU concurrency
|
||
> was capped at 1 req/GPU. Superseded by §3.6.
|
||
> - **Earlier "elastic v3" warm-vs-fresh runs**: baselines were not
|
||
> restarted between trials, leaving residual KV cache that inflated
|
||
> baseline TTFT ~2×. Superseded by the cold-start results in §3.6/§3.7.
|
||
> - **Any reference to running `--max-inflight-sessions 64+`**: that flag
|
||
> was removed when replay moved to trace-driven dispatch. The next-step
|
||
> experiment requires restoring the flag first (see `FIXES.md` §B2
|
||
> route A) before any production-concurrency numbers can be produced.
|
||
> - **§3.9 "Final Design" framing**: the single-argmin + PUSH-migration
|
||
> design was retired after `cc6e562` / `4c583f2` showed forced and
|
||
> relaxed-gate migration variants both regressed E2E tail. Current
|
||
> policy is the hybrid LMetric + high-cache affinity landed in
|
||
> `255c8e6`. See the per-section note in §3.9 and the active algorithm
|
||
> in `docs/migration-policy-design.md`.
|
||
>
|
||
> The authoritative results are in **§3.6 and §3.7**.
|
||
|
||
## 2. Experimental Setup
|
||
|
||
### 2.1 Hardware
|
||
|
||
| Resource | Spec |
|
||
|----------|------|
|
||
| Machine | dash0 / dash1 (identical config) |
|
||
| GPU | 8× NVIDIA H20 96GB HBM, NVLink |
|
||
| Network | 4× ConnectX-7 200Gbps RDMA |
|
||
| Storage | cpfs shared storage across machines |
|
||
|
||
### 2.2 Software
|
||
|
||
| Component | Version | Notes |
|
||
|-----------|---------|-------|
|
||
| vLLM | 0.18.1 (source in `third_party/vllm/`) | Patched scheduler assert (see `patches/`) |
|
||
| Mooncake | 0.3.10 | RDMA-based KV transfer between instances |
|
||
| Python | 3.x managed by `uv` | `.venv/` at project root |
|
||
| Model | `Qwen3-Coder-30B-A3B-Instruct` | MoE 128 experts top-8, 3B active params |
|
||
| Model path | `~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct` | Same on dash0 and dash1 |
|
||
|
||
### 2.3 Workload Trace
|
||
|
||
| Property | Value |
|
||
|----------|-------|
|
||
| Source | GLM-5.1 Agentic Coder, production cluster, 2h window |
|
||
| Raw trace | `~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl` on dash0 |
|
||
| Total requests | 2,114,220 |
|
||
| Avg input tokens | 33,600 (p50=20k, p90=88k) |
|
||
| Avg output tokens | 445 (p50=80) |
|
||
| I/O ratio | 75.6× aggregate |
|
||
| Prefill token share | 98% |
|
||
| KV reuse breakdown | 62% intra-session, 38% cross-session (token-level) |
|
||
| Theoretical max APC | 67% (infinite cache, single instance, prefix-only) |
|
||
|
||
**Sampled trace for benchmarks**: `traces/w600_r0.0015_st30.jsonl` (1214 requests, 688 sessions, 70% multi-turn). Generated with window+thin sampling:
|
||
|
||
```bash
|
||
python scripts/sample_trace.py \
|
||
--input ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \
|
||
--output traces/w600_r0.0015_st30.jsonl \
|
||
--sample-ratio 0.0015 --max-single-turn-ratio 0.3 \
|
||
--window-seconds 600 --seed 42
|
||
```
|
||
|
||
| Trace property | Value |
|
||
|---------------|-------|
|
||
| Sessions | 688 (70% multi-turn, avg 4.9 turns) |
|
||
| Requests | 1214 (use `--request-limit 850` for daily, full for validation) |
|
||
| Avg input tokens | 48,776 |
|
||
| Trace span | 2912s (48.5 min); dense segment 0-990s (850 req) |
|
||
| Peak QPS | 1.6 req/s (in dense segment) |
|
||
| Hash block sharing | 48.3% (vs 52% full trace) |
|
||
| Theoretical APC | 80% (full), 76% (first 850 req) |
|
||
|
||
> **Sampling methodology (2026-05-23)**: Prior traces used random session sampling + `--time-scale` compression + `--max-inflight-sessions` semaphore, which (a) destroyed cross-session hash block sharing (52% → 16%), (b) artificially limited concurrency to 1 req/GPU, and (c) masked prefill-decode interference. The new approach uses contiguous time-window sampling with session thinning (`--max-single-turn-ratio 0.3`) to preserve KV reuse patterns, and trace-driven replay with no artificial concurrency limits.
|
||
|
||
### 2.4 Two Configurations Compared
|
||
|
||
#### Baseline: PD-Combined (8× TP=1 DP=8)
|
||
|
||
```
|
||
8 independent vLLM instances, 1 GPU each, no Mooncake.
|
||
All instances do both prefill and decode.
|
||
Global scheduler (cache_aware_proxy.py --combined) handles:
|
||
- Session-sticky routing (multi-turn → same instance)
|
||
- Load-aware override (if pinned instance > 2× avg load, redirect)
|
||
- Cache-hit scoring (prefer instance with matching prefix blocks)
|
||
```
|
||
|
||
Launch:
|
||
```bash
|
||
# On dash0:
|
||
for i in $(seq 0 7); do
|
||
MASTER_PORT=$((29500+i)) CUDA_VISIBLE_DEVICES=$i \
|
||
vllm serve ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
|
||
--port $((8000+i)) --tp 1 \
|
||
--enable-prefix-caching --enforce-eager \
|
||
--gpu-memory-utilization 0.9 --max-model-len 200000 \
|
||
> /tmp/ab_base_$i.log 2>&1 &
|
||
done
|
||
|
||
python scripts/cache_aware_proxy.py \
|
||
--combined http://127.0.0.1:800{0..7} --port 9090
|
||
```
|
||
|
||
#### Elastic P2P Offload (8× TP=1 kv_both + selective offload)
|
||
|
||
```
|
||
8 independent vLLM instances, 1 GPU each, all kv_role=kv_both (Mooncake).
|
||
Same global scheduler, plus elastic offload logic:
|
||
- Proxy classifies each request: WARM (<5k new), MEDIUM (5-20k), HEAVY (>20k)
|
||
- WARM/MEDIUM: co-located on session-sticky instance (no KV transfer)
|
||
- HEAVY: prefill on a different instance (P), KV via Mooncake RDMA,
|
||
decode on session-sticky instance (D)
|
||
- Cap: max 4 concurrent offloads (MAX_OFFLOAD_INFLIGHT)
|
||
- P instance selection: round-robin with overload skip
|
||
```
|
||
|
||
Launch:
|
||
```bash
|
||
# On dash1 (or use scripts/launch_elastic_p2p.sh):
|
||
for i in $(seq 0 7); do
|
||
VLLM_MOONCAKE_BOOTSTRAP_PORT=$((8998+i)) \
|
||
MASTER_PORT=$((29500+i)) CUDA_VISIBLE_DEVICES=$i \
|
||
vllm serve ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
|
||
--port $((8000+i)) --tp 1 \
|
||
--enable-prefix-caching --enforce-eager \
|
||
--gpu-memory-utilization 0.9 --max-model-len 200000 \
|
||
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
|
||
> /tmp/ab_elastic_$i.log 2>&1 &
|
||
sleep 2 # stagger to avoid NCCL port collision
|
||
done
|
||
|
||
# Wait for bootstrap servers
|
||
for bp in $(seq 8998 9005); do
|
||
until curl -s localhost:$bp/query > /dev/null 2>&1; do sleep 2; done
|
||
done
|
||
|
||
python scripts/cache_aware_proxy.py \
|
||
--combined http://127.0.0.1:800{0..7} \
|
||
--bootstrap-ports 8998,8999,9000,9001,9002,9003,9004,9005 \
|
||
--offload --heavy-threshold 20000 --port 9090
|
||
```
|
||
|
||
### 2.5 Benchmark Parameters
|
||
|
||
| Parameter | Value |
|
||
|-----------|-------|
|
||
| Trace | `traces/w600_r0.0015_st30.jsonl` (window+thin, 70% multi-turn) |
|
||
| Daily iteration | `--request-limit 850` (~13 min, APC≈76%) |
|
||
| Full validation | All 1214 requests (~48 min, APC≈80%) |
|
||
| Replay mode | Trace-driven (no session limit, no time compression) |
|
||
| Request timeout | 600s |
|
||
| vLLM flags | `--enforce-eager --enable-prefix-caching --max-model-len 200000` |
|
||
| GPU memory util | 0.9 |
|
||
| Fresh restart | Both configs started from cold (no warm cache) |
|
||
|
||
### 2.6 Reproducing the Benchmark
|
||
|
||
```bash
|
||
# Activate environment
|
||
cd ~/agentic-kv && source .venv/bin/activate
|
||
|
||
# Ensure sampled trace exists
|
||
python scripts/sample_trace.py \
|
||
--input ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \
|
||
--output traces/sampled_1000req_seed42.jsonl \
|
||
--target-requests 1000 --seed 42
|
||
|
||
# Run benchmark (daily iteration)
|
||
bash scripts/bench.sh --tag my_experiment --mode baseline --policy linear \
|
||
--trace traces/w600_r0.0015_st30.jsonl --requests 850
|
||
|
||
# Run benchmark (full validation)
|
||
bash scripts/bench.sh --tag my_experiment_full --mode baseline --policy linear \
|
||
--trace traces/w600_r0.0015_st30.jsonl
|
||
```
|
||
|
||
## 3. Results
|
||
|
||
> **Errata (2026-05-22)**: The initial cross-machine A/B (dash0 baseline vs dash1 elastic) reported -44% E2E improvement. Post-hoc analysis revealed the dash0 baseline instances were **not freshly restarted** — residual KV cache from prior experiments caused 2× TTFT inflation.
|
||
|
||
> **Errata (2026-05-23)**: §3.1 results used artificial concurrency limits (`--max-inflight-sessions 8`, 1 req/GPU) and random session sampling that destroyed cross-session KV sharing (52% → 16%). See §3.6 for production-realistic results with corrected methodology.
|
||
|
||
### 3.1 Legacy Comparison (artificial 1 req/GPU, 200 req)
|
||
|
||
| Config | OK/N | TTFT p50 | TTFT p90 | TPOT p90 | E2E p50 |
|
||
|--------|------|----------|----------|----------|---------|
|
||
| **Baseline (no Mooncake)** | **198/200** | **1.075s** | **9.384s** | **0.076s** | **5.075s** |
|
||
| LMetric routing | 198/200 | 1.099s | 9.392s | 0.073s | 5.205s |
|
||
| Elastic P2P (kv_both) | 195/200 | 1.018s | 11.312s | 0.085s | 6.977s |
|
||
|
||
### 3.2 Per-Class Breakdown
|
||
|
||
**Baseline (fresh):**
|
||
|
||
| Class | Count | % | TTFT p50 | TTFT p90 | TPOT p90 |
|
||
|-------|-------|---|----------|----------|----------|
|
||
| WARM (<5k) | 46 | 23% | 0.137s | 0.262s | 0.061s |
|
||
| MEDIUM (5-20k) | 50 | 25% | 0.921s | 1.846s | 0.079s |
|
||
| HEAVY (20-50k) | 64 | 32% | 2.660s | 6.278s | 0.076s |
|
||
| HEAVY (>50k) | 38 | 19% | 9.587s | 30.415s | 0.102s |
|
||
|
||
**Elastic P2P (fresh):**
|
||
|
||
| Class | Count | % | TTFT p50 | TTFT p90 | TPOT p90 |
|
||
|-------|-------|---|----------|----------|----------|
|
||
| WARM (<5k) | 46 | 23% | 0.142s | 0.279s | 0.072s |
|
||
| MEDIUM (5-20k) | 50 | 25% | 0.766s | 1.814s | 0.197s |
|
||
| HEAVY (>20k) | 99 | 51% | 6.390s | 22.668s | 0.085s |
|
||
|
||
### 3.3 Success Rate
|
||
|
||
| Config | OK | Total | Rate | Failure mode |
|
||
|--------|-----|-------|------|-------------|
|
||
| Baseline | 198 | 200 | 99.0% | RemoteProtocolError (replayer-side) |
|
||
| Elastic P2P | 195 | 200 | 97.5% | 2× RemoteProtocolError + 3× ReadTimeout on >60k |
|
||
|
||
Elastic's 3 extra errors are D-side KV pull failures: prefill succeeded on P, KV pushed to Mooncake, but D never produced first token (decode scheduler couldn't allocate KV cache space). Prefill timeout fallback (120s → co-located) was never triggered.
|
||
|
||
### 3.4 Routing Policy: Linear vs LMetric (OSDI'26)
|
||
|
||
LMetric (`score = P_tokens × BS`, pure per-request, no session affinity) vs Linear (`score = ongoing_tokens - α·cache_hit`, session-sticky). Both fresh-restart, same trace.
|
||
|
||
> **Errata (2026-05-23)**: Prior LMetric implementation incorrectly shared session-sticky logic with Linear. Fixed to pure per-request routing per OSDI'26 spec: `score = (pending_prefill + new_tokens) × num_requests`, no affinity, no overload override. Results below use corrected implementation.
|
||
|
||
| Policy | TTFT p50 | TTFT p90 | TPOT p90 | E2E p50 | Delta E2E |
|
||
|--------|----------|----------|----------|---------|-----------|
|
||
| Linear (session-sticky) | 1.073s | 9.347s | 0.073s | 5.119s | — |
|
||
| LMetric (no affinity) | 1.081s | 9.408s | 0.072s | 5.102s | **-0.3%** |
|
||
|
||
**Key finding**: LMetric without explicit session affinity matches Linear with session affinity on all metrics (< 2% difference). The cache-hit term in LMetric's scoring (`new_tokens = input - cache_hit`) creates **implicit soft affinity** — instances that already cached a session's prefix get lower P_tokens, naturally attracting subsequent turns. Explicit session-sticky routing is not required; cache-aware load balancing captures it automatically.
|
||
|
||
APC distribution (LMetric, no affinity): inst_0=60.6%, inst_1=58.3%, inst_2=43.2%, inst_3=28.9%, inst_4=16.6%, inst_5=24.0%, inst_6=13.9%, inst_7=0.0%. Non-uniform but comparable aggregate to Linear's explicit affinity.
|
||
|
||
### 3.5 Errata: Why Prior Cross-Machine A/B Was Invalid
|
||
|
||
The initial comparison (commit `1e86285`) reported:
|
||
```
|
||
Baseline (dash0): TTFT50=2.383 E2E50=10.232 ← WRONG (warm instances)
|
||
Elastic (dash1): TTFT50=1.315 E2E50=5.708
|
||
Delta: -45% -44% ← INVALID
|
||
```
|
||
|
||
**Evidence that prior baseline was not fresh:**
|
||
1. `inst_7` APC = 68.3% — impossible from 25 cold-start requests (max ~25%)
|
||
2. WARM TTFT p90 = 3.327s (fresh = 0.262s, 12.7× gap) — indicates KV cache memory pressure from prior experiments
|
||
3. HEAVY TPOT p90 = 0.154s (fresh = 0.076s, 2.0× gap) — heavy prefill-decode interference from full KV cache
|
||
|
||
The elastic numbers on dash1 were genuinely fresh. The "improvement" was actually comparing fresh elastic against degraded baseline.
|
||
|
||
### 3.6 Production-Realistic Baseline (trace-driven, corrected methodology)
|
||
|
||
> Corrected sampling (window+thin, 70% multi-turn, block sharing 48%) and trace-driven replay (no session limit, no time compression). See §2.3 for trace details.
|
||
|
||
**Linear policy, 912 requests (dense segment), peak QPS ≈ 1.6:**
|
||
|
||
| Metric | Legacy (§3.1, 1 req/GPU) | **New (trace-driven)** | Delta |
|
||
|--------|-------------------------|----------------------|-------|
|
||
| TTFT mean | 1.07s | **4.54s** | +4.2× |
|
||
| TTFT p50 | 1.08s | **0.94s** | -13% |
|
||
| TTFT p90 | 9.38s | **14.12s** | +51% |
|
||
| TPOT p50 | 0.038s | **0.070s** | **+84%** |
|
||
| TPOT p90 | 0.073s | **0.175s** | **+139%** |
|
||
| APC (mean) | ~44% | **67.5%** | **+23pp** |
|
||
| Errors | 2/200 (1.0%) | 0/912 (0%) | better |
|
||
| E2E p50 | 5.08s | 6.98s | +37% |
|
||
|
||
**Key differences from legacy methodology:**
|
||
|
||
1. **APC 67.5% vs 44%**: Window+thin sampling preserves cross-session block sharing (48% vs 16% in legacy random sampling), yielding production-realistic cache hit rates. Per-instance APC ranges 46–84%.
|
||
|
||
2. **TPOT +139% at p90**: With trace-driven replay, multiple concurrent requests per GPU create **real prefill-decode interference**. The legacy 1 req/GPU setup showed TPOT p90=0.073s (no interference), but production-realistic load shows TPOT p90=0.175s. This validates that prefill-decode interference is a real problem at production concurrency.
|
||
|
||
3. **TTFT p50 improved (-13%) but mean degraded (+4.2×)**: Higher APC means cached requests get very fast TTFT (p50=0.94s). But concurrent heavy prefills cause queuing for non-cached requests, inflating the mean and p90.
|
||
|
||
4. **Per-instance APC imbalance (46–84%)**: Routing quality directly determines per-instance APC. The 38pp gap between worst and best instance suggests routing optimization is still the highest-leverage improvement.
|
||
|
||
**Output**: `outputs/baseline_r0015_st30/` on dash0.
|
||
|
||
### 3.7 Elastic PS vs Baseline (production-realistic trace)
|
||
|
||
850 requests, `w600_r0.0015_st30.jsonl`, peak QPS≈1.6. Baseline on dash0, elastic on dash1.
|
||
|
||
| Metric | Baseline | Elastic PS | Delta |
|
||
|--------|----------|-----------|-------|
|
||
| TTFT mean | 4.35s | 4.01s | -7.8% |
|
||
| TTFT p50 | 0.94s | 0.93s | -1% |
|
||
| TPOT p50 | 0.070 | 0.071 | +2% |
|
||
| TPOT p90 | 0.162 | 0.157 | -3.1% |
|
||
| E2E p50 | 6.38s | 6.44s | +0.9% |
|
||
| APC mean | 60.7% | 59.9% | -0.8pp |
|
||
| Errors | 0/850 | 4/832 | 4 ReadTimeout |
|
||
|
||
**Elastic PS is near-neutral.** Root cause analysis:
|
||
|
||
**Problem 1: Offload gate too restrictive** — only 17/118 HEAVY requests (14%) were offloaded. 75% of HEAVY requests had `cache_ratio=0%` (cold Turn 1), failing the `cache_ratio >= 0.3` gate. The gate was designed to avoid offloading cold requests (full prefill on P is slower than co-located), but this means 86% of HEAVY prefills still interfere with decode.
|
||
|
||
**Problem 2: Offloaded requests are slower (+50.6%)** — HEAVY_OFFLOAD TTFT=19.94s vs HEAVY_COLO=13.25s. Breakdown:
|
||
- Prefill on P: 14.72s (P also queued, no faster than co-located)
|
||
- KV transfer + decode start on D: 5.71s (pure overhead)
|
||
|
||
**Interference is real but unaddressed**: 89% of WARM/MEDIUM requests ran concurrently with 1+ HEAVY prefills (up to 60 concurrent). Elastic PS only offloaded 17/118 HEAVY requests — insufficient to reduce interference.
|
||
|
||
**Conclusion**: The offload gate (`cache_ratio >= 0.3`) is correct in principle (cold offload IS slower), but leaves the core problem unsolved. Reducing prefill-decode interference requires either:
|
||
1. Offloading ALL heavy prefills (accepting higher TTFT for offloaded requests in exchange for lower TPOT for all)
|
||
2. Chunked prefill scheduling that yields to decode (vLLM-side optimization)
|
||
3. Dedicated prefill GPUs (full PD separation) if KV memory wall can be solved
|
||
|
||
**Output**: `outputs/eval_baseline_linear/` on dash0, `outputs/eval_elastic_linear/` on dash1.
|
||
|
||
### 3.8 Direct KV Cache Migration (Bootstrap-Triggered PUSH)
|
||
|
||
**Architecture**: D asks C's bootstrap server to PUSH cached KV blocks directly into D's GPU memory via Mooncake RDMA WRITE. C's vLLM scheduler is NOT involved (0 GPU compute on C). D then does local prefill for new tokens + decode.
|
||
|
||
**Implementation details** (vLLM + Mooncake patches):
|
||
|
||
1. **Hash table sync** (scheduler → worker → bootstrap): Each step, scheduler computes delta of `BlockPool.cached_block_hash_to_block` and syncs to worker's bootstrap server via `MooncakeConnectorMetadata.hash_table_updates`.
|
||
|
||
2. **Token-based block lookup**: D sends `POST /push_blocks` with prompt `token_ids` + D's GPU addresses. C's bootstrap computes block hashes using `sha256` + `NONE_HASH` (same hash function as scheduler), matches against synced hash table.
|
||
|
||
3. **RDMA PUSH**: C's bootstrap calls `TransferEngine.batch_transfer_sync_write` to push matched KV blocks from C's GPU into D's GPU. This uses the existing RDMA WRITE path (proven reliable), not RDMA READ (which fails on `batch_register_memory`'d GPU memory due to missing `IBV_ACCESS_REMOTE_READ` flags).
|
||
|
||
4. **Cost model**: `offload when colocated_cost + interference > offload_cost`, where `interference = prefill_time × min(num_requests, 3) × 0.3`. Offload triggers when C has 1+ concurrent request.
|
||
|
||
5. **Requirements**: `PYTHONHASHSEED` must be set (bench.sh sets `PYTHONHASHSEED=42` for elastic mode) to ensure deterministic `NONE_HASH` across scheduler/worker code paths.
|
||
|
||
**Minimal test verification** (`scripts/test_direct_read.py`):
|
||
|
||
| Metric | inst_0 (local cache) | inst_1 (RDMA push from inst_0) |
|
||
|--------|---------------------|-------------------------------|
|
||
| Turn 2 TTFT | 0.338s | **0.367s** |
|
||
| Blocks transferred | — | **640/640 matched, push ret=0** |
|
||
| External APC | 0% | **80%** |
|
||
|
||
**Key bugs fixed during development**:
|
||
- `NameError: field not imported` — missing dataclass import
|
||
- Scheduler assertion crash (`assert RequestStatus.is_finished`) — partial remote prefill state mismatch
|
||
- Hash mismatch 0/640 — `sha256` vs `sha256_cbor` (default hash algo is `sha256`, not `sha256_cbor`)
|
||
- Hash mismatch 0/640 — `from X import NONE_HASH` creates stale value binding after `init_none_hash` reassigns the global; fixed with `import X; X.NONE_HASH`
|
||
- RDMA READ ret=-1 — `batch_register_memory` only sets `IBV_ACCESS_REMOTE_WRITE`; switched to bootstrap-triggered PUSH
|
||
- Cost model 0% trigger — removed stale `cache_gate_ratio` check; added interference penalty
|
||
|
||
**Output**: `outputs/eval_direct_rdma_v*/` on dash0.
|
||
|
||
### 3.9 Unified Routing (Historical — superseded)
|
||
|
||
> **Superseded by git history.** The "single argmin + PUSH migration" design
|
||
> described here was implemented in `6b255fa`, refined through
|
||
> `5892739` (soft affinity), `2b9eae0` (numbers below), and `4b50c5a`
|
||
> (queue/overload-gate fixes). Follow-on attempts to scale migration —
|
||
> `e991960`/`5772149` (forced session migration) and `bf4469a` (relaxed
|
||
> push gate) — were both reverted (`cc6e562`, `4c583f2`) after they
|
||
> regressed E2E tail (57 migrations → HEAVY TTFT p90 15.9s → 59.1s;
|
||
> 134 offloads → E2E p90 37s → 82s).
|
||
>
|
||
> Current implementation is the **hybrid LMetric + high-cache affinity**
|
||
> direction landed in `255c8e6`. See `docs/migration-policy-design.md`
|
||
> for the active algorithm and `analysis/unified_routing_fix_review.md`
|
||
> for the reasoning. The numbers below remain valid for the
|
||
> `eval_unified_v3` artifact; do not treat them as the current
|
||
> production policy.
|
||
|
||
Replaced two-phase routing (pick_instance → offload gate) with single `argmin(expected_latency)` per instance:
|
||
|
||
```
|
||
latency(D) = queue(D) + prefill(D) + transfer(D)
|
||
- Local cache: prefill = (input - local_hit) / throughput, transfer = 0
|
||
- PUSH from C: prefill = (input - push_hit) / throughput, transfer = 0.1s
|
||
- Cold: prefill = input / throughput, transfer = 0
|
||
```
|
||
|
||
Session affinity as soft preference: use last instance if its cost ≤ 2× global best.
|
||
|
||
Only 2 measured parameters: `prefill_throughput=7000 tok/s`, `rdma_overhead=0.1s`.
|
||
|
||
**Results (eval_unified_v3, 850/850, 0 errors):**
|
||
|
||
Baseline = `eval_baseline_linear` (plain vLLM, no Mooncake, linear policy, 850 req, same trace).
|
||
|
||
| Metric | Baseline (plain) | **Unified v3 (kv_both)** | Delta |
|
||
|--------|-----------------|-------------------------|-------|
|
||
| TTFT mean | 4.348s | **3.277s** | **-24.6%** |
|
||
| TTFT p50 | 0.945s | **0.793s** | **-16.1%** |
|
||
| TTFT p90 | 12.468s | **8.472s** | **-32.1%** |
|
||
| TTFT p99 | 48.149s | **41.587s** | **-13.6%** |
|
||
| TPOT mean | 0.116s | 0.112s | -3.1% |
|
||
| TPOT p50 | 0.071s | 0.077s | +8.9% |
|
||
| TPOT p90 | 0.177s | 0.198s | +11.7% |
|
||
| TPOT p99 | 1.018s | **0.816s** | **-19.9%** |
|
||
| E2E mean | 19.10s | 19.81s | +3.7% |
|
||
| E2E p50 | 6.443s | **5.599s** | **-13.1%** |
|
||
| E2E p90 | 42.27s | 47.48s | +12.3% |
|
||
| E2E p99 | 192.2s | 238.0s | +23.8% |
|
||
|
||
**Routing**: 723 LOCAL + 116 PUSH_MIGRATE (13.8%). All 116 pushes had cache (avg 25k tokens) — no cold offloads. The unified cost model naturally avoids cold migration because `cold + RDMA > cold` (RDMA adds overhead without reducing prefill).
|
||
|
||
**Tradeoff**: TTFT uniformly improves (p50 -16%, p90 -32%). TPOT mixed: p50/p90 worse (+9%/+12%), but p99 improves (-20%) — PUSH migration relieves the heaviest tail prefills. **E2E tail degrades significantly** (p90 +12%, p99 +24%): kv_both always-on overhead + PUSH transfer latency on migrated requests inflates E2E for long requests, offsetting the TTFT gain. The p50 benefit (-13%) comes from the majority of LOCAL requests getting faster prefill due to reduced queue contention.
|
||
|
||
**Output**: `outputs/eval_unified_v3/` on dash0, baseline from `outputs/eval_baseline_linear/`.
|
||
|
||
## 4. System-Level Analysis
|
||
|
||
### 4.1 Elastic P2P Does Not Improve Single-Machine Performance
|
||
|
||
Under fair comparison (same machine, both fresh):
|
||
|
||
| Metric | Baseline | Elastic | Delta |
|
||
|--------|----------|---------|-------|
|
||
| TTFT p50 | 1.075s | 1.018s | -5.3% |
|
||
| TTFT p90 | 9.384s | 11.312s | +20.5% |
|
||
| TPOT p90 | 0.076s | 0.085s | +11.6% |
|
||
| E2E p50 | 5.075s | 6.977s | +37.5% |
|
||
|
||
Elastic is **worse** on all metrics except TTFT p50. Root causes:
|
||
|
||
**1. Mooncake kv_both memory overhead**
|
||
|
||
Each instance with `kv_role=kv_both` maintains RDMA buffers + Mooncake bootstrap server, reducing GPU memory available for KV cache. This affects ALL requests (including WARM/MEDIUM that don't use P2P transfer), causing more cache eviction and higher TPOT.
|
||
|
||
Evidence: MEDIUM TPOT p90 = 0.197s (elastic) vs 0.079s (baseline) — **2.5× worse** despite MEDIUM requests not using P2P at all.
|
||
|
||
**2. D-side KV pull failures**
|
||
|
||
3 HEAVY requests completed prefill on P instance successfully but D-side never produced first token. The KV cache on D was too full to allocate space for the transferred blocks. These became 600s timeouts.
|
||
|
||
**3. P2P overhead without proportional benefit**
|
||
|
||
The P2P path adds: prefill queue on P (p50=6.3s) + KV transfer + decode start on D (p50=0.8s). For requests where the D instance isn't under heavy prefill load (which is the case on fresh instances), co-located execution is faster.
|
||
|
||
### 4.2 When Elastic P2P Could Help
|
||
|
||
Elastic P2P is designed for the scenario where D-instance decode is disrupted by co-located heavy prefill. On fresh instances with 200 requests, this contention is moderate. The benefit may emerge under:
|
||
- Higher sustained load (1000+ concurrent requests)
|
||
- Longer experiment duration (KV cache fills up, eviction pressure increases)
|
||
- Multi-machine deployment (P on a different node, no memory competition)
|
||
|
||
## 5. Data & Log Locations
|
||
|
||
### 5.1 Experiment Outputs (on respective machines)
|
||
|
||
| Directory | Machine | Config | Notes |
|
||
|-----------|---------|--------|-------|
|
||
| `outputs/ab_baseline/` | dash0 | Combined 8× TP=1 | ~~Initial A/B~~ (INVALIDATED: warm instances) |
|
||
| `outputs/ab_elastic/` | dash0 | Elastic P2P cap=4 | ~~Initial A/B~~ (INVALIDATED) |
|
||
| `outputs/baseline_stability_fresh/` | dash0 | Combined 8× fresh | **Canonical baseline** (§3.1) |
|
||
| `outputs/elastic_stability_*/` | dash0 | Elastic P2P kv_both fresh | **Canonical elastic** (§3.1) |
|
||
| `outputs/ab_linear/` | dash0 | Linear policy, 200 req | §3.4 routing policy comparison |
|
||
| `outputs/ab_lmetric/` | dash0 | LMetric policy, 200 req | §3.4 routing policy comparison |
|
||
| `outputs/gpu_ab_combined/` | local | Combined 8× TP=1 | Earlier run, has gpu_util.csv |
|
||
| `outputs/gpu_ab_pdsep/` | local | PD-Sep 4P+4D | Earlier run, has gpu_util.csv |
|
||
| `outputs/exp2_combined_tp1_dp8/` | local | Combined 8× TP=1 | 1000 req, cache-aware |
|
||
| `outputs/exp3_pd_sep_tp1_mooncake/` | local | PD-Sep 4P+4D Mooncake | 1000 req |
|
||
|
||
### 5.2 vLLM Instance Logs
|
||
|
||
| Path pattern | Machine | Config |
|
||
|-------------|---------|--------|
|
||
| `/tmp/ab_base_$i.log` | dash0 | Baseline instances 0-7 |
|
||
| `/tmp/ab_elastic_$i.log` | dash1 | Elastic instances 0-7 |
|
||
| `/tmp/lmetric_ab_inst_$i.log` | dash0 | Linear policy instances 0-7 (§3.6) |
|
||
| `/tmp/lmetric_inst_$i.log` | dash0 | LMetric policy instances 0-7 (§3.6) |
|
||
|
||
Logs contain `Prefix cache hit rate` and `External prefix cache hit rate` lines for APC extraction.
|
||
|
||
### 5.3 Trace Data
|
||
|
||
| Path | Machine | Description |
|
||
|------|---------|-------------|
|
||
| `~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl` | dash0 | Full 2h production trace (2.1M requests) |
|
||
| `traces/sampled_1000req_seed42.jsonl` | all | Sampled 1000 requests (gitignored, regenerate with `sample_trace.py`) |
|
||
|
||
### 5.4 Analysis Documents
|
||
|
||
| File | Content |
|
||
|------|---------|
|
||
| `analysis/pd_separation_analysis.md` | Main report: PD-Sep vs Combined + Elastic P2P (§5) |
|
||
| `analysis/elastic_offload_design.md` | Elastic P2P design rationale |
|
||
| `analysis/kv_lifecycle_design.md` | KV cache eviction policy analysis |
|
||
| `analysis/adaptive_prefill_offload_design.md` | Initial adaptive offload design (superseded by elastic) |
|
||
|
||
## 6. Repository Structure
|
||
|
||
```
|
||
agentic-kv/
|
||
├── analysis/ # Research reports and design docs
|
||
│ ├── pd_separation_analysis.md # Main comprehensive report
|
||
│ ├── elastic_offload_design.md # Elastic P2P design
|
||
│ ├── kv_lifecycle_design.md # Cache eviction analysis
|
||
│ └── ...
|
||
├── replayer/ # Trace replay framework
|
||
│ ├── __main__.py # CLI entry: python -m replayer
|
||
│ ├── replay.py # Async replayer (session-aware, SSE streaming)
|
||
│ ├── trace.py # TraceRequest dataclass, session/hash_id handling
|
||
│ └── metrics.py # RequestMetrics, crash-safe JSONL sink
|
||
├── scripts/
|
||
│ ├── cache_aware_proxy.py # Global scheduler (combined + PD-sep + elastic offload)
|
||
│ ├── sample_trace.py # Cluster-to-machine trace sampler
|
||
│ ├── launch_vllm.sh # Launch combined TP=8
|
||
│ ├── launch_pd_mooncake.sh # Launch PD-Sep with Mooncake
|
||
│ ├── launch_elastic_p2p.sh # Launch elastic P2P (8× kv_both + offload proxy)
|
||
│ ├── run_experiments.sh # Full experiment matrix (combined/PD-sep)
|
||
│ ├── run_benchmark.sh # Single benchmark run
|
||
│ ├── gpu_monitor.sh # GPU utilization sampler (5s CSV)
|
||
│ ├── compute_roofline.py # Prefill/decode roofline analysis
|
||
│ ├── analyze_*.py # Various analysis scripts
|
||
│ └── compare_*.py # Experiment comparison scripts
|
||
├── patches/
|
||
│ ├── 0001-fix-kv-transfer-abort-race.patch
|
||
│ └── README.md
|
||
├── third_party/vllm/ # vLLM 0.18.1 source (with patch applied)
|
||
├── outputs/ # Experiment results (gitignored)
|
||
├── traces/ # Sampled traces (gitignored)
|
||
├── TODO.md # Original research goals
|
||
└── REPORT.md # This milestone report
|
||
```
|
||
|
||
## 7. Key Scripts Reference
|
||
|
||
| Script | What it does | Key flags |
|
||
|--------|-------------|-----------|
|
||
| `scripts/cache_aware_proxy.py` | Global scheduler + elastic offload proxy | `--combined`, `--offload`, `--policy {linear,lmetric}`, `--heavy-threshold`, `--bootstrap-ports` |
|
||
| `scripts/run_lmetric_ab.sh` | A/B: linear vs lmetric routing policy | Runs both experiments with fresh restart |
|
||
| `scripts/run_elastic_stability_test.sh` | Elastic vs baseline with full isolation | Fresh start/stop per experiment |
|
||
| `scripts/bench.sh` | Standard single-experiment harness | `--tag`, `--mode {baseline,elastic}` |
|
||
| `scripts/sample_trace.py` | Sample complete sessions from cluster trace | `--target-requests`, `--seed` |
|
||
| `python -m replayer` | Replay trace against vLLM endpoint | `--time-scale`, `--max-inflight-sessions`, `--request-limit` |
|
||
| `scripts/gpu_monitor.sh` | Sample nvidia-smi to CSV | Pipe to `outputs/<tag>/gpu_util.csv` |
|
||
| `scripts/launch_elastic_p2p.sh` | Launch all 8 kv_both instances + offload proxy | `HEAVY_THRESHOLD`, `MAX_OFFLOAD` env vars |
|
||
|
||
## 8. GPU Load Imbalance & Elastic Prefill Service Analysis
|
||
|
||
### 8.1 Load Imbalance Characterization
|
||
|
||
Session-sticky routing creates token load imbalance across instances. The severity depends on scale:
|
||
|
||
| Scale | Imbalance | Top 5 sessions | Cause |
|
||
|-------|-----------|----------------|-------|
|
||
| 200 req (143 sessions) | **8.6×** tokens | 49% of all tokens | Small sample, few dominant sessions |
|
||
| 1000 req (668 sessions) | **1.24×** tokens | 29% of all tokens | More sessions → natural averaging |
|
||
|
||
At 1000 requests, the heaviest instance has 4.5M tokens vs lightest 3.6M (1.24×). Despite this, TPOT is uniform across all instances (0.070–0.077), confirming that prefill-decode interference is minimal at ≤1 session/GPU. The imbalance manifests in **TTFT only**: heaviest 2 instances TTFT p50 = 1.42s vs lightest 2 at 0.83s (1.7× gap).
|
||
|
||
### 8.2 Session Accumulation Pattern
|
||
|
||
Agentic workloads produce long-lived sessions with growing context:
|
||
|
||
| Session | Turns | Total Tokens | Context Growth |
|
||
|---------|-------|-------------|----------------|
|
||
| 1569319 | 36 | 2.32M | 27k → 98k (+2.0k/turn) |
|
||
| 1206593 | 36 | 2.31M | 15k → 106k (+2.6k/turn) |
|
||
| 178176 | 25 | 1.93M | 36k → 95k (+2.5k/turn) |
|
||
|
||
Top 5 sessions = 29% of all tokens. With session-sticky, these lock their instances, creating persistent load hotspots.
|
||
|
||
### 8.3 Benchmark Concurrency vs Production Reality
|
||
|
||
> **Critical caveat**: All prior experiments used `--max-inflight-sessions 8` (1 session/GPU). This is **10–15× below production concurrency** and masks the interference that elastic PS is designed to solve.
|
||
|
||
| | Our Benchmark | Production Estimate |
|
||
|--|---------------|---------------------|
|
||
| Concurrent requests/GPU | 1–2 | **8–15** |
|
||
| KV cache usage/GPU | 26–28% (single req) | **80–100%** |
|
||
| Prefill-decode interference | Minimal | **Significant** |
|
||
|
||
**KV cache capacity**: 281,888 tokens/GPU (25.8 GiB). A single 75k-token request consumes 27% of KV cache. At production concurrency (~15 req/GPU), KV cache is near-full, triggering eviction, cache misses, and prefill queuing — none of which appear in our 1-req/GPU benchmark.
|
||
|
||
**Measured interference scaling**:
|
||
|
||
| Concurrency | TPOT p90 | vs 8-session |
|
||
|-------------|----------|-------------|
|
||
| 8 sessions (1/GPU) | 0.075s | baseline |
|
||
| 16 sessions (2/GPU) | 0.103s | **+38%** |
|
||
| Production (~15/GPU) | *not tested* | *expected >>+45%* |
|
||
|
||
### 8.4 Elastic PS: Two Capabilities Re-Evaluated
|
||
|
||
**Capability 1: Reduce prefill-decode interference (lower TPOT)**
|
||
|
||
At 1 req/GPU (our benchmark): no interference, no benefit. But this is an artifact of unrealistically low concurrency. At ≥2 req/GPU, chunked prefill interrupts decode steps, causing TPOT +38–45%. At production concurrency (~15/GPU), multiple HEAVY prefills sharing a GPU with decode requests would cause severe interference. Elastic PS's ability to isolate heavy prefills on separate GPUs directly addresses this.
|
||
|
||
**Capability 2: Session migration for load balancing**
|
||
|
||
Elastic PS enables mid-session migration: prefill on original instance (cache hit), KV transfer to a different instance for decode + future turns. This provides two benefits:
|
||
|
||
1. **Break session lock-in**: Agentic sessions grow +2k tokens/turn over 30+ turns. With session-sticky, a 36-turn session (2.3M tokens total) locks its GPU, creating a hotspot. Elastic PS lets the session migrate to a less-loaded GPU while preserving cache on the original (PS does fast cached prefill, new GPU decodes).
|
||
|
||
2. **Rebalance without cache loss**: Unlike breaking affinity (which destroys cache), elastic PS migration preserves the prefix cache on the original instance — the PS re-uses it for fast prefill, then transfers only the new KV to the destination.
|
||
|
||
Simulation of migration strategies (1000 req, at current low concurrency):
|
||
|
||
| Strategy | Imbalance | Migrations | KV Transfer Overhead |
|
||
|----------|-----------|------------|---------------------|
|
||
| No migration | 1.24× | 0 | 0s |
|
||
| Every 10 turns | 1.21× | 10 | 15s |
|
||
| Every 5 turns | 1.20× | 20 | 30s |
|
||
|
||
At 1 req/GPU, migration benefit is marginal (imbalance is only 1.24×). At production concurrency where imbalance combines with KV cache pressure and interference, the benefit would be substantially larger.
|
||
|
||
**Capability 3: Soft affinity from cache-hit scoring**
|
||
|
||
The corrected LMetric experiment (§3.4) reveals that **explicit session affinity is unnecessary**. Cache-hit scoring (`new_tokens = input − cached`) creates implicit soft affinity — instances with cached prefixes score lower, naturally attracting subsequent turns. This matches hard session-sticky on all metrics (< 2% difference) while providing more flexible load balancing.
|
||
|
||
### 8.5 Elastic PS Verdict
|
||
|
||
| Aspect | At 1 req/GPU (tested) | At production load (expected) |
|
||
|--------|----------------------|-------------------------------|
|
||
| TPOT reduction | 0% (no interference) | **Significant** (interference scales with concurrency) |
|
||
| Session migration | Marginal (1.24× → 1.20×) | **Larger benefit** (KV pressure + interference amplify imbalance) |
|
||
| Cache preservation | N/A | **Key advantage** vs breaking affinity |
|
||
|
||
**At our benchmark concurrency (1 req/GPU), elastic PS is not justified** — Mooncake overhead exceeds the non-existent interference benefit. **But our benchmark is 10–15× below production load.** The real question is whether elastic PS helps at production-realistic concurrency (64–128 concurrent sessions, 8–15 req/GPU), where:
|
||
- Prefill-decode interference is measurable (already +38% TPOT at just 2/GPU)
|
||
- KV cache pressure causes eviction and queue delays
|
||
- Session accumulation creates compounding hotspots
|
||
- Heavy prefills (50–100k tokens) block decode for seconds
|
||
|
||
**Next step: run `--max-inflight-sessions 64` benchmark** to test elastic PS at production-realistic concurrency.
|
||
|
||
## 9. Conclusions & Next Steps
|
||
|
||
### Established findings:
|
||
1. Full PD separation is **net negative** for single-machine agentic workloads (KV cache memory wall)
|
||
2. Cache-aware routing is the **dominant optimization** (+24pp APC, -60% TTFT vs round-robin)
|
||
3. **Explicit session affinity is unnecessary** — cache-hit scoring creates implicit soft affinity that matches hard session-sticky (< 2% difference)
|
||
4. At low concurrency (1 req/GPU), elastic P2P offload adds overhead without benefit
|
||
5. **Our benchmark concurrency is 10–15× below production**: `--max-inflight-sessions 8` yields 1 req/GPU, masking prefill-decode interference that appears at ≥2 req/GPU (+38% TPOT) and would dominate at production load (~15 req/GPU)
|
||
6. **Experimental methodology matters**: warm vs fresh instances cause 2× TTFT difference
|
||
|
||
### Lessons learned:
|
||
- Prior cross-machine A/B (commit `1e86285`) was invalid — warm baseline inflated by 2×
|
||
- Prior LMetric implementation was invalid — incorrectly shared session-sticky logic with Linear
|
||
- `kv_role=kv_both` has non-trivial always-on overhead even when P2P transfer is not used
|
||
- Experiment isolation (kill all → verify GPU free → fresh start) is critical for reproducibility
|
||
- **Benchmark concurrency must match production** — sub-production concurrency hides interference effects that drive system design decisions
|
||
|
||
### Open problems (priority ordered):
|
||
1. **Production-concurrency benchmark** (`--max-inflight-sessions 64+`): Validate whether prefill-decode interference at 8–15 req/GPU makes elastic PS net-positive
|
||
2. **Multi-machine elastic**: P on a different node eliminates GPU memory competition — the main cost that makes single-machine elastic net negative
|
||
3. **Layerwise KV transfer**: Mooncake's block-level transfer after full prefill is the bottleneck. Layerwise pipelining could reduce transfer latency by overlapping with computation
|
||
4. **Router state accuracy**: proxy shadow state vs vLLM-internal exact state (TODO: vLLM → Redis → router)
|
||
|
||
---
|
||
|
||
*Updated 2026-05-23. LMetric corrected (§3.4 errata). GPU imbalance analysis added (§8). Benchmark concurrency gap identified — production-load experiments needed.*
|