Commit Graph

67 Commits

Author SHA1 Message Date
a7514fc3d5 Fix retry syntax: async generator can't use return, use break+try/finally
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 22:37:32 +08:00
daeb95eca0 RDMA overhead 2.0→0.1s (direct read is raw memory, not scheduler flow) +
retry on ConnectError to handle kv_both connection instability

With RDMA_overhead=0.1s, offload triggers when C_s has just 700 tokens
pending (0.1s queue), vs 38k tokens (5.4s) with the old 2.0s estimate.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 22:33:10 +08:00
5c66f500fc Fix offload gate: remove cache_gate for direct RDMA read, fix cost model
The cache_gate_ratio=0.3 check blocked 83/112 HEAVY requests (75%)
because they were cold (cache_ratio=0). But with direct RDMA read,
D reads C's cached blocks via RDMA regardless of cache ratio — the
gate was protecting against the OLD flow (C does prefill + push).

Also fixed cost model: offload_cost now reflects direct read reality:
  OLD: P_queue + P_full_prefill + RDMA (P has no cache → expensive)
  NEW: D_queue + RDMA_read + D_local_prefill(new_tokens)

Offload wins when C_s queue > RDMA_overhead (~2s).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 22:01:43 +08:00
23788f7cd5 Fix: import field from dataclasses for PullReqMeta
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 21:29:24 +08:00
1dea82f2ff launch_phase1_ps: parameterise project + model paths (B6 followup) 2026-05-23 21:14:15 +08:00
52a54e44af proxy: split session_affinity per mode + vLLM patch self-check (M4, S2)
- Replace the global session_affinity dict with two namespace-isolated
  ones (combined / prefill) so a session_id never indexes the wrong
  instance list across mode switches. Keep `session_affinity` as a
  read-only alias to the combined dict for any existing tooling.
- Add a startup _verify_vllm_patch() that scans
  vllm.v1.core.sched.scheduler.Scheduler for the original
  `assert req_id in self.requests` line. If the patch was not
  re-applied after a vLLM upgrade we now print a loud warning at
  lifespan startup instead of dying mid-experiment on a KV-transfer
  abort race.
2026-05-23 21:12:56 +08:00
c843f2e3db proxy: Settings dataclass + cache-ratio gate + P-pick offload penalty (B4, M2, M3, D5)
- Replace mutable module constants (HEAVY_THRESHOLD/OVERLOAD_FACTOR/
  MAX_OFFLOAD_INFLIGHT/PREFILL_THROUGHPUT/RDMA_OVERHEAD_S/
  CACHE_CAPACITY_BLOCKS) with a Settings dataclass + SETTINGS singleton.
  __main__ now mutates SETTINGS so CLI overrides survive even when the
  module is imported as a library (e.g. by tests/) (D5).
- Add --max-offload-inflight CLI flag (M3) and read it from SETTINGS.
- Add --cache-gate-ratio CLI flag and a real gate before the cost-model
  branch: if cache_hit/input_length < ratio, mark cache_gate_REASON and
  fall back to colocated. cache_ratio is no longer a write-only field
  (B4).
- P candidate selection penalises instances already running offloaded
  HEAVY prefills, so back-to-back HEAVY requests don't pile onto the
  same P (M2).
- bench.sh forwards --max-offload-inflight / --cache-gate-ratio to the
  proxy.
- Tests cover SETTINGS knobs + the heavy_threshold-driven P-offload
  penalty.
2026-05-23 21:11:17 +08:00
0701f84c00 tests: add minimal coverage for percentile + proxy routing (S1)
- tests/test_metrics.py asserts the new linear-interp _percentile against
  hand-computed expected values (single value, two-value interpolation,
  endpoints, numpy-equivalent linear default, on-integer rank).
- tests/test_proxy_pick.py exercises InstanceState LRU eviction and
  move-to-end on hit, plus session-affinity stickiness, the overload
  fallback, the active_p_offloads penalty, and lmetric scoring. The
  proxy is loaded by file path with stub fastapi/uvicorn/httpx modules
  so the suite runs without the FastAPI server deps installed.
- pyproject.toml gets a hatchling wheel target and a [tool.pytest]
  section so `uv run --extra dev pytest` works out of the box.
2026-05-23 21:07:14 +08:00
7c7f8b951a replayer: wire --max-inflight-sessions cap into replay loop (B2)
Trace-driven dispatch is preserved by default (semaphore=None when the
flag is not set), but operators can now cap concurrent sessions to
reproduce session-admission scenarios from earlier sweeps without
artificial time compression.
2026-05-23 21:04:09 +08:00
2c7f7fdaae replayer: restore optional max_inflight_sessions for backwards compat
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 21:02:26 +08:00
a7df84bd3b Direct RDMA read: D reads cached KV from C's GPU without C's scheduler
Complete implementation of direct RDMA read for KV cache migration:

vLLM Mooncake connector (mooncake_connector.py):
- PullReqMeta: add direct_read flag + block_hashes
- MooncakeConnectorMetadata: add hash_table_updates/removals for
  scheduler->worker block hash sync
- MooncakeConnectorScheduler: set_block_pool() to access BlockPool,
  build_connector_meta() computes hash table deltas each step,
  update_state_after_alloc() captures request block hashes for direct_read
- MooncakeConnectorWorker: _start_direct_read() + _direct_read_single()
  implements D-side RDMA read via batch_transfer_sync_read, with
  HTTP query/unpin to C's bootstrap server

Bootstrap server (mooncake_utils.py):
- POST /query_blocks: look up block hashes, return block_ids + GPU layout
- POST /unpin_blocks: release pin tracking
- set_worker_kv_info(): register GPU addresses at init
- update_hash_table(): receive scheduler deltas each step

Scheduler (scheduler.py):
- One-line hookup: pass block_pool to connector after KVCacheManager init

Proxy (cache_aware_proxy.py):
- _handle_direct_read_offload: sends request ONLY to D with
  direct_read=True + remote_bootstrap_addr. No request to C at all.
- C's scheduler is completely uninvolved (0 GPU time on C)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 21:02:13 +08:00
020be9f444 proxy: real LRU for cached_blocks + shadow-state reconcile loop (M1, M5)
M1: cached_blocks was a plain set with a "trim half via list slicing"
eviction. CPython does not guarantee set iteration order, so the trim
discarded an arbitrary half of the entries — completely unlike vLLM's
LRU and a known contributor to the router's cache_hit estimate
diverging from real APC. Replace with an OrderedDict-backed LRU:
move_to_end on hits, popitem(last=False) on overflow. Capacity exposed
as CACHE_CAPACITY_BLOCKS module constant (200000 by default).

M5: streamed responses decrement load counters in their generator's
finally block. If a client disconnects before consuming the body the
generator is never entered and the decrement is lost, causing
ongoing_tokens / num_requests / pending_prefill_tokens to drift
negative under load. Add a 60s background reconcile_loop that clamps
those counters at zero as a safety net. Started in lifespan, cancelled
on shutdown. Does not replace proper vLLM exact-state syncing.
2026-05-23 21:00:35 +08:00
0ed1ce200e metrics: replace round-based percentile with linear interpolation (B5)
The previous implementation used round((n-1) * pct), which under Python's
banker's rounding returned the upper-middle element on every even-length
array (e.g. p50 of [1,2,3,4] returned 3 instead of 2.5). All summary
JSONs were biased upward at p50 as a result. Match numpy.percentile's
default linear interpolation between the two adjacent sorted values.
2026-05-23 21:00:24 +08:00
0958823cdb REPORT: add §1.1 errata flagging superseded sections (S3)
Calls out that §3.1 (old random sampler, time-scale compression, 1 req/GPU
cap) and the early elastic v3 warm-vs-fresh runs are no longer current,
and that the "--max-inflight-sessions 64+" next-step text refers to a
flag that was removed and must be restored per FIXES.md §B2 before those
numbers can be reproduced. Points readers at §3.6/§3.7 as authoritative.
2026-05-23 20:58:38 +08:00
ea5c3bfe6b compute_roofline: argparse --trace, fix stale default path (D4)
The hardcoded traces/sampled_1000req_seed42.jsonl no longer exists; switch
the default to the current sampled trace file w600_r0.0015_st30.jsonl and
let users override via --trace. Skip Part 4 cleanly when the file is
missing instead of relying on os.path.exists.
2026-05-23 20:58:09 +08:00
547611e022 scripts: archive obsolete one-off shell/python scripts to legacy/ (D2, D3)
D2: run_benchmark.sh and run_experiments.sh still pass --time-scale and
--max-inflight-sessions to the replayer, but those flags were removed when
the project moved to trace-driven dispatch. The scripts cannot run as-is.

D3: ~25 ad-hoc analyze_* / compare_* / profile_* / final_* scripts and a
handful of single-experiment run_*.sh point at /home/admin/cpfs paths,
deleted output directories, or a sampled trace file that no longer exists.
Keep them in scripts/legacy/ for historical reference; the scripts that
remain in scripts/ (analyze_trace, analyze_breakdown, analyze_cache_hit,
analyze_eviction, compare_results, compute_roofline, sample_trace,
analyze_agentic_patterns, simulate_cache_policies, plus launch_*.sh,
gpu_monitor.sh, bench.sh) cover the current workflow.

Adds scripts/legacy/README.md to document the archival policy.
2026-05-23 20:57:32 +08:00
c64b0b39c7 bench.sh: fix stale MODEL and TRACE defaults (B6)
The default MODEL pointed at /home/admin/cpfs/... which never existed on
the public dev machines (other launch_*.sh and TODO.md use $HOME/models),
and the default TRACE pointed at traces/sampled_1000req_seed42.jsonl
which was deleted when the sampler moved to window+thin output. Update
both to the values the rest of the repo already standardized on.
2026-05-23 20:56:40 +08:00
556f3011c6 proxy: remove dead state and broken fire-and-forget path (B1, D1)
B1: _inst_cumulative_tokens was written by pick_instance but never read
anywhere; delete the variable, global declaration, and per-call increment.
Load is already tracked via inst.ongoing_tokens.

D1: _send_prefill_async + the --fire-and-forget branch were unreachable
in practice (no launch/bench script enabled the flag) and broken even if
exercised: D-decode would fire before P registered the transfer_id,
guaranteeing a Mooncake 502. Collapse _handle_pd_sep to its synchronous
path and drop the CLI flag.
2026-05-23 20:56:11 +08:00
fc445df0ad Add FIXES.md with prioritized repo cleanup checklist
Captures the full review of bugs, fake/half-implemented features, dead
branches, and quality gaps found in cache_aware_proxy.py, replayer, and
the shell scripts. Each item has file:line, problem, fix, and verification
steps so any contributor can pick it up directly.
2026-05-23 20:35:56 +08:00
b2ede1da77 bench.sh: add trap for graceful cleanup on kill/interrupt
Added EXIT/INT/TERM traps to ensure vLLM, proxy, and gpu_monitor
processes are cleaned up even when bench.sh is killed externally.
Also includes gpu_monitor in cleanup_gpu pattern matching.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 20:24:13 +08:00
ea5149726c Partial remote prefill: C_s exports cache, D computes new tokens locally
vLLM Mooncake patch:
- get_num_new_matched_tokens: support remote_num_tokens parameter for
  partial remote prefill (pull N tokens from remote, compute rest locally)
- update_state_after_alloc: only allocate receive blocks for external portion

Proxy _handle_heavy_offload rewrite:
- Step 1: C_s exports ONLY cached blocks (truncated prompt, 0 compute)
- Step 2: D pulls cached blocks + does local prefill for new tokens + decodes
- C_s's blocks auto-freed by Mooncake delay_free after D confirms receipt

This enables true session migration: C_s releases cache, D takes over.
C_s's GPU is freed immediately (no compute), vs old approach where C_s
had to do full prefill (1-15s GPU occupancy).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 20:04:13 +08:00
be273f7f27 Replace static offload gate with runtime cost model
Old gate: cache_ratio >= 0.3 (static, only 14% of HEAVY triggered)
New gate: offload when offload_cost < colocated_cost, where:
  colocated_cost = queue(C_s) + prefill(new_tokens)
  offload_cost = queue(P_idle) + prefill(P_tokens) + RDMA_overhead

Key changes:
- P is now least-loaded instance (not session-sticky C_s)
- Gate considers C_s queue depth dynamically
- Crossover: offload wins when C_s queue >= 38k tokens (~5.4s)
- Cold HEAVY requests CAN be offloaded if C_s is busy enough
- P accounting uses P's actual cache hit, not C_s's

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 19:42:33 +08:00
9835d6af5d Elastic PS eval: near-neutral, offload gate triggers only 14% of HEAVY
Root cause: 75% of HEAVY requests are cold (cache_ratio=0%), failing the
cache_ratio>=0.3 gate. Only 17/118 HEAVY offloaded, insufficient to reduce
prefill-decode interference. Offloaded requests are 50% SLOWER due to
P-side queuing (14.7s) + RDMA overhead (5.7s).

Interference IS real: 89% of WARM/MEDIUM have 1+ concurrent HEAVY prefill.
But elastic PS in current form can't address it because cold HEAVY prefills
(the majority) can't benefit from offload.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 16:49:25 +08:00
03e88b30bd Add elastic PS evaluation plan for production-realistic trace
4 experiments: baseline vs elastic × linear vs lmetric
Using corrected trace (w600_r0.0015_st30, 70% multi-turn, APC~76%)
and fixed elastic PS (D accounting, offload cap, cache sync).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 15:56:05 +08:00
f5e45afd4e Fix 4 elastic PS bugs: D accounting, offload cap, cache migration, prefix sync
Bug 1+5: D instance had no accounting during prefill phase (7-11s window).
Router saw D as idle, routing extra traffic that caused KV allocation failures.
Fix: reserve D's ongoing_tokens+num_requests at offload decision time.

Bug 7: No cap on concurrent offloads despite REPORT claiming MAX_OFFLOAD=4.
Fix: add MAX_OFFLOAD_INFLIGHT=4 check before offloading.

Bug 6: Session affinity migrated to D but proxy cache estimator wasn't
updated for D. Future turns scored D as cache-cold.
Fix: call d_inst.record_prefix(token_ids) after successful decode.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 15:55:11 +08:00
bf037594c4 Production-realistic baseline: APC 67.5%, TPOT +139% from interference
Updated methodology:
- Window+thin sampling preserves cross-session sharing (48% vs 16%)
- --max-single-turn-ratio 0.3 boosts multi-turn to 70%
- --window-seconds 600 for 10-min contiguous window
- Trace-driven replay (no session limit, no time compression)
- Daily config: --requests 850 (~13 min, APC~76%)

Key result: TPOT p90=0.175s (vs 0.073s in legacy 1-req/GPU setup),
confirming prefill-decode interference is real at production concurrency.
APC 67.5% (vs 44%) from better KV reuse preservation.

Also fixed KV reuse breakdown: 62% intra-session / 38% cross-session
(was incorrectly reported as 91% / 9%).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 15:44:34 +08:00
d8dc9dc0ce Add --max-single-turn-ratio to control single-turn session fraction
Single-turn sessions with unique prefixes get 0% cache hit, diluting
APC in benchmarks.  --max-single-turn-ratio caps their fraction,
boosting multi-turn density and theoretical APC.

Example: --sample-ratio 0.008 --max-single-turn-ratio 0.3
  Before: 9.2% multi-turn, APC=70.5%
  After:  70.0% multi-turn, APC=85.0%, sharing=53.3%

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 14:17:25 +08:00
1e1e2e774d Fix sampler: window+thin preserves cross-session KV cache sharing
Random session sampling destroys cross-session hash block sharing
(52% -> 16%) because sessions sharing system prompts get scattered.

New approach: take a contiguous time window from the trace (preserving
temporal locality of shared-prefix sessions), then thin within the
window to hit target QPS. This preserves both intra-session reuse
(62% of reusable tokens) and cross-session sharing (38%).

Results (block sharing rate):
  Old random r=0.002:  16.0%  ->  Window+thin: 29.7%
  Old random r=0.016:  19.5%  ->  Window+thin: 42.7%
  Full trace baseline: 52%

Also corrected the "91% intra-session" claim: actual split is
62% intra / 38% cross (token-level), making cross-session sharing
preservation critical for valid APC benchmarks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 14:03:12 +08:00
4089ffd63f Fix replay methodology: trace-driven dispatch, no artificial limits
The replayer was artificially limiting concurrency with --max-inflight-sessions
(semaphore) and --time-scale (time compression), producing unrealistically low
1 req/GPU load that masked prefill-decode interference.

Replayer changes:
- Remove session_sem and time_scale entirely
- Each request dispatched at its trace timestamp exactly
- Sessions still sequential (turn N+1 waits for turn N completion)
- If turn completes late, next turn fires immediately

Sampler changes:
- Add --sample-ratio for GPU-proportional session sampling
- Keep --target-requests for backwards compat
- No time compression (preserve original arrival pattern)

bench.sh: remove --time-scale and --max-inflight-sessions args

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 12:43:41 +08:00
c8ba666517 Benchmark concurrency gap: 1 req/GPU is 10-15x below production
Our --max-inflight-sessions 8 yields 1 req/GPU, masking prefill-decode
interference that appears at 2/GPU (+38% TPOT) and would dominate at
production load (~15/GPU). Updated §8 to re-evaluate elastic PS at
production concurrency. Next step: --max-inflight-sessions 64 benchmark.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 12:16:20 +08:00
fefbd71ca9 GPU imbalance analysis + elastic PS verdict + corrected LMetric results
Key findings:
- Session-sticky imbalance is 8.6x at 200 req (small-sample artifact)
  but only 1.24x at 1000 req (moderate, TPOT unaffected)
- Elastic PS not justified: interference reduction 0% at 1/GPU,
  migration reduces imbalance 1.24x→1.18x at 1.5s/event cost
- Corrected LMetric (no affinity) matches Linear (sticky) on all
  metrics (<2%), proving soft affinity from cache-hit scoring works
- Updated §3.4 errata, added §8 GPU imbalance + elastic PS analysis

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 12:11:23 +08:00
3594f7dce0 Fix LMetric routing: remove session affinity, align with OSDI'26 spec
LMetric was incorrectly sharing session-sticky logic with Linear policy.
Fixed to pure per-request routing: score = P_tokens × BS where
P = pending_prefill + (input - cache_hit), BS = num_requests.

Experiment result (200 req, fresh restart): Linear vs corrected LMetric
show <2% difference on all metrics — LMetric's cache-hit estimation
provides implicit soft affinity that preserves locality without explicit
session stickiness.

Also fix bench.sh missing cd (replayer module not found from non-project
cwd) and rewrite run_lmetric_ab.sh as thin wrapper around bench.sh to
eliminate duplicated launch/cleanup logic that broke under set -euo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 11:56:58 +08:00
8e0c6e78b0 Add comprehensive research findings document
Synthesizes all experiments into a paper-ready analysis:
- Agentic workload characteristics vs chatbot/API
- Why PD-Sep, LMetric, elastic RDMA, chunk-size tuning don't work
- Why cache-aware session-sticky routing IS the key optimization
  (-60% TTFT, +24pp APC vs round-robin)
- System-level insights: prefill-decode interference threshold,
  Mooncake limitations, effective request weight after cache
- GPU balance → HEAVY TTFT -10.5% (demonstrated)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 07:16:31 +08:00
080a8fa138 Chunk-size ablation + comprehensive synthesis
max_num_batched_tokens sweep at 16 sessions (2048/4096/8192/16384):
- Default 8192 has best overall TPOT p90 (0.106) and E2E p50 (5.83)
- 16384: HEAVY TTFT -16%, HEAVY TPOT -17%, but overall worse (+18%)
- Smaller chunks (2048/4096) always worse (scheduler overhead)

bench.sh now supports --max-batched-tokens flag.

Updated elastic_hypotheses.md with H8 (high concurrency validated),
H9 (elastic RDMA at 16s rejected), and final synthesis.

Key conclusion: for agentic workloads, the dominant optimization is
cache-aware session-sticky routing (-60% TTFT, +24pp APC vs RR).
Neither PD-Sep, LMetric, elastic RDMA, nor chunk-size tuning provides
additional benefit beyond well-tuned routing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 07:15:02 +08:00
baf7ffb08c 16-session contention: TPOT +45% from prefill-decode interference
Key finding: at 16 concurrent sessions (2 per GPU), TPOT p90 degrades
from 0.073 to 0.106 (+45%), with MEDIUM TPOT at 0.197 (+149%).
This is the first time we've reproduced real prefill-decode interference
in controlled experiments.

Elastic RDMA at 16 sessions doesn't help: only 13/500 offloaded (cache-gate
correct for cold turn-1), kv_both adds ~16% TPOT overhead at high concurrency.

Load scaling: 1000req_ts20, 200req_ts10, 200req_ts5, 500req_ts10 all show
~30% GPU util at 8 sessions. The bottleneck is max_inflight_sessions, not
arrival rate.

Updated elastic_hypotheses.md with H8, H9, and comprehensive final analysis.
The real bottleneck is vLLM's chunked prefill scheduling, not routing or
PD disaggregation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 05:51:47 +08:00
85b230455e H7 OVERLOAD_FACTOR sweep: negative result + H4 GPU profiling
H7: Sweeping OVERLOAD_FACTOR (2.0/1.5/1.3/1.0) has no effect on GPU
imbalance (~3.5-4x across all settings). Root cause: imbalance is from
workload skew at session placement (turn 1), not from routing at turn 2+.

H4 GPU profiling confirms: GPU balance improvement IS real (4.0x→2.0x),
and it directly improves HEAVY_COLO TTFT by 10.5%. But RDMA-offloaded
requests have bimodal transfer times (0.6s or 18-31s) that negate the
routing benefit.

Updated elastic_hypotheses.md with H7 results and next directions:
higher load experiments where contention amplifies routing differences.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 03:04:02 +08:00
3bc37cc6d5 PS experiments + H4 cache-gate + GPU profiling + Mooncake elif→if fix
Experiments run:
- Phase 0: kv_both has zero idle overhead (TPOT +1.3%, noise)
- PS V1 (cold prefill): REJECTED — PS always slower than cached C
- PS V1+flexD: 92.5% OK, HEAVY TTFT 7.8s (baseline 5.0s) — PS bottleneck
- V2 (C_s prefill + flexible D): E2E -9% but 6 errors, RDMA bimodal
- H4 (cache-gate): 198/200 OK, GPU imbalance 4.0x→2.0x, but HEAVY_OFFLOAD
  TTFT=11.5s due to RDMA. HEAVY_COLO improved 10.5% from better balance.
- H5: Mooncake RDMA transfer R²=0.095, bimodal (0.6s or 18-30s)

Key findings:
- Mooncake lacks layerwise KV transfer → RDMA is pure sequential overhead
- 92% of HEAVY are turn-1 cold → offloading cold requests always loses
- GPU balance improvement from routing IS real (-10.5% HEAVY_COLO TTFT)
- RDMA transfer negates the routing benefit for offloaded requests

Code changes:
- bench.sh: add GPU timeline monitoring (gpu_monitor.sh during benchmark)
- cache_aware_proxy.py: H4 cache-gate, flexible D, PS routing
- mooncake_connector.py: elif→if fix (allow dual prefill+decode flags)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 02:14:37 +08:00
098d86385a Add elastic hypotheses tracking doc with H1-H6 analysis
Tracks all hypotheses tested during elastic PD disaggregation research:
- H1 (kv_both overhead): REJECTED — zero overhead at idle
- H2 (PS cold prefill): REJECTED — PS slower than cached C
- H3 (C_s+flexD): PARTIALLY VALIDATED — E2E -9% but HEAVY p90 +117%
- H4 (cache-aware offload): TODO — only offload high-cache-hit HEAVY
- H5 (RDMA overhead): TODO — Mooncake lacks layerwise transfer
- H6 (session migration): TODO — verify D's APC after migration

Key insight: offload decision should be cache-aware (new_tokens),
not size-based (total_input). 80k request with 90% cache = 8k prefill.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 01:17:12 +08:00
fc92410ec9 Invalidate prior A/B results + add proper experiment harness
Prior cross-machine comparison (commit 1e86285) was invalid: dash0
baseline used warm instances with residual KV cache, inflating TTFT
by 2x. Evidence: inst_7 APC=68.3% impossible from 25 cold-start
requests; WARM TTFT p90=3.3s vs fresh=0.26s.

Fair same-machine comparison (both fresh restart on dash0):
  Baseline:    TTFT50=1.075  TPOT90=0.076  E2E50=5.075  OK=198/200
  Elastic P2P: TTFT50=1.018  TPOT90=0.085  E2E50=6.977  OK=195/200
Elastic is WORSE due to Mooncake kv_both memory overhead.

Changes:
- REPORT.md: rewrite §3-4 with corrected results, add §3.5 errata
- pd_separation_analysis.md: update elastic TL;DR with correct numbers
- cache_aware_proxy.py: fix double-decrement bugs in offload path,
  add 120s prefill timeout with co-located fallback (HEAVY_COLO_FALLBACK)
- bench.sh: standardized experiment harness with guaranteed GPU cleanup
  and fresh-state verification (nvidia-smi check before start)
- run_elastic_stability_test.sh: two-phase elastic vs baseline test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 17:54:21 +08:00
e4fa56cb1e LMetric routing policy (OSDI'26) + A/B results vs linear baseline
Implement LMetric (P_tokens × BS multiplication score) from "Simple is
Better" (Zhang et al., OSDI'26) as alternative routing policy for
combined mode. Key changes:

- cache_aware_proxy.py: add --policy {linear,lmetric} flag, track
  pending_prefill_tokens and num_requests per instance, /stats endpoint
- run_lmetric_ab.sh: automated A/B script for fair comparison

Results (200 req, fresh restart, same trace):
  Linear:  TTFT50=1.086  TPOT90=0.077  E2E50=5.423
  LMetric: TTFT50=1.099  TPOT90=0.073  E2E50=5.205
  Delta:   TTFT +1.2%    TPOT -5.9%    E2E -4.0%

LMetric improves TPOT/E2E modestly through better load balancing, but
routing policy headroom is limited vs elastic P2P offload (-44% E2E).

TODO: vLLM → Redis → router pipeline for exact state ablation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 16:57:32 +08:00
2b0ac70ee7 Phase 1 milestone: system-level analysis + reproducible report
- REPORT.md: self-contained milestone report covering baseline vs elastic
  setup, exact launch commands, benchmark params, results, log locations,
  and repo structure — sufficient for anyone to reproduce
- analysis/pd_separation_analysis.md §5: elastic P2P system-level breakdown
  (KV cache hit ratio, per-class TTFT, GPU util paradox explanation)
- scripts/cache_aware_proxy.py: round-robin P-instance selection replacing
  argmin(ongoing_tokens) to fix GPU load imbalance (3.0x → expected ~2x)
- scripts/launch_elastic_p2p.sh: one-command launch for elastic P2P config

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 16:17:41 +08:00
1e8628581b Fair A/B: Elastic P2P wins on ALL metrics vs baseline (fresh restart)
Same-condition comparison (both fresh restart, same trace, same params):
  Baseline (combined):  TTFT=2.383/27.622  TPOT90=0.117  E2E=10.232
  Elastic P2P (cap=4):  TTFT=1.315/13.179  TPOT90=0.075  E2E=5.708
  Delta:                -45%  / -52%        -36%          -44%

Key finding: TPOT p90 dropped 36% — confirming heavy prefill DOES
disrupt decode in combined mode, and elastic offload effectively
isolates it. Previous comparisons missed this because baselines
were run under different conditions (stale instances, different time_scale).

GPU util: elastic uses less GPU (15.8% vs 28.7%) but achieves better
latency — higher efficiency through better cache distribution.

APC: elastic has more balanced per-instance APC (36-38% prefix + 30-35%
external) vs baseline's skewed distribution (3.8% - 68.3%).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 15:48:51 +08:00
76ee28a40f Elastic P2P v4: error rate 25% -> 4%, TTFT p50 -12% (median-tail tradeoff)
Fixed offload decision: removed p>=d gate (was blocking all offloads),
added MAX_OFFLOAD_INFLIGHT=4 cap and p_saturated threshold.

Result (200 req, fresh restart):
  Baseline: 99% success, TTFT=1.080/9.410, TPOT90=0.076, E2E=5.306
  Elastic:  96% success, TTFT=0.946/15.843, TPOT90=0.077, E2E=5.717

Architectural tradeoff confirmed:
  - Median (p50) improves: D instances not disrupted by heavy prefill
  - Tail (p90) worsens: offloaded HEAVY requests pay KV transfer cost
  - TPOT unchanged: decode isolation is not the bottleneck

To improve p90: need layerwise pipelined KV transfer (overlap with prefill
compute) or smarter offload gating that avoids offloading the very largest
requests (which have the longest prefill time and generate the most KV).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 15:08:16 +08:00
1d2eeb4925 Elastic P2P offload: TTFT p50 -49% vs baseline (0.551 vs 1.080)
Design: offload HEAVY prefill only when P instance is less loaded than D
AND P is not overloaded (< 1.5x avg). Preserves session-sticky on D
for future KV reuse. External KV correctly registered in prefix cache.

Result (67/200 processed, 75% success):
  TTFT p50: 0.551s (-49% vs baseline 1.080s)
  TTFT p90: 4.135s (vs baseline 9.410s, -56%)
  TPOT p90: 0.074s (same as baseline)
  E2E  p50: 2.938s (-45% vs baseline 5.306s)

25% error rate from ReadTimeout on very large HEAVY requests queuing on P.
Needs stricter elastic gate or higher timeout. But successful requests
show significant improvement over both baseline and previous P2P.

Also: added external_prefix_cache metrics tracking to replayer summary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 13:50:25 +08:00
e9e313f9c5 P2P cache analysis: external KV correctly registered in prefix cache
Investigation confirms vLLM Mooncake connector DOES correctly register
externally-received KV blocks in the prefix cache. No bug exists.

Evidence from vLLM logs (per-instance):
  inst_1: prefix_cache=14.7%, external_cache=72.1%  <- high external hit
  inst_4: prefix_cache=52.4%, external_cache=59.0%

The 0.5% aggregate APC from /metrics was a measurement artifact:
inst_0 received 718M query tokens (cold-start prefills) with 0% hit,
diluting the aggregate. D-instances have 20-72% external cache hit.

The /metrics endpoint's prefix_cache_hits_total counter does not include
external hits. The vLLM log's "External prefix cache hit rate" is the
correct metric for Mooncake-transferred KV reuse.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 13:25:34 +08:00
1b9268ba4c P2P prefill offload: TTFT p50 -13% but p90 +59% (median-vs-tail tradeoff)
Fixed race condition in P instance selection (all going to inst_0).
P2P design: HEAVY requests prefill on least-loaded OTHER instance,
KV transfer via Mooncake, decode on session-sticky instance.

Result (200 req, fresh restart, vs baseline):
  TTFT p50: 1.080 -> 0.939 (-13%)   <- median improves (decode not disrupted)
  TTFT p90: 9.410 -> 14.987 (+59%)  <- tail worsens (KV transfer on large req)
  TPOT p90: 0.076 -> 0.075 (-1%)    <- unchanged (not the bottleneck)
  E2E p50: 5.306 -> 5.565 (+5%)     <- slightly worse overall

The P2P offload helps the common case (WARM/MEDIUM get lower TTFT because
their instance isn't blocked by a heavy prefill) but hurts HEAVY requests
(extra KV transfer latency). This is a median-vs-tail tradeoff.

For SLOs targeting p50: P2P offload helps.
For SLOs targeting p90/p99: baseline combined is better.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 12:28:24 +08:00
7f93d36970 System profile: 4 mechanisms why PD-Sep loses to session-sticky combined
Evidence-backed analysis with per-request matched comparison:

1. KV CACHE MEMORY WALL (Evidence 3)
   Combined: 12% KV cache per instance (comfortable)
   PD-Sep 6P+2D: 48-97% on decode instances (saturation -> 100s waits)

2. KV TRANSFER OVERHEAD (Evidence 4, matched requests)
   Mean 1.79s extra TTFT per request, 3.3x slower overall
   Small requests (<5k) hit 8.0x ratio (transfer dominates prefill)
   Large requests (>50k) hit 1.3x ratio (prefill dominates)

3. SESSION AFFINITY BROKEN (Evidence 5)
   Combined: turn N+1 hits same GPU -> 80% multi-turn APC
   PD-Sep: turn N+1 prefill on P has NO prior KV (sent to D) -> 0% APC on P
   Must re-prefill + re-transfer on every turn

4. GPU UNDERUTILIZATION (Evidence 2)
   PD-Sep: 12-17% GPU util (decode is memory-bound, wastes GPU compute)
   Combined: 28-54% GPU util (flexible P+D on same GPU)

Root cause: agentic workloads break PD-Sep's assumptions (short input,
no prefix sharing, compute-heavy prefill) with long context, 91%
intra-session KV reuse, and lightweight MoE compute.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 10:58:59 +08:00
42bcd31976 TP=2 DP=4 + hybrid routing: best TTFT at cost of TPOT
TP=2 DP=4 with hybrid routing achieves TTFT p50=0.611s (-43% vs TP=1),
the best TTFT across all tested configurations. But TPOT p90=0.109s
(+51% vs TP=1) due to cross-GPU all-reduce in decode.

Full comparison across 7 configurations shows two Pareto-optimal points:
  TP=1 DP=8 hybrid: best TPOT (0.072s), good TTFT (1.064s)
  TP=2 DP=4 hybrid: best TTFT (0.611s), acceptable TPOT (0.109s)

The choice depends on SLO:
  TTFT-sensitive (interactive) -> TP=2 DP=4
  TPOT-sensitive (streaming)   -> TP=1 DP=8

All PD-Sep configurations are strictly dominated by one of these two.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 10:35:18 +08:00
a65ec42467 Update report: adaptive v2 confirms no KV transfer helps single-machine
All PD/offload schemes tested are worse than PD-combined + hybrid routing:
  Combined hybrid:    TTFT=0.737  TPOT90=0.072  APC=49.4%  (BEST)
  PD-Sep 4P+4D:       TTFT=1.994  TPOT90=0.075  APC=40.2%
  Adaptive v2 offload: TTFT=1.462  TPOT90=0.077  APC=~45%

Definitive: single-machine agentic serving = PD-combined + smart routing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 10:15:08 +08:00
2fee355626 Adaptive v2 (selective Mooncake offload): worse than baseline
Implemented --offload mode: HEAVY requests (>20k new tokens) get P on
least-loaded instance, KV via Mooncake RDMA, D on session-sticky instance.
WARM/MEDIUM stay co-located (no KV transfer). All 8 instances run kv_both.

Result (200 req, same instances, fresh restart):
  Baseline (no offload):   TTFT=1.073  TPOT90=0.074  E2E=5.086
  Offload HEAVY:            TTFT=1.462  TPOT90=0.077  E2E=6.847
  Delta:                    +36%        +4%            +35%

Conclusion: even selective KV transfer (only 44% of requests) adds more
overhead than the isolation benefit provides. On single-machine 8 GPU,
PD-combined with hybrid routing is strictly optimal. No form of KV
transfer — full PD-sep, selective offload, or otherwise — improves
over co-located serving for this workload.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 10:14:10 +08:00