Elastic P2P v4: error rate 25% -> 4%, TTFT p50 -12% (median-tail tradeoff)

Fixed offload decision: removed p>=d gate (was blocking all offloads),
added MAX_OFFLOAD_INFLIGHT=4 cap and p_saturated threshold.

Result (200 req, fresh restart):
  Baseline: 99% success, TTFT=1.080/9.410, TPOT90=0.076, E2E=5.306
  Elastic:  96% success, TTFT=0.946/15.843, TPOT90=0.077, E2E=5.717

Architectural tradeoff confirmed:
  - Median (p50) improves: D instances not disrupted by heavy prefill
  - Tail (p90) worsens: offloaded HEAVY requests pay KV transfer cost
  - TPOT unchanged: decode isolation is not the bottleneck

To improve p90: need layerwise pipelined KV transfer (overlap with prefill
compute) or smarter offload gating that avoids offloading the very largest
requests (which have the longest prefill time and generate the most KV).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-22 15:08:16 +08:00
parent 1d2eeb4925
commit 76ee28a40f
2 changed files with 74 additions and 10 deletions

View File

@@ -116,6 +116,8 @@ decode_instances: list[InstanceState] = []
session_affinity: dict[str, int] = {}
is_pd_sep = False
_breakdown_log: list[dict] = []
_offload_inflight = 0 # number of currently in-flight offloaded HEAVY requests
MAX_OFFLOAD_INFLIGHT = 4 # cap concurrent offloads to prevent P overload
async def init_prefill_bootstrap(instances: list[InstanceState], ready: asyncio.Event):
@@ -242,18 +244,21 @@ async def _handle_combined(api, req_data, token_ids, input_length, session_id, h
avg_load = max(sum(i.ongoing_tokens for i in combined_instances) / len(combined_instances), 1.0)
# Decision logic:
# 1. P must be less loaded than D (otherwise offload makes things worse)
# 2. P must not be overloaded (ongoing > 1.5x average = would queue too long)
# 3. D should be currently decoding (otherwise no disruption to avoid)
if p_inst.ongoing_tokens >= d_inst.ongoing_tokens:
offload_reason = "p_busier_than_d"
elif p_inst.ongoing_tokens > avg_load * 1.5:
offload_reason = "p_overloaded"
elif d_inst.ongoing_decode_tokens == 0 and d_inst.ongoing_tokens < avg_load * 0.5:
offload_reason = "d_idle_no_benefit"
# 1. Global cap: max N concurrent offloads (prevents all-offload storm)
# 2. P must not already be saturated with heavy prefills
# 3. D must be doing something (otherwise no benefit from offloading)
# NOTE: We do NOT require P < D. P can be busier than D — the point
# is to keep heavy prefill OFF the session-sticky D instance so D's
# decode is not disrupted and D's KV cache is available for future turns.
global _offload_inflight
if _offload_inflight >= MAX_OFFLOAD_INFLIGHT:
offload_reason = "max_concurrent_reached"
elif p_inst.ongoing_tokens >= HEAVY_THRESHOLD * 2:
offload_reason = "p_saturated"
else:
use_offload = True
offload_reason = "p_available_d_busy"
offload_reason = "offload_accepted"
_offload_inflight += 1
if use_offload:
d_idx = best_idx
@@ -331,9 +336,12 @@ async def _handle_heavy_offload(api, req_data, headers, token_ids, input_length,
breakdown["t_prefill_done"] = _time.monotonic()
breakdown["error"] = str(e)
_breakdown_log.append(breakdown)
global _offload_inflight
_offload_inflight = max(0, _offload_inflight - 1)
raise HTTPException(status_code=502, detail="Prefill failed: %s" % e)
finally:
p_inst.ongoing_tokens -= input_length
_offload_inflight = max(0, _offload_inflight - 1)
# Step 2: Stream decode on d_inst (pulls KV from Mooncake)
d_inst.ongoing_tokens += input_length