Files
agentic-kvc/docs/migration-policy-design.md
Gahow Wang 45b82272c3 Add migration policy design doc with A/B experiment results
Approach A (contention-aware cost model): TTFT p90 -52% vs baseline.
Approach B (session migration): 0 triggers at 1.5x threshold — needs tuning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 18:24:49 +08:00

3.5 KiB
Raw Blame History

Migration Policy Design: Improving Load Balance in Elastic KV

Problem Statement

With the unified cost model (v3), elastic routing achieves TTFT p90 -37% vs baseline on WARM/MEDIUM requests. However, HEAVY turn>=2 requests with 99% cache hit still suffer TTFT 5-150s due to queuing contention on overloaded instances.

Root cause: the cost model combines cache benefit and queuing into a single scalar. When cache hit is 99%, the cost is dominated by queue estimation, but queue is inaccurately estimated via (pending_prefill + decode_tokens) / throughput — a token-based proxy that misses real contention (batch size).

Key data (v3, 850 requests, 8 instances):

  • 391 turn>=2 HEAVY LOCAL requests were migration candidates
  • 298 (76%) had cache>80% — affinity held correctly
  • 38 of those 298 (13%) had TTFT>5s despite 94-99% cache hit (queuing victims)
  • Only 8 offloads triggered total (2 real migrations, 6 useless turn-1 offloads)
  • Theoretical TTFT for turn2+ HEAVY: mean=0.81s (actual: 4.73s, 5.8x gap)

Approach A: Contention-Aware Cost Model [ADOPTED]

Replace (pending_prefill + decode_tokens) / throughput with num_requests * decode_iteration_s + pending_prefill / throughput as the queue estimation. num_requests (batch size) is the primary driver of decode iteration time and thus real contention.

Add a migration discount for sessions with accumulated cache (turn >= 2), reflecting the long-term value of migrating a session off a loaded instance.

Parameters

  • decode_iteration_s = 0.05 (per-request decode iteration cost on H20)
  • migration_discount_cap = 5 (max turns to discount)

Results (vs baseline, 850 requests, 8×H20)

Metric Baseline Approach A Change
ALL TTFT mean 5.639 3.675 -35%
ALL TTFT p90 16.058 7.638 -52%
MEDIUM TTFT p90 4.412 1.681 -62%
HEAVY TTFT p90 23.780 15.929 -33%
ALL TPOT p90 0.105 0.075 -28%
ALL E2E p50 7.446 6.429 -14%
Errors 0 0

Approach B: Session-Level Lazy Migration [UNDER TUNING]

Add a migration trigger before the cost model. When a request arrives for a session on an overloaded instance, force migration if:

  1. Instance busy: num_requests > avg * migration_request_factor
  2. Session has cache: cache_ratio > 0.5
  3. Request is HEAVY: input_length >= heavy_threshold
  4. Target meaningfully less loaded: target.num_requests < source - 2

Results (A+B combined, migration_request_factor=1.5)

0 migrations triggered — Approach A's contention-aware routing already distributes load well enough that no instance reaches 1.5x average. The threshold needs to be lowered or the trigger redesigned.

Next steps

  • Lower migration_request_factor (e.g. 1.2 or 1.3)
  • Consider absolute threshold instead of relative (e.g. > avg + 3)
  • Or trigger based on recent TTFT rather than instantaneous num_requests

Evolution of Results

Version Description ALL TTFT p90 HEAVY TTFT p90 tok max/min
Baseline linear routing 16.058 23.780 2.7x
v2 (bug) unified, queue=prefill only 23.339 38.070 10.3x
v3 +decode in queue, +hard gate 10.121 18.471 2.6x
A +num_requests contention 7.638 15.929 3.5x
A+B +session migration (1.5x) 8.291 16.384 3.0x