Files
agentic-kvc/analysis/migration_trigger_validation/README.md
Gahow Wang 8d422c4301 Migration trigger validation: unified_v4 fires at 2x QPS, not at 1x
Ran unified vs unified_v4 A/B on dash2 (8×H20, kv_both+DR-fix substrate,
w600_r0.0015_st30_first600s trace). Key findings:

- At 1x QPS (~1.3 req/s): zero migrations. pending_prefill_tokens is 0 for
  95% of routing decisions because instances complete prefill before the next
  request arrives. The relative arm (src_pp > fleet_median*1.5) never fires.
- At 2x QPS (~2.7 req/s): 4 migrations (0.5%). src_pp>0 rises to 24% of
  eligible decisions. Trigger correctly identifies genuinely overloaded
  instances (src_pp 13k–73k vs fleet median 3.8k–33k).

Conclusion: mechanism is correct but migration benefit requires higher
concurrency (scale-out or >3x QPS) where queue pressure makes the signal
non-trivial. At single-node 8-instance scale, Pillar 1 (affinity routing)
is sufficient and Pillar 2 gracefully degrades to no-op.

Next: scale-out validation (16+ GPU) where session skew naturally
concentrates load and triggers migration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 15:36:58 +08:00

6.1 KiB
Raw Blame History

Migration Trigger Validation (unified_v4) — 2026-05-30

Hardware: dash2, 8×H20, Qwen3-Coder-30B-A3B, TP=1, kv_both + DR-fix substrate. Trace: w600_r0.0015_st30_first600s.jsonl (807 reqs, 600s span). Policy: unified_v4 = unified hybrid routing + pending-prefill-queue-triggered session migration (commit 3a6bf5d on kzlin-dev branch).

Research Question

Does Pillar 2 (hot-triggered session migration) provide measurable benefit on top of Pillar 1 (affinity-default routing)?

Experiment Design

Arm Policy Substrate Trace QPS
unified_1x unified (affinity-only) kv_both + DR-fix ~1.3 (original)
unified_v4_1x unified_v4 (affinity + migration) kv_both + DR-fix ~1.3
unified_v4_2x unified_v4 (affinity + migration) kv_both + DR-fix ~2.7 (2× compressed)

The 2× trace was generated by halving inter-request intervals: ts_new = ts_min + (ts_orig - ts_min) / 2.

Results

1x QPS: unified vs unified_v4

Metric unified unified_v4 Delta
OK/total 807/807 807/807
TTFT mean 3.990s 4.142s +3.8%
TTFT p50 0.719s 0.711s 1.0%
TTFT p90 11.499s 12.293s +6.9%
TPOT p90 0.024s 0.022s 9.3%
E2E p50 2.265s 2.293s +1.2%
E2E p90 24.507s 23.955s 2.3%
Migrations 0 0

Conclusion: At 1x QPS (~1.3 req/s, ~0.16 req/instance/s), the migration trigger NEVER fires. The two arms produce statistically identical results.

2x QPS: unified_v4 under higher load

Metric unified_v4 @ 2x
OK/total 807/807
TTFT mean 5.227s
TTFT p90 15.000s
E2E p90 39.401s
Migrations 4/807 (0.5%)

4 migrated requests (all verified via v3_decode_target_url in breakdown):

Session Input new_local src_pp fleet_median proj_prefill Target
1313181 22,686 22,686 13,360 6,680 5.1s inst_5
1310590 32,440 14,520 57,051 12,630 10.2s inst_4
1373431 126,340 126,340 73,385 33,294 28.5s inst_4
1313181 60,004 17,508 19,503 3,806 5.3s inst_5

Root Cause Analysis: Why Zero Migrations at 1x

The unified_v4 trigger requires BOTH arms to fire simultaneously:

  • Absolute SLO arm: proj_prefill_s(src) > 2.5s — fires for 41% of eligible requests
  • Relative arm: src_pending_prefill > fleet_median × 1.5 — NEVER fires at 1x

The relative arm fails because pending_prefill_tokens (the proxy's shadow counter) is 0 for 95% of routing decisions at 1x QPS:

QPS src_pp > 0 (% of eligible) Migrations
1.3 (1×) 5% (8/241) 0
2.7 (2×) 24% (62/257) 4

Mechanism: pending_prefill_tokens reflects previously-dispatched requests that haven't finished their prefill yet. At 0.16 req/instance/s, each instance completes its prefill before the next request arrives — the counter is almost always 0 at decision time. Only under genuine queueing pressure (2× and above) does the counter become non-zero and the relative arm can fire.

The high TTFT at 1x (~11.5s p90) comes from compute-bound large prefills (single 60k+ token requests inherently need ~9s), NOT from queue depth.

Interpretation for Paper

  1. The migration mechanism is functionally correct. At 2x it fires on the right signal (src genuinely overloaded relative to fleet) and selects valid targets (cooler instances with load gap).

  2. At benchmark scale (8 instances, ~1 QPS), migration is not needed. The affinity-default routing (Pillar 1) already achieves APC ~79% and the remaining hot-pin issue is mild (max-worker/median-worker ≈ 3.7×). The "dispatch coupling" feedback loop is present but not yet at the catastrophic amplification regime.

  3. Migration becomes relevant under scale-out + higher load. With more instances (1632), session skew concentrates more load per hot instance while cold instances sit idle — exactly the condition where src_pp > fleet_median × 1.5 naturally fires. The 1x→2x progression (0%→0.5% migration rate) shows the correct scaling direction.

  4. Paper §3.3 framing: Migration is a scale-out insurance mechanism that gracefully degrades to no-op under low load. Its value is NOT demonstrable at 8-instance single-node benchmark; the argument must rely on (a) the mechanism's correctness (this experiment), (b) the substrate's net-positive property (commit ef9e010), and (c) scale-out projection (future: 16+ GPU, multi-node).

Next Steps

  • Scale-out validation (16 GPU, 2 nodes): With more instances and the same trace, more sessions compete per-instance → higher pending_prefill → migration triggers naturally. This is the strongest evidence path.
  • 34× QPS on 8 instances: Push to saturation to measure migration's effect in the catastrophic regime. Risk: may exceed serving capacity (errors).
  • Threshold sensitivity: Ablate v4_rel_hi (1.5→1.2→1.0) and v4_ttft_slo_s (2.5→1.5→1.0) to characterize the trigger landscape.

Reproduction

# On dash2 (local /tmp, does NOT modify shared NAS):
# 1x QPS
bash /tmp/migration_exp/run_migration_ab.sh   # interleaved unified vs unified_v4

# 2x QPS
python3 -c "
import json
trace_in = '/home/admin/cpfs/wjh/agentic-kv/traces/w600_r0.0015_st30_first600s.jsonl'
rows = [json.loads(l) for l in open(trace_in)]
ts_min = min(r['timestamp'] for r in rows)
for r in rows: r['timestamp'] = ts_min + (r['timestamp'] - ts_min) / 2.0
with open('/tmp/migration_exp/trace_2x.jsonl', 'w') as f:
    for r in rows: f.write(json.dumps(r) + '\n')
"
bash /tmp/migration_exp/run_2x.sh

Data Locations (dash2 /tmp, ephemeral)

Path Content
/tmp/migration_exp/outputs/unified_run1/ Baseline arm (1x)
/tmp/migration_exp/outputs/unified_v4_run1/ Migration arm (1x)
/tmp/migration_exp/outputs/unified_v4_2x/ Migration arm (2x)
/tmp/migration_exp/outputs/*/breakdown.json Per-request routing decisions with v4_* fields
/tmp/migration_exp/outputs/*/metrics.jsonl Per-request latency metrics