Ran unified vs unified_v4 A/B on dash2 (8×H20, kv_both+DR-fix substrate, w600_r0.0015_st30_first600s trace). Key findings: - At 1x QPS (~1.3 req/s): zero migrations. pending_prefill_tokens is 0 for 95% of routing decisions because instances complete prefill before the next request arrives. The relative arm (src_pp > fleet_median*1.5) never fires. - At 2x QPS (~2.7 req/s): 4 migrations (0.5%). src_pp>0 rises to 24% of eligible decisions. Trigger correctly identifies genuinely overloaded instances (src_pp 13k–73k vs fleet median 3.8k–33k). Conclusion: mechanism is correct but migration benefit requires higher concurrency (scale-out or >3x QPS) where queue pressure makes the signal non-trivial. At single-node 8-instance scale, Pillar 1 (affinity routing) is sufficient and Pillar 2 gracefully degrades to no-op. Next: scale-out validation (16+ GPU) where session skew naturally concentrates load and triggers migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Migration Trigger Validation (unified_v4) — 2026-05-30
Hardware: dash2, 8×H20, Qwen3-Coder-30B-A3B, TP=1, kv_both + DR-fix substrate.
Trace: w600_r0.0015_st30_first600s.jsonl (807 reqs, 600s span).
Policy: unified_v4 = unified hybrid routing + pending-prefill-queue-triggered
session migration (commit 3a6bf5d on kzlin-dev branch).
Research Question
Does Pillar 2 (hot-triggered session migration) provide measurable benefit on top of Pillar 1 (affinity-default routing)?
Experiment Design
| Arm | Policy | Substrate | Trace QPS |
|---|---|---|---|
| unified_1x | unified (affinity-only) | kv_both + DR-fix | ~1.3 (original) |
| unified_v4_1x | unified_v4 (affinity + migration) | kv_both + DR-fix | ~1.3 |
| unified_v4_2x | unified_v4 (affinity + migration) | kv_both + DR-fix | ~2.7 (2× compressed) |
The 2× trace was generated by halving inter-request intervals:
ts_new = ts_min + (ts_orig - ts_min) / 2.
Results
1x QPS: unified vs unified_v4
| Metric | unified | unified_v4 | Delta |
|---|---|---|---|
| OK/total | 807/807 | 807/807 | — |
| TTFT mean | 3.990s | 4.142s | +3.8% |
| TTFT p50 | 0.719s | 0.711s | −1.0% |
| TTFT p90 | 11.499s | 12.293s | +6.9% |
| TPOT p90 | 0.024s | 0.022s | −9.3% |
| E2E p50 | 2.265s | 2.293s | +1.2% |
| E2E p90 | 24.507s | 23.955s | −2.3% |
| Migrations | 0 | 0 | — |
Conclusion: At 1x QPS (~1.3 req/s, ~0.16 req/instance/s), the migration trigger NEVER fires. The two arms produce statistically identical results.
2x QPS: unified_v4 under higher load
| Metric | unified_v4 @ 2x |
|---|---|
| OK/total | 807/807 |
| TTFT mean | 5.227s |
| TTFT p90 | 15.000s |
| E2E p90 | 39.401s |
| Migrations | 4/807 (0.5%) |
4 migrated requests (all verified via v3_decode_target_url in breakdown):
| Session | Input | new_local | src_pp | fleet_median | proj_prefill | Target |
|---|---|---|---|---|---|---|
| 1313181 | 22,686 | 22,686 | 13,360 | 6,680 | 5.1s | inst_5 |
| 1310590 | 32,440 | 14,520 | 57,051 | 12,630 | 10.2s | inst_4 |
| 1373431 | 126,340 | 126,340 | 73,385 | 33,294 | 28.5s | inst_4 |
| 1313181 | 60,004 | 17,508 | 19,503 | 3,806 | 5.3s | inst_5 |
Root Cause Analysis: Why Zero Migrations at 1x
The unified_v4 trigger requires BOTH arms to fire simultaneously:
- Absolute SLO arm:
proj_prefill_s(src) > 2.5s— fires for 41% of eligible requests - Relative arm:
src_pending_prefill > fleet_median × 1.5— NEVER fires at 1x
The relative arm fails because pending_prefill_tokens (the proxy's shadow
counter) is 0 for 95% of routing decisions at 1x QPS:
| QPS | src_pp > 0 (% of eligible) | Migrations |
|---|---|---|
| 1.3 (1×) | 5% (8/241) | 0 |
| 2.7 (2×) | 24% (62/257) | 4 |
Mechanism: pending_prefill_tokens reflects previously-dispatched requests
that haven't finished their prefill yet. At 0.16 req/instance/s, each instance
completes its prefill before the next request arrives — the counter is almost
always 0 at decision time. Only under genuine queueing pressure (2× and above)
does the counter become non-zero and the relative arm can fire.
The high TTFT at 1x (~11.5s p90) comes from compute-bound large prefills (single 60k+ token requests inherently need ~9s), NOT from queue depth.
Interpretation for Paper
-
The migration mechanism is functionally correct. At 2x it fires on the right signal (src genuinely overloaded relative to fleet) and selects valid targets (cooler instances with load gap).
-
At benchmark scale (8 instances, ~1 QPS), migration is not needed. The affinity-default routing (Pillar 1) already achieves APC ~79% and the remaining hot-pin issue is mild (max-worker/median-worker ≈ 3.7×). The "dispatch coupling" feedback loop is present but not yet at the catastrophic amplification regime.
-
Migration becomes relevant under scale-out + higher load. With more instances (16–32), session skew concentrates more load per hot instance while cold instances sit idle — exactly the condition where
src_pp > fleet_median × 1.5naturally fires. The 1x→2x progression (0%→0.5% migration rate) shows the correct scaling direction. -
Paper §3.3 framing: Migration is a scale-out insurance mechanism that gracefully degrades to no-op under low load. Its value is NOT demonstrable at 8-instance single-node benchmark; the argument must rely on (a) the mechanism's correctness (this experiment), (b) the substrate's net-positive property (commit
ef9e010), and (c) scale-out projection (future: 16+ GPU, multi-node).
Next Steps
- Scale-out validation (16 GPU, 2 nodes): With more instances and the same trace, more sessions compete per-instance → higher pending_prefill → migration triggers naturally. This is the strongest evidence path.
- 3–4× QPS on 8 instances: Push to saturation to measure migration's effect in the catastrophic regime. Risk: may exceed serving capacity (errors).
- Threshold sensitivity: Ablate
v4_rel_hi(1.5→1.2→1.0) andv4_ttft_slo_s(2.5→1.5→1.0) to characterize the trigger landscape.
Reproduction
# On dash2 (local /tmp, does NOT modify shared NAS):
# 1x QPS
bash /tmp/migration_exp/run_migration_ab.sh # interleaved unified vs unified_v4
# 2x QPS
python3 -c "
import json
trace_in = '/home/admin/cpfs/wjh/agentic-kv/traces/w600_r0.0015_st30_first600s.jsonl'
rows = [json.loads(l) for l in open(trace_in)]
ts_min = min(r['timestamp'] for r in rows)
for r in rows: r['timestamp'] = ts_min + (r['timestamp'] - ts_min) / 2.0
with open('/tmp/migration_exp/trace_2x.jsonl', 'w') as f:
for r in rows: f.write(json.dumps(r) + '\n')
"
bash /tmp/migration_exp/run_2x.sh
Data Locations (dash2 /tmp, ephemeral)
| Path | Content |
|---|---|
/tmp/migration_exp/outputs/unified_run1/ |
Baseline arm (1x) |
/tmp/migration_exp/outputs/unified_v4_run1/ |
Migration arm (1x) |
/tmp/migration_exp/outputs/unified_v4_2x/ |
Migration arm (2x) |
/tmp/migration_exp/outputs/*/breakdown.json |
Per-request routing decisions with v4_* fields |
/tmp/migration_exp/outputs/*/metrics.jsonl |
Per-request latency metrics |