Files

Gahow Wang 8d422c4301 Migration trigger validation: unified_v4 fires at 2x QPS, not at 1x

Ran unified vs unified_v4 A/B on dash2 (8×H20, kv_both+DR-fix substrate,
w600_r0.0015_st30_first600s trace). Key findings:

- At 1x QPS (~1.3 req/s): zero migrations. pending_prefill_tokens is 0 for
  95% of routing decisions because instances complete prefill before the next
  request arrives. The relative arm (src_pp > fleet_median*1.5) never fires.
- At 2x QPS (~2.7 req/s): 4 migrations (0.5%). src_pp>0 rises to 24% of
  eligible decisions. Trigger correctly identifies genuinely overloaded
  instances (src_pp 13k–73k vs fleet median 3.8k–33k).

Conclusion: mechanism is correct but migration benefit requires higher
concurrency (scale-out or >3x QPS) where queue pressure makes the signal
non-trivial. At single-node 8-instance scale, Pillar 1 (affinity routing)
is sufficient and Pillar 2 gracefully degrades to no-op.

Next: scale-out validation (16+ GPU) where session skew naturally
concentrates load and triggers migration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-30 15:36:58 +08:00

6.1 KiB

Raw Blame History

Migration Trigger Validation (unified_v4) — 2026-05-30

Hardware: dash2, 8×H20, Qwen3-Coder-30B-A3B, TP=1, kv_both + DR-fix substrate. Trace: w600_r0.0015_st30_first600s.jsonl (807 reqs, 600s span). Policy: unified_v4 = unified hybrid routing + pending-prefill-queue-triggered session migration (commit 3a6bf5d on kzlin-dev branch).

Research Question

Does Pillar 2 (hot-triggered session migration) provide measurable benefit on top of Pillar 1 (affinity-default routing)?

Experiment Design

Arm	Policy	Substrate	Trace QPS
unified_1x	unified (affinity-only)	kv_both + DR-fix	~1.3 (original)
unified_v4_1x	unified_v4 (affinity + migration)	kv_both + DR-fix	~1.3
unified_v4_2x	unified_v4 (affinity + migration)	kv_both + DR-fix	~2.7 (2× compressed)

The 2× trace was generated by halving inter-request intervals: ts_new = ts_min + (ts_orig - ts_min) / 2.

Results

1x QPS: unified vs unified_v4

Metric	unified	unified_v4	Delta
OK/total	807/807	807/807	—
TTFT mean	3.990s	4.142s	+3.8%
TTFT p50	0.719s	0.711s	−1.0%
TTFT p90	11.499s	12.293s	+6.9%
TPOT p90	0.024s	0.022s	−9.3%
E2E p50	2.265s	2.293s	+1.2%
E2E p90	24.507s	23.955s	−2.3%
Migrations	0	0	—

Conclusion: At 1x QPS (~1.3 req/s, ~0.16 req/instance/s), the migration trigger NEVER fires. The two arms produce statistically identical results.

2x QPS: unified_v4 under higher load

Metric	unified_v4 @ 2x
OK/total	807/807
TTFT mean	5.227s
TTFT p90	15.000s
E2E p90	39.401s
Migrations	4/807 (0.5%)

4 migrated requests (all verified via v3_decode_target_url in breakdown):

Session	Input	new_local	src_pp	fleet_median	proj_prefill	Target
1313181	22,686	22,686	13,360	6,680	5.1s	inst_5
1310590	32,440	14,520	57,051	12,630	10.2s	inst_4
1373431	126,340	126,340	73,385	33,294	28.5s	inst_4
1313181	60,004	17,508	19,503	3,806	5.3s	inst_5

Root Cause Analysis: Why Zero Migrations at 1x

The unified_v4 trigger requires BOTH arms to fire simultaneously:

Absolute SLO arm: proj_prefill_s(src) > 2.5s — fires for 41% of eligible requests
Relative arm: src_pending_prefill > fleet_median × 1.5 — NEVER fires at 1x

The relative arm fails because pending_prefill_tokens (the proxy's shadow counter) is 0 for 95% of routing decisions at 1x QPS:

QPS	src_pp > 0 (% of eligible)	Migrations
1.3 (1×)	5% (8/241)	0
2.7 (2×)	24% (62/257)	4

Mechanism: pending_prefill_tokens reflects previously-dispatched requests that haven't finished their prefill yet. At 0.16 req/instance/s, each instance completes its prefill before the next request arrives — the counter is almost always 0 at decision time. Only under genuine queueing pressure (2× and above) does the counter become non-zero and the relative arm can fire.

The high TTFT at 1x (~11.5s p90) comes from compute-bound large prefills (single 60k+ token requests inherently need ~9s), NOT from queue depth.

Interpretation for Paper

The migration mechanism is functionally correct. At 2x it fires on the right signal (src genuinely overloaded relative to fleet) and selects valid targets (cooler instances with load gap).
At benchmark scale (8 instances, ~1 QPS), migration is not needed. The affinity-default routing (Pillar 1) already achieves APC ~79% and the remaining hot-pin issue is mild (max-worker/median-worker ≈ 3.7×). The "dispatch coupling" feedback loop is present but not yet at the catastrophic amplification regime.
Migration becomes relevant under scale-out + higher load. With more instances (16–32), session skew concentrates more load per hot instance while cold instances sit idle — exactly the condition where src_pp > fleet_median × 1.5 naturally fires. The 1x→2x progression (0%→0.5% migration rate) shows the correct scaling direction.
Paper §3.3 framing: Migration is a scale-out insurance mechanism that gracefully degrades to no-op under low load. Its value is NOT demonstrable at 8-instance single-node benchmark; the argument must rely on (a) the mechanism's correctness (this experiment), (b) the substrate's net-positive property (commit ef9e010), and (c) scale-out projection (future: 16+ GPU, multi-node).

Next Steps

Scale-out validation (16 GPU, 2 nodes): With more instances and the same trace, more sessions compete per-instance → higher pending_prefill → migration triggers naturally. This is the strongest evidence path.
3–4× QPS on 8 instances: Push to saturation to measure migration's effect in the catastrophic regime. Risk: may exceed serving capacity (errors).
Threshold sensitivity: Ablate v4_rel_hi (1.5→1.2→1.0) and v4_ttft_slo_s (2.5→1.5→1.0) to characterize the trigger landscape.

Reproduction

# On dash2 (local /tmp, does NOT modify shared NAS):
# 1x QPS
bash /tmp/migration_exp/run_migration_ab.sh   # interleaved unified vs unified_v4

# 2x QPS
python3 -c "
import json
trace_in = '/home/admin/cpfs/wjh/agentic-kv/traces/w600_r0.0015_st30_first600s.jsonl'
rows = [json.loads(l) for l in open(trace_in)]
ts_min = min(r['timestamp'] for r in rows)
for r in rows: r['timestamp'] = ts_min + (r['timestamp'] - ts_min) / 2.0
with open('/tmp/migration_exp/trace_2x.jsonl', 'w') as f:
    for r in rows: f.write(json.dumps(r) + '\n')
"
bash /tmp/migration_exp/run_2x.sh

Data Locations (dash2 /tmp, ephemeral)

Path	Content
`/tmp/migration_exp/outputs/unified_run1/`	Baseline arm (1x)
`/tmp/migration_exp/outputs/unified_v4_run1/`	Migration arm (1x)
`/tmp/migration_exp/outputs/unified_v4_2x/`	Migration arm (2x)
`/tmp/migration_exp/outputs/*/breakdown.json`	Per-request routing decisions with v4_* fields
`/tmp/migration_exp/outputs/*/metrics.jsonl`	Per-request latency metrics

6.1 KiB Raw Blame History Unescape Escape