Files
Gahow Wang 8d422c4301 Migration trigger validation: unified_v4 fires at 2x QPS, not at 1x
Ran unified vs unified_v4 A/B on dash2 (8×H20, kv_both+DR-fix substrate,
w600_r0.0015_st30_first600s trace). Key findings:

- At 1x QPS (~1.3 req/s): zero migrations. pending_prefill_tokens is 0 for
  95% of routing decisions because instances complete prefill before the next
  request arrives. The relative arm (src_pp > fleet_median*1.5) never fires.
- At 2x QPS (~2.7 req/s): 4 migrations (0.5%). src_pp>0 rises to 24% of
  eligible decisions. Trigger correctly identifies genuinely overloaded
  instances (src_pp 13k–73k vs fleet median 3.8k–33k).

Conclusion: mechanism is correct but migration benefit requires higher
concurrency (scale-out or >3x QPS) where queue pressure makes the signal
non-trivial. At single-node 8-instance scale, Pillar 1 (affinity routing)
is sufficient and Pillar 2 gracefully degrades to no-op.

Next: scale-out validation (16+ GPU) where session skew naturally
concentrates load and triggers migration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 15:36:58 +08:00

148 lines
6.1 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Migration Trigger Validation (unified_v4) — 2026-05-30
Hardware: dash2, 8×H20, Qwen3-Coder-30B-A3B, TP=1, kv_both + DR-fix substrate.
Trace: `w600_r0.0015_st30_first600s.jsonl` (807 reqs, 600s span).
Policy: `unified_v4` = unified hybrid routing + pending-prefill-queue-triggered
session migration (commit `3a6bf5d` on `kzlin-dev` branch).
## Research Question
Does Pillar 2 (hot-triggered session migration) provide measurable benefit on
top of Pillar 1 (affinity-default routing)?
## Experiment Design
| Arm | Policy | Substrate | Trace QPS |
|---|---|---|---|
| unified_1x | unified (affinity-only) | kv_both + DR-fix | ~1.3 (original) |
| unified_v4_1x | unified_v4 (affinity + migration) | kv_both + DR-fix | ~1.3 |
| unified_v4_2x | unified_v4 (affinity + migration) | kv_both + DR-fix | ~2.7 (2× compressed) |
The 2× trace was generated by halving inter-request intervals:
`ts_new = ts_min + (ts_orig - ts_min) / 2`.
## Results
### 1x QPS: unified vs unified_v4
| Metric | unified | unified_v4 | Delta |
|---|---:|---:|---:|
| OK/total | 807/807 | 807/807 | — |
| TTFT mean | 3.990s | 4.142s | +3.8% |
| TTFT p50 | 0.719s | 0.711s | 1.0% |
| TTFT p90 | 11.499s | 12.293s | +6.9% |
| TPOT p90 | 0.024s | 0.022s | 9.3% |
| E2E p50 | 2.265s | 2.293s | +1.2% |
| E2E p90 | 24.507s | 23.955s | 2.3% |
| **Migrations** | **0** | **0** | — |
**Conclusion**: At 1x QPS (~1.3 req/s, ~0.16 req/instance/s), the migration
trigger NEVER fires. The two arms produce statistically identical results.
### 2x QPS: unified_v4 under higher load
| Metric | unified_v4 @ 2x |
|---|---:|
| OK/total | 807/807 |
| TTFT mean | 5.227s |
| TTFT p90 | 15.000s |
| E2E p90 | 39.401s |
| **Migrations** | **4/807 (0.5%)** |
4 migrated requests (all verified via `v3_decode_target_url` in breakdown):
| Session | Input | new_local | src_pp | fleet_median | proj_prefill | Target |
|---|---:|---:|---:|---:|---:|---|
| 1313181 | 22,686 | 22,686 | 13,360 | 6,680 | 5.1s | inst_5 |
| 1310590 | 32,440 | 14,520 | 57,051 | 12,630 | 10.2s | inst_4 |
| 1373431 | 126,340 | 126,340 | 73,385 | 33,294 | 28.5s | inst_4 |
| 1313181 | 60,004 | 17,508 | 19,503 | 3,806 | 5.3s | inst_5 |
## Root Cause Analysis: Why Zero Migrations at 1x
The unified_v4 trigger requires BOTH arms to fire simultaneously:
- **Absolute SLO arm**: `proj_prefill_s(src) > 2.5s` — fires for 41% of eligible requests
- **Relative arm**: `src_pending_prefill > fleet_median × 1.5` — NEVER fires at 1x
The relative arm fails because `pending_prefill_tokens` (the proxy's shadow
counter) is 0 for **95% of routing decisions** at 1x QPS:
| QPS | src_pp > 0 (% of eligible) | Migrations |
|---:|---:|---:|
| 1.3 (1×) | 5% (8/241) | 0 |
| 2.7 (2×) | 24% (62/257) | 4 |
**Mechanism**: `pending_prefill_tokens` reflects previously-dispatched requests
that haven't finished their prefill yet. At 0.16 req/instance/s, each instance
completes its prefill before the next request arrives — the counter is almost
always 0 at decision time. Only under genuine queueing pressure (2× and above)
does the counter become non-zero and the relative arm can fire.
The high TTFT at 1x (~11.5s p90) comes from **compute-bound large prefills**
(single 60k+ token requests inherently need ~9s), NOT from queue depth.
## Interpretation for Paper
1. **The migration mechanism is functionally correct.** At 2x it fires on the
right signal (src genuinely overloaded relative to fleet) and selects valid
targets (cooler instances with load gap).
2. **At benchmark scale (8 instances, ~1 QPS), migration is not needed.** The
affinity-default routing (Pillar 1) already achieves APC ~79% and the
remaining hot-pin issue is mild (max-worker/median-worker ≈ 3.7×). The
"dispatch coupling" feedback loop is present but not yet at the catastrophic
amplification regime.
3. **Migration becomes relevant under scale-out + higher load.** With more
instances (1632), session skew concentrates more load per hot instance
while cold instances sit idle — exactly the condition where `src_pp >
fleet_median × 1.5` naturally fires. The 1x→2x progression (0%→0.5%
migration rate) shows the correct scaling direction.
4. **Paper §3.3 framing**: Migration is a **scale-out insurance mechanism**
that gracefully degrades to no-op under low load. Its value is NOT
demonstrable at 8-instance single-node benchmark; the argument must rely on
(a) the mechanism's correctness (this experiment), (b) the substrate's
net-positive property (commit `ef9e010`), and (c) scale-out projection
(future: 16+ GPU, multi-node).
## Next Steps
- [ ] **Scale-out validation** (16 GPU, 2 nodes): With more instances and the
same trace, more sessions compete per-instance → higher pending_prefill →
migration triggers naturally. This is the strongest evidence path.
- [ ] **34× QPS on 8 instances**: Push to saturation to measure migration's
effect in the catastrophic regime. Risk: may exceed serving capacity (errors).
- [ ] **Threshold sensitivity**: Ablate `v4_rel_hi` (1.5→1.2→1.0) and
`v4_ttft_slo_s` (2.5→1.5→1.0) to characterize the trigger landscape.
## Reproduction
```bash
# On dash2 (local /tmp, does NOT modify shared NAS):
# 1x QPS
bash /tmp/migration_exp/run_migration_ab.sh # interleaved unified vs unified_v4
# 2x QPS
python3 -c "
import json
trace_in = '/home/admin/cpfs/wjh/agentic-kv/traces/w600_r0.0015_st30_first600s.jsonl'
rows = [json.loads(l) for l in open(trace_in)]
ts_min = min(r['timestamp'] for r in rows)
for r in rows: r['timestamp'] = ts_min + (r['timestamp'] - ts_min) / 2.0
with open('/tmp/migration_exp/trace_2x.jsonl', 'w') as f:
for r in rows: f.write(json.dumps(r) + '\n')
"
bash /tmp/migration_exp/run_2x.sh
```
## Data Locations (dash2 /tmp, ephemeral)
| Path | Content |
|---|---|
| `/tmp/migration_exp/outputs/unified_run1/` | Baseline arm (1x) |
| `/tmp/migration_exp/outputs/unified_v4_run1/` | Migration arm (1x) |
| `/tmp/migration_exp/outputs/unified_v4_2x/` | Migration arm (2x) |
| `/tmp/migration_exp/outputs/*/breakdown.json` | Per-request routing decisions with v4_* fields |
| `/tmp/migration_exp/outputs/*/metrics.jsonl` | Per-request latency metrics |