Files
agentic-pd-hybrid/docs/E3_FINDINGS_ZH.md
tim d40db1f117 docs(experiments): E3 first run — load-floor bonus works, exposes SGLang bug
H1 (load balance) confirmed at the 15-min checkpoint: D2 received
22.5% of bindings (225 out of 1001) covering 30 unique sessions,
versus 0 in both E1 and E2. The graduated load-floor formula with
K=200 produces the intended distribution: fresh sessions on
under-loaded D, sticky sessions stay put.

But decode-1 crashed at 11:51:21 (~5 min into benchmark) with an
SGLang AssertionError in schedule_batch.py:1646. Root cause: the
streaming-session correction at line 1572-1585 patches
req.extend_input_len to 0 when len(fill_ids) < len(prefix_indices),
but the downstream invariant uses raw fill_ids/prefix_indices
lengths, so the arithmetic check fails. This is a pre-existing
landmine in the b8e6f13 SGLang vendor patch, not caused by the
load-floor bonus. It just happened to be masked in E2 by the
failure cascade preventing sessions from accumulating deep enough
prefix to trigger the correction.

Crash session 1000195 stayed on decode-1 the whole time (not a
migration race). E3 exposes this faster because sessions actually
run further with rebalanced load.

5 fix options evaluated. Recommended: Fix A — local patch at
schedule_batch.py:1646 to skip zero-extend-len reqs before
asserting. Less invasive than C (recomputing seq/prefix arrays);
addresses the actual case (D and E are workarounds, not fixes).

4 decision points for review; no code changes in this commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 12:05:51 +08:00

8.2 KiB

E3 — first run findings + bug exposure

Status: E3 first attempt aborted at ~16 min wall by SGLang assertion crash on decode-1. Partial data confirms the load-floor bonus works as designed; the crash is an independent vendored-SGLang bug exposed by E3's new routing pattern.

Branch: h200-cu130. Companion: docs/E1_E2_RESULTS_ZH.md, docs/E1_E2_FIX_DESIGN_ZH.md.


1. What worked: load-floor bonus (K=200)

Within the first ~15 minutes of E3, before the crash:

E1 (run1) E2 (run1) E3 (run1, partial)
total bindings 1285 1186 admit attempts 1001
decode-0 bindings 575 600 240 (24.0%)
decode-1 bindings 710 685 536 (53.5%)
decode-2 bindings 0 0 225 (22.5%)
unique sessions on D2 0 0 30

Load-floor bonus successfully broke the overlap-pinning death spiral. D2 is finally getting traffic on Inferact's shared-boilerplate workload. The graduated formula (K * deficit / mean) plus the not sticky gate produces the intended behavior: fresh sessions land on under-loaded D's, established sessions keep going to their original D for cache locality.

This validates the Q2.B design from docs/E1_E2_FIX_DESIGN_ZH.md empirically — but only as far as the run got. End-to-end metrics (lat / TTFT / failure rate) are not interpretable yet because the worker died.

2. The new crash: SGLang streaming-session correction leaves an invariant violated

At 01:51:21 (~5 min into the benchmark), decode-1 hit:

[01:51:21] Correcting streaming-session extend_input_len from 6648 to 0
  (rid=6f4318e93dd543a49dbf19248cfc1e6f, session_id=1000195,
   fill_len=6648, prefix_len=43459, kv_committed_len=43459)
[01:51:21] Scheduler hit an exception: AssertionError
  at third_party/sglang/python/sglang/srt/managers/schedule_batch.py:1646
  → assert seq_len - pre_len == req.extend_input_len

Mechanism

With --enable-streaming-session, SGLang's session_aware_cache hands the scheduler a request whose fill_ids is just the new tokens since the last turn (6648), while prefix_indices represents the already-cached prefix on this D (43459 blocks). When the prefix exceeds fill_ids (e.g., the new turn's input is short relative to the conversation history that's already in cache), this code path fires at schedule_batch.py:1572-1585:

actual_extend_len = max(0, len(req.fill_ids) - len(req.prefix_indices))
if req.extend_input_len != actual_extend_len:
    logger.warning("Correcting streaming-session extend_input_len from %d to %d ...")
    req.set_extend_input_len(actual_extend_len)

So req.extend_input_len becomes max(0, 6648 - 43459) = 0.

Then at line 1588-1590:

seq_lens = [len(r.fill_ids) for r in reqs]       # 6648
prefix_lens = [len(r.prefix_indices) for r in reqs]  # 43459

And at line 1646:

assert seq_len - pre_len == req.extend_input_len  # 6648 - 43459 == 0 → FAIL

The correction patches extend_input_len but the downstream invariant is computed from raw fill_ids/prefix_indices lengths, which the correction never touched. The arithmetic check is fundamentally incompatible with the corrected state.

Provenance

The streaming-session correction (schedule_batch.py:1572-1585) and the assertion site (line 1646) are both inside the project's SGLang vendor patches — git log on this file shows the patch came from commit b8e6f13 feat(sglang): support decode session cache admission. So this is a regression in the project's own SGLang fork, not upstream SGLang.

Why E3 triggers it and E2 didn't

The crash is independent of migration (session 1000195 stayed on decode-1 the entire time). Two factors combined to expose it in E3:

  1. D1 was under more sustained load in E3 — 536 bindings on 17 unique sessions means high re-binding density per session, which means more concurrent turns of the same session at the scheduler, increasing the rate at which streaming-session corrections fire.
  2. Faster overall dispatch — with D2 actually consuming work, the prefill→decode pipeline moves faster, so streaming-session entries reach the corrected state more often than in E2's saturated cap-out regime.

Both factors are effects of the load-floor fix, not its cause. The crash is a pre-existing landmine in the vendored streaming-session code that E1 and E2 happened to avoid because their pipelines stalled before sessions accumulated enough committed prefix to trigger the correction.


3. Decision space for the fix

# Fix Layer Where Risk
A Patch the assertion to match the corrected state vendored SGLang schedule_batch.py:1646 Add: if req.extend_input_len == 0 and len(req.fill_ids) < len(req.prefix_indices): continue to skip degenerate reqs before iterating. Local, scoped, doesn't touch correctness elsewhere. Need to handle the skipped reqs (set was_skipped flag, drop from batch).
B Fix the correction site to also drop the req from the batch vendored SGLang schedule_batch.py:1572-1585 When actual_extend_len == 0 and req has nothing to extend, signal upstream to remove the req from this batch (defer or drop). Slightly more invasive. The upstream call path needs to handle a "filtered" return.
C Compute seq_lens and prefix_lens consistently with the correction vendored SGLang schedule_batch.py:1588-1590 After correction, recompute seq_lens = [len(r.fill_ids[:pre_len] + extension)] or align both sides. Risky; affects all downstream tensor sizing.
D Workaround: disable session migration in E3 (the trigger combination) our cli flag --kvcache-migration-reject-threshold 0 One-line config change in sweep_e3_*.sh. Doesn't actually fix the crash — session 1000195 didn't migrate. May reduce but not eliminate. Might still hit it on a different session.
E Workaround: disable streaming session server flag, remove --enable-streaming-session Sidesteps the entire correction path. Loses KVC's direct-to-D fast path (the central perf win we measure). Defeats the experiment.

Recommendation

Fix A — patch schedule_batch.py:1646 to skip the malformed req before asserting. It's the minimal-blast-radius change and matches the apparent intent of the correction (graceful handling of the degenerate state).

Concretely:

# Just before the assertion at line ~1646
if req.extend_input_len == 0:
    # The streaming-session correction zeroed extend_input_len because
    # prefix_indices already covers fill_ids. Skip this req from the
    # extend batch — its KV is already committed; nothing to compute.
    skip_indices.append(i)
    continue

Then the caller of prepare_for_extend needs to handle skipped requests (return them to the decode queue without an extend pass).

Avoid Fix D/E — D doesn't address the root cause (the failing session didn't migrate), and E loses the entire reason we're running this experiment.


4. Decision points for review

# Question Default if no answer
D1 Implement Fix A (vendor patch to skip zero-extend-len reqs)? Yes
D2 Re-run E3 with same K=200, same subset, after the fix? Yes
D3 Add a structural log entry every time the correction fires so we can track its frequency? Recommended
D4 File this as a separate feat(sglang) commit on the branch so the patch and the failure case it fixes are traceable? Yes

5. What this tells us about KVC v2 maturity

The load-floor bonus's first real exposure to the production codepath uncovered an existing patch bug that was masked by E2's failure cascade. This is good news: the failure cascade in E2 was hiding another layer of breakage. Without rebalancing, sessions cap-out → cascade → never run long enough to commit deep prefixes → never hit the streaming-session correction → never crash. With rebalancing, sessions DO commit deep prefixes → trigger the correction → crash.

Each fix tends to expose the next-shallowest bug. This is expected for a stack of ~6 interacting subsystems (kv-aware policy, KVC admission, session_aware_cache, streaming session, mooncake transfer, prefill batch prep). The path forward is to keep patching, re-running, and pushing the failure boundary out.