Files
agentic-pd-hybrid/docs
tim ef4dc81ea9 docs(experiments): forensic explanation for E2 80% failure rate
Pulling admission-events.jsonl, prefill-0.log, and request-metrics
sampling shows the 1054 failures are NOT timeouts as initially
assumed. They are a 3-layer cascade:

  L1: 562 "no-space" + 43 "session-not-resident" worker admission
      rejects (51% of all admit attempts) because D0/D1 KV pools
      saturate while D2 stays empty.
  L2: rejects re-route to seed/reseed which need mooncake P→D KV
      transfer; the backlog drops mooncake heartbeats and prefill-0
      logs "Decode instance could be dead, remote mooncake session
      ... is not alive".
  L3: SGLang aborts the request, SSE stream closes with 0 tokens,
      agentic-pd-hybrid raises "generate stream ended before
      producing any token" (the literal error string for all 1054).

E1 didn't hit this because pd-disaggregation has no admission RPC —
sessions just queue behind the running batch, paying TTFT instead
of failing. KVC v2's worker admission is supposed to be a safety
valve; on the cold-D pathology it becomes a failure amplifier.

The real fix is upstream D rebalancing (cold-D bonus or pre-warm),
not relaxing admission.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 10:38:49 +08:00
..