Captures the current state of the D→P RDMA snapshot push work for
the next agent (or future me): which commits land which phase, which
phases are verified vs in-flight, and the known unverified surfaces
(byte-level KV layout, cross-node, multi-D contention, token_id
consistency, D-side evict races, chunked-prefill interactions).
Also maps the §2 design points to their implementation locations so
the doc-to-code traceability is explicit.
Pre-registers the E4 experiment that tests whether KVC + D→P RDMA
snapshot push beats the naive PD-disagg E1 baseline on the
inferact_50sess subset. Compared to E3 the only changed flag is
--enable-d-to-p-sync.
Three hypotheses (see docs/E4_PROTOCOL_ZH.md §2.3):
H1 (main): E4 TTFT p99 ≤ E1 TTFT p99
H2: E4 reseed-mode TTFT < E3 reseed-mode TTFT
H3: E4 success count ≥ E3 success count
The full reseed → snapshot-push orchestration is wired in b9b0cf0
(_attempt_d_to_p_sync); the SGLang scheduler RPCs and the runtime
mem-leak fix are in 86412bb / a369722.
Phase 2 prepare_receive allocates kv_pool slots that aren't visible
to radix / session bookkeeping until finalize_ingest. Without this
fix, the scheduler's idle self_check fires:
ValueError: token_to_kv_pool_allocator memory leak detected!
available=288391, evictable=5, protected=0, session_held=0
(expected sum == 288460)
_check_radix_cache_memory now subtracts
sum(len(rec.slot_indices) for rec in ctrl._ingest_records.values())
from the expected total before flagging a leak. Snapshot_reserved is
also printed in the leak message for diagnostics.
Smoke confirmed (scripts/smoke_snapshot_sglang_integration.py):
[smoke] prepare_receive on P → 200: ok=true (96 layer bufs)
[smoke] dump on D → 200: ok=false, reason=session-not-resident
[smoke] finalize on P → 200: ok=true, inserted_prefix_len=0
[smoke] OVERALL: PASS
End-to-end KV-correctness (snapshot ingest yields cache hit on next
prefill) still requires the agentic+router stack — covered in the E4
sweep, not this smoke.
Phase 3 — wires the SGLang-side snapshot RPCs (committed in 86412bb)
into the agentic reseed slow-path. On _invoke_kvcache_seeded_router:
1. POST {prefill_url}/_snapshot/prepare_receive alloc P-side slots
2. POST {old_decode_url}/_snapshot/dump RDMA push session KV
3. POST {prefill_url}/_snapshot/finalize_ingest insert into P radix
After step 3 P's radix tree has the session prefix cached; the subsequent
SGLang router-driven prefill on P hits cache instead of re-computing.
Any RPC failure short-circuits to the existing seeded_router fallback
(re-prefill from scratch). All steps are best-effort and structurally
logged for post-hoc analysis.
Flag plumbing:
cli.py --enable-d-to-p-sync (replay + benchmark)
topology.py SingleNodeTopology.enable_d_to_p_sync
stack.py SGLANG_SNAPSHOT_LINK_ENABLE=1 injection per worker
replay.py ReplayConfig.enable_d_to_p_sync +
_attempt_d_to_p_sync helper
Snapshot port per worker derives from disaggregation_bootstrap_port +
1000 (set in third_party/.../snapshot/controller.py), so different
workers get distinct mooncake snapshot engines on the same node.
Smoke (next): scripts/smoke_snapshot_sglang_integration.py spawns one
D + one P, exercises the 3 RPCs end-to-end, checks cache_tokens on a
follow-up generate request.
See docs/D_TO_P_SYNC_DESIGN_ZH.md for the full design.
Phase 2 of the D→P sync feature (Phase 1 in dc4867c verified the
underlying RDMA link in isolation). This commit wires that link into
each SGLang worker's scheduler so D and P can exchange session KV
without going through the PD prefill pipeline.
New module:
third_party/sglang/python/sglang/srt/disaggregation/snapshot/
controller.py — SnapshotLinkController owns one mooncake transfer
engine per worker, pre-registers all kv_pool layer
buffers, and exposes prepare_receive() and
push_session_kv() APIs. Receive bookkeeping via
a session_id → SnapshotIngestRecord side-table.
Three RPC types added to io_struct.py and full plumbing wired through:
SnapshotPrepareReceiveReqInput/Output P-side alloc + return layout
SnapshotDumpReqInput/Output D-side read kv_pool + RDMA push
SnapshotFinalizeIngestReqInput/Output P-side radix tree insert
Files touched:
managers/io_struct.py 3 new ReqInput/ReqOutput pairs
managers/tokenizer_communicator_mixin.py 3 communicators, 3 awaitables
managers/scheduler.py init controller + 3 handlers
entrypoints/http_server.py 3 HTTP endpoints under /_snapshot
Activation: set SGLANG_SNAPSHOT_LINK_ENABLE=1 (and
SGLANG_SNAPSHOT_LINK_HOST / _PORT / _IB_DEVICE) per worker. Controller
init is opt-in and defaults off, so production PD pipeline is
untouched.
Subsequent work (Phase 3): agentic-pd-hybrid orchestration in
_invoke_kvcache_seeded_router to call prepare_receive on P, dump on
D-old, finalize_ingest on P, then trigger the existing P→D' transfer
which will now hit P's radix cache (skipping re-prefill).
Confirms snapshot_link works for cuda device pointers, not just host
memory. Sender on cuda:0 pushes to receiver on cuda:1 via RDMA over
mlx5_60. All 5 sizes (16K, 1M, 16M, 64M, 256M) pass SHA verification.
16 KB 8.3 ms 0.016 Gbps (cold openSegment)
1 MB 0.10 ms 87.6 Gbps
16 MB 0.84 ms 159 Gbps
64 MB 2.52 ms 213 Gbps
256 MB 8.54 ms 251 Gbps (~60% NDR400 line rate)
For Inferact-scale sessions (~50K tokens × ~80 KB layer-per-token =
~4 GB), this projects D→P transfer time at ~130 ms — within the
"reseed-savings" envelope sketched in design doc §3.2.
Files:
scripts/snapshot_link_receiver_gpu.py
scripts/smoke_snapshot_link_gpu.py
Next: SGLang scheduler integration for D-side dump + P-side ingest.
Goal: skip P-side re-prefill on reseed path. Push session KV
snapshot from D back to P after each direct-to-D append; reseed
re-uses P's snapshot to fire only the P→D' transfer (no model.forward
on P).
Decision: Option C — D→P snapshot at append-commit, P-side
PrefillSnapshotStore (side-table, not in radix tree), prefill
bypass when snapshot is fresh. Rejects A (radix multi-producer),
B (D→D' direct, fails for session-not-resident), D (eviction-only).
Lays out 8-commit roadmap, wire protocol, failure modes, and the
E4 experiment plan (KVC + D→P vs naive PD-disagg E1 baseline).
After E3 exposed massive session-level eviction (90 trims × avg
67K tokens/evict = 6.1M tokens trashed in 1h12min), we have to
acknowledge the local-patch sequence (E2→load-floor→Fix A →
proposed disable-migration → proposed disable-admission) was a
KVC-to-DP collapse trajectory, not a fix.
The fundamental issue: SessionAwareCache merged two responsibilities
that should be separate.
1. Session lifecycle tracking (legitimate — streaming sessions
reuse KV across turns and need per-session metadata).
2. Eviction granularity decision (wrong — sessions should not be
the eviction unit).
`release_session` frees the session-exclusive range
[cache_protected_len, kv_allocated_len), which is the post-radix-
commit tail accumulated over decode/extend. On Inferact's
50-session workload this is 35-87K tokens per session. The radix
tree never gets a chance to do block-level leaf-LRU on that range
because it was never committed there.
Effect: evict-revisit cycle forces full 50-90K re-prefill per
session per evict — which is exactly the per-request cost of naive
PD-disagg. KVC's direct-to-D fast-path advantage collapses.
The right fix is structural (not a patch): progressively commit
streaming-session decode output to the radix tree so SGLang's
block-level LRU can shed only the deepest leaves, preserving the
recent prefix that next-turn requests are most likely to match.
SessionSlot becomes pure metadata. Scope is ~1-2 weeks of vendored
SGLang refactor, orthogonal-and-complementary to the D→P sync work
proposed in RESEED_SLOW_PATH_AND_D_TO_P_GAP §4.
Doc lists five anti-patterns the next agent should avoid (tuning
migration_reject_threshold, disabling migration/admission, etc) —
all of those are local symptoms downstream of the eviction
granularity choice.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fix A from docs/E3_FINDINGS_ZH.md §3. The existing streaming-session
correction at the top of ScheduleBatch.prepare_for_extend zeroes
req.extend_input_len when len(fill_ids) <= len(prefix_indices), but
the per-req invariant later in the same function (assert
seq_len - pre_len == req.extend_input_len) is computed from raw
fill_ids/prefix_indices lengths and has no path to be satisfied
when fill_len < prefix_len. The result is an AssertionError that
crashes the entire decode worker.
Add a pre-filter pass at the start of prepare_for_extend that
detects this state, marks the affected reqs with FINISH_ABORT (so
the client gets an error response instead of the worker hanging),
and drops them from the batch before the correction loop runs. If
all reqs are filtered, populate empty tensor/list state and return
early so downstream model.forward sees a valid no-op batch.
This treats fill_ids < prefix_indices as upstream state
inconsistency that should be reported to the client rather than
silently miscomputed. The narrower invariant after this filter:
prepare_for_extend's body only ever sees streaming-session reqs
where actual_extend_len > 0, which is the regime the existing
correction logic was designed for.
Reproduced by E3 first run on 2026-05-12 02:51:21 UTC (rid
6f4318e93dd543a49dbf19248cfc1e6f, session 1000195, fill_len=6648,
prefix_len=43459) — masked in E1/E2 because the cap-out failure
cascade prevented sessions from accumulating deep enough committed
prefix to trigger the inconsistency.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
H1 (load balance) confirmed at the 15-min checkpoint: D2 received
22.5% of bindings (225 out of 1001) covering 30 unique sessions,
versus 0 in both E1 and E2. The graduated load-floor formula with
K=200 produces the intended distribution: fresh sessions on
under-loaded D, sticky sessions stay put.
But decode-1 crashed at 11:51:21 (~5 min into benchmark) with an
SGLang AssertionError in schedule_batch.py:1646. Root cause: the
streaming-session correction at line 1572-1585 patches
req.extend_input_len to 0 when len(fill_ids) < len(prefix_indices),
but the downstream invariant uses raw fill_ids/prefix_indices
lengths, so the arithmetic check fails. This is a pre-existing
landmine in the b8e6f13 SGLang vendor patch, not caused by the
load-floor bonus. It just happened to be masked in E2 by the
failure cascade preventing sessions from accumulating deep enough
prefix to trigger the correction.
Crash session 1000195 stayed on decode-1 the whole time (not a
migration race). E3 exposes this faster because sessions actually
run further with rebalanced load.
5 fix options evaluated. Recommended: Fix A — local patch at
schedule_batch.py:1646 to skip zero-extend-len reqs before
asserting. Less invasive than C (recomputing seq/prefix arrays);
addresses the actual case (D and E are workarounds, not fixes).
4 decision points for review; no code changes in this commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same outputs/inferact_50sess.jsonl subset as E1/E2 (md5
7bb263a32600ef5a6ef5099ba340a487). Identical to E2 except adds
--kvcache-load-floor-bonus 200. Tests three hypotheses:
H1 (load balance): D2 receives non-trivial bindings (E1/E2: 0)
H2 (failure rate): mooncake batch_transfer timeouts disappear
because D0/D1 KV pool no longer saturates
(E2 had 1054 fails; expect ≤ E1's 85)
H3 (TTFT): E2's 0.43s p50 (over the 231 successes)
generalizes to most reqs once cascade is gone
K override via LOAD_FLOOR_BONUS env var (default 200).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements the design proposed and approved in
docs/E1_E2_FIX_DESIGN_ZH.md §Q2.B.
KvAwarePolicy gains a `load_floor_bonus: int = 0` knob. When > 0:
mean_assigned = sum(assigned[*]) / len(D)
for each D candidate:
if not sticky and mean_assigned > 0:
deficit = max(0, mean_assigned - assigned[D])
floor_bonus = K * deficit / mean_assigned
else:
floor_bonus = 0
score = (overlap + sticky*α + floor_bonus, sticky, -inflight, -assigned)
Properties (verified by unit-style probe in commit message):
- Default 0 = old behavior preserved
- Sticky-gated: turn-1+ requests of an existing session keep going
to their original D (cache locality preserved)
- Graduated: bonus magnitude scales with the D's deficit ratio,
approaches K as deficit/mean → 1, drops to 0 when balanced
- Set above max expected boilerplate overlap (Inferact ~50 → 200)
so cross-session shared-prefix overlap doesn't pin cold D's idle,
but real per-session prefix overlap (>K blocks) still wins
Plumbed through ReplayConfig, BenchmarkConfig, and CLI flag
--kvcache-load-floor-bonus on both `replay` and `benchmark-live`.
Empirical verification on synthetic state (same conditions as the
E2 cold-D pathology):
- OFF (K=0): route fresh session → decode-0 (boilerplate winner)
- ON (K=200): route fresh session → decode-1 (cold D rebalanced)
Validation pass next: scripts/sweep_e3_kvc_v2_loadfloor_rdma.sh
(committed separately).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mooncake C++ batch_transfer_sync defaults to 30s timeout; on
saturated D scheduler threads doing LRU eviction, that fires as a
false positive and the SGLang hair-trigger in conn.py:1270
permanently blacklists the D's mooncake_session_id (E2 forensic in
docs/E1_E2_RESULTS_ZH.md §5c). Bump to 1800s in setup_env.sh and
mirror to subprocess env in stack.py so SGLang workers get it too.
30-min envelope still detects genuinely broken peers eventually.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For Q1 (D scheduler LRU starves mooncake control plane → 30s
batch_transfer_sync timeout → hair-trigger blacklist), six candidate
fixes evaluated. Recommendation: do Q2 fix first since it removes
the only condition under which we observe LRU thrash; bump mooncake
timeout to 120s as cheap defense-in-depth; avoid invasive SGLang
vendor changes (windowed hair-trigger, async eviction thread) until
Q2 fix demonstrates they're insufficient.
For Q2 (overlap-first lex score + shared boilerplate → permanent
D2 cold), seven candidate fixes evaluated. Recommendation: load-
floor bonus (graduated, decoupled from overlap, gated on
not-sticky) as the primary mechanism — proactive on first-touch as
user requested, avoiding the binary one-shot pitfall of the
reverted cold-D bonus. Orthogonal cleanup: fix the substring filter
in _is_admission_rejection_mode so the existing migration mechanism
serves as a backstop when load balancing alone isn't enough.
7 decision points listed for review; no code merged until a shape
is approved.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implementation jumped ahead of design. The cold-D bonus is one of
several candidates for the overlap-pinning fix (others: load-floor
bonus, idle-D bonus, capacity-aware overlap discount, pre-warming
boilerplate). Need to evaluate the design space first, including
whether a single bonus is even the right shape vs a separate term
in the lex score, before committing to a specific knob.
This reverts commit 786cbb8 cleanly (forensic docs in bf4da28 and
7f2ebf3 are kept since they record observations, not designs).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
KvAwarePolicy now accepts an optional cold_d_bonus int. When > 0,
fresh requests (sticky=0, i.e. no prior D for this session) receive
the bonus added to lex-score position 0 (overlap+sticky_bonus) for
any D worker that has never been assigned a session yet
(decode_assignment_counts == 0). This breaks the pathology
documented in docs/E1_E2_RESULTS_ZH.md §5d where workloads with
shared cross-session prefix (e.g. Inferact's "permissions
instructions" boilerplate) cause every D that has hosted any session
to dominate the overlap term against any cold D, leaving the cold D
permanently unused.
Sticky behavior is preserved: turn 1+ requests of an existing
session continue to stick to their original D because the bonus is
gated on `not sticky`.
Plumbed through ReplayConfig.kvcache_cold_d_bonus (default 0,
keeping current behavior unchanged), BenchmarkConfig, and CLI flag
--kvcache-cold-d-bonus on both `replay` and `benchmark-live`
subcommands. Set above max expected boilerplate overlap (Inferact's
~50 24-token blocks → 1000 is safe).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Q1 mystery resolves: P-side mooncake C++ logs show
"Sync batch data transfer timeout after 37452515723ns" (37.45 s) at
01:56:42 — this is mooncake's batch_transfer_sync giving up after
its internal timeout. The hair-trigger >=1 in conn.py:1270 is
correct in the idle case (a 30-s RDMA stall genuinely means the
peer is broken), but it fires here because of D-side congestion:
decode-0.log shows two consecutive LRU evictions ("Trimmed decode
session cache via LRU. evicted_sessions: 2, freed_tokens: 77675")
firing at the exact same wall second the timeout triggers.
The D scheduler thread is busy with multi-session GPU memory frees
+ session-aware-cache bookkeeping under lock; the mooncake C++
control plane on the receive side gets starved for >30 s; P times
out and marks the whole D's mooncake_session_id failed.
Two-layer fix listed in §5c: root-cause = spread load to D2 (cold-D
bonus, next commit); defense-in-depth = windowed threshold + retry
in vendored mooncake conn.py.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Q1: Mooncake "is not alive" is hair-trigger — a single
send_kvcache_slice ret != 0 in
third_party/sglang/python/sglang/srt/disaggregation/mooncake/conn.py
:1270 permanently adds the D's mooncake_session_id to failed_sessions
and blacklists it for the rest of the process lifetime. The D worker
process is alive (D1 keeps serving admit_direct_append OK seconds
after), but every subsequent P→D transfer for that session
short-circuits at conn.py:1184. The "Failures should never happen if
the session is not dead" comment encodes the wrong assumption for the
saturation regime we hit.
Q2: KVC v2's migration mechanism IS sound but its trigger is gated
by replay.py:1379 _ADMISSION_REJECTION_SUBSTRINGS = ("session-cap",
"no-d-capacity", "d-backpressure"). All 1054 failures have
execution_mode="kvcache-centric" (generic fallback bucket) which
contains none of those substrings, so session_d_rejects is never
incremented. Empirically 46 of 49 (sess, D) pairs that the worker
RPC rejected would have qualified for blacklist (most-rejected
pair: 25 rejects), but policy never saw them. Result: D0 reject
→ next-bind D0 (253×), D1 reject → next-bind D1 (329×), D0/D1 reject
→ next-bind D2 (0×).
Fix paths documented for both, shortest path is widening the
substring filter to include the failure-fallback bucket, but the
right fix is to call record_admission_reject directly from the
actual rejection signal site instead of string-matching execution_mode.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pulling admission-events.jsonl, prefill-0.log, and request-metrics
sampling shows the 1054 failures are NOT timeouts as initially
assumed. They are a 3-layer cascade:
L1: 562 "no-space" + 43 "session-not-resident" worker admission
rejects (51% of all admit attempts) because D0/D1 KV pools
saturate while D2 stays empty.
L2: rejects re-route to seed/reseed which need mooncake P→D KV
transfer; the backlog drops mooncake heartbeats and prefill-0
logs "Decode instance could be dead, remote mooncake session
... is not alive".
L3: SGLang aborts the request, SSE stream closes with 0 tokens,
agentic-pd-hybrid raises "generate stream ended before
producing any token" (the literal error string for all 1054).
E1 didn't hit this because pd-disaggregation has no admission RPC —
sessions just queue behind the running batch, paying TTFT instead
of failing. KVC v2's worker admission is supposed to be a safety
valve; on the cold-D pathology it becomes a failure amplifier.
The real fix is upstream D rebalancing (cold-D bonus or pre-warm),
not relaxing admission.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
E2 finished 1h33min wall. Headline contrast on the matched Inferact
50-session subset:
E1 (naive 1P3D + kv-aware + RDMA):
1200/1285 succ, lat p50=93s p99=219s, TTFT p50=89s p99=207s
E2 (KVC v2 + RDMA):
231/1285 succ, lat p50= 7.4s p99=65s, TTFT p50=0.43s p99=8.7s
E2 is 12.4× worse on failure rate but 20× better on TTFT p50 among
the requests that did complete. Both runs leave D2 entirely unused
for the same structural reason: Inferact's shared "permissions
instructions" boilerplate makes overlap dominate the kv-aware lex
score, and v2's migration mechanism only fires on capacity rejects
which never reach D2. The 1054 E2 timeouts are downstream of that
imbalance, not a v2 bug per se.
The doc closes with five concrete follow-ups for the next agent —
cold-D bonus, router-mode admission, default-policy control arm,
TCP-loopback comparison, failure mode forensics.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same pathological imbalance E1 showed reproduces in E2: D2 has zero
bindings at 33% POSTs in. Root cause is structural, not a KVC v2 bug:
all 50 Inferact sessions begin with identical "permissions
instructions" boilerplate, so the converter assigns them identical
first-block hash_ids. kv-aware policy's overlap term (lex-score
position 0) makes any already-resident D dominate a fresh D
unconditionally, and v2's migration only activates on admission
rejects which never fire because D0/D1 KV pools have headroom. The
H1 conclusion is qualified: KVC v2 helps per-request work (direct-
to-D fast path) but does not rebalance D worker load on workloads
with shared cross-session prefixes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
E1 finished 1h29min wall on the 50-session Inferact subset. Headline:
1200/1285 succeeded, latency p50=93s p99=219s, TTFT p50=89s p99=207s,
85 timeouts. Decode-2 was never bound to a single session — all 50
sessions stuck to decode-0/1 by kv-aware policy stickiness with no
migration to rebalance, so effective topology was 1P2D, not 1P3D.
This is exactly the failure mode H1 predicts naive pd-disaggregation
should exhibit, giving E2 (full KVC v2 with migration) a concrete
baseline to improve against.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
KVC v2 config from sweep_ts1_migration_v2.sh (reset-on-success +
direct-append threshold 8192) layered on top of the RDMA-enabled
mooncake stack, against the same outputs/inferact_50sess.jsonl
subset that E1 uses. Pair-wise contrast tests H1 (KVC layer marginal
contribution on top of 1P3D + kv-aware) and H2/H3 (RDMA reducing
reseed slow-path tail).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/sample_trace_subset.py — file-order head-cut that takes the
first N sessions of a converted trace. No RNG, no hashing — same
input yields byte-identical output (the included assertion compares
md5 across two runs).
scripts/sweep_e1_naive_1p3d.sh — E1 of ONBOARDING_NEXT_AGENT_ZH §3.1:
mechanism=pd-disaggregation, policy=kv-aware, 1P3D, RDMA on
(mlx5_60). Defaults to outputs/inferact_50sess.jsonl so E1 and E2
can share the exact same subset; override via TRACE= env var to run
on the full 20,230-request trace.
Reproducing the subset:
uv run --no-sync python scripts/sample_trace_subset.py \\
--input outputs/inferact_codex_swebenchpro.jsonl \\
--output outputs/inferact_50sess.jsonl \\
--sessions 50
# expected output_md5: 7bb263a32600ef5a6ef5099ba340a487
# 1285 requests, mean input_length 67631 tokens
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the full debugging journey of getting vendored SGLang 0.5.10
+ mooncake RDMA running on a 4×H200 node with the older driver
570.86.15. Driver 570's actual API is cu12.8 — nvidia-smi's
"CUDA Version: 13.0" header is a forward-compat ceiling, not the
driver's own version — and that single misreading drove most of the
detours. Lessons cover: pip vs vendor sglang divergence, why cu13
switching was a dead end (mooncake is cu12-only by wheel, driver 570
can't run cu13 anyway), why --disable-overlap-schedule alone isn't
enough, why pip nvidia-cuda-nvcc-cu12 doesn't ship the nvcc binary,
and how tvm_ffi's ninja-driven nvcc invocation makes CUDA_HOME the
single hook point that fixes everything.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
setup_env.sh: source-able shell snippet that points tvm_ffi (vendor
sglang JIT compiler) at \$HOME/cuda-12.8/bin/nvcc and exposes both
libcudart.so.12 (for mooncake.engine, a cu12 wheel) and cu12.8 lib64
(for tvm_ffi compile-time linker) on LD_LIBRARY_PATH. Without this,
JIT-compiled kernels NEEDED libcudart.so.13 and driver 570 rejected
them at every JIT call.
convert_inferact_to_trace.py: turns Inferact codex_swebenchpro_traces
(ShareGPT {"from","value"} pairs) into the chat_id/parent_chat_id/
turn/hash_ids JSONL schema replay.py expects. Tokenizes with the
model's own tokenizer, builds prefix-sharing 24-token block hashes,
synthesizes timestamps. Output cross-checks 20,230 LLM calls — exactly
matches the Inferact README count for 610 successful trials.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On a node with driver 570.86.15 (cu12.8 driver API ceiling), SGLang's
overlap event loop hits cudaErrorInsufficientDriver inside
event_loop_overlap_disagg_prefill → resolve_future_token_ids JIT
kernel. Switching to the normal event loop sidesteps this specific
codepath. The flag is harmless on newer drivers and remains a useful
default until overlap is independently re-validated on this hardware.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace pip-resolved sglang==0.5.10 with an editable install from
third_party/sglang/python. The vendored fork carries patches the pip
release does not (admit_direct_append RPC types, _should_allow_local_
prefill_on_decode, maybe_trim_decode_session_cache, backpressure pause
hint) — KVC routing depends on them, so the vendored copy must be the
import target, not just on PYTHONPATH at runtime.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two cleanups:
1. Drop "E1: naive 1P3D default" experiment from the onboarding manual.
GPU hours are precious; naive 1P3D + policy=default has near-certain
loss on multi-turn cache hit (it's round-robin without prefix awareness),
so the comparison doesn't add information vs E1=naive 1P3D kv-aware.
The new manifest has only 2 runs: E1 (naive 1P3D kv-aware) + E2 (KVC
v2 + RDMA). Run-time budget drops from 16.5h serial to 11h serial /
5.5h parallel. Updated:
- §0 TL;DR ("3 组" -> "2 组")
- §2 H1 hypothesis (drop "default and kv-aware each one" -> just kv-aware)
- §3.1 experiment matrix (3 rows -> 2 rows + rationale for the drop)
- §3.2 startup config (drop E1 default section, renumber E2/E3 -> E1/E2)
- §6 decision table + expected-range table
- §7 FAQ ("3 个 E1-E3" -> "2 个 E1-E2")
- §9 deliverables
2. Move 8 deprecated docs to docs/archive/:
AGENTIC_FIT_ANALYSIS_ZH.md (ts=10 era analysis; superseded)
STRUCTURAL_VALIDATION_REPORT_ZH.md (ts=10 era validation; superseded)
KVC_DEBUG_JOURNEY_V1_TO_V5.md (v1-v5 sweep process notes)
V5_PROFILE_INVESTIGATION_ZH.md (v5 1Hz polling investigation)
REFACTOR_PLAN_ZH.md (v0 plan; superseded by V1)
KVCACHE_CENTRIC_PROGRESS_ZH.md (earliest 2026-04-27 progress)
SWEBENCH_EXPERIMENT_PROGRESS.md (early SWE trace setup)
SWEBENCH_EXPERIMENT_RESULTS.md (early SWE result snapshot)
All cross-references in active docs (V2_DEEP_ANALYSIS / V2_RESULTS /
REFACTOR_PLAN_V1 / TEAM_REPORT / ONBOARDING) rewritten from
`docs/FOO.md` to `docs/archive/FOO.md` via sed pass.
Added `docs/archive/README.md` explaining what each archived doc is
and when (if ever) to reopen it. Designed so a new reader hitting
the archive dir immediately knows it's not required reading.
After this commit the active docs in docs/ are 9 files (down from 17),
which should make the onboarding doc's "Level 1 / Level 2 / Level 3"
classification self-evident.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A single self-contained reading manual designed to bring a fresh agent
(LLM or human) to current-state proficiency in 30 min of reading +
30 min of environment validation, then have them run the next round of
ablation experiments without re-litigating questions already settled.
Structure:
§0 TL;DR -- what you are inheriting in 5 lines
§1 Reading order, tiered into Must-Read / On-Demand / Archive,
with reasons for each
§2 Current-state snapshot: trace/hardware/branches + claims verified
+ hypotheses pending
§3 The three ablation experiments (E1/E2/E3) with full CLI flag
specifications and environment-validation checklist
§4 Known gotchas (8 of them) with symptoms and fixes -- the most
important section to skim before you start
§5 CLI cheatsheet: run experiments / read data / plot / git
§6 Result-analysis checklist: numbers to collect, expected ranges
§7 FAQ for likely stuck-points
§8 Anti-patterns: what NOT to do
§9 Two specific deliverables the main agent expects back
Appendix A: file location lookup table
Appendix B: commit lookup table (by intent)
Goals encoded into the doc:
- Frame "your job is ablation, not new development" -- the new agent
should not be tempted to start D->P sync work; that goes on the
feat/d-to-p-sync branch in a separate phase.
- Make abort-accounting / max-input-len / mooncake-TCP-default
pitfalls extremely visible up front so they don't get repeated.
- Provide expected-result ranges so a 2x deviation is treated as a
config check, not a "finding".
- Make the critic-vs-production framing explicit so the new agent
knows when an audit-style "MAJOR" is actually a design intent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bar-overlap fix: extend ylim by 35-45% above the tallest bar to give the
"P GPU only sees 328 requests" and "P GPU does 1.07M tokens" annotations
clean white-bbox space above the bars instead of crashing into the KVC D
bars at x=1. Move both annotation xytext positions to x=2.4 (left panel)
and x=5.5 (right panel) so the arrows pull away from the orange P bar
toward the center of the panel.
Group labels (KVC 1P3D / DP 4-way CA) kept in axes-fraction bboxes at
y=1.02; subplot titles raised to pad=24 to leave room.
Note: a small visual collision between the bboxed group labels and the
subplot-title second line remains in the rendered output (acknowledged
in the prior conversation). Acceptable for now; full layout rework is
deferred. The annotation-vs-bar overlap (the original blocker) is fixed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Standalone reference document capturing the v2 reseed slow-path forensic
audit before opening the feat/d-to-p-sync branch. Designed to be quoted
directly by future paper drafts and to prevent the team from re-relitigating
the same questions verbally.
Contents:
§1. The three team-member challenges that disproved "capacity-backup will
save the slow path" (each with code citation and verdict):
1) P pool can't fit all backups -- replay.py:1618-1620 caps backup
count at 1 for sessions with ~50K peak input.
2) P's backup is a stale snapshot -- 49K of direct-to-D append work
never flows through P. _commit_prefill_backup_residency
(replay.py:1483) is only called from seed/reseed paths;
direct-to-D path (replay.py:2719) never touches P-side state.
3) When D evicts, old KV is freed directly (no D->P dump).
session_aware_cache.release_session only calls
kv_pool_allocator.free().
§2. End-to-end reseed timeline (t=0 to t=4550ms) with code citations
showing exactly where each component sits. P-side re-prefill =
1.5-3s, mooncake transfer = 1.5-4s, both contributing 50/50 to
total reseed cost.
§3. Table of "looks like D->P but isn't" code locations -- every
candidate found during forensic search ruled out with line citations.
§4. Specification of what D->P incremental sync would require:
mooncake bidirectional roles (~400 LOC), D-side append commit hook
(easy), P-side radix tree multi-producer extension (the real blocker),
agentic-pd-hybrid replay.py hooks. Estimated 1-2 weeks engineering.
§5. Confirmation via `git ls-remote origin --refs` that author has NOT
secretly implemented D->P on another branch -- only main + this
working branch exist on the server.
§6. Roadmap for the upcoming feat/d-to-p-sync branch.
Appendices: code position crosswalk, related commits, paper section
suggestions.
This document is referenced by V2_DEEP_ANALYSIS_ZH §4.2 and by
KVC_ROUTER_ALGORITHM §9 Open Question 4.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After an independent Opus-agent forensic audit, the previous "(c) 增量
fetch (工程量较大,未实现)" line in V2_DEEP_ANALYSIS §4.2 was understating
the gap. The audit confirmed:
- No D->P KV transfer code exists in the framework at any layer
(agentic_pd_hybrid orchestration, vendored SGLang disaggregation,
or mooncake transport).
- Mooncake MooncakeKVManager has a hard role split: PREFILL = sender,
DECODE = receiver-only loop. `add_transfer_request` asserts the
disaggregation_mode is PREFILL.
- The BaseKVSender / BaseKVReceiver abstraction has no bidirectional slot.
- session_aware_cache.release_session only calls kv_pool_allocator.free()
on eviction -- no serialization, no outbound network call.
- _commit_prefill_backup_residency is only called from the seed/reseed
path (_invoke_kvcache_seeded_router). direct-to-D path never updates
P-side backup state.
- "capacity-backup" policy semantics: it only skips the close on P after
reseed -- the backup is the seed-time static snapshot, never refreshed
by D-side append-prefill activity.
V2_DEEP_ANALYSIS §4.2:
- Decomposed the 3-7s reseed cost into the P-side re-prefill segment
(1.5-3s, dominant) and the P->D mooncake transfer segment (1.5-4s).
- Quantified the realistic effect of enabling RDMA: only the transfer
segment shrinks, reseed reduces to 1.7-3.2s, TTFT p99 ~0.7s, still
loses to DP's 0.43s.
- Replaced the throwaway "(c) incremental fetch" line with a full
paragraph explaining what D->P sync would require, why it's the
largest engineering gap, and that the blocker is SGLang's radix-tree
single-producer assumption, not the network layer.
KVC_ROUTER_ALGORITHM §9:
- Refined Open Question 3 (RDMA) to clarify it only helps the transfer
segment, not the re-prefill segment.
- Added Open Question 4: D->P incremental KV sync as the central
future-work contribution gap, with cited evidence for why it doesn't
currently exist.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two figures inserted into V2_DEEP_ANALYSIS §4.5 and §4.4 respectively, to
visually rebut the two critic-agent claims that we argued in prose were
design intent, not deficiencies.
(1) gpu_utilization.png -- §4.5 "P GPU is wasted 90% of the time"
Two-panel side-by-side:
Left (request count view, the naive reading): KVC P = 328 reqs (7.4%),
KVC D = ~1450 each, DP = ~1100 each. P "looks idle."
Right (compute work view, the honest reading): KVC P does 1.07M tokens
of prefill, comparable to each KVC D worker's ~0.80M. P is a
low-frequency high-cost safety net, not idle capacity.
Bonus finding: KVC's total compute (3.47M tokens across 4 GPUs) is 33%
LESS than DP's (5.17M). Same GPUs, less work done. That's the affinity
win.
(2) cache_efficiency.png -- §4.4 "Cache concentration is not policy win"
Two-panel side-by-side. The setup: KVC has 27% LESS total KV pool
(276K vs 351K tokens) yet caches MORE per request.
Left (cache hit rate vs turn number): KVC's session-affinity lets
hit rate accumulate with turns; DP's hash + radix-LRU causes
a mid-turn drift around turns 8-25 where KVC = 97.0% vs DP
= 95.8% (1.24pp gap). Shows mechanism, not just outcome.
Right (ECDF of per-request uncached tokens, log x): KVC's distribution
concentrates near zero (50% < 187 tokens), DP's is spread
(50% < 781 tokens). At uncached = 500 tokens threshold, KVC
has 74% of requests below, DP has 31%.
→ smaller pool, better retention, less per-request work. Direct empirical
rebuttal to "fragmentation is architectural, not policy."
Bundled scripts (rerunable):
- scripts/analysis/plot_gpu_utilization.py
- scripts/analysis/plot_cache_efficiency.py
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a two-panel TTFT PDF comparison plot inserted as a new V2_DEEP_ANALYSIS
§3.4 ("TTFT 概率密度对比: bimodal vs unimodal"). Single-percentile numbers
(p50 / p99) hide the qualitative difference between the two distributions;
the figure makes it visible at a glance.
Left panel (linear x in [0, 0.6]s, body):
KVC has a sharp peak at ~40ms (the direct-to-D fast path).
DP has a broad peak around 50-200ms (full prefill per request).
Annotated with p50 and p90 markers for each side.
Right panel (log x in [10ms, 10s], full range):
KVC is visibly bimodal: a tall fast-path peak plus a small reseed tail
around 1-5s.
DP is unimodal: a single broad peak with shorter tail.
Annotated with p99 callouts pointing to each tail.
KDE: scipy.stats.gaussian_kde, bandwidth=0.15 for the body (Scott's rule
oversmooths the sharp fast-path peak), log10-transformed for the full-range
panel so the bimodal structure is visible.
Bundled:
- scripts/analysis/plot_ttft_pdf.py -- rerunable when v2 / DP data change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
V2_DEEP_ANALYSIS §3.1 (execution_mode distribution) and §3.2 (path-level
latency vs DP) had hand-typed tables with approximate latencies (e.g.
"~1.0s") and required readers to mentally compare 5+ rows × 5 columns.
Both sections now reference generated PNG figures derived directly from
the v2 + DP metrics.jsonl files.
§3.1 figure (v2_execution_mode_distribution.png):
Horizontal bar chart, log x-axis. 4076 direct-to-D fast-path requests
(green) dwarf the rest by ~30x; the long tail of slow / fallback /
failure modes is visible at one glance. Counts and percentages
annotated on each bar.
§3.2 figure (v2_path_level_latency.png):
Grouped bar chart, log y-axis. Per-path TTFT p50 / TTFT p99 / Lat p50
with exact numeric labels (no more "~1.0s" approximations). Sample
counts annotated below each path. Quick visual reads:
- KVC fast path TTFT p50 41ms vs DP 92ms (2.2x faster)
- KVC reseed TTFT p99 5.12s vs DP 0.43s (12x slower) -- the cost
- KVC no-d-capacity TTFT p99 7.65s (worst case)
Bundled:
- scripts/analysis/plot_v2_path_breakdown.py -- the script that
generates both figures; rerunable when v2 data changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the critic-agent audit, V2_DEEP_ANALYSIS had drifted into an
audit-grade "5 wins / 1 loss / 3 draws" framing that mistook KVC's
deliberate design motifs (cache concentration via session affinity;
prefill-GPU idle as TTFT-stability trade-off) for "comparison
unfairness." This commit corrects the framing back to a production-
decision lens and adds a paper-track formal specification of the
router algorithm.
V2_DEEP_ANALYSIS_ZH.md changes:
- §0 TL;DR: lead with "online coding agent serving should pick
KVC 1P3D"; the only real cost is TTFT p99 long-tail (3x DP) from
the 8.3% mooncake reseed path, mitigable with real RDMA.
- §4 restructured into three buckets:
real costs (TTFT p99 tail, abort accounting now fixed),
counter-arguments to the critic (cache concentration and idle
prefill GPU are design intent, not deficits),
methodology to-do (naive-1P3D control, v2 N>=2 determinism).
- §6 replaces "5/1/3 rescoring" with production decision rationale:
KVC wins on 6 latency/TTFT metrics + lower failure rate; pays
TTFT p99 tail; lists workloads where DP would reverse the call.
- §8 decision points: D1 recommends Yes (accept v2 as milestone);
D8 added: paper motif "KVC trades P idle for TTFT stability."
KVC_ROUTER_ALGORITHM.md (new, paper-track, Chinese narrative + English
algorithm boxes / variable names / theorems for direct paper reuse):
- Problem formulation, system model, full notation
- Algorithm 1 Route: lexicographic-tuple scoring on
(overlap+alpha*sticky, sticky, -inflight, -assigned)
- Algorithm 2 Admit: D-worker autonomous admission deciding
Direct / Seed / Reseed / reject (with reason)
- Algorithm 3 Dispatch: end-to-end orchestration with reset-on-success
(the v2-specific fix that eliminates v1's self-amplifying thrashing)
- Theorem 1 (no permanent starvation) and Theorem 2 (fast-path
determinism), each with a proof sketch
- Comparison table vs vanilla pd-disagg / DP cache-aware
- Anti-patterns ("what KVC explicitly is NOT")
- Open questions for reviewers
- Suggested paper citation phrasing
- Appendix A: algorithm-step to source-file:line crosswalk
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The old filter `if row.latency_s is not None` accepted SGLang's fast
input-length-aborts (latency_s ~ 0.08s, finish_reason='abort/BadRequest')
as if they were successful zero-cost requests. This deflated mean/p50
of any run where the model rejected oversized inputs.
Impact on existing comparisons (ts=1 4-run validation + v2):
KVC v2 has 40 aborts + 5 ReadTimeouts (was reported as just 5);
DP 4w has 67 aborts (was reported as 5).
Both runs have abort behavior; the asymmetry (40 vs 67) is purely from
SGLang's mem-fraction-derived max-input-len: KVC decode-only worker gets
~10 GB free GPU mem -> max-input=92098, DP fused worker gets ~9 GB ->
max-input=87811, because DP also needs chunked-prefill workspace.
The KVC-vs-DP latency-win direction holds and widens slightly under the
fixed filter (lat mean delta: -0.8% -> -1.4%); see V2_DEEP_ANALYSIS_ZH
§4.3 for the recomputed table.
Changes:
- metrics.py: new _is_failed_request(row) helper; latency/ttft/tpot
stats now exclude both errors and aborts. New summary fields
abort_count and failure_count expose the counts directly.
- scripts/analysis/recompute_summary.py: re-derives summary.json from
existing metrics.jsonl using the fixed code, with optional --diff
against the old buggy summary for inspection.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-v2 audit consolidating ts=1 validation + v1 thrashing + v2 win, plus
critic-agent adversarial review of the v2 vs 4DP comparison.
Headline outcomes:
- TEAM_REPORT §1 (session pin starvation) fully fixed by v2 migration +
reset-on-success; direct-to-D 42.8% -> 91.6%.
- TEAM_REPORT §2/§3/§5 (LRU, backpressure, admission RPC) are absorbed by
ts=1 natural drain time, not mechanism-fixed -- will resurface under
ts=10/longer traces/higher concurrency.
- TEAM_REPORT §6 (ts=10 distortion) confirmed and locked as precondition;
TEAM_REPORT §8 (N=1 unreliable) rewritten to "high-pressure N>=3, normal N=1".
Three new problems exposed by adversarial review:
- TTFT p99: KVC 1.285s vs DP 0.427s (KVC 3.0x worse) -- cherry-picked out of
the V2_RESULTS_ZH.md headline table. Root cause: 8.3% non-direct path pays
3-7s mooncake reseed cost on 50-90K-token KV transfer.
- Error accounting asymmetry: DP has 67 fast-aborts (not 5) at ~0.08s each
counted in latency stats; KVC's 5 ReadTimeouts excluded entirely. Root
cause: --max-input-len 87811 (DP) vs 92098 (KVC) + metrics.py:124 filter.
- Topology mismatch: KVC 1P3D's prefill GPU is idle 91.7% of the time
(only ~373/4449 requests use seed/P path); 4DP CA has all 4 GPUs at full
utilization. Plus: no naive 1P3D control exists in the repo -- cannot
isolate KVC-layer contribution from 1P3D-topology contribution.
Re-scored headline: 5 KVC wins / 1 DP win / 3 draws -- still net positive
but not the "7/8 wins" framing the V2_RESULTS_ZH.md claims.
Recommended follow-ups (ROI order):
1. naive 1P3D ts=1 N=1 control (critic's only CRITICAL finding)
2. v2 N=2/N=3 to verify ts=1 determinism with new code paths
3. symmetric error accounting recompute + DP max-input-len = 92098 rerun
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three new docs covering the structural-fit investigation:
- AGENTIC_FIT_ANALYSIS_ZH.md: §1-§7 of structural design issues that
surface KVC vs vanilla DP gap on real agentic workloads (SWE 50sess).
Quantifies session pinning, LRU shortfall, P-side imbalance,
time-scale distortion, etc., with code citations and N=3 rerun data.
- REFACTOR_PLAN_ZH.md: KISS-edition refactor plan. After verifying the
original "estimate inflation" and "resident_blocks aging" claims were
not real bugs, scope shrinks to one code change (backpressure) plus a
4-run smoke sweep within an 8h budget.
- STRUCTURAL_VALIDATION_REPORT_ZH.md: validates §1-§7 claims using
existing v5 baseline rerun data + 8DP CA baseline. Each claim labeled
fully-supported / indirect / retracted with the data source. Notes
that backpressure E2E validation is pending GPU smoke run.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replay-side changes paired with the SGLang admission hint:
- DecodeResidencyState gains pause_until_s; admission probe parses
recommended_pause_ms and updates the per-D pause window.
- _wait_for_decode_pause is invoked at request entry points
(_invoke_router, _invoke_session_direct) so requests stall before
hitting a saturated D instead of timing out via mooncake.
- New CLI flags: --enable-backpressure (default off, baseline preserved),
--backpressure-max-pause-s (cap on per-request sleep, default 2s).
Structural instrumentation written under <run_dir>/structural/:
- admission-events.jsonl: every admission probe (RTT, queue_depth,
pause_ms, available_tokens, evicted_count)
- backpressure-events.jsonl: every actual pause sleep
- session-d-binding.jsonl: per-request policy decision
Used to validate the structural claims documented separately.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `recommended_pause_ms` field to DirectAppendAdmissionReqOutput so D
can advise callers when its transfer queue is heavy or KV pool is near
capacity. The hint is computed from transfer_queue_depth,
retracted_queue_depth, and post-trim token_usage; thresholds are simple
heuristics (>0.90 usage, >=8 queue depth, retracted>0).
Default behavior is unchanged for callers that ignore the field.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hostile audit of the original report flagged three load-bearing errors:
1. held_tokens semantic was inverted. session_held_tokens() at
session_aware_cache.py:278-282 sums (kv_allocated_len - cache_protected_len)
per slot, i.e. slot-private (NOT in radix tree). So "other = cap - held -
avail" actually CONTAINS the radix-tree protected prefix cache (likely the
single biggest component for shared agentic prefixes), not just running
batch + in-flight as the original report claimed.
2. Admission-race causal hypothesis for the 415 EXP2+profile errors is
contradicted by the data: 414/415 errors have kv_transfer_blocks > 0 — they
passed admission and died downstream ("generate stream ended before
producing any token", raised by the client when a 200 response had an empty
stream).
3. Polling deconfound was too quickly dismissed. Mode counts shift ~1:1
(session-cap-fb -356 / kvcache-centric +406), and /server_info is not a
passive read — it dispatches into the scheduler main loop and iterates
every session slot.
Plus: per-D error% confounded by sticky session affinity (only 18 unique
sessions cause 415 errors, decode-3 had 0 errors only because no high-error
session landed there); decile 10 "recovery" was an equal-time binning
artifact (24.5% under equal-count); v5 vs v5+profile time gap was 21h not
6h; p50/p90 latency comparison is N=1.
Rewritten report (docs/V5_PROFILE_INVESTIGATION_ZH.md) marks each correction
with ⚠️ and demotes admission-race to one of four hypotheses (H1-H4).
Action items split into P0 (verify, must do first) and P1 (instrument):
P0 — scripts/sweep_tp1_v5_baseline_rerun_exp2.sh runs 3x v5 baseline EXP2
(no polling, identical config to the original v5 run) to test whether the
9-error baseline result is reproducible. If 3 runs give ~9 errors and
profile gives 415, polling is the leading suspect. Currently running
in background.
P1 — scheduler.py:_compute_pool_breakdown_for_diagnostics adds a read-only
"pool_breakdown" dict to /server_info covering: radix_evictable_tokens,
radix_protected_tokens, slot_private_held_tokens, session_slot_count,
running_batch_{reqs,kv_tokens}, transfer_queue_{reqs,tokens},
prealloc_queue_{reqs,tokens}, retracted_queue_{reqs,tokens}. With these,
"unaccounted = cap - sum(known)" exposes true leakage. replay.py captures
all fields into the per-tick row; analyzer prints the decomposition and
gracefully handles old timeseries (prints "P1 instrument absent").
Mock-tested end-to-end. SGLang patch is read-only and does not affect
admission/scheduling. Old v5+profile data still analyzes correctly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v5 dropped errors but pushed session-cap fallback to 46-51%. Before adding
v6 mitigations we need to attribute that capacity loss to one of:
(a) active sessions — real footprint
(b) idle-evictable sessions — LRU not aggressive enough
(c) prefill backup blocks / in-flight / fragmentation — release timing
Without this it's all guessing. Plumb a 1Hz poller into replay that hits
each P/D worker's /server_info, captures session_cache + memory_usage, and
writes a per-worker time-series JSONL to <run_dir>/d-pool-timeseries.jsonl.
Off by default (--pool-poll-interval-s 0); v5+profile sweep enables it at
1.0s. Per-tick HTTP cost is ~8 parallel /server_info calls — negligible
relative to the 50min run.
Analyzer (scripts/analysis/analyze_pool_timeseries.py) decomposes each D's
capacity into active_held / idle_evictable / other (= cap-held-avail, the
backup-blocks bucket) / free, and reports session residency churn across
workers as a starvation/thrashing signal.
Mock-tested poller end-to-end (cancellation clean, file flushed, sessions
captured); analyzer validated against synthetic timeseries.
Next: run scripts/sweep_tp1_v5_optD_profile.sh on hardware (~90min), then
analyze results to pick a v6 direction.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v5 sweep (sweep_tp1_v5_optD.sh) lands the previously-deferred Option D:
worker admission_mode authoritative for direct_append + seed + reseed,
bypassing replay's local _decode_session_soft_cap.
Key findings now documented:
- errors collapse from 9-10% to 0.2% (mooncake timeouts gone)
- session-cap fallback rises 33-35% -> 46-51% — D's true KV pool is the
binding constraint, not replay's estimator; v4's "low fallback" was
hiding capacity overruns as transfer-timeout errors
- direct-to-D subset latency unchanged from v4 (admission overhead negligible)
- new bottleneck: D's physical KV pool — points v6 at prefill backup release
timing, priority eviction tuning, chunked seed, cross-D session migration,
and real RDMA
Also adds a 5th lesson on errors-vs-fallback reciprocity and updates the
code index with the v5 endpoint extension and new CLI knobs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v4 (cap=16) saw 35% session-cap fallback because the local soft_cap
min(16, usable / target) evaluates to 1-2 for large agentic inputs.
The cap was hit not because D was full but because replay's heuristic
underestimated capacity.
This change makes worker admission_mode authoritative for ALL paths:
SGLang side:
- io_struct.py: DirectAppendAdmissionReqInput gains a `mode` field
("direct_append" | "seed", default "direct_append" preserves prior
behavior).
- scheduler.py:admit_direct_append: when mode == "seed", skip the
resident-on-D requirement and run the same capacity check + LRU
eviction (maybe_trim_decode_session_cache) that direct_append uses.
This lets D atomically decide if a new session can be admitted based
on actual token_to_kv_pool_allocator state.
Replay side (replay.py):
- _query_decode_direct_admission gains a `mode` parameter.
- _reserve_decode_session_capacity: in worker admission_mode, the
seed/reseed branch now queries D with mode="seed" and trusts the
result, instead of estimating capacity from the residency snapshot.
- _should_admit_new_decode_session: in worker mode, skip the local
soft_cap pre-check and let D decide. Same-D session fast-path is
preserved.
Effects:
- Local hardcoded cap of 16 is bypassed under worker mode; D's real
KV pool size is the only constraint.
- LRU eviction runs in D's process atomically with admission, so
starvation (the v3 bimodal "lucky vs starved sessions" pattern)
should resolve.
scripts/sweep_tp1_v5_optD.sh added to run the same 1P7D / 2P6D
configs as v4 with the new admission path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add v4 sweep results and post-mortem analysis showing:
- direct-to-D path: 54.3% (1P7D) / 58.0% (2P6D) of requests now use
KVC cleanly. P50=0.5s and TTFT P50=0.043s; this path beats baseline
8DP across the board (P50 -24%, TTFT P50 -54%, TTFT P90 -79%).
- Overall vs baseline (errors+truncated excluded):
v4 2P6D P50=0.85s vs baseline 0.66s (28% slower).
Reason is not errors -- 35% of requests still hit
fallback-large-append-session-cap, where capacity-based
cap = usable_tokens / target_tokens evaluates to 1-2 (not 16)
for large agentic inputs.
- 9-10% errors on KVC variants are mooncake TCP transfer timeouts,
not SGLang logic bugs. Prefill log shows
"Failed to send kv chunk ... 32s timeout ... session not alive".
Errors concentrate in turn>=31 (large inputs) after run >44.8%.
Track:
- docs/KVC_DEBUG_JOURNEY_V1_TO_V4.md: append v4 results table,
per-mode breakdown, and error root cause.
- scripts/analysis/{analyze_v3,analyze_v4,analyze_errors,compare_no_error}.py
- outputs/qwen3-30b-tp1-v{3,4}*/exp*_summary.json (force-added,
small JSON; metrics.jsonl excluded due to size).
- outputs/qwen3-30b-tp1-v{3,4}*/sweep_results.txt
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>