34 Commits

Author SHA1 Message Date
110bd68000 docs(failures): consolidated 5-mode failure taxonomy
Consolidates failure modes scattered across V2_DEEP_ANALYSIS,
E1_E2_RESULTS, E3_FINDINGS, KVC_EVICTION_GRANULARITY,
REAL_ALI_KVC_EXPERIMENT into a single lookup table with
five fields per mode: symptom → root cause → trigger →
current mitigation → real fix.

Five modes covered:
  A. Mooncake "instance not alive" cascade
     — E2 80%-failure pathology; admission no-space →
       seed burst → heartbeat drop → batch abort
  B. Cold-D / overlap-pinning
     — shared boilerplate hash pins all sessions to a
       subset of D's; load_floor_bonus is a patch, the
       real fix is exclusive_overlap redefinition
  C. Evict storm (session-level eviction)
     — release_session frees 38–88K tokens in one shot;
       fix is BLOCK_LEVEL_EVICTION_DESIGN
  C'. Reseed storm (turn-1 concurrent seeds)
     — startup-phase mooncake burst; fix is per-D
       pending-seed budget, frequency drops after C
  D. Streaming-session correction invariant crash (E3)
     — schedule_batch.py:1646 landmine, hotfixed by
       986f351, root-fix is removing the correction
       path entirely (BLOCK_LEVEL_EVICTION §3.7)

Each mode has a forensic link back to the original
experiment doc that surfaced it.

§6 adds a diagnostic cheat sheet: "if you see X, look at Y."
§7 wires every mode to a roadmap item — Milestone 1 should
graduate §1–§4 to "mitigated" and eliminate §5.

INDEX_ZH gets a new §1.6 section linking this and the
SGLang patch inventory.

No code change. Reading dependency for anyone debugging
a sweep or writing paper §Limitations.
2026-05-13 00:43:58 +08:00
d93228e156 docs(sglang): patch surface inventory + retire-after-refactor list
Resolves AUDIT_AND_ROADMAP §S6: the 785 lines of vendored
SGLang patch are a known reviewer trust risk because the
prototype touches scheduler.py / schedule_batch.py /
session_aware_cache.py / disaggregation hot paths. Without
classification readers cannot tell core mechanism from
temporary scaffold.

Classifies each of the 10 patched files into:
  MUST-HAVE         — Algorithm 1/2/3, streaming session
                       lifecycle, admit RPC. ~450 lines.
                       Long-term retention.
  WORKAROUND        — release_session token-free,
                       maybe_trim_decode_session_cache,
                       streaming-session extend_input_len
                       correction (incl. the E3 landmine
                       hotfix from commit 986f351),
                       DecodePreallocQueue trim trigger.
                       ~150 lines. To DELETE entirely
                       after block-level eviction refactor
                       (BLOCK_LEVEL_EVICTION_DESIGN §3.7).
  EXPERIMENTAL      — backpressure pause hint
                       (_compute_backpressure_pause_hint).
                       ~60 lines. Signal not closed-loop
                       per REAL_ALI §4.3; retain as hook
                       or retire in 1 month.
  INSTRUMENTATION   — _compute_pool_breakdown_for_diagnostics.
                       ~50 lines. Keep behind a flag.
  MINOR             — ~3 lines. Ignore.

The §2 summary gives reviewers a one-glance picture of
what's core vs. scaffold. Maintenance convention in §3
mandates classifying every new (sglang) patch at commit
time.

§4 wires the classification into the roadmap: clearing
the WORKAROUND bucket is the objective completion marker
for block-level eviction refactor.

No code change.
2026-05-13 00:42:22 +08:00
9a81c993ab docs(onboarding): link new audit / design / eval docs from
the root README + AGENTS.md

Without this, the four docs added on this branch
(AUDIT_AND_ROADMAP, INDEX, BLOCK_LEVEL_EVICTION_DESIGN,
D_TO_P_SYNC_CONTRACT, EVALUATION_PROTOCOL) are reachable
only by listing docs/. This wires them into the two entry
points an agent or collaborator hits first.

README.md changes:
  - top-of-page pointer to INDEX_ZH for new collaborators
  - pointer to AUDIT_AND_ROADMAP_ZH for project state
  - "单元测试 (无 GPU)" section: how to run pytest
  - "评测脚本" section: invocations for the two new
    analysis scripts

AGENTS.md changes:
  - top section "For new collaborators / agents" before
    the existing "Environment" block, pointing at INDEX_ZH,
    AUDIT_AND_ROADMAP_ZH, the two ready-to-pick-up design
    docs, and EVALUATION_PROTOCOL_ZH
  - pytest invocation under Environment
2026-05-12 23:58:56 +08:00
dbb9eee471 feat(analysis): paired comparison with bootstrap CI
Implements docs/EVALUATION_PROTOCOL_ZH.md §2.2 (M2 fix):
mechanism A vs B comparisons on the same trace must be
paired on same-trial-mask, with errors and aborts surfaced
rather than silently dropped.

How it differs from scripts/analysis/compare_no_error.py:
  - works on raw request-metrics.jsonl (not pre-aggregated
    summary.json) so it can recompute paired masks
  - reports 95% bootstrap CIs for mean / p50 / p90
  - exposes intersection size + per-side failure count in
    the intersection so the reader can see how many rows
    were dropped from the comparison and whether the
    candidate's win came from selection effects

stdlib only — random.Random for bootstrap, no scipy/numpy.
Default 2000 bootstrap iterations; seed is configurable
for reproducibility.

Verified locally on a synthetic 20-row pair (5s constant
delta + one candidate failure): correctly reports
paired_size=19, candidate_fail_in_common=1, mean delta
-5.000s, 19/0/0 win/loss/tie.

CLI:
  scripts/analysis/paired_compare.py \\
      --baseline outputs/run-dp/request-metrics.jsonl \\
      --candidate outputs/run-kvc/request-metrics.jsonl \\
      [--metric latency_s|ttft_s|tpot_s] \\
      [--bootstrap 5000] [--seed 42] [--json]
2026-05-12 23:57:57 +08:00
4021f27ee2 feat(analysis): stratified latency / TTFT reporter
Implements docs/EVALUATION_PROTOCOL_ZH.md §1.3 (M3 fix):
headline numbers must be accompanied by stratified
breakdowns so reviewers can see which slice the gains
come from.

The script reads one or more request-metrics.jsonl files
and buckets rows along four orthogonal dimensions:
  - turn_id        : {1, 2-5, 6-20, 21+}
  - input_length   : {<=8K, 8K-64K, >64K}
  - overlap_ratio  : {<=0.3, 0.3-0.7, >0.7}
  - append_tokens  : {<=128, 128-1K, 1K-8K, >8K}

Per bucket: n, n_ok, err_pct, latency/ttft mean+p50+p90+p99.
Output is markdown by default, --json for machine read.

stdlib only — no pandas/numpy. Verified on a synthetic
5-row jsonl (turn=1 with one error correctly reports
33.3% err% on the bucket).

Why this script and not pandas:
  - the existing scripts/analysis/* are stdlib-only;
    keeping consistency
  - reviewers can run it on the artifact without
    pip-installing anything beyond pytest
  - speed irrelevant; runs in <1s on the largest existing
    sweep (4449 rows)

Usage shown in EVALUATION_PROTOCOL_ZH §3.
2026-05-12 23:57:13 +08:00
c5f552e122 test(policy): Theorem 1 no-starvation property tests
Adds the algorithm-layer guarantee tests for
docs/KVC_ROUTER_ALGORITHM.md §4.1. The full Dispatch loop
lives in replay.py (HTTP + mooncake), but the policy-layer
guarantee is testable in isolation: under any reject
sequence, select() must keep returning a valid worker.

Cases:
  - select returns a valid decision even after every (s,d)
    is past τ_reject (degenerate fallback)
  - |D|·τ_reject rejects suffice to explore every D
    (cannot trap a session on one D under universal
    rejection)
  - degenerate fallback picks the least-rejected D
    (Algorithm 1 line 4)
  - per-(session, D) isolation: session A's blacklist
    does not affect session B
  - migration_reject_threshold=0 disables blacklist
  - select() does NOT silently bump the reject counter
    (the only mutator is record_admission_reject)

Adds tests/_fixtures.py with minimal make_topology() and
make_request() helpers that skip build_single_node_topology's
GPU-budget validation (irrelevant in unit tests).

Verified locally: 20/20 passing under pytest 9.0.3. The
six new tests cover only Algorithm 1's policy-layer
half of Theorem 1; the reset-on-success half lives in
Algorithm 3 (replay.py) and is a future test target.
2026-05-12 23:55:57 +08:00
a785b83023 test(policy): unit tests for Algorithm 1 lex scoring
Adds the project's first test suite. Covers the
score_candidate() pure function from the previous refactor
commit, validating the qualitative properties that
KVC_ROUTER_ALGORITHM.md §3.1 and §4.2 rely on.

Tests / properties:
  - determinism: same args -> same tuple
  - shape: 4-int tuple
  - primary term: overlap dominates pure sticky
  - primary term: sticky_bonus credited
  - tie-2 inflight: lower wins
  - tie-3 assigned: lower wins
  - strict lex order: sticky wins position-1 over fresh-idle
  - load_floor disabled by default
  - load_floor gated off when sticky=True
  - load_floor zero during warmup (mean=0)
  - load_floor proportional to deficit (200/100/0 at 0/50/100% load)
  - load_floor does not underflow when overloaded
  - real per-session overlap beats load_floor on warm D
  - boilerplate overlap loses to load_floor on cold D
    (the cold-D fix from E1_E2_FIX_DESIGN §Q2)

Test infrastructure:
  - tests/ package with README explaining the GPU-free
    scope and the run instruction
  - pyproject.toml [dependency-groups] test = [pytest>=8]
    (install via `uv sync --group test`)
  - pyproject.toml [tool.pytest.ini_options] sets testpaths

Verified locally: 14/14 passing under pytest 9.0.3 in an
isolated 3.13 venv. No SGLang / GPU touched.
2026-05-12 23:54:48 +08:00
76a79dfdda refactor(policy): extract pure score_candidate() from KvAwarePolicy
Pulls the per-D score computation out of KvAwarePolicy.select
into a top-level pure function that takes primitives. The
in-method behavior is unchanged — the loop now calls
score_candidate() instead of inlining the arithmetic.

Motivation:
  Algorithm 1 (KVC_ROUTER_ALGORITHM.md §3.1) is the routing
  core. Until now its only API was select(), which requires
  building TraceRequest + SingleNodeTopology + RoutingState
  to test even a single lex-score property. After this
  extraction, unit tests can drive the four-tuple score
  directly with integers.

What changed:
  - Added module-level CandidateScore type alias.
  - Added score_candidate(*, overlap, sticky, inflight,
    assigned, mean_assigned, sticky_bonus,
    load_floor_bonus) -> CandidateScore.
  - KvAwarePolicy.select() loop body collapsed to a
    score_candidate() call; sticky now bool (was int)
    inside the call site.
  - Moved the load-floor docstring from KvAwarePolicy
    onto score_candidate where the formula lives.

Verified pure:
  - same kwargs -> same tuple
  - overlap=5 beats sticky-only (no load_floor): (5,0,0,0) > (1,1,0,0)
  - load_floor gated off when sticky=True

No behavior change; follow-up commit adds the unit tests
this refactor enables.
2026-05-12 23:53:17 +08:00
591cd6d382 docs(eval): paper-quality evaluation protocol (M1–M6)
Codifies the methodology fixes for every weakness called
out in AUDIT_AND_ROADMAP_ZH §3.1. Existing sweep reports
(KVCACHE_CENTRIC_PROGRESS_ZH, V2_RESULTS_ZH) violate at
least one of these; future runs must use this protocol.

Contents:
- §1.1 M1 — N≥3 + bootstrap CI; no N=1 in headline
- §1.2 M2 — paired-on-same-trial-mask; same trace /
       timeout / max_input_len / time_scale; errors
       and aborts each get their own column
- §1.3 M3 — required stratification dimensions
       (turn_id / append_len / overlap_ratio /
       inter_turn_gap / input_len)
- §1.4 M4 — minimum 2 baselines from a 6-item list,
       including at least one non-SGLang baseline
- §1.5 M5 — trace mix: Ali full + SWE-Bench +
       ShareGPT + synthetic adversarial
- §1.6 M6 — hardware tiers; single-node 4xH200 +
       dual-node NVLink/IB as minimum
- §2 report templates (main table, paired delta,
      stratified, negative-result section)
- §3 tool support: marks the two scripts that the
      follow-up commits on this branch add
- §4 SOSP/OSDI artifact requirements
- §5 pre-submission self-checklist
- §6 phased delivery plan for catching up to protocol

No code change; reading dependency for the analyzer
scripts that follow.
2026-05-12 23:51:46 +08:00
fd37eda367 docs(design): D->P sync interface contract + 4-phase rollout
Companion to BLOCK_LEVEL_EVICTION_DESIGN_ZH. Specifies the
three-layer contract (mooncake / SGLang / agentic-pd-hybrid)
that the empty feat/d-to-p-sync branch is meant to fill.

Contents:
- §1 staleness budget β as a first-class system parameter,
      with recommended default (page_size .. 4096 tokens)
- §2.1 mooncake double-role API: KVRole enum extension,
      DecodeKVSender / PrefillKVReceiver class shapes,
      independent bootstrap channel
- §2.2 SGLang RadixCache.insert_external signature with
      five concrete design decisions (re-mapping policy,
      failure handling, lock_ref discipline, evict
      interaction, multi-P backup view)
- §2.3 agentic-pd-hybrid CLI flags, DirectSessionState
      additions, hook points in _invoke_session_direct
      and _invoke_kvcache_seeded_router
- §3 candidate Theorem 4 (reseed_cost upper bound under
      staleness budget β)
- §4 P1..P4 rollout with validation criteria per phase
- §5 five enumerated risks + mitigation
- §6 explicit decoupling: block-level eviction first,
      then D->P sync; do NOT bundle in one PR

Makes the feat/d-to-p-sync branch actionable for the next
collaborator without GPU until P2 microbench phase.
2026-05-12 23:50:39 +08:00
683c44bd71 docs(design): block-level eviction refactor — concrete API plan
Turns the architectural manifesto
(KVC_EVICTION_GRANULARITY_DESIGN_ZH.md) into a
function-by-function design the next collaborator can
implement against.

Contents:
- §1 current SessionAwareCache state with exact field
      semantics (req_pool_idx / kv_committed_len /
      kv_allocated_len / cache_protected_len)
- §3.1–§3.6 post-refactor source sketches for
      SessionSlot, cache_finished_req,
      cache_unfinished_req, match_prefix,
      release_session, get_session_status
- §3.7 the schedule_batch.py:1572-1646 correction
      block we can remove (the E3 landmine)
- §4 five invariants the PR must defend
- §5 GPU-free unit + property test plan with a
      MockRadixCache shape
- §6 ~1 week engineering estimate and three risks
- §7 dependency relationship to the planned
      D->P sync work
- §8 minimal step list for the implementing agent

No code change yet. Future commits on a
feat/block-level-evict branch will execute against
this spec.
2026-05-12 23:49:18 +08:00
baa843a3f9 docs(index): collaborator-facing doc index
Single navigation entry point. Existing docs were scattered
across five branches with no clear reading order — this is
the fix. Includes:

- 3-doc fast path for anyone joining
- topic-grouped table (algorithm / experiments / design
  discussions / evaluation / environment / archive)
- role-based reading paths (new SWE, paper reviewer,
  reproducing student, control-plane reader)

Index also references the four docs added later on this
branch (AUDIT_AND_ROADMAP, BLOCK_LEVEL_EVICTION_DESIGN,
D_TO_P_SYNC_CONTRACT, EVALUATION_PROTOCOL) so reviewers
can see the planned layout up front.
2026-05-12 23:47:28 +08:00
6cdea52f28 docs(audit): cross-branch audit + 3-milestone roadmap
Consolidates the state of the five working branches
(main / kvc-debug-journey-v1-to-v4 / feat/d-to-p-sync /
h200-cu130 / kvc-real-ali-iter-v1) into a single
collaborator-facing document.

Sections:
- §1 per-branch state
- §2 contributions a reviewer cannot refute
- §3 weaknesses (M1–M6 methodology, S1–S10 system,
      infra) ranked by how badly they hurt at OSDI/SOSP
- §4 3-milestone roadmap (defensible submission →
      production substrate → OSDI'27 increments)
- §5 GPU-free work queue (what subsequent commits
      in this branch deliver)

No code change. Acts as the index target for the
follow-up commits on this branch.
2026-05-12 23:46:40 +08:00
tim
6d1c9237fa docs(architecture): KVC eviction granularity is the wrong abstraction
After E3 exposed massive session-level eviction (90 trims × avg
67K tokens/evict = 6.1M tokens trashed in 1h12min), we have to
acknowledge the local-patch sequence (E2→load-floor→Fix A →
proposed disable-migration → proposed disable-admission) was a
KVC-to-DP collapse trajectory, not a fix.

The fundamental issue: SessionAwareCache merged two responsibilities
that should be separate.

  1. Session lifecycle tracking (legitimate — streaming sessions
     reuse KV across turns and need per-session metadata).
  2. Eviction granularity decision (wrong — sessions should not be
     the eviction unit).

`release_session` frees the session-exclusive range
[cache_protected_len, kv_allocated_len), which is the post-radix-
commit tail accumulated over decode/extend. On Inferact's
50-session workload this is 35-87K tokens per session. The radix
tree never gets a chance to do block-level leaf-LRU on that range
because it was never committed there.

Effect: evict-revisit cycle forces full 50-90K re-prefill per
session per evict — which is exactly the per-request cost of naive
PD-disagg. KVC's direct-to-D fast-path advantage collapses.

The right fix is structural (not a patch): progressively commit
streaming-session decode output to the radix tree so SGLang's
block-level LRU can shed only the deepest leaves, preserving the
recent prefix that next-turn requests are most likely to match.
SessionSlot becomes pure metadata. Scope is ~1-2 weeks of vendored
SGLang refactor, orthogonal-and-complementary to the D→P sync work
proposed in RESEED_SLOW_PATH_AND_D_TO_P_GAP §4.

Doc lists five anti-patterns the next agent should avoid (tuning
migration_reject_threshold, disabling migration/admission, etc) —
all of those are local symptoms downstream of the eviction
granularity choice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 14:21:45 +08:00
tim
986f351365 feat(sglang): drop streaming-session reqs with fill_ids < prefix_indices
Fix A from docs/E3_FINDINGS_ZH.md §3. The existing streaming-session
correction at the top of ScheduleBatch.prepare_for_extend zeroes
req.extend_input_len when len(fill_ids) <= len(prefix_indices), but
the per-req invariant later in the same function (assert
seq_len - pre_len == req.extend_input_len) is computed from raw
fill_ids/prefix_indices lengths and has no path to be satisfied
when fill_len < prefix_len. The result is an AssertionError that
crashes the entire decode worker.

Add a pre-filter pass at the start of prepare_for_extend that
detects this state, marks the affected reqs with FINISH_ABORT (so
the client gets an error response instead of the worker hanging),
and drops them from the batch before the correction loop runs. If
all reqs are filtered, populate empty tensor/list state and return
early so downstream model.forward sees a valid no-op batch.

This treats fill_ids < prefix_indices as upstream state
inconsistency that should be reported to the client rather than
silently miscomputed. The narrower invariant after this filter:
prepare_for_extend's body only ever sees streaming-session reqs
where actual_extend_len > 0, which is the regime the existing
correction logic was designed for.

Reproduced by E3 first run on 2026-05-12 02:51:21 UTC (rid
6f4318e93dd543a49dbf19248cfc1e6f, session 1000195, fill_len=6648,
prefix_len=43459) — masked in E1/E2 because the cap-out failure
cascade prevented sessions from accumulating deep enough committed
prefix to trigger the inconsistency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 12:12:14 +08:00
tim
d40db1f117 docs(experiments): E3 first run — load-floor bonus works, exposes SGLang bug
H1 (load balance) confirmed at the 15-min checkpoint: D2 received
22.5% of bindings (225 out of 1001) covering 30 unique sessions,
versus 0 in both E1 and E2. The graduated load-floor formula with
K=200 produces the intended distribution: fresh sessions on
under-loaded D, sticky sessions stay put.

But decode-1 crashed at 11:51:21 (~5 min into benchmark) with an
SGLang AssertionError in schedule_batch.py:1646. Root cause: the
streaming-session correction at line 1572-1585 patches
req.extend_input_len to 0 when len(fill_ids) < len(prefix_indices),
but the downstream invariant uses raw fill_ids/prefix_indices
lengths, so the arithmetic check fails. This is a pre-existing
landmine in the b8e6f13 SGLang vendor patch, not caused by the
load-floor bonus. It just happened to be masked in E2 by the
failure cascade preventing sessions from accumulating deep enough
prefix to trigger the correction.

Crash session 1000195 stayed on decode-1 the whole time (not a
migration race). E3 exposes this faster because sessions actually
run further with rebalanced load.

5 fix options evaluated. Recommended: Fix A — local patch at
schedule_batch.py:1646 to skip zero-extend-len reqs before
asserting. Less invasive than C (recomputing seq/prefix arrays);
addresses the actual case (D and E are workarounds, not fixes).

4 decision points for review; no code changes in this commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 12:05:51 +08:00
tim
a1abdcd50c feat(experiments): E3 sweep — KVC v2 + RDMA + load-floor bonus
Same outputs/inferact_50sess.jsonl subset as E1/E2 (md5
7bb263a32600ef5a6ef5099ba340a487). Identical to E2 except adds
--kvcache-load-floor-bonus 200. Tests three hypotheses:

  H1 (load balance):  D2 receives non-trivial bindings (E1/E2: 0)
  H2 (failure rate):  mooncake batch_transfer timeouts disappear
                      because D0/D1 KV pool no longer saturates
                      (E2 had 1054 fails; expect ≤ E1's 85)
  H3 (TTFT):          E2's 0.43s p50 (over the 231 successes)
                      generalizes to most reqs once cascade is gone

K override via LOAD_FLOOR_BONUS env var (default 200).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 11:45:09 +08:00
tim
93fce42747 feat(policy): load-floor bonus for KvAwarePolicy (Q2.B)
Implements the design proposed and approved in
docs/E1_E2_FIX_DESIGN_ZH.md §Q2.B.

KvAwarePolicy gains a `load_floor_bonus: int = 0` knob. When > 0:

  mean_assigned = sum(assigned[*]) / len(D)
  for each D candidate:
    if not sticky and mean_assigned > 0:
      deficit = max(0, mean_assigned - assigned[D])
      floor_bonus = K * deficit / mean_assigned
    else:
      floor_bonus = 0
    score = (overlap + sticky*α + floor_bonus, sticky, -inflight, -assigned)

Properties (verified by unit-style probe in commit message):
- Default 0 = old behavior preserved
- Sticky-gated: turn-1+ requests of an existing session keep going
  to their original D (cache locality preserved)
- Graduated: bonus magnitude scales with the D's deficit ratio,
  approaches K as deficit/mean → 1, drops to 0 when balanced
- Set above max expected boilerplate overlap (Inferact ~50 → 200)
  so cross-session shared-prefix overlap doesn't pin cold D's idle,
  but real per-session prefix overlap (>K blocks) still wins

Plumbed through ReplayConfig, BenchmarkConfig, and CLI flag
--kvcache-load-floor-bonus on both `replay` and `benchmark-live`.

Empirical verification on synthetic state (same conditions as the
E2 cold-D pathology):
  - OFF (K=0):   route fresh session → decode-0 (boilerplate winner)
  - ON  (K=200): route fresh session → decode-1 (cold D rebalanced)

Validation pass next: scripts/sweep_e3_kvc_v2_loadfloor_rdma.sh
(committed separately).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 11:45:09 +08:00
tim
905d671135 feat(env): MC_TRANSFER_TIMEOUT=1800s default in setup_env + stack
Mooncake C++ batch_transfer_sync defaults to 30s timeout; on
saturated D scheduler threads doing LRU eviction, that fires as a
false positive and the SGLang hair-trigger in conn.py:1270
permanently blacklists the D's mooncake_session_id (E2 forensic in
docs/E1_E2_RESULTS_ZH.md §5c). Bump to 1800s in setup_env.sh and
mirror to subprocess env in stack.py so SGLang workers get it too.
30-min envelope still detects genuinely broken peers eventually.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 11:45:09 +08:00
tim
9a166ac43b docs(experiments): design space for Q1 (mooncake stall) + Q2 (cold-D)
For Q1 (D scheduler LRU starves mooncake control plane → 30s
batch_transfer_sync timeout → hair-trigger blacklist), six candidate
fixes evaluated. Recommendation: do Q2 fix first since it removes
the only condition under which we observe LRU thrash; bump mooncake
timeout to 120s as cheap defense-in-depth; avoid invasive SGLang
vendor changes (windowed hair-trigger, async eviction thread) until
Q2 fix demonstrates they're insufficient.

For Q2 (overlap-first lex score + shared boilerplate → permanent
D2 cold), seven candidate fixes evaluated. Recommendation: load-
floor bonus (graduated, decoupled from overlap, gated on
not-sticky) as the primary mechanism — proactive on first-touch as
user requested, avoiding the binary one-shot pitfall of the
reverted cold-D bonus. Orthogonal cleanup: fix the substring filter
in _is_admission_rejection_mode so the existing migration mechanism
serves as a backstop when load balancing alone isn't enough.

7 decision points listed for review; no code merged until a shape
is approved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 11:20:00 +08:00
tim
976115ea5e Revert "feat(policy): cold-D bonus to break overlap-pinning death spiral"
Implementation jumped ahead of design. The cold-D bonus is one of
several candidates for the overlap-pinning fix (others: load-floor
bonus, idle-D bonus, capacity-aware overlap discount, pre-warming
boilerplate). Need to evaluate the design space first, including
whether a single bonus is even the right shape vs a separate term
in the lex score, before committing to a specific knob.

This reverts commit 786cbb8 cleanly (forensic docs in bf4da28 and
7f2ebf3 are kept since they record observations, not designs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 11:17:16 +08:00
tim
786cbb8d91 feat(policy): cold-D bonus to break overlap-pinning death spiral
KvAwarePolicy now accepts an optional cold_d_bonus int. When > 0,
fresh requests (sticky=0, i.e. no prior D for this session) receive
the bonus added to lex-score position 0 (overlap+sticky_bonus) for
any D worker that has never been assigned a session yet
(decode_assignment_counts == 0). This breaks the pathology
documented in docs/E1_E2_RESULTS_ZH.md §5d where workloads with
shared cross-session prefix (e.g. Inferact's "permissions
instructions" boilerplate) cause every D that has hosted any session
to dominate the overlap term against any cold D, leaving the cold D
permanently unused.

Sticky behavior is preserved: turn 1+ requests of an existing
session continue to stick to their original D because the bonus is
gated on `not sticky`.

Plumbed through ReplayConfig.kvcache_cold_d_bonus (default 0,
keeping current behavior unchanged), BenchmarkConfig, and CLI flag
--kvcache-cold-d-bonus on both `replay` and `benchmark-live`
subcommands. Set above max expected boilerplate overlap (Inferact's
~50 24-token blocks → 1000 is safe).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 11:14:00 +08:00
tim
bf4da281c0 docs(experiments): mooncake "is not alive" deep-dives to LRU starvation
The Q1 mystery resolves: P-side mooncake C++ logs show
"Sync batch data transfer timeout after 37452515723ns" (37.45 s) at
01:56:42 — this is mooncake's batch_transfer_sync giving up after
its internal timeout. The hair-trigger >=1 in conn.py:1270 is
correct in the idle case (a 30-s RDMA stall genuinely means the
peer is broken), but it fires here because of D-side congestion:
decode-0.log shows two consecutive LRU evictions ("Trimmed decode
session cache via LRU. evicted_sessions: 2, freed_tokens: 77675")
firing at the exact same wall second the timeout triggers.

The D scheduler thread is busy with multi-session GPU memory frees
+ session-aware-cache bookkeeping under lock; the mooncake C++
control plane on the receive side gets starved for >30 s; P times
out and marks the whole D's mooncake_session_id failed.

Two-layer fix listed in §5c: root-cause = spread load to D2 (cold-D
bonus, next commit); defense-in-depth = windowed threshold + retry
in vendored mooncake conn.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 11:14:00 +08:00
tim
7f2ebf3d87 docs(experiments): forensic on Q1 (mooncake death) and Q2 (no D2 migration)
Q1: Mooncake "is not alive" is hair-trigger — a single
send_kvcache_slice ret != 0 in
third_party/sglang/python/sglang/srt/disaggregation/mooncake/conn.py
:1270 permanently adds the D's mooncake_session_id to failed_sessions
and blacklists it for the rest of the process lifetime. The D worker
process is alive (D1 keeps serving admit_direct_append OK seconds
after), but every subsequent P→D transfer for that session
short-circuits at conn.py:1184. The "Failures should never happen if
the session is not dead" comment encodes the wrong assumption for the
saturation regime we hit.

Q2: KVC v2's migration mechanism IS sound but its trigger is gated
by replay.py:1379 _ADMISSION_REJECTION_SUBSTRINGS = ("session-cap",
"no-d-capacity", "d-backpressure"). All 1054 failures have
execution_mode="kvcache-centric" (generic fallback bucket) which
contains none of those substrings, so session_d_rejects is never
incremented. Empirically 46 of 49 (sess, D) pairs that the worker
RPC rejected would have qualified for blacklist (most-rejected
pair: 25 rejects), but policy never saw them. Result: D0 reject
→ next-bind D0 (253×), D1 reject → next-bind D1 (329×), D0/D1 reject
→ next-bind D2 (0×).

Fix paths documented for both, shortest path is widening the
substring filter to include the failure-fallback bucket, but the
right fix is to call record_admission_reject directly from the
actual rejection signal site instead of string-matching execution_mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 10:45:18 +08:00
tim
ef4dc81ea9 docs(experiments): forensic explanation for E2 80% failure rate
Pulling admission-events.jsonl, prefill-0.log, and request-metrics
sampling shows the 1054 failures are NOT timeouts as initially
assumed. They are a 3-layer cascade:

  L1: 562 "no-space" + 43 "session-not-resident" worker admission
      rejects (51% of all admit attempts) because D0/D1 KV pools
      saturate while D2 stays empty.
  L2: rejects re-route to seed/reseed which need mooncake P→D KV
      transfer; the backlog drops mooncake heartbeats and prefill-0
      logs "Decode instance could be dead, remote mooncake session
      ... is not alive".
  L3: SGLang aborts the request, SSE stream closes with 0 tokens,
      agentic-pd-hybrid raises "generate stream ended before
      producing any token" (the literal error string for all 1054).

E1 didn't hit this because pd-disaggregation has no admission RPC —
sessions just queue behind the running batch, paying TTFT instead
of failing. KVC v2's worker admission is supposed to be a safety
valve; on the cold-D pathology it becomes a failure amplifier.

The real fix is upstream D rebalancing (cold-D bonus or pre-warm),
not relaxing admission.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 10:38:49 +08:00
tim
3db2d84df8 docs(experiments): E2 complete — qualified H1 with a surprise
E2 finished 1h33min wall. Headline contrast on the matched Inferact
50-session subset:

E1 (naive 1P3D + kv-aware + RDMA):
  1200/1285 succ, lat p50=93s p99=219s, TTFT p50=89s p99=207s
E2 (KVC v2 + RDMA):
   231/1285 succ, lat p50= 7.4s p99=65s, TTFT p50=0.43s p99=8.7s

E2 is 12.4× worse on failure rate but 20× better on TTFT p50 among
the requests that did complete. Both runs leave D2 entirely unused
for the same structural reason: Inferact's shared "permissions
instructions" boilerplate makes overlap dominate the kv-aware lex
score, and v2's migration mechanism only fires on capacity rejects
which never reach D2. The 1054 E2 timeouts are downstream of that
imbalance, not a v2 bug per se.

The doc closes with five concrete follow-ups for the next agent —
cold-D bonus, router-mode admission, default-policy control arm,
TCP-loopback comparison, failure mode forensics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 03:23:33 +08:00
tim
e3e5c45ed4 docs(experiments): E2 mid-run finding — D2 stays cold in KVC v2 too
Same pathological imbalance E1 showed reproduces in E2: D2 has zero
bindings at 33% POSTs in. Root cause is structural, not a KVC v2 bug:
all 50 Inferact sessions begin with identical "permissions
instructions" boilerplate, so the converter assigns them identical
first-block hash_ids. kv-aware policy's overlap term (lex-score
position 0) makes any already-resident D dominate a fresh D
unconditionally, and v2's migration only activates on admission
rejects which never fire because D0/D1 KV pools have headroom. The
H1 conclusion is qualified: KVC v2 helps per-request work (direct-
to-D fast path) but does not rebalance D worker load on workloads
with shared cross-session prefixes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 02:08:00 +08:00
tim
631b2c8847 docs(experiments): E1 results — naive 1P3D + kv-aware confirms H1 baseline
E1 finished 1h29min wall on the 50-session Inferact subset. Headline:
1200/1285 succeeded, latency p50=93s p99=219s, TTFT p50=89s p99=207s,
85 timeouts. Decode-2 was never bound to a single session — all 50
sessions stuck to decode-0/1 by kv-aware policy stickiness with no
migration to rebalance, so effective topology was 1P2D, not 1P3D.
This is exactly the failure mode H1 predicts naive pd-disaggregation
should exhibit, giving E2 (full KVC v2 with migration) a concrete
baseline to improve against.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 01:49:52 +08:00
tim
ad8aaa8c5a feat(experiments): E2 sweep — KVC v2 + RDMA on the matched subset
KVC v2 config from sweep_ts1_migration_v2.sh (reset-on-success +
direct-append threshold 8192) layered on top of the RDMA-enabled
mooncake stack, against the same outputs/inferact_50sess.jsonl
subset that E1 uses. Pair-wise contrast tests H1 (KVC layer marginal
contribution on top of 1P3D + kv-aware) and H2/H3 (RDMA reducing
reseed slow-path tail).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 00:49:53 +08:00
tim
bb9cc249cd feat(experiments): E1 sweep on 50-session deterministic subset
scripts/sample_trace_subset.py — file-order head-cut that takes the
first N sessions of a converted trace. No RNG, no hashing — same
input yields byte-identical output (the included assertion compares
md5 across two runs).

scripts/sweep_e1_naive_1p3d.sh — E1 of ONBOARDING_NEXT_AGENT_ZH §3.1:
mechanism=pd-disaggregation, policy=kv-aware, 1P3D, RDMA on
(mlx5_60). Defaults to outputs/inferact_50sess.jsonl so E1 and E2
can share the exact same subset; override via TRACE= env var to run
on the full 20,230-request trace.

Reproducing the subset:
  uv run --no-sync python scripts/sample_trace_subset.py \\
    --input outputs/inferact_codex_swebenchpro.jsonl \\
    --output outputs/inferact_50sess.jsonl \\
    --sessions 50
  # expected output_md5: 7bb263a32600ef5a6ef5099ba340a487
  # 1285 requests, mean input_length 67631 tokens

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 00:21:36 +08:00
tim
b55371fe69 docs: H200 + driver 570 setup guide + 11 lessons learned
Captures the full debugging journey of getting vendored SGLang 0.5.10
+ mooncake RDMA running on a 4×H200 node with the older driver
570.86.15. Driver 570's actual API is cu12.8 — nvidia-smi's
"CUDA Version: 13.0" header is a forward-compat ceiling, not the
driver's own version — and that single misreading drove most of the
detours. Lessons cover: pip vs vendor sglang divergence, why cu13
switching was a dead end (mooncake is cu12-only by wheel, driver 570
can't run cu13 anyway), why --disable-overlap-schedule alone isn't
enough, why pip nvidia-cuda-nvcc-cu12 doesn't ship the nvcc binary,
and how tvm_ffi's ninja-driven nvcc invocation makes CUDA_HOME the
single hook point that fixes everything.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 00:10:14 +08:00
tim
d11a66d11b feat(scripts): cu12.8 env wrapper + Inferact trace converter
setup_env.sh: source-able shell snippet that points tvm_ffi (vendor
sglang JIT compiler) at \$HOME/cuda-12.8/bin/nvcc and exposes both
libcudart.so.12 (for mooncake.engine, a cu12 wheel) and cu12.8 lib64
(for tvm_ffi compile-time linker) on LD_LIBRARY_PATH. Without this,
JIT-compiled kernels NEEDED libcudart.so.13 and driver 570 rejected
them at every JIT call.

convert_inferact_to_trace.py: turns Inferact codex_swebenchpro_traces
(ShareGPT {"from","value"} pairs) into the chat_id/parent_chat_id/
turn/hash_ids JSONL schema replay.py expects. Tokenizes with the
model's own tokenizer, builds prefix-sharing 24-token block hashes,
synthesizes timestamps. Output cross-checks 20,230 LLM calls — exactly
matches the Inferact README count for 610 successful trials.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 00:10:06 +08:00
tim
a418aafeed feat(stack): pin PD workers to --disable-overlap-schedule
On a node with driver 570.86.15 (cu12.8 driver API ceiling), SGLang's
overlap event loop hits cudaErrorInsufficientDriver inside
event_loop_overlap_disagg_prefill → resolve_future_token_ids JIT
kernel. Switching to the normal event loop sidesteps this specific
codepath. The flag is harmless on newer drivers and remains a useful
default until overlap is independently re-validated on this hardware.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 00:09:56 +08:00
tim
e874b1f055 feat(env): install vendored SGLang via uv path source
Replace pip-resolved sglang==0.5.10 with an editable install from
third_party/sglang/python. The vendored fork carries patches the pip
release does not (admit_direct_append RPC types, _should_allow_local_
prefill_on_decode, maybe_trim_decode_session_cache, backpressure pause
hint) — KVC routing depends on them, so the vendored copy must be the
import target, not just on PYTHONPATH at runtime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 00:09:50 +08:00
35 changed files with 4841 additions and 102 deletions

View File

@@ -1,9 +1,33 @@
# AGENTS.md
## For new collaborators / agents
Before doing anything else, read [docs/INDEX_ZH.md](docs/INDEX_ZH.md). It points to the
3 must-read docs and a role-based reading path (new SWE, paper reviewer,
reproducing student, control-plane reader).
Cross-branch progress, weaknesses, and roadmap live in
[docs/AUDIT_AND_ROADMAP_ZH.md](docs/AUDIT_AND_ROADMAP_ZH.md). It is the single source of truth
for "what's done, what's broken, what to do next."
Two engineering work items are pre-specced and ready to pick up:
- block-level eviction refactor — [docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md)
- D→P incremental KV sync — [docs/D_TO_P_SYNC_CONTRACT_ZH.md](docs/D_TO_P_SYNC_CONTRACT_ZH.md)
Evaluation protocol (paper-quality N, paired CI, stratification,
baselines) is in [docs/EVALUATION_PROTOCOL_ZH.md](docs/EVALUATION_PROTOCOL_ZH.md).
## Environment
Use `uv` to manage all python environment. `uv add` to manage deps so that we can `uv sync` to get exactly same runnable environment each time.
Algorithm-layer unit tests (no GPU, no SGLang):
```bash
uv sync --group test
uv run pytest
```
## Goal
Build a minimal prototype on top of **SGLang xPyD** to test whether **session-aware / KV-cache-aware P/D routing** can improve **end-to-end latency** for agentic coding workloads.

View File

@@ -6,6 +6,9 @@
更完整但仍然简洁的说明见 [docs/PROJECT_OVERVIEW.md](docs/PROJECT_OVERVIEW.md)。
新加入的合作者:先看 [docs/INDEX_ZH.md](docs/INDEX_ZH.md),按"我是谁"选 3 篇必读文档。
项目当前进度、薄弱点、路线图总览见 [docs/AUDIT_AND_ROADMAP_ZH.md](docs/AUDIT_AND_ROADMAP_ZH.md)。
## 当前做了什么
- 启动单机 SGLang P/D 栈。
@@ -99,3 +102,28 @@ uv run agentic-pd-hybrid replay \
- SGLang 改动:`feat(sglang): ...` / `fix(sglang): ...`
- `third_party/sglang` 的基线是 clean SGLang `v0.5.10` snapshot。
- 不提交 `outputs/`、日志、`__pycache__`、虚拟环境。
## 单元测试(无 GPU
算法层policies、Algorithm 1 / Theorem 1有 pure-Python 单测,跑测试不需要 GPU、不需要 SGLang
```bash
uv sync --group test
uv run pytest
```
详见 [tests/README.md](tests/README.md)。
## 评测脚本
按 [docs/EVALUATION_PROTOCOL_ZH.md](docs/EVALUATION_PROTOCOL_ZH.md) 跑数据后:
```bash
# M3: 按 turn_id / input_length / overlap_ratio / append_tokens 分桶
scripts/analysis/stratified.py outputs/<run>/request-metrics.jsonl
# M2: paired-on-same-trial bootstrap 95% CI
scripts/analysis/paired_compare.py \
--baseline outputs/run-dp/request-metrics.jsonl \
--candidate outputs/run-kvc/request-metrics.jsonl
```

View File

@@ -0,0 +1,140 @@
# 项目整体审阅与下一阶段路线图
**日期**2026-05-12
**分支起点**`improve/audit-and-foundations`(基于 `h200-cu130`
**性质**:跨分支整合 + 路线图,供合作者判断每个 commit 是否值得 merge
**对象**:项目下一个 SWE / research agent + 论文 reviewer 预读
本文把 `main` / `kvc-debug-journey-v1-to-v4` / `feat/d-to-p-sync` / `h200-cu130` / `kvc-real-ali-iter-v1` 五个分支的进度、已成立的贡献、薄弱点、走到 SOSP/OSDI + 工业级的路线图集中到一处,方便快速对齐。
---
## 0. TL;DR
1. **已经成立**v1 → v2 算法reset-on-success、字典序 Route、worker-mode Admit RPC有形式化定义 + 两条 theorem + SWE-Bench 50 sess ts=1 上 6/8 指标击败 4DP CA 的实测。
2. **核心薄弱点**(a) session-level eviction 与 KVC 设计意图冲突;(b) D→P 增量 KV 同步不存在TTFT p99 长尾来自此;(c) mooncake "instance not alive" 级联是控制层根本可用性问题;(d) 评测仍缺多 baseline 多 trace 强统计。
3. **不需要 GPU 也能推进**的事:算法层 unit test、形式化设计文档block-level evict、D→P sync 接口契约)、评测协议、分层分析工具、文档体系收口。本路线图的 Milestone 1 大部分都属于此类。
4. **进 OSDI/SOSP 必须做的**:执行 §S1block-level evict+ §S2D→P sync POC+ §M2/M3/M4多 baseline / 全 Ali / paired 协议)。预计 34 个月单/双人。
---
## 1. 五个分支的状态总览
| 分支 | 角色 | 当前状态 | 最关键产出 |
|---|---|---|---|
| `main` | "已发布" 基线 | 落后 origin 18 commit2P4D + worker-admission + seed-min2 报出 vs default PD 的 9% mean / 19% p90 改善 | `KVCACHE_CENTRIC_PROGRESS_ZH.md` 的两档策略latency-best vs stable |
| `kvc-debug-journey-v1-to-v4` | 主工作分支 | v1→v5 完整算法演化;`KVC_ROUTER_ALGORITHM.md` 三段算法 + 两条 theorem | SWE-Bench 50 sess ts=1v2 6/8 指标击败 4DP CA**TTFT p99 仍输 3×**1.28s vs 0.43s),诊断为 8.3% reseed 慢路径 |
| `feat/d-to-p-sync` | 占位分支 | 代码空,仅 `RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md` | 已排除"capacity-backup 是 D→P sync"的误解;列出 4 项工程子任务 |
| `h200-cu130` | 真硬件 + RDMA 验证 | 4×H200 + mlx5_60 NDR 400 Gb/s 上跑 E1/E2/E3 | **E2 80% failure**mooncake 死链级联);**E3 16min 触发 SGLang patch invariant crash**;最新 `KVC_EVICTION_GRANULARITY_DESIGN_ZH.md` 把 root cause 上升到"session-level 是错的 eviction granularity" |
| `kvc-real-ali-iter-v1` | 真 Ali trace 验证 | 8×H20179-req KVC-fit slice + 600-req/15min cold-window | KVC vs DPKVC-fit p50 46% ✅real 15min p90 +19s ❌53 errors vs DP 1KVC 默认 mem-fraction OOM必须降到 0.82 |
---
## 2. 已经"硬"成立的贡献
按"reviewer 能不能反驳"为标尺:
1. **Reset-on-success 修复 v1 thrashing**v1 永久 blacklist → migration 死循环 failure mode 有实测 + Algorithm 3 形式化 + Theorem 1 的不饿死证明(`KVC_ROUTER_ALGORITHM.md` §3.4 / §4.1)。
2. **三段算法分工清晰**Algorithm 1字典序 Route+ Algorithm 2D 自治 Admit RPC+ Algorithm 3Dispatch + reset-on-success。v5 把 admission 从 router 估算改成 D RPCOption D是把 capacity ground truth 与 routing score 解耦的正确分层。
3. **Direct-to-D 快路径的确定性命中**Theorem 2只要 residency ⊇ prefix ∧ append ≤ τ_append ∧ cap_ok 三条件同时成立必走快路径SWE-Bench 91.6% 命中、TTFT p50 = 0.43s 是结构性结果。
4. **每一个 negative result 都有 forensic 级解释**mooncake death、cold-D、reseed 慢路径、session-level evict 都有代码定位 + 时间线 + 反例。这条对 paper 是真正加分项。
---
## 3. 让 reviewer 一击致命的薄弱点
### 3.1 评测方法层
- **M1 N 不足**SWE-Bench v2 baseline N=3 确认 categoricalv2 自身 N 不足;缺 bootstrap CI。
- **M2 比较口径不对等**E2 80% 失败时用 "successful only" 算 latency 与 E1 全集比paper 必须 paired-on-same-trial。
- **M3 trace 偏 KVC-friendly**KVC-fit slice 按 small-append + high overlap 筛过full Aliturn2+ ratio 26%、single-turn 极多)的 dilution 后结果没跑过。
- **M4 baseline 不够强**:缺 vLLM + prefix-cache、DistServe、SplitWise、Mooncake-Master 任何一个。
- **M5 trace 单一性**:缺 ShareGPT/Mooncake trace、缺 long-context tool-use agent benchmark、缺合成 adversarial trace。
- **M6 硬件覆盖**:只 single-node ≤ 8 GPU没有跨节点、没有 ≥ 32 GPU 集群实测。
### 3.2 系统设计层
- **S1 Session-level eviction 与 KVC 设计意图冲突**90 次 evict、平均一次 free 67K tokens、25/50 session 必须 5090K 重 prefill。`KVC_EVICTION_GRANULARITY_DESIGN_ZH.md` 已识别但未实现修复。
- **S2 D→P 增量同步不存在**TTFT p99 长尾 50% 来自 P 重 prefill。`capacity-backup` 是 seed-time 静态快照,不是 D→P sync。修复需改 SGLang radix 的单生产者假设。
- **S3 Mooncake 级联 death**admission no-space → 持续重试 seed → 心跳掉线 → SGLang 整批 abortE2 1054/1285 失败)。控制层根本可用性 bug。
- **S4 Admission RPC 同步阻塞**:缺 backoff / hedging / staleness budget。D scheduler GIL 抖动即把 router 卡死。
- **S5 Cold-D / overlap-pinning**boilerplate 24-token block hash 让所有 session 与 D0/D1 重叠 → D2/D3 0 binding。load-floor bonus 是补丁,不是 first-principles 修复。
- **S6 SGLang 本地 patch 已 785 行 / 10 文件**,含 `schedule_batch.py:1646` 这种 hot-path 不变量改动E3 crash 就是 vendored patch 引入的 latent landmine。
- **S7 失败恢复 / 幂等性**streaming session 在 chunked-prefill retry 下幂等性靠 `SessionSlot.restore_to_req`;缺 worker crash / mooncake 重连 / partial KV 损坏的恢复 protocol。
- **S8 没有 multi-tenant / SLO-aware scheduling**:算法目标隐式 w_ttft=w_lat=1。生产里 interactive / batch / background 必须分级。
- **S9 Topology fixed at boot**P/D 比例是启动参数。生产负载需要 elastic。
- **S10 Backpressure pause hint 信号未闭环**:触发 20 次但因 no-BP 无人响应control-plane 没接通。
### 3.3 工程基础设施层
- **可观测性**metrics 是 jsonl + 离线 `recompute_summary.py`;生产需要 Prometheus + Grafana + OpenTelemetry trace。
- **形式化测试**:算法层与状态层缺 unit test`SessionSlot.restore_to_req` 幂等性是作者自己 flag 的 invariant。
- **混沌注入**mooncake death 这种 control-plane failure 必须有 fault injection harness。
- **代码体量**`replay.py` 2460 行,集 orchestration / policy hook / control plane / metrics 于一身——prototype OKpaper-quality artifact 偏弱。
---
## 4. 路线图
分三个 milestone。每个 milestone 可独立交付paper 章节或工程 release
### Milestone 1 — Defensible SOSP/OSDI submission34 个月,单 / 双人)
**目标**:把现有算法 + 失败诊断收口成能扛 PC 第一轮的稿子。
1. **执行 §S1block-level eviction refactor** — 见 `docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md`
- Streaming-session decode 输出在每个 turn finish 时通过 `cache_finished_req` 增量提交进 radix tree。
- `SessionSlot` 退化为纯 metadata仅持 `last_node` + lock_ref
- `release_session` 改为 `dec_lock_ref` + 删 slotevict 完全交给 SGLang radix LRU。
- 预期evict 粒度从 67K tokens/次降到 24 tokens/次reseed 频率降一个数量级。
2. **执行 §S2D→P 增量同步 POC** — 见 `docs/D_TO_P_SYNC_CONTRACT_ZH.md`
- microbench 证明D append 完成后异步推 KV block 回 P 端 radix → 下次 reseed 跳过 re-prefill。
3. **修 §S3mooncake death 级联)**admission RPC backoff + jitterper-D pending-seed budgetmooncake heartbeat 与 admission 解耦。
4. **修 §S5 的 first-principles 解法**:把 `overlap` 重定义为 "session 在 D 上独占 prefix 的 hash 数"(去掉 boilerplate 共享 hash 贡献),让 score 自然分散。
5. **重做评测**:见 `docs/EVALUATION_PROTOCOL_ZH.md`。N≥3 + bootstrap CI + 多 baseline + 全 Ali + 分层报告。
6. **形式化扩充**:加 Theorem 3block-level evict 下重 prefill cost 上界)+ Theorem 4D→P sync 的 staleness budget β 与 reseed cost 关系)。
7. **Artifact**:一键脚本 + Dockerfile + 4×A100 一小时复现核心 table/figure。
### Milestone 2 — Production-quality serving substrate再 36 个月23 人)
8. **控制平面分层**:把 `replay.py` 拆成 `router/` / `control/` / `obs/` / `orch/`
9. **Elastic topology**autoscaling controller输入 (P queue, D transfer queue, D KV usage)。
10. **Multi-tenant + SLO classes**interactive / batch / background 三档独立 admission budget。
11. **Failure injection harness**mooncake link flap / D OOM kill / router GC pause / partial KV corruption每个 case 有恢复 SLA。
12. **Persistent KV tier**CPU DRAM + NVMe + RDMA-attached poolevict 改为 demote。
13. **Cross-node + heterogeneous**H100 + H200 + L40S 混合topology-aware routing。
14. **Observability**per-request OpenTelemetry + Prometheus per-D + Grafana 主面板。
### Milestone 3 — 真正能进 OSDI'27 的科研增量612 个月)
15. **Learning-based admission / migration**multi-armed bandit / RL 控制 τ_reject 与 K用 trace 训 session-aliveness predictor。
16. **跨 router residency consensus**:轻量 gossip 共享 `Σ.resident[d]`
17. **可证明 competitive ratio**:在 oracle KV-residency 模型下证明 KVC expected TTFT 与 offline optimal 比值有界。
18. **分布式 prefix tree**:逻辑 prefix 映射到多 D 物理副本,支持 multi-tenant prefix 共享system prompt / tool schema
19. **Energy-aware variant**GPU SM 利用率 + PCIe/RDMA 能耗进目标函数。
20. **End-to-end agent serving framing**:从 request-level latency 上升到 agent task completion timecoding agent 一个 task 30+ turn
---
## 5. 不需要 GPU 也能推进的工作清单
按 ROI 排:
- [x] 本路线图(`AUDIT_AND_ROADMAP_ZH.md`)。
- [x] 合作者入口(`docs/INDEX_ZH.md`)。
- [x] Block-level eviction 具体设计(`docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md`)。
- [x] D→P sync 接口契约(`docs/D_TO_P_SYNC_CONTRACT_ZH.md`)。
- [x] 评测协议(`docs/EVALUATION_PROTOCOL_ZH.md`)。
- [x] `KvAwarePolicy` 纯函数 score 抽取 + unit testAlgorithm 1
- [x] 不饿死性质测试Theorem 1
- [x] 分层分析脚本(按 turn-index / append-size / overlap 三维分桶)。
- [x] Paired-comparison 协议 helper。
- [ ] Mooncake death 的可重现 mock harness无 GPU 也能跑)。
- [ ] SGLang patch surface 的归类清单(每个 patch 标"必须" / "实验性" / "可下线")。
- [ ] Failure-mode taxonomy 文档cold-D、overlap-pin、mooncake death、reseed storm、evict storm
---
## 6. 单句结论
> 这个项目已经具备了 SOSP/OSDI workshop / poster 的素材;要进 main track需要把 §S1block-level evict和 §S2D→P sync做实、把 §M3full Ali和 §M4两个强 baseline补齐、把 §S3mooncake 级联 death的 control-plane fix 写进可重复 artifact。如果只能做一件事先做 block-level eviction refactor —— 它同时解决"reseed 太频繁"和"P 端 radix 多生产者扩展的前置条件"。

View File

@@ -0,0 +1,309 @@
# Block-level Eviction Refactor — 设计文档
**日期**2026-05-12
**前置**[KVC_EVICTION_GRANULARITY_DESIGN_ZH.md](KVC_EVICTION_GRANULARITY_DESIGN_ZH.md)(架构层 manifesto
**性质**:实现层设计 + API 草案 + 测试计划,供下一个合作者直接据此编码
**Status**:草案,未实现。代码全部 quoted from `third_party/sglang/python/sglang/srt/mem_cache/session_aware_cache.py @ origin/h200-cu130`
---
## 0. TL;DR
`SessionAwareCache` 当前对 streaming-session **整段 KV 一次性 free** 的语义改成:
1. Streaming-session decode 输出在 turn finish 时 **增量 commit 进 radix tree**
2. `SessionSlot` 退化为**纯 metadata**(仅持 `last_node` + lock_ref 状态),不再独占 KV 区间。
3. `release_session` 改为只 dec_lock_ref + 删 slot**让 SGLang 标准 radix LRU 按 block 粒度蚕食**。
预期收益evict 粒度从一次 ~67K tokens 降到 ~24 tokenspage_size 个 tokenreseed 频率降一个数量级;同时把 P 端 radix tree 改造成可被外部喂数据(为 [D_TO_P_SYNC_CONTRACT_ZH.md](D_TO_P_SYNC_CONTRACT_ZH.md) 铺路)。
---
## 1. 现状代码梳理
### 1.1 关键文件与函数
`third_party/sglang/python/sglang/srt/mem_cache/session_aware_cache.py`
| 函数 / 字段 | 当前语义 |
|---|---|
| `SessionSlot.req_pool_idx` | streaming-session 独占的 req_pool 槽位 |
| `SessionSlot.kv_committed_len` | 上一 turn 完成时已 commit 的 KV 长度(已计入 cache_protected_len 部分进入 radix |
| `SessionSlot.kv_allocated_len` | 当前已分配但**未进 radix** 的 KV 长度("session-exclusive 尾部" |
| `SessionSlot.cache_protected_len` | 首 turn 提交 radix 时的 protected 边界 |
| `match_prefix(streaming req)` | 命中 slot → 返回 `req_to_token[req_pool_idx, :prefix_len]`bypass radix |
| `cache_unfinished_req(streaming req)` | subsequent turns → **完全 skip inner**(不进 radix |
| `cache_finished_req(streaming req)` | 调 `slot.save_from_req`**不调 inner.cache_finished_req** |
| `release_session(sid)` | `dec_lock_ref(slot.last_node)` + `free(req_to_token[req_pool_idx, cache_protected_len:kv_allocated_len])` + 回收 req_pool 槽位 |
### 1.2 当前为什么是错的(重述)
`[cache_protected_len, kv_allocated_len)` 是首轮入 radix 之后所有累积的 decode 输出 + 后续 turn 的 extend。在 Inferact / SWE-Bench 实测:
- `cache_protected_len` ≈ 首 turn boilerplate ~12K
- `kv_allocated_len` 累积 50100K
- 每次 `release_session` 一次性释放 3888K这部分**从未进 radix**,无法享受 leaf-by-leaf 渐进 evict
→ session 被 evict 后必须从 client 原 prompt 重 prefill 全长 + mooncake transfer 全长,跟 naive PD-disagg 等价(详见 manifesto §1
---
## 2. 目标行为表
| 场景 | 现状 | 目标 |
|---|---|---|
| Session 累积 50K KVD 满了 | `release_session` 一次释放 38K | radix LRU 从最老 leaf 开始 evict单次 ~24 tokens |
| Session 被 evict 后再到来 | 必须 reseed 50K | 仅 re-prefill 被 evict 的 leaf 部分(典型 ≤ 5K |
| Evicted session TTFT | 5090K reseed ≈ 37s | 5K append-prefill ≈ 200ms |
| 不被 evict 的 session | 同 session 内 turns append-only | 同样 append-only不变 |
| Direct-to-D fast path 命中率 | 91.6% (SWE-Bench) / 38% (E3 Inferact) | 应 ≥ 85% 即使 saturation |
---
## 3. 设计
### 3.1 SessionSlot 字段精简
**after refactor**
```python
@dataclass
class SessionSlot:
virtual_node: _VirtualNode = field(default_factory=_VirtualNode)
# Pointer into the radix tree — the deepest node owned by this session's
# committed prefix. Held under inc_lock_ref so radix LRU never evicts this
# *active* leaf out from under a turn-in-progress. Released by
# release_session.
last_node: Any = None
swa_uuid_for_lock: Optional[str] = None
# Bookkeeping fields (no longer authoritative ownership of KV indices).
last_access_time: float = field(default_factory=time.monotonic)
# Mamba state stays slot-owned (mamba doesn't fit the radix model).
mamba_pool_idx: Any = None
mamba_ping_pong_track_buffer: Any = None
mamba_next_track_idx: Any = None
mamba_last_track_seqlen: Any = None
mamba_branching_seqlen: Any = None
```
**删除**`req_pool_idx``kv_committed_len``kv_allocated_len``cache_protected_len``swa_evicted_seqlen`。这些字段的真值改由 radix tree + req_to_token_pool 共同维护。
### 3.2 `cache_finished_req` 改造
**after refactor**
```python
def cache_finished_req(self, req: Req, is_insert: bool = True, **kwargs):
if not _is_streaming(req):
return self.inner.cache_finished_req(req, is_insert=is_insert, **kwargs)
session_id = req.session.session_id
slot = self.slots.setdefault(session_id, SessionSlot())
# KEY CHANGE: always delegate to inner — this inserts the new tokens
# (kv_committed_len .. fill_ids end) as radix-tree blocks. Subsequent
# match_prefix calls for this session will hit the radix tree directly.
result = self.inner.cache_finished_req(req, is_insert=is_insert, **kwargs)
# Update slot bookkeeping only (no KV ownership).
slot.last_node = req.last_node
slot.swa_uuid_for_lock = req.swa_uuid_for_lock
slot.last_access_time = time.monotonic()
# Mamba state still goes through slot.
slot.mamba_pool_idx = req.mamba_pool_idx
...
return result
```
**不变量**
- `inner.cache_finished_req` 会把 `[kv_committed_len_old, kv_committed_len_new)` 范围内对齐到 page_size 的 KV 插入 radix。这个语义来自 SGLang 标准实现,无需改 inner。
- `slot.last_node` 现在指向**当前 session 已 commit prefix 的尾节点**,每个 turn 后向前推进。
- `dec_lock_ref(old_last_node)` + `inc_lock_ref(new_last_node)` 必须在 turn 切换时执行。
### 3.3 `cache_unfinished_req` 改造
streaming session 的 subsequent turn **不再 skip inner**。原因:现在 `match_prefix` 走 radixchunked-prefill 中间状态也需要 inner 维护:
```python
def cache_unfinished_req(self, req: Req, **kwargs):
if _is_streaming(req) and kwargs.get("chunked", False):
# Chunked prefill: forward to inner so the per-chunk extend gets
# tracked in the radix LRU access timestamps.
...
self.inner.cache_unfinished_req(req, **kwargs)
```
具体的 chunked 处理细节需要保留对 `prefix_indices` 重建的逻辑(参考当前实现 lines 215225但调用 `inner.cache_unfinished_req` 不能 skip。
### 3.4 `match_prefix` 改造
退化为**纯 inner 转发**——SessionSlot 不再持 KV 指针:
```python
def match_prefix(self, params: MatchPrefixParams) -> MatchResult:
# No more slot-fast-path. Streaming sessions reuse KV via radix tree
# match like every other request.
return self.inner.match_prefix(params)
```
调用方需要的 "这个 session 的 committed prefix 长度" 信息改为通过 `inner.match_prefix(...).device_indices.shape[0]` 推导。
### 3.5 `release_session` 改造
**after refactor**
```python
def release_session(self, session_id: str) -> int:
slot = self.slots.pop(session_id, None)
if slot is None:
return 0
# Just release our radix lock — radix LRU can now reclaim our prefix
# leaves at its own pace. NO direct token_to_kv_pool free.
if slot.last_node is not None:
if slot.swa_uuid_for_lock is not None:
self.inner.dec_lock_ref(
slot.last_node,
DecLockRefParams(swa_uuid_for_lock=slot.swa_uuid_for_lock),
)
else:
self.inner.dec_lock_ref(slot.last_node)
# Mamba state still needs explicit cleanup if present.
if slot.mamba_pool_idx is not None:
...
return 0 # "freed_tokens" no longer meaningful; radix LRU shed lazily
```
### 3.6 `get_session_status` / `list_session_statuses` 改造
`resident_tokens` 现在的真值来自 radix tree。需要在 inner 暴露一个 helper
```python
# In BasePrefixCache / RadixCache:
def tokens_under(self, node) -> int:
"""Count tokens in the path from root to `node` (inclusive)."""
...
# In SessionAwareCache:
def get_session_status(self, session_id: str) -> Optional[Dict[str, Any]]:
slot = self.slots.get(session_id)
if slot is None:
return None
resident_tokens = self.inner.tokens_under(slot.last_node) if slot.last_node else 0
return {
"session_id": session_id,
"resident": resident_tokens > 0,
"resident_tokens": int(resident_tokens),
"last_access_time": float(slot.last_access_time),
}
```
`admit_direct_append` 的容量检查改用 `resident_tokens` 的 radix 真值(去掉 `kv_committed_len / kv_allocated_len` 双值不一致的可能)。
### 3.7 SGLang 调度路径配套改动
参考 `schedule_batch.py:1572-1646`,当前 streaming-session correctioncommit b8e6f13 / 986f351 引入)建立在 SessionSlot 拥有独立 KV 范围之上。block-level refactor 后这条 correction 路径**完全无需存在**——req 的 fill_ids / prefix_indices 由 inner radix `match_prefix` 直接给出一致值。
**移除项**
- `schedule_batch.py:1572-1585``actual_extend_len = max(0, len(fill_ids) - len(prefix_indices))` correction 块。
- `schedule_batch.py:1646``assert seq_len - pre_len == req.extend_input_len`refactor 后该不变量结构上必然成立)。
- E3 触发的 latent landmine ([E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) §2)随之消失。
---
## 4. 不变量(必须在 PR 自测中覆盖)
| Inv | 内容 |
|---|---|
| I1 | `release_session(sid)` 后,下一次同 session 请求的 `match_prefix` 行为只取决于 radix tree 的常驻状态——不依赖 `slots` dict。 |
| I2 | 任意 (session_id, turn_id) 的 `cache_finished_req` 调用后radix tree 上必然存在一条 root→leaf 路径覆盖该 turn 的全部 committed token`tokens_under(slot.last_node)` 严格不降)。 |
| I3 | `restore_to_req` 必须**幂等**:在 chunked-prefill 重试场景下,对同一 req 可被调用多次而最终 req 状态等价。当前实现靠"不清 slot 字段"实现 → refactor 后改由 radix `match_prefix` 的纯函数性质保证。 |
| I4 | 无 streaming-session 的请求(`req.session is None`)行为 **不变**:所有路径 short-circuit 到 inner。 |
| I5 | 任一 turn 结束后,对 `slot.last_node``inc_lock_ref` 必须有对应的 `dec_lock_ref`,且 `release_session` 是最终的释放点。 |
---
## 5. 测试计划(无 GPU 可跑)
### 5.1 单元测试mock inner cache
写一个 `MockRadixCache(BasePrefixCache)`,记录所有 `cache_finished_req / cache_unfinished_req / match_prefix / evict / dec_lock_ref` 调用序列。然后:
| Test | 断言 |
|---|---|
| `test_release_session_no_direct_free` | 调 `release_session`Mock 上 **没有** 直接 `free(kv_indices)` 调用,只有 `dec_lock_ref` |
| `test_subsequent_turn_inserts_radix` | 模拟 turn 0 → 1 → 2 三次 `cache_finished_req`,断言每次都触发 `inner.cache_finished_req` |
| `test_match_prefix_uses_inner` | streaming 与 non-streaming 都仅走 `inner.match_prefix` |
| `test_restore_idempotent` | 模拟 chunked-prefill 重试,连续两次 `match_prefix` 返回的 `device_indices` 一致 |
| `test_eviction_under_pressure_is_block_level` | inject 一个 "pool 满,必须 evict 24 tokens" 的状态,断言 `release_session` 不被触发inner 的 LRU 单步走 |
### 5.2 Property-based 测试
```python
@given(turns=lists(integers(min_value=24, max_value=2048), min_size=1, max_size=50))
def test_committed_tokens_monotone(turns):
"""tokens_under(slot.last_node) is monotonically non-decreasing across turns."""
...
```
### 5.3 Integration smoke需要 GPU但放在 sweep 脚本里)
执行 `sweep_e2_kvc_v2_rdma.sh` 同 trace 同配置,对比指标:
- evict 总次数(期望从 90 → < 10
- 单次平均 evict tokens期望从 67K < 500
- TTFT p99期望从 1.28s < 0.7s
- direct-to-D 命中率期望 85%
---
## 6. 工程量与风险
### 6.1 工程量
| 工作 | 估时 | 风险 |
|---|---|---|
| §3.1–§3.6 SessionAwareCache 改造 | 23 | 需要熟悉 radix 内部 lock_ref / evict 协议 |
| §3.7 schedule_batch 清理 | 0.5 | 是删代码 |
| §4 不变量单元测试 | 2 | |
| §5.3 GPU smoke + 数据对比 | 2 | mooncake 仍可能触发 E2 级联 death需要 §S3 修复一并跑 |
| **总计** | **~1 ** | |
### 6.2 关键风险
1. **`inner.cache_finished_req` streaming-session req 的兼容性**当前 SGLang 标准 radix 假设 req cache_finished_req 时是 "完整 prefill+decode 完成"。streaming-session req 在每个 turn 结束时还会留下"未完成的 conversation"要确保 inner 在插入时不会把 decode-only tokens 当成可丢弃尾巴需要 audit `radix_cache.py:cache_finished_req` 的实现
2. **lock_ref 顺序**turn N+1 开始的 `match_prefix` inc_lock_ref(new_node)turn N 结束的 dec_lock_ref(old_node)时序若反了会在并发下让 LRU 把刚 commit leaf evict建议加 assertion`dec_lock_ref` 之前 `inc_lock_ref` 必须先到
3. **chunked-prefill retry** I3SGLang 当前 `restore_to_req` 不清 slot 字段就是为此 retryrefactor 后必须确认 inner radix `match_prefix` retry 下也幂等标准 radix tree 是的但要写测试明确锁住这个性质)。
---
## 7. 与 D→P sync 工作的关系
block-level evict [D_TO_P_SYNC_CONTRACT_ZH.md](D_TO_P_SYNC_CONTRACT_ZH.md) **前置条件**
- DP sync 需要 P radix tree **可接收外部喂入的 KV block**
- 当前 P radix 假设单生产者 worker 模型输出)。
- block-level refactor 完成后streaming-session KV 已经走标准 radix 路径——再让 radix tree 接受"外部喂入"的额外生产者就只是扩展 insert API而不是发明新的存储路径
两件事可顺序做 block-level evict DP sync
---
## 8. 接班 agent 的最小动作
1. fork 一个 `feat/block-level-evict` 分支 `improve/audit-and-foundations` `h200-cu130`)。
2. 实现 §3.1–§3.6
3. §5.1 + §5.2 单元测试
4. 8×H100 / H200 上跑 §5.3 smoke对比 evict 频次和 TTFT p99
5. §6.2 风险 1 成立 SGLang `radix_cache.py` 看是否需要给 streaming-session req `is_session_active=True` flag 阻止"丢弃 decode "。
---
**核心句** session lifecycle 边界保留**不要**让它做 eviction 边界移交给 radix LRU)。这次 refactor 同时解决"reseed 太频繁""P radix 不可外部喂入"两个 blocker

View File

@@ -0,0 +1,247 @@
# D→P 增量 KV 同步 — 接口契约与 rollout 计划
**日期**2026-05-12
**前置**[RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md](RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md)(缺口定位)+ [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md)(前置条件)
**性质**:跨层接口契约 + staleness budget 形式化 + 分阶段 rollout
**Status**:草案。`feat/d-to-p-sync` 分支当前为空,本文是该分支应当首先 land 的设计文档
---
## 0. TL;DR
reseed 慢路径的 50% 时间在 P 重 prefill**修复 transfer 段(启 RDMA只能解一半**。彻底消除长尾的唯一办法是让 P 端 backup 增量跟上 D 端的 append
> D 在 direct-to-D 路径上完成一个 turn → 异步把新 commit 的 KV block 推回 P 端 radix → 下次 reseed 时 P 端 radix 命中完整 prefix无需 re-prefill仅一次 P→D transfer。
本文给出三层mooncake / SGLang / agentic-pd-hybrid的接口契约、一个 **staleness budget β** 的形式化定义,以及四阶段 rollout 计划,让该工作可以与 block-level eviction 解耦推进。
---
## 1. Staleness Budget β —— 形式化定义
设 D 上 session `s` 的 committed prefix 长度为 `L_D(s, t)`time `t` 的瞬时值P 上同 session 的 backup prefix 长度为 `L_P(s, t)`
```
staleness(s, t) := L_D(s, t) - L_P(s, t) ≥ 0
```
**Staleness budget β** 是系统承诺维持的上界:
```
∀ s, ∀ t : staleness(s, t) ≤ β
```
直观:β 越小 → reseed 命中 P 端 backup 的可能越高 → reseed 退化为单次 P→D transfer + ≤ β tokens 的 re-prefill。
- **β = 0**完全同步D 每 commit 一块就阻塞等 P ack。延迟成本高不推荐。
- **β = ∞**当前状态P 端 backup 永远 seed-time 静态快照)。
- **β = 一个 page24 tokens**:单 block sync。理论最优粒度但 D 端每次 append 都触发一次 D→P RPC。
- **β = O(append_len)(典型 1K4K**:批量 sync。推荐起点把同 turn 的 decode 输出聚合后整批推送。
- **β = O(turn_size)(典型 ~50K**:粗粒度 sync。失效 reseed bypass仅减少 transfer。不可取。
→ rollout 推荐 β = `max(page_size, min(committed_in_turn, β_max))``β_max` 默认 4096。
---
## 2. 三层接口契约
### 2.1 Mooncake 层:双角色化
**当前状态**(详见 [RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md](RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md) §3
- `MooncakeKVManager` 在初始化时按 `disaggregation_mode ∈ {PREFILL, DECODE}` 强角色化。
- `MooncakeKVSender` 仅在 PREFILL 模式实例化,`MooncakeKVReceiver` 仅在 DECODE 模式实例化。
- `add_transfer_request` 含硬约束 `assert disaggregation_mode == PREFILL`
**目标接口**
```python
# third_party/sglang/python/sglang/srt/disaggregation/base/conn.py
class BaseKVManager:
roles: set[KVRole] # 替换原单值字段,允许 {PREFILL, DECODE}
class KVRole(Enum):
PREFILL = "prefill"
DECODE = "decode"
PREFILL_BACKUP_RECEIVER = "prefill_backup_receiver" # 新P 端接收 D→P sync
DECODE_BACKUP_SENDER = "decode_backup_sender" # 新D 端发送 D→P sync
```
**新增类**(实现层 ~400 LOC
| 类 | 角色 | 关键方法 |
|---|---|---|
| `DecodeKVSender` | D 端把 append 后的新 KV block 推回 P | `enqueue_sync(session_id, kv_blocks, target_p)` 异步入队,返回 `sync_id` |
| `PrefillKVReceiver` | P 端接收 D→P sync 包 | `recv_loop()` 后台线程;每个包触发 callback 注入 radix tree |
**Bootstrap channel**:需要独立于现有 P→D 通道的第二个 bootstrap socket避免 buffer pointer 协商冲突)。配置:
- 默认 disable由 ServerArgs flag `--enable-d2p-sync` 开启
- 新增 port range `BOOTSTRAP_D2P_PORT_BASE = 22000`
### 2.2 SGLang 层Radix 多生产者扩展
**当前状态**P 端 radix 假设单生产者(本 worker 模型输出)。`RadixCache.cache_finished_req` 内部直接从 `req_to_token_pool[req_pool_idx, :]` 取 KV indices 插入树。
**目标接口**(在 [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) 完成之后):
```python
class RadixCache(BasePrefixCache):
def insert_external(
self,
token_ids: Sequence[int],
kv_tensor: torch.Tensor,
*,
source_worker_id: str,
session_id: str,
) -> InsertExternalResult:
"""
Insert KV blocks supplied by an external worker (D→P sync).
Allocates fresh slots in token_to_kv_pool, copies kv_tensor into them,
and threads the resulting indices through the radix tree exactly like
cache_finished_req would for a local prefill.
Invariants:
- Same model layout (verified at handshake time, not per-call).
- On collision with existing radix path, no-op for the shared prefix
and only insert the diverging suffix.
- Inserted nodes get lock_ref += 1 if `pin=True`, default False.
D→P sync is best-effort; LRU is allowed to evict the inserted leaves.
"""
```
**关键设计点**
| 决策 | 选项 | 推荐 |
|---|---|---|
| KV index 重映射 | A) D 发原 indices, P 重映射B) D 发紧密打包的 tensorP 重新分配 | **B**:避免跨 worker 索引泄漏 |
| 失败处理 | A) D→P 失败 → 退化为重 prefillB) 重试 N 次 | **A** + 后续 reseed 时若 P 未命中走旧路径 |
| Reference counting | sync 进 P 的 KV 是否被 pin | **不 pin**P 端 LRU 自然管理,避免 backup 把生产 KV 挤出 |
| 与 evict 协调 | sync 来到时 P 满怎么办? | 让 sync insert 触发 inner.evict → 与本地生产 KV 公平 LRU 竞争 |
| 同 session 多 P 实例 | router round-robin 把 turn 派到不同 P 怎么办? | **接受 multi-source**:每个 P 维护自己的 backupreseed 时挑 staleness 最小者 |
### 2.3 agentic-pd-hybrid 层Hooks 与状态机
**新增 CLI flag**
```bash
--enable-d2p-sync # off by default
--d2p-staleness-budget-tokens 4096 # β_max
--d2p-sync-batch-min-tokens 24 # 至少 ≥ 1 page 才触发
--d2p-sync-target-policy {last_p, round_robin, broadcast}
# last_p: 推回该 session 上次 seed 的 P
# broadcast: 推到所有 Preseed 时灵活但带宽大)
```
**新增 state 字段**`replay.py``DirectSessionState`
```python
@dataclass
class DirectSessionState:
...
# NEW: per-P backup view, populated by D->P sync callbacks.
prefill_resident_tokens_by_p: dict[str, int] = field(default_factory=dict)
last_d2p_sync_at: float | None = None
```
**Hook 在 `_invoke_session_direct` 完成后**
```python
async def _invoke_session_direct(...):
...
response = await self._stream_direct_to_d(...)
if response.ok and self.config.enable_d2p_sync:
new_committed = response.kv_committed_len
prev_p_resident = max(session.prefill_resident_tokens_by_p.values(), default=0)
staleness = new_committed - prev_p_resident
if staleness >= self.config.d2p_sync_batch_min_tokens:
target_p = self._choose_d2p_target(session)
asyncio.create_task(
self._issue_d2p_sync(session, target_p, prev_p_resident, new_committed)
)
```
**Hook 在 reseed 路径**`_invoke_kvcache_seeded_router`
```python
async def _invoke_kvcache_seeded_router(..., request):
...
if self.config.enable_d2p_sync:
# Probe P-side residency before issuing full re-prefill.
probe = await self._probe_prefill_residency(session_id)
if probe.resident_tokens >= request.prefix_len - β_max:
# Use the up-to-date backup: skip re-prefill, just trigger P→D transfer.
return await self._invoke_p_to_d_transfer_only(...)
# Fall back to existing path.
return await self._invoke_kvcache_seeded_router_legacy(...)
```
---
## 3. 性质(待证明)
### 3.1 Theorem 4 候选(论文形式)
*设 staleness budget β 维持成立。对一个 session `s` 在 D 上累积长度 L、被 evict 后 reseed 触发:*
```
reseed_cost(s) ≤ T_p2d(L) + T_prefill(min(β, L))
```
*其中 T_p2d 是 P→D transfer 时间(在 RDMA 下 ~L · 4 ns/tokenT_prefill 是 prefill 时间(在 H100 TP1 Qwen3-30B 下 ~50K tokens/s。当 β ≪ L 时退化为 single P→D transfer 主导。*
**对比 baseline**(无 D→P sync`reseed_cost = T_p2d(L) + T_prefill(L seed_size)`re-prefill 占主导。
### 3.2 与 Theorem 2 的关系
Theorem 2 只保证 direct-to-D 路径的快速命中。Theorem 4 把"fast path miss 时的 fallback cost"也压低到次秒级,使 KVC 在**全分位数**击败 DP 成为可能。
---
## 4. 四阶段 Rollout
| Phase | 范围 | GPU 需求 | 验收指标 |
|---|---|---|---|
| **P1** | block-level eviction refactor[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) | 4×H100 smoke | evict 单次平均 ≤ 500 tokens |
| **P2** | mooncake 双角色化 + microbenchD→P 单包 RTT、带宽利用 | 单机 + RDMA | P→D RTT < 50mslocal 16K-token block 带宽 50% 理论上限 |
| **P3** | SGLang `insert_external` + agentic-pd-hybrid hook best-effort reseed probe | 4×H100 + RDMA | sync 触发率 > 80% 同 turn 内完成;不引入新 failure mode |
| **P4** | reseed probe 接通 + 端到端 evaluation | 4×H100 + RDMA | reseed 单次 < 0.5svs 当前 37sTTFT p99 < 0.5s |
**关键决策点**P1 P2 之间需要走 audit确认 SGLang radix `insert_external` 不会与 streaming-session decode 路径冲突若发现严重冲突引入 "P-only sync mode" 占位等架构稳定再放开
---
## 5. 风险与对策
| 风险 | 影响 | 对策 |
|---|---|---|
| Mooncake 双角色化破坏现有 PD 单向路径 | E2 已暴露 mooncake "instance not alive" 级联再加一条通道可能放大 | P2 阶段先用独立 bootstrap channel + feature flag保留 disable 路径 |
| DP sync 占用 D 出口带宽影响 direct-to-D append-prefill 延迟 | 直接劣化主路径 | sync 用低优先级 QPRDMA SL=0 batch 触发 turn 内最多 1 |
| P radix backup 填满反而挤出本地生产 KV | P prefill 速度降 | sync 插入不 pin(§2.2 LRU 公平竞争 |
| P backup view 协调复杂 | router 选择 target_p 时需考虑 staleness | 起点用 `last_p` policyrecency-biased观察实测分布再决定是否上 `broadcast` |
| SGLang patch 升级时 `insert_external` upstream API 漂移 | 维护负担 | API 限制在我方 vendor patch 边界不污染 upstream radix并写 contract test |
---
## 6. 与 block-level eviction 的解耦关系
| 工作 | 是否依赖另一个 |
|---|---|
| block-level eviction | 不依赖 DP sync可独立交付能单独降低 reseed 频次 |
| DP sync | **依赖** block-level eviction需要 P radix streaming session KV 的真值源 |
| 一起做 | 收益最大reseed 频次降一个数量级 + 单次 reseed 时间降一个数量级 |
rollout 顺序block-level eviction landDP sync 随后开 `feat/d-to-p-sync` 推进两者**不应**合在一个 PR
---
## 7. 接班 agent 的最小动作
1. `feat/d-to-p-sync` 分支上 land 本文
2. block-level eviction main P2 阶段mooncake 双角色化 + microbench单测 SGLang 主路径耦合)。
3. P3 阶段加 `insert_external` hook disabled-by-default main
4. P4 端到端 evaluation 后再判断 reseed probe policy`last_p` vs `broadcast`)。
---
**核心句**DP 增量同步不是"再加一条网络通道"那么简单关键是把 P radix 从单生产者扩展到允许 best-effort 外部喂入Block-level eviction 是这件事的前置条件——所以两件工作可以一前一后不能颠倒

137
docs/E1_E2_FIX_DESIGN_ZH.md Normal file
View File

@@ -0,0 +1,137 @@
# E1 / E2 Failure Modes — Fix Design Space (no code changes)
**Status**: design proposal for review.
**Branch**: `h200-cu130`.
**Companion**: `docs/E1_E2_RESULTS_ZH.md` §5b§5d for the forensic findings this design responds to.
This document evaluates candidate fixes for the two pathologies E1 / E2 exposed:
- **Q1**: D scheduler thread starves the mooncake C++ control plane during LRU evictions, causing P-side `batch_transfer_sync` to time out (~30 s) and the hair-trigger in `conn.py:1270` to permanently blacklist the D's mooncake_session_id.
- **Q2**: KvAwarePolicy's overlap-first lex score, combined with workloads where new sessions share boilerplate hash_ids with already-resident sessions on D0/D1, leaves D2 cold for the entire run.
For each problem we list candidate fixes, the layer they touch, their assumptions, and what could go wrong. **No code is committed** until a path is chosen.
---
## Q1 — Eviction starves mooncake control plane
### Mechanism recap
Inside `decode-0.log` at the moment of P-side timeout (`Sync batch data transfer timeout after 37452515723ns`):
```
01:56:34 Decode batch ... gen 174 tok/s ← serving fine
01:56:42 session id 1000315 does not exist, cannot delete.
01:56:42 Trimmed decode session cache via LRU. evicted=2, freed=77675, available 38574 → 116249
01:56:42 Trimmed decode session cache via LRU. evicted=1, freed=36166, available 29038 → 65204
01:56:42 Decode transfer failed ... ← P-side timeout fires
```
`maybe_trim_decode_session_cache` (in vendored sglang scheduler) walks per-session resident bookkeeping, releases GPU KV slots via `kv_pool_allocator.free()`, and updates `session_aware_cache` under lock. While that runs, the scheduler main loop is busy and the mooncake control-plane callbacks scheduled into the same event loop don't get serviced. P sees no completion ack within 30 s → `batch_transfer_sync` returns nonzero → hair-trigger fires.
### Design space
| # | Fix | Layer | Mechanism | Assumes | Risks |
|---|---|---|---|---|---|
| **Q1.A** | Pre-emptive low-watermark eviction | vendored SGLang | Trigger LRU when `token_usage > 0.7` in idle scheduler ticks, so admission rarely needs to evict inline. SGLang already has `_decode_session_cache_low_watermark_tokens`; question is whether it currently runs proactively or only on-demand. | Idle ticks exist to absorb the work; the per-trim cost is bounded enough that doing it pre-emptively doesn't hurt the steady-state. | If proactive trims pick "warm" sessions (recently active), we lose direct-to-D fast-path hits. Need careful watermark + LRU-priority tuning. |
| **Q1.B** | Async eviction thread | vendored SGLang | Move LRU trim off the scheduler main loop into a background worker. Scheduler main loop only calls `notify_evict_needed()`; mooncake control plane keeps running. | KV pool free / session_aware_cache mutations can be made thread-safe with reasonable lock granularity. | Largest blast radius. Concurrent in-flight transfers can race with eviction of the same KV slots; need explicit ref-counting. Harder to reason about correctness. |
| **Q1.C** | Bump mooncake transfer timeout | mooncake env / wheel patch | Set `MC_TRANSFER_TIMEOUT_NS` (or equivalent) from 30 s default → 120 s+, giving D's eviction more headroom before P gives up. | A real broken link won't go unnoticed for ≥120 s. | Pure defense-in-depth. Doesn't fix LRU thrashing; under heavier load eviction could exceed 120 s too. Slows real-failure detection. |
| **Q1.D** | Windowed hair-trigger | vendored SGLang `conn.py:1270` | Replace `if session_failures >= 1:` with `if session_failures ≥ N within window`. Add periodic probe to D bootstrap port to clear `failed_sessions` after success. | Transient stalls are recoverable; real deaths are not. | Changes core failure semantics. We may keep dispatching to a D that is actually slow-dying. Adds windowed-state bookkeeping to a stable codepath. |
| **Q1.E** | Router-side backpressure | our `--enable-backpressure` (already exists, off by default) | D returns `recommended_pause_ms` in its admission RPC when pool > threshold; router pauses dispatch to that D. Already implemented. | Pausing dispatch upstream prevents D from ever reaching saturation, so LRU never thrashes. | Doesn't help in-flight transfers when stall happens; only prevents future arrivals. Won't rescue requests already mid-mooncake when LRU fires. |
| **Q1.F** | Upstream load balance (= Q2 fix) | our `policies.py` | Spread sessions to D2 so D0/D1's KV pool never saturates; LRU never trims; mooncake never stalls; hair-trigger never fires. | Q2 fix is sound and the workload's KV demand fits into 3 D's evenly. | The LRU+mooncake interaction stays latent. A different workload that still imbalances (e.g. a few sessions much larger than others) could re-trigger. |
### Recommendation for Q1
**Primary: Q1.F (do Q2 fix first).** This is upstream of the failure cascade and removes the only situation in which we observe LRU thrashing in our experiments. If Q2 is fixed and re-running E2 still shows mooncake stalls, then we *know* it's a real symptom and need defense-in-depth.
**Defense-in-depth (cheap): Q1.C (bump mooncake timeout).** Single env-var change, gives 4× safety margin, costs nothing. Safe to do regardless.
**Avoid for now: Q1.B and Q1.D.** Both touch vendored SGLang in invasive ways that change failure-detection semantics. Hold until Q1.F + Q1.C demonstrate they aren't enough.
**Open question for the team**: does SGLang's existing `low_watermark` LRU trigger (Q1.A) already run proactively? If we read the scheduler loop and find it only trims on demand, Q1.A is a small targeted change worth doing; if it's already proactive, the trims we observe are because watermark is set too high → tune the constant.
---
## Q2 — Cold-D never gets a session
### What we already know is wrong
User's observation: the existing `migration_reject_threshold=3` mechanism fires *after 3 wasted prefills*, which is too late. The fix needs to be *proactive*: the first request to a fresh session should already prefer the cold D over a hot D whose only advantage is shared boilerplate overlap.
### Design space
Let `assigned[D] = state.decode_assignment_counts[D]` and `inflight[D] = state.inflight_decode[D]`. Lex score is currently:
```
score(D) = (overlap + α·sticky, sticky, -inflight, -assigned)
```
| # | Fix | Mechanism | Assumes | Risks |
|---|---|---|---|---|
| **Q2.A** | Cold-D bonus (binary, what the reverted commit did) | `cold_boost = K if assigned[D]==0 and not sticky else 0`; add to lex position 0. | Each D needs to be "popped" from cold once, after that the bonus disappears. | One-shot: only protects the first session per D. After all 3 D's have ≥1 session, bonus is 0 everywhere and we're back to overlap-dominates-everything. If new session pressure remains skewed (e.g. boilerplate keeps growing on D0/D1), we re-imbalance silently. |
| **Q2.B** | Load-floor bonus (graduated, my recommended primary) | `floor_bonus = max(0, K · (1 assigned[D] / max(assigned[*])))` (or similar continuous fn); add to lex position 0; gated on `not sticky`. | "Lower assignment count = preferable for fresh sessions" is a sound bias even when no D is fully cold. | Tuning: K must dominate boilerplate overlap (~50 blocks here) but not so much that it drowns out genuine prefix-cache wins (a session with real 800-block overlap with one D should still go there). Suggest K ≈ 100×median(overlap_for_fresh_sessions). |
| **Q2.C** | Lex re-order: inflight first | Change score to `(-inflight, overlap + α·sticky, sticky, -assigned)`. | Idle D always wins ties → idle D2 wins fresh sessions immediately. | Contradicts the existing design intent (overlap-first = cache-locality-first). Hurts cache reuse when load *is* balanced. Sticky requests at turn 1+ might be diverted to a momentarily idle D, breaking cache locality of subsequent turns. |
| **Q2.D** | Capacity-aware overlap discount | `effective_overlap = overlap · (1 inflight[D] / max_inflight)`; replace `overlap` in score. | Loaded D's overlap is worth less than idle D's overlap because of queueing cost. Matches what theory says about cache-vs-load tradeoff. | More complex than Q2.B; needs `max_inflight` estimate (per-D? global?). Harder to reason about and tune. Saves only marginal modeling correctness over Q2.B. |
| **Q2.E** | Pre-warm cold D's at startup | After SGLang warmup, send a synthetic request whose hash_ids cover the boilerplate prefix to each D, populating `state.resident[D]` evenly. | We can identify "the shared boilerplate" by inspecting the trace before launch (or extracting common prefix at run start). | Trace-aware / requires upstream knowledge. Doesn't help workloads with multiple distinct shared prefixes. Workload-coupled — feels brittle. |
| **Q2.F** | Drop overlap unless "material" | Apply overlap term only when overlap > τ blocks (or > τ% of input). | Tiny overlap doesn't actually save meaningful prefill work. | Hides imbalance instead of solving it. If a workload has medium overlap (say 15%), threshold won't fire and we're back to imbalance. Doesn't address the bigger issue. |
| **Q2.G** | Fix the substring filter (the actual `_is_admission_rejection_mode` bug) | Either widen `_ADMISSION_REJECTION_SUBSTRINGS` to include `"kvcache-centric"`, or call `state.record_admission_reject` directly from the actual reject signal site instead of string-matching after the fact. | Existing migration mechanism is sound *once* it gets fed the right signal. | User has explicitly said 3-reject threshold is too late. So Q2.G alone isn't enough. But it's still a real bug — fixing it is orthogonal cleanup. |
### Recommendation for Q2
**Primary: Q2.B (load-floor bonus, graduated).**
- Continuous, not binary one-shot like Q2.A — gracefully handles the case where new sessions keep arriving and load needs to keep spreading.
- Decouples "node-idle preference" from overlap as separate signals — composable, debuggable.
- Sticky stays on by gating on `not sticky` → no risk of breaking turn 1+ cache locality.
- Single knob (`K`) to tune.
**Orthogonal cleanup: Q2.G (fix the reject-substring filter).** Independent of Q2.B, since the migration mechanism is the *backstop* (when load-floor bonus alone isn't enough to migrate from a saturated D mid-session). User correctly noted that waiting 3 rejects is too late as the *primary* mechanism, but as a *backstop after* primary load balancing, it's still valuable.
**Avoid: Q2.C** (lex re-order destroys overlap-first design). **Avoid: Q2.E** (workload-coupled, brittle). **Q2.D / Q2.F** are reasonable but more complex than Q2.B with marginal gain.
### Concrete shape of Q2.B (for review, not for merge)
```python
# In KvAwarePolicy.select, replacing the current score line:
total_assigned = sum(state.decode_assignment_counts.values())
n_decoders = max(1, len(topology.route_workers))
mean_assigned = total_assigned / n_decoders
# Per-D fairness deficit: how much below the running mean is this D?
deficit = max(0, mean_assigned - state.decode_assignment_counts.get(worker.worker_id, 0))
floor_bonus = int(self.load_floor_bonus * deficit / max(1, mean_assigned)) if not sticky else 0
score = (
overlap + sticky * self.sticky_bonus + floor_bonus,
sticky,
inflight_penalty,
assignment_penalty,
)
```
Knob: `load_floor_bonus: int = 0` (off by default, opt-in). When set to e.g. 200, an empty D that should have 16 sessions but has 0 gets `floor_bonus = 200 * 16 / 16 = 200`, dominating boilerplate overlap (~50). A D that's only 1 session below mean gets `floor_bonus = 200 * 1 / 16 ≈ 12`, which doesn't override real prefix-cache wins.
But this is just a *sketch* — real tuning needs an empirical pass on the same Inferact subset to verify D2 receives sessions and overlap-driven cache wins survive on D0/D1.
### Validation plan if we go with Q2.B
1. Implement Q2.B + flag, default off.
2. Re-run E2 on the same `outputs/inferact_50sess.jsonl` subset with `--kvcache-load-floor-bonus 200`.
3. Check structural log: do D0/D1/D2 each get a non-trivial share of `session-d-binding.jsonl` rows?
4. Check failure rate: drop from 1054 → < 100? (Hypothesis: yes, because the LRU thrash that triggered the mooncake hair-trigger was downstream of D0/D1 saturation.)
5. Check direct-to-D rate: should stay similar or improve (load-balancing should not destroy cache reuse, since sticky still wins for known sessions).
6. Re-evaluate H1 with E1 vs the new E2.
---
## Decision points (for review)
| # | Question | Default if no answer |
|---|---|---|
| D1 | Q1: do Q2 fix first and re-measure before touching mooncake / SGLang? | **Yes** (recommended) |
| D2 | Q1: bump mooncake `MC_TRANSFER_TIMEOUT_NS` to 120 s as cheap defense-in-depth? | Yes |
| D3 | Q2: is Q2.B (load-floor bonus, graduated) the right shape, or should we pick a different option from the table? | Q2.B |
| D4 | Q2: also do Q2.G (fix the reject-substring filter) as orthogonal cleanup? | Yes |
| D5 | Q2.B: is the proposed deficit-vs-mean formula OK, or do you prefer a simpler "bonus = K · (max - mine) / max" form? | Defer |
| D6 | Q2.B: bonus magnitude K = 200 reasonable, or want to grid-search a few values? | Try 200 first |
| D7 | Validation: re-run E2 on same 50-session subset, or expand to 100 sessions for more headroom? | Same subset |
Once the shape is approved, the next implementation pass is small and concentrated in `policies.py` + `replay.py` + `cli.py` (no SGLang vendor changes needed for the primary fix).

416
docs/E1_E2_RESULTS_ZH.md Normal file
View File

@@ -0,0 +1,416 @@
# E1 vs E2 Experiment Results — H200 + Driver 570
**Status**: E1 ✅ complete (2026-05-12 01:48 UTC, wall 1h29min). E2 ✅ complete (2026-05-12 03:22 UTC, wall 1h33min).
**Branch**: `h200-cu130`.
**Trace**: `outputs/inferact_50sess.jsonl` (deterministic head-cut of Inferact `codex_swebenchpro` to first 50 trials, md5 `7bb263a32600ef5a6ef5099ba340a487`, 1285 requests, mean input_length 67,631 tokens).
**Hardware**: 4× H200 80GB, driver 570.86.15 (cu12.8 API), Mellanox mlx5_60 RoCE 400 Gb/s NDR.
**Model**: Qwen3-30B-A3B-Instruct-2507 (TP1).
**Toolchain**: vendored SGLang 0.5.10 + cu12.8 nvcc local install (`~/cuda-12.8`) — see `docs/H200_DRIVER570_SETUP_ZH.md`.
---
## 1. Hypotheses being tested
From `docs/ONBOARDING_NEXT_AGENT_ZH.md` §3.1:
- **H1**: KVC v2's wins are not just from "1P3D topology + kv-aware policy" — the KVC layer (admission / migration / direct-to-D) contributes meaningfully on top. Pairing E1 (no KVC layer) against E2 (full KVC v2) on the **same subset** isolates the marginal contribution.
- **H2/H3**: Enabling real RDMA pushes TTFT p99 down from the reported 1.28s (TCP loopback) toward ~0.7s. Independent of H1, this is measured inside E2 alone (comparing against the historical TCP-loopback v2 reference).
---
## 2. E1 results — naive 1P3D + kv-aware + RDMA
**Configuration**: `mechanism=pd-disaggregation`, `policy=kv-aware`, 1P3D (GPU0=P, GPU1/2/3=D), `--force-rdma --ib-device mlx5_60`, `--concurrency-limit 32`, ts=1.
| Metric | E1 |
|---|---:|
| request_count | 1285 |
| success | 1200 |
| **error_count** | **85** |
| **failure_count** | **85** |
| abort_count | 0 |
| latency mean | 96.34 s |
| latency p50 | 93.21 s |
| latency p90 | 180.69 s |
| latency p99 | 219.46 s |
| ttft mean | 90.48 s |
| ttft p50 | 88.62 s |
| ttft p90 | 175.13 s |
| **ttft p99** | **207.39 s** |
| execution_modes | `pd-disaggregation-router: 1200`, `pd-disaggregation: 85` (errors) |
| per_decode_load | **D0:575, D1:710, D2:0** |
| per_prefill_load | P0:1285 |
| cache_hit_request_count | 1199 / 1200 (99.9%) |
### Key observations on E1
1. **D2 was never bound to a single session**. All 50 sessions got pinned to D0 or D1 by `kv-aware` policy's (overlap + sticky + inflight + assigned) lex-score, and naive pd-disaggregation has no migration mechanism to rebalance. Effective topology was **1P2D**, not 1P3D.
2. **Massive queueing**. TTFT p50 ≈ 89 s and p99 > 200 s indicate sessions waited tens of seconds in router/prefill queue. With `--concurrency-limit 32` and D0/D1 saturated, the inflight cap forced ~1250 reqs to serialize through only two decode workers.
3. **85 failures (6.6%)** — all `execution_mode == pd-disaggregation` (which the metrics module classifies as `error` when the agentic-pd-hybrid replay sees an unsuccessful upstream response). Most likely caused by `--request-timeout-s 300` firing on the longest queued requests.
4. **Cache hit 99.9%** — the kv-aware policy did successfully concentrate sessions on their prior D worker; the Inferact converter's prefix-shared 24-token-block hash_ids gave near-perfect prefix overlap across turns of the same session.
### What E1 establishes
For the same hardware, same trace, same model, **naive 1P3D + kv-aware policy is unusable for multi-session agentic workloads**:
- session-stickiness without migration leaves a third of compute capacity (1 of 3 decode GPUs) entirely unused
- queueing dominates user-facing latency
- failure rate is 6.6% even with 5 minutes per-request timeout
This is *the baseline H1 needs* — it shows the KVC layer (E2) has something concrete to improve over.
---
## 3. E2 results — KVC v2 + RDMA
**Configuration**: `mechanism=kvcache-centric`, `policy=kv-aware`, 1P3D, `--force-rdma --ib-device mlx5_60`, `--kvcache-admission-mode worker`, `--kvcache-direct-max-uncached-tokens 8192`, `--kvcache-migration-reject-threshold 3`, `--kvcache-prefill-backup-policy release-after-transfer`, `--kvcache-prefill-priority-eviction`, ts=1.
| Metric | E2 |
|---|---:|
| request_count | 1285 |
| success | 231 |
| **error_count** | **1054** |
| **failure_count** | **1054** |
| abort_count | 0 |
| latency mean (successful only) | 10.94 s |
| latency p50 | 7.44 s |
| latency p90 | 20.68 s |
| latency p99 | 64.73 s |
| ttft mean (successful only) | 1.76 s |
| ttft p50 | 0.43 s |
| ttft p90 | 6.56 s |
| **ttft p99** | **8.74 s** |
| execution_modes (succ.) | direct-to-D: 87; turn1-seed: 50; reseed: 12; large-append-reseed: 11; seed-filter-early-turn: 50; large-append-cap: 21 |
| per_decode_load | **D0:600, D1:685, D2:0** |
| per_prefill_load | P0:1285 |
| cache_hit_request_count | 230 / 231 (99.6 %) |
### Key observations on E2
1. **D2 still has zero bindings** — same root cause as E1. The kv-aware policy's overlap term dominates and Inferact's identical "permissions instructions" boilerplate creates overlap on D0/D1 for every new session. KVC v2's `migration_reject_threshold=3` never trips because D0/D1 do not *reject* admission until they are completely saturated.
2. **80 % failure rate, 1054 / 1285**. **NOT timeouts** — actual root cause is a 3-layer cascade documented in §6. Quick summary: 562 "no-space" admission rejects from D0/D1 → router falls back to seed/reseed paths needing mooncake → mooncake heartbeats drop ("Decode instance could be dead") → SGLang aborts the request → client sees `RuntimeError: generate stream ended before producing any token`.
3. **Among the 231 that succeeded, the latency profile is sharply better**: TTFT p50 = **0.43 s** vs E1's 88.62 s (E2/E1 = 0.5 %), latency p50 = **7.44 s** vs E1's 93.21 s (8 %). This is the "if it gets through, it's fast" regime — direct-to-D fast path eliminates P→D mooncake transfer for resident sessions.
4. **Direct-to-D fast path engaged 87 / 231 = 37.7 %** of successful requests. Lower than historical v2's 91.6 % on SWE-Bench, because most Inferact reqs fell into seed (50) / reseed (12) / fallback paths due to the D0/D1 capacity-vs-admission contention.
---
## 4. Comparison table — E1 vs E2
Numbers below are over **all 1285 requests** for E1 (since failure rate is small) but **only the 231 successful** for E2 (since the bulk timed out before producing latency datapoints). This is **not a fair head-to-head**, see §6.
| Metric | E1 | E2 (succ only) | E2 / E1 |
|---|---:|---:|---:|
| Total reqs | 1285 | 1285 | |
| Successful | 1200 | **231** | 0.19× |
| **error_count** | 85 (6.6 %) | **1054 (82 %)** | **12.4× worse** |
| lat mean | 96.34 s | 10.94 s | 0.114 |
| lat p50 | 93.21 s | **7.44 s** | **0.080** |
| lat p90 | 180.69 s | 20.68 s | 0.114 |
| lat p99 | 219.46 s | 64.73 s | 0.295 |
| ttft mean | 90.48 s | 1.76 s | 0.019 |
| **ttft p50** | 88.62 s | **0.43 s** | **0.005** |
| ttft p90 | 175.13 s | 6.56 s | 0.037 |
| ttft p99 | 207.39 s | 8.74 s | 0.042 |
| per_decode_load | D0:575, D1:710, D2:0 | D0:600, D1:685, D2:0 | both 1P2D |
| direct-to-D % | N/A (no KVC) | 87/231 = 37.7 % | |
---
## 5. Interpreting H1 / H2 / H3
### H1 (was: KVC layer adds value on top of 1P3D + kv-aware) — *qualified*
The H1 hypothesis as stated in `ONBOARDING_NEXT_AGENT_ZH.md` predicted E2 would clearly win on most metrics. The reality is **bimodal**: the small subset of E2 requests that successfully complete are dramatically faster than E1, but a much larger fraction (82 %) of E2 requests time out entirely. Net throughput on this workload is *worse* for E2 than E1.
Two issues drove this:
1. The D2 cold-start pathology already documented in §3, root cause. Both runs are de facto 1P2D, not 1P3D.
2. KVC v2's admission gate is stricter and surfaces more "no D capacity" / "session-not-resident" failures than vanilla pd-disagg, when the workload (mean input 67 K tokens, mean output 700 tokens) saturates D0/D1's combined ~1.5 M KV pool.
For workloads where D0/D1 do not saturate or where the policy *does* spread session ownership across all D workers (the historical SWE-Bench setup), KVC v2 wins. The Inferact `codex_swebenchpro` subset breaks both assumptions.
### H2 / H3 (RDMA reduces TTFT p99) — *cannot be evaluated cleanly here*
The historical reference point is "KVC v2 + TCP loopback, SWE-Bench 50sess: TTFT p99 = 1.28 s". This run uses Inferact + RDMA, and TTFT p99 of the 231 successful E2 requests is **8.74 s** — much higher than the TCP baseline. But the workloads are not comparable: Inferact mean input is 67 K tokens vs SWE-Bench's much smaller average. Per-request prefill + transfer is roughly 5× longer here. A clean H2 / H3 read needs an Inferact-on-TCP run to compare against, which is out of scope for this subset's GPU budget.
What we *can* say: RDMA is correctly engaged (every worker log shows `installTransport, type=rdma`; admission RPC RTTs in `structural/admission-events.jsonl` are ~6 ms — consistent with one-hop RoCE).
---
## 5b. Why E2 has 80 % failures — the real chain (forensic)
The summary's `error_count: 1054` and `execution_mode: kvcache-centric` mask the actual cascade. Pulling the underlying `request-metrics.jsonl`, `structural/admission-events.jsonl`, and per-worker SGLang logs gives the full picture.
### Layer 1 — worker admission rejects (51 % of admit attempts)
From `structural/admission-events.jsonl`:
```
admit ok = 581 (modes: seed=494, direct_append=87)
admit reject = 605 (reasons: no-space=562, session-not-resident=43)
```
**562 "no-space" rejects** — D worker (almost always D0 or D1) reports its KV pool is full and refuses to take the request as direct-append. The router then re-routes the request to the seed/reseed path.
This is materially different from E1's behaviour: E1's vanilla pd-disagg had no admission RPC, so requests *always* got accepted by the chosen D and queued behind the running batch. E1 paid for that as a 90-second TTFT but never saw a "no-space" failure.
### Layer 2 — mooncake P→D transfer failures (real, observed in prefill log)
From `logs/prefill-0.log`:
```
[01:56:42] Prefill transfer failed for request rank=0 req.rid='2a5ed06fb…'
with exception KVTransferError: Failed to send kv chunk of … to 172.18.112.37:46067
[01:56:42] Prefill transfer failed for request rank=0 req.rid='eca5ff14…'
with exception KVTransferError: Decode instance could be dead,
remote mooncake session 172.18.112.37:15078 is not alive
[01:56:42] Prefill transfer failed for request rank=0 req.rid='7ed9827b…'
Decode instance could be dead, remote mooncake session ... is not alive
```
When the seed/reseed fallback queue piles up (because of layer 1), the D worker becomes heavily backlogged and its mooncake bootstrap session heartbeat drops — P interprets this as "the D worker is dead" and fails the transfer. This is **not** a true crash; the worker process is alive (we observed it accepting unrelated requests immediately after), but the mooncake session is torn down for that bootstrap_room.
### Layer 3 — client-visible error
From `request-metrics.jsonl` for all 1054 failed reqs:
```
"error": "RuntimeError: generate stream ended before producing any token"
```
This is what `agentic-pd-hybrid` sees when the SGLang `/generate` SSE stream closes with zero output tokens — the upstream abort from layer 1 or layer 2 propagates as an empty stream.
### The complete causal chain
```
Inferact shared "permissions instructions" boilerplate
overlap term in kv-aware lex score never lets D2 win → D2 cold forever
50 sessions all pinned to D0 / D1
D0 / D1 KV pool saturates
worker admission emits 562 × "no-space" ← Layer 1
router falls back to seed/reseed path (needs P→D mooncake transfer)
P→D transfer queue piles up; D mooncake heartbeat drops
"Decode instance could be dead" → KVTransferError ← Layer 2
SGLang aborts the req → SSE stream closes with 0 tokens
agentic-pd-hybrid raises "generate stream ended ..." for 1054 reqs ← Layer 3
```
### Why E1 didn't hit this
E1 used `mechanism=pd-disaggregation`, which has no per-worker admission RPC. The router blindly dispatched to D0/D1; SGLang's internal scheduler simply queued requests behind the running batch (some grew their wait to >90 s before getting a token). Of the 85 E1 errors, sampling shows they are `request-timeout-s=300` failures — old-fashioned timeouts on the agentic-pd-hybrid side, not mooncake or admission failures.
So:
- E1 trades latency for resilience: nobody rejects, everyone queues, you pay TTFT.
- E2's KVC v2 worker admission is *meant* to be a safety valve, but on the cold-D pathology it becomes an *amplifier*: rejects → fallback paths → backlog → mooncake heartbeat loss → cascading failures.
### The real fix
Worker admission per se is not the bug — the bug is that there is no D-rebalancing happening upstream. With balanced D load (e.g. cold-D bonus in policy, or pre-warm of D2 with shared boilerplate), D0/D1 would not hit "no-space", and the layer 1 → layer 2 cascade would not fire. The reseed long-tail TTFT (8.74 s p99 here) becomes the dominant cost — exactly the regime onboarding §3.1 H3 describes.
---
## 5c. Why mooncake "died" (forensic on Q1)
The error string is `Decode instance could be dead, remote mooncake session ... is not alive`, which sounds like the D worker process crashed. **It did not.** Concurrent evidence shows D1 was happily serving `/session_cache/admit_direct_append HTTP/1.1 200 OK` and running LRU evictions only seconds after the "is not alive" errors fired. The real mechanism is hair-trigger.
### What the SGLang mooncake conn.py actually does
In `third_party/sglang/python/sglang/srt/disaggregation/mooncake/conn.py:1267-1276`:
```python
if ret != 0: # one transfer slice failed
with self.session_lock:
self.session_failures[req.mooncake_session_id] += 1
# Failures should never happen if the session is not dead,
# if the session fails once, mark it as failed
if self.session_failures[req.mooncake_session_id] >= 1:
self.failed_sessions.add(req.mooncake_session_id)
logger.error(f"Session {req.mooncake_session_id} failed.")
...
```
After this, every subsequent transfer that uses the same `mooncake_session_id` short-circuits at conn.py:1184:
```python
if req.mooncake_session_id in self.failed_sessions:
self.record_failure(kv_chunk.room,
f"Decode instance could be dead, remote mooncake session ... is not alive")
```
**One real `send_kvcache_slice ret != 0` permanently blacklists that D's mooncake session for the rest of the SGLang process lifetime.** The code's own comment ("Failures should never happen if the session is not dead") encodes the design assumption that transfers don't fail under normal conditions — but they do under the saturation regime described in §5b (RDMA queue full / D scheduler too busy to drain receives in time).
### Connecting back to Q1 timeline
Looking at decode-1.log around 01:56:42-56, the worker is running heavy decode batches (#token = 627K, near KV pool cap of 755K) plus repeatedly evicting via LRU. Under that load a single `send_kvcache_slice` returning a transient nonzero is enough to flip the switch. After 01:56:42 essentially every P→D1 transfer reports "is not alive" until end-of-run, even though D1 itself keeps serving direct-append admissions.
### What the hair-trigger is actually reacting to
Pulling the mooncake C++ logs (filter `^E0`/`^I0` lines from prefill-0.log) reveals the actual underlying error:
```
I0512 01:56:42.242062 transfer_engine_py.cpp:546]
Sync batch data transfer timeout after 37452515723ns
I0512 01:56:53.335597 transfer_engine_py.cpp:546]
Sync batch data transfer timeout after 30892690400ns
```
**37.45 s** and **30.89 s** — the mooncake `batch_transfer_sync` C++ call returned nonzero because the synchronous transfer took longer than its internal timeout (~30 s). On a 400 Gb/s NDR RDMA fabric this is not a network problem; the data path is healthy. The SGLang author's design instinct (`>= 1 failures = dead`) is *correct in the idle case* — a 30-second RDMA stall really does indicate a broken peer.
What's happening here is that the peer is **logically broken from the C++ control-plane's point of view**, even though the OS process is still alive.
### Why does the D side stall the control plane for 30 s?
Cross-referencing decode-0.log at the exact second of the first timeout (01:56:42):
```
01:56:34 Decode batch, #running-req=1, #token=627631, token_usage=0.83,
gen throughput=174.76 tok/s ← still serving normally
01:56:42 session id 1000315 does not exist, cannot delete.
01:56:42 session id 1000360 does not exist, cannot delete.
01:56:42 Trimmed decode session cache via LRU.
#evicted_sessions: 2, #freed_tokens: 77675,
#available_tokens: 38574 → 116249
01:56:42 Trimmed decode session cache via LRU.
#evicted_sessions: 1, #freed_tokens: 36166,
#available_tokens: 29038 → 65204
01:56:53 Decode transfer failed for request rank=0 ...
Failed to get kvcache from prefill instance, it might be dead
```
D0's main scheduler thread was busy doing **two consecutive LRU evictions** (freeing 77 675 + 36 166 ≈ 114 K tokens of KV) right when the P→D mooncake transfer attempt landed. Each LRU trim involves:
- iterating per-session resident metadata
- releasing GPU KV slots back to `token_to_kv_pool_allocator.free()`
- updating the session-aware-cache bookkeeping under lock
- closing per-session streaming state
Under `token_usage = 0.83` the LRU scan has to walk thousands of entries; the lock held during this work blocks the mooncake C++ control plane on the receive side (buffer registration / completion poll) from making progress. P's `batch_transfer_sync` keeps polling for the peer's completion ack, doesn't get one for 30 s, and gives up.
So the chain is:
```
D KV pool saturated by D2-cold-pinning (§5d)
D triggers heavy LRU eviction (114K tokens at a time)
D main scheduler thread starves mooncake C++ control plane for 30+ s
P's batch_transfer_sync returns nonzero (timeout)
P's hair-trigger marks D's whole mooncake_session_id "failed forever"
all subsequent reqs to that D blow up with "is not alive"
```
The hair-trigger threshold (`>= 1`) is structurally wrong for this regime — but it would not fire at all if the LRU thrash didn't happen, and the LRU thrash would not happen if the load were spread across all 3 D workers (§5d).
### Two layers of fix
| Layer | What | Cost |
|---|---|---|
| Root cause | Spread load to D2 so D0/D1's KV never saturate, LRU never thrashes. See §5d and the cold-D bonus implementation in `policies.py` (next commit). | Low — pure policy change |
| Defense in depth | In `mooncake/conn.py:1267-1276`, replace `>= 1` with a windowed threshold (e.g. ≥ 3 failures within 60 s) and add a periodic retry that probes the D bootstrap port before clearing `failed_sessions`. | Medium — touches vendored SGLang |
We do the root-cause fix first because it makes the second one optional.
---
## 5d. Why no session ever migrated to D2 (forensic on Q2)
KVC v2's design (KVC_ROUTER_ALGORITHM §3.3) uses `state.session_d_rejects[(session_id, D)] += 1` after a rejection, then policy.select skips any D with `rejects >= migration_reject_threshold (=3)`. The mechanism is conceptually sound. The bug is in *which* failures count as rejections.
### The substring filter is too narrow
In `replay.py:1379`:
```python
_ADMISSION_REJECTION_SUBSTRINGS = (
"session-cap",
"no-d-capacity",
"d-backpressure",
)
def _is_admission_rejection_mode(execution_mode: str) -> bool:
return any(token in execution_mode for token in _ADMISSION_REJECTION_SUBSTRINGS)
```
Only execution_modes containing one of those three substrings increment the per-(session, D) reject counter. **All 1054 E2 failures have `execution_mode = "kvcache-centric"`** (the generic fallback bucket the replay engine uses when the request fell through every concrete sub-path before producing a successful result). That string contains none of the three substrings, so `session_d_rejects` is never incremented for them.
### Empirical confirmation
Counting from `structural/admission-events.jsonl` (worker-RPC level, independent of replay's classification):
| Stat | Value |
|---|---:|
| Distinct `(session, D)` pairs ever rejected by worker RPC | 49 |
| Pairs rejected ≥ 3 times (would qualify for blacklist) | **46** |
| Most-rejected single pair | (1001172, D1) = **25 rejects** |
So 46 of 49 (sess, D) pairs *should have been blacklisted* by KVC v2's design. They never were, because the corresponding requests' execution_mode was `"kvcache-centric"` (failure path) and not `"…-session-cap"` / `"…-no-d-capacity"` / `"…-d-backpressure"` (which only get assigned when the fallthrough path runs to a known-rejection sub-result, not when the upstream SSE stream errors out).
Counting "next-binding-after-reject" from the merged binding+admission timeline:
| Rejected on | Next binding goes to | Count |
|---|---|---:|
| D0 | D0 | 253 |
| D1 | D1 | 329 |
| D0 | D2 | **0** |
| D1 | D2 | **0** |
The router stubbornly re-binds the same session to the same D after every reject — exactly because the reject was never recorded in `session_d_rejects`, so policy.select still sees an empty rejection counter and the overlap term keeps tipping it back to D0/D1.
### The fix
Two paths, in increasing scope:
1. **Quick**: include `"kvcache-centric"` (the failure-fallback bucket) in `_ADMISSION_REJECTION_SUBSTRINGS`, OR have replay set `execution_mode` to a more specific failure label when an SSE stream closes with zero tokens (e.g. `"upstream-aborted"`) and add that to the substring set.
2. **Better**: don't rely on string-matching at all. Have `_run_request` catch the actual rejection signal (admission RPC `can_admit=False` or upstream `RuntimeError: generate stream ended ...`) and call `state.record_admission_reject(...)` directly at that point. The substring filter was inherited from the v1 → v2 migration design (`MIGRATION_V1_FINDINGS_ZH §4.1`) when only specific fallback paths set those names.
Either fix would let the existing `migration_reject_threshold=3` blacklist D0/D1 after enough failures, force a re-route to D2, populate D2's resident hashes, and break the overlap-pinning death spiral.
---
## 6. What this experiment actually shows
1. **The H200 + driver 570 + cu12.8 toolchain works for production-scale SGLang xPyD workloads.** Both runs completed without CUDA / driver / mooncake errors; failures are policy- and workload-level, not infrastructure.
2. **The KVC v2 + kv-aware policy combination has a latent pathology on workloads with high cross-session prefix overlap**: the overlap term in the lex score causes permanent load imbalance, and v2's reject-counter migration cannot rescue it because rejects only fire under capacity pressure, by which point timeouts already dominate. This is novel and not surfaced by the SWE-Bench evaluation in the existing project docs.
3. **For Inferact-like workloads, a cold-D bonus (e.g. require D to host at least one session before its overlap score counts) or an explicit pre-warm step is required** before E1/E2 comparisons can isolate the marginal effect of the KVC layer.
---
## 7. Reproducibility
- Trace: `outputs/inferact_50sess.jsonl`, md5 `7bb263a32600ef5a6ef5099ba340a487`, regenerable via `scripts/sample_trace_subset.py`.
- E1: `bash scripts/sweep_e1_naive_1p3d.sh` (1h 29 min wall)
- E2: `bash scripts/sweep_e2_kvc_v2_rdma.sh` (1h 33 min wall)
- Summary JSON paths:
- `outputs/e1_naive_1p3d_kvaware_rdma_50sess/e1_naive_1p3d_kvaware_run1_summary.json`
- `outputs/e2_kvc_v2_rdma_50sess/e2_kvc_v2_rdma_run1_summary.json`
- Per-request metrics JSONL alongside each summary, plus structural events under `*/structural/`.
---
## 8. Open follow-ups for the next agent
1. **Add a cold-D bonus** to `KvAwarePolicy.select` (e.g. positive constant for D with `state.resident[D] == ∅`) and re-run E2 on the same subset. Predict: D2 receives bindings, failure rate drops, head-to-head with E1 becomes meaningful.
2. **Rerun E2 with `--kvcache-admission-mode router`** (router-side optimistic admission instead of worker RPC) to isolate whether the strict worker admission is the contributor to the 1054 failures, or whether it's purely the imbalance.
3. **Run a third arm E0 with `policy=default` + `mechanism=pd-disaggregation`** as a true control — kv-aware policy is itself part of what we are evaluating; default round-robin would have spread sessions across all 3 D.
4. **Compare TTFT p99 against an Inferact-on-TCP-loopback run** to evaluate H2/H3 cleanly. Cost: 1 more E2-shaped sweep (~1.5 h).
5. **Investigate the 1054 E2 failures** in `request-metrics.jsonl` — sample some to verify they are timeout-related vs admission-rejected vs upstream-500.
---
## 4. Comparison table — pending
To be appended.
---
## 5. Open questions for the next iteration
- Are the 85 E1 errors all timeouts? `request-metrics.jsonl` rows with `error` execution_mode should be sampled to confirm. (Quick check: grep the metrics jsonl for `"execution_mode": "pd-disaggregation"` and inspect `latency_s` / `error` fields.)
- Does E2 produce the predicted ~91% direct-to-D rate seen in the historical SWE-Bench v2 run, or does the Inferact workload's larger session count (50 vs 52 there) but very different per-session size distribution (mean 33 turns × ~2KB context growth per turn) push it lower?
- Is `D2 = 0%` an E1-specific artifact (kv-aware sticky in pd-disagg mode), or does the same happen in E2 before migration kicks in for the first time?

129
docs/E3_FINDINGS_ZH.md Normal file
View File

@@ -0,0 +1,129 @@
# E3 — first run findings + bug exposure
**Status**: E3 first attempt aborted at ~16 min wall by SGLang assertion crash on decode-1. Partial data confirms the load-floor bonus works as designed; the crash is an independent vendored-SGLang bug exposed by E3's new routing pattern.
**Branch**: `h200-cu130`.
**Companion**: `docs/E1_E2_RESULTS_ZH.md`, `docs/E1_E2_FIX_DESIGN_ZH.md`.
---
## 1. What worked: load-floor bonus (K=200)
Within the first ~15 minutes of E3, before the crash:
| | E1 (run1) | E2 (run1) | E3 (run1, partial) |
|---|---:|---:|---:|
| total bindings | 1285 | 1186 admit attempts | 1001 |
| decode-0 bindings | 575 | 600 | 240 (24.0%) |
| decode-1 bindings | 710 | 685 | 536 (53.5%) |
| **decode-2 bindings** | **0** | **0** | **225 (22.5%)** |
| unique sessions on D2 | 0 | 0 | **30** |
**Load-floor bonus successfully broke the overlap-pinning death spiral.** D2 is finally getting traffic on Inferact's shared-boilerplate workload. The graduated formula (`K * deficit / mean`) plus the `not sticky` gate produces the intended behavior: fresh sessions land on under-loaded D's, established sessions keep going to their original D for cache locality.
This validates the Q2.B design from `docs/E1_E2_FIX_DESIGN_ZH.md` empirically — but only as far as the run got. End-to-end metrics (lat / TTFT / failure rate) are not interpretable yet because the worker died.
## 2. The new crash: SGLang streaming-session correction leaves an invariant violated
At `01:51:21` (~5 min into the benchmark), decode-1 hit:
```
[01:51:21] Correcting streaming-session extend_input_len from 6648 to 0
(rid=6f4318e93dd543a49dbf19248cfc1e6f, session_id=1000195,
fill_len=6648, prefix_len=43459, kv_committed_len=43459)
[01:51:21] Scheduler hit an exception: AssertionError
at third_party/sglang/python/sglang/srt/managers/schedule_batch.py:1646
→ assert seq_len - pre_len == req.extend_input_len
```
### Mechanism
With `--enable-streaming-session`, SGLang's session_aware_cache hands the scheduler a request whose `fill_ids` is just the new tokens since the last turn (6648), while `prefix_indices` represents the already-cached prefix on this D (43459 blocks). When the prefix exceeds `fill_ids` (e.g., the new turn's input is short relative to the conversation history that's already in cache), this code path fires at `schedule_batch.py:1572-1585`:
```python
actual_extend_len = max(0, len(req.fill_ids) - len(req.prefix_indices))
if req.extend_input_len != actual_extend_len:
logger.warning("Correcting streaming-session extend_input_len from %d to %d ...")
req.set_extend_input_len(actual_extend_len)
```
So `req.extend_input_len` becomes `max(0, 6648 - 43459) = 0`.
Then at line 1588-1590:
```python
seq_lens = [len(r.fill_ids) for r in reqs] # 6648
prefix_lens = [len(r.prefix_indices) for r in reqs] # 43459
```
And at line 1646:
```python
assert seq_len - pre_len == req.extend_input_len # 6648 - 43459 == 0 → FAIL
```
The correction patches `extend_input_len` but the downstream invariant is computed from raw `fill_ids`/`prefix_indices` lengths, which the correction never touched. The arithmetic check is fundamentally incompatible with the corrected state.
### Provenance
The streaming-session correction (`schedule_batch.py:1572-1585`) and the assertion site (line 1646) are both inside the project's SGLang vendor patches — `git log` on this file shows the patch came from commit `b8e6f13 feat(sglang): support decode session cache admission`. So this is a regression in the project's own SGLang fork, not upstream SGLang.
### Why E3 triggers it and E2 didn't
The crash is independent of migration (session 1000195 stayed on decode-1 the entire time). Two factors combined to expose it in E3:
1. **D1 was under more sustained load in E3** — 536 bindings on 17 unique sessions means high re-binding density per session, which means more concurrent turns of the same session at the scheduler, increasing the rate at which streaming-session corrections fire.
2. **Faster overall dispatch** — with D2 actually consuming work, the prefill→decode pipeline moves faster, so streaming-session entries reach the corrected state more often than in E2's saturated cap-out regime.
Both factors are effects of the load-floor fix, not its cause. The crash is a pre-existing landmine in the vendored streaming-session code that E1 and E2 happened to avoid because their pipelines stalled before sessions accumulated enough committed prefix to trigger the correction.
---
## 3. Decision space for the fix
| # | Fix | Layer | Where | Risk |
|---|---|---|---|---|
| **A** | Patch the assertion to match the corrected state | vendored SGLang `schedule_batch.py:1646` | Add: `if req.extend_input_len == 0 and len(req.fill_ids) < len(req.prefix_indices): continue` to skip degenerate reqs before iterating. | Local, scoped, doesn't touch correctness elsewhere. Need to handle the skipped reqs (set `was_skipped` flag, drop from batch). |
| **B** | Fix the correction site to also drop the req from the batch | vendored SGLang `schedule_batch.py:1572-1585` | When `actual_extend_len == 0` and req has nothing to extend, signal upstream to remove the req from this batch (defer or drop). | Slightly more invasive. The upstream call path needs to handle a "filtered" return. |
| **C** | Compute `seq_lens` and `prefix_lens` consistently with the correction | vendored SGLang `schedule_batch.py:1588-1590` | After correction, recompute `seq_lens = [len(r.fill_ids[:pre_len] + extension)]` or align both sides. | Risky; affects all downstream tensor sizing. |
| **D** | Workaround: disable session migration in E3 (the trigger combination) | our `cli` flag `--kvcache-migration-reject-threshold 0` | One-line config change in `sweep_e3_*.sh`. | Doesn't actually fix the crash — session 1000195 didn't migrate. May reduce but not eliminate. Might still hit it on a different session. |
| **E** | Workaround: disable streaming session | server flag, remove `--enable-streaming-session` | Sidesteps the entire correction path. | Loses KVC's direct-to-D fast path (the central perf win we measure). Defeats the experiment. |
### Recommendation
**Fix A** — patch `schedule_batch.py:1646` to skip the malformed req before asserting. It's the minimal-blast-radius change and matches the apparent intent of the correction (graceful handling of the degenerate state).
Concretely:
```python
# Just before the assertion at line ~1646
if req.extend_input_len == 0:
# The streaming-session correction zeroed extend_input_len because
# prefix_indices already covers fill_ids. Skip this req from the
# extend batch — its KV is already committed; nothing to compute.
skip_indices.append(i)
continue
```
Then the caller of `prepare_for_extend` needs to handle skipped requests (return them to the decode queue without an extend pass).
**Avoid Fix D/E** — D doesn't address the root cause (the failing session didn't migrate), and E loses the entire reason we're running this experiment.
---
## 4. Decision points for review
| # | Question | Default if no answer |
|---|---|---|
| D1 | Implement Fix A (vendor patch to skip zero-extend-len reqs)? | **Yes** |
| D2 | Re-run E3 with same K=200, same subset, after the fix? | Yes |
| D3 | Add a structural log entry every time the correction fires so we can track its frequency? | Recommended |
| D4 | File this as a separate `feat(sglang)` commit on the branch so the patch and the failure case it fixes are traceable? | Yes |
---
## 5. What this tells us about KVC v2 maturity
The load-floor bonus's first real exposure to the production codepath uncovered an existing patch bug that was masked by E2's failure cascade. This is good news: the failure cascade in E2 was hiding *another* layer of breakage. Without rebalancing, sessions cap-out → cascade → never run long enough to commit deep prefixes → never hit the streaming-session correction → never crash. With rebalancing, sessions DO commit deep prefixes → trigger the correction → crash.
Each fix tends to expose the next-shallowest bug. This is expected for a stack of ~6 interacting subsystems (kv-aware policy, KVC admission, session_aware_cache, streaming session, mooncake transfer, prefill batch prep). The path forward is to keep patching, re-running, and pushing the failure boundary out.

View File

@@ -0,0 +1,185 @@
# 评测协议Paper-quality
**日期**2026-05-12
**性质**:评测协议规范,覆盖 [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §3.1 M1M6 全部薄弱点
**对象**:跑实验的合作者;写 paper 的人artifact reviewer
---
## 0. 总原则
> 论文里每一个数字都必须能回答两个问题:
> 1. **抽样误差有多大?**bootstrap CI、N、std
> 2. **公平吗?**(同 trial、同 trace、同 token cap、同 timeout、paired
当前 sweep 报告(`KVCACHE_CENTRIC_PROGRESS_ZH.md` / `V2_RESULTS_ZH.md`)都不满足上述任一条。本文给出合规模板。
---
## 1. 评测维度M1M6 一对一解决)
### 1.1 M1 — 统计显著性
| 决策 | 规则 |
|---|---|
| `N` 每个 config 最小 run 数 | **3**headline 数字)/ **5**ablation 终值) |
| 报告统计量 | `mean ± std`**附 2.5/97.5 bootstrap CI** |
| 多 run 聚合 | 把每 run 的 per-request latency append 后整体做 bootstrap不要先 per-run 求 mean 再 average mean |
| 差异显著性 | paired bootstrap p-value≥ 5000 samples |
| `N=1` 仅允许 | smoke / sanity check**不进 headline 表** |
### 1.2 M2 — 公平 paired 比较
| 决策 | 规则 |
|---|---|
| trace fixity | 用同一个 `samples-*.jsonl` 文件replay 用 `--use-trace-as-sample` 锁定 |
| timeout | 所有 mechanism 同 `--request-timeout-s`;不允许某一组用 600s 而另一组 300s |
| token cap | 同 `--max-input-len`(取所有 baseline 的最小值并显式 truncate |
| 错误 / abort | **不**只算成功请求abort 与 timeout 各自单列 `error_count`,按全集(含错误)报指标,或 paired-on-same-trial-mask |
| 时间窗 | `time_scale` 一致;不允许同 sweep 内换 |
| Worker 数 / GPU 类型 | 一致topology 差异必须标注 |
**反例**:当前 `E1 vs E2` 表([E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md) §4显式声明 "not a fair head-to-head"——E2 80% 失败successful-only 算 latency 与 E1 全集对比。**这种表不能直接进 paper**。
### 1.3 M3 — Trace 分层
| 维度 | 分桶建议 |
|---|---|
| `turn_id` | `{1, 2-5, 6-20, 21+}` |
| `append_len` | `{≤128, 128-1K, 1K-8K, >8K}` |
| `overlap_ratio` | `{≤0.3, 0.3-0.7, >0.7}` |
| `inter_turn_gap_s` | `{≤5, 5-30, 30-300, >300}` |
| `input_len` | `{≤8K, 8K-64K, >64K}` |
**报告要求**headline 数字之外,至少给一张"按 turn_id × append_len"的 heatmap让 reviewer 看到收益来自哪个 slice。
**反例**:当前 Real Ali 实验仅在 KVC-fit slicehigh overlap + small append + 100% direct-eligible上报 -46% p50。这是上限不是平均。必须同时给出 full Ali 上的 paired 表。
### 1.4 M4 — Baseline 矩阵
至少以下 baseline 中跑 **2 个**
| Baseline | 类别 | 库 |
|---|---|---|
| vLLM + automatic prefix caching | 同 model 单 worker prefix cache | vLLM main |
| SGLang DP cache-aware4×TP1 | 当前主要 baseline | 本仓 vendored SGLang |
| SGLang PD-disaggregationkv-aware | naive 但 cache-aware 拓扑 | 本仓 |
| DistServe | P/D 分离 baseline | DistServe upstream |
| SplitWise | P/D split + adaptive routing | open-source impl |
| Mooncake-Master scheduler | 同代设计 | mooncake-master |
**额外推荐**:跑一个 "oracle" baseline——assume `Σ.resident[d]` 完美已知 + admission 永不失败,作为 KVC 的上限对照。
### 1.5 M5 — Trace 组合
| Trace | 用途 |
|---|---|
| Ali coding agent (full) | 主结果;含 single-turn dilution |
| Ali KVC-fit slice | KVC 上限演示 |
| SWE-Bench 50 sess | 已有;多轮高 overlap workload |
| ShareGPT | 对比 chat workload短 turn低 overlap。**用来证明 KVC 不会在不合适 workload 上劣化** |
| Inferact | tool-use heavy 的 agent workload |
| Mooncake trace | 单 turn LLM serving 的 baseline trace |
| Synthetic adversarial | 自构burst 100 个新 session 同时 seed验证 mooncake death 与 reset-on-success 的 robustness |
**最低组合**Ali full + SWE-Bench + ShareGPT + Synthetic adversarial。
### 1.6 M6 — 硬件覆盖
| Tier | 用途 |
|---|---|
| 单节点 ≤ 8 GPU | 当前所有结果 |
| 双节点 NVLink + IB | 验证跨节点 D→P sync 与 mooncake 行为 |
| 4 节点 cluster≥ 16 GPU | scaling 数字、cluster scheduler 假设 |
| 异构H100 + L40S | topology-aware routing |
**最低组合**:单节点 4×H200 + 双节点 NVLink + IB。剩下两个 tier 可放 future work。
---
## 2. 报告模板
### 2.1 主结果表Table 1
```
| Config | N | mean ± std | p50 [CI] | p90 [CI] | p99 [CI] | err% | timeout% |
|--------|---|------------|----------|----------|----------|------|----------|
```
加注trace name、time_scale、`max_input_len``request_timeout_s`、所有共用参数。
### 2.2 Paired delta 表
```
| Pair | N pairs | mean delta [CI] | p50 delta [CI] | wins / losses | p-value |
```
`N pairs` = 两边都 successful 的 trial 数。`wins` = `latency_kvc < latency_baseline` 的 trial 数。
### 2.3 分层表Table 2
每个分层维度§1.3)独立一张。
### 2.4 Negative-result 章节(强制)
paper 必须有专章列出:
- KVC 在 ShareGPT 上比 baseline 慢的具体数字。
- KVC 在 trace 哪些 percentile / slice 不胜。
- 失败的 sweepmooncake death、E3 crash的诊断链路。
→ 论文 reviewer 看见诚实的 negative result 会显著提高印象分。当前的 [V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md) §4 雏形可以扩成这一章。
---
## 3. 工具支持(本仓需要的脚本)
| 脚本 | 状态 | 说明 |
|---|---|---|
| `scripts/analysis/recompute_summary.py` | ✅ 已有 | 修复 abort 污染的 latency本协议主要数据入口 |
| `scripts/analysis/stratified.py` | ⏳ 本分支新增 | 按 §1.3 维度切桶 + 输出表 |
| `scripts/analysis/paired_compare.py` | ⏳ 本分支新增 | paired bootstrap输出 §2.2 表 |
| `scripts/analysis/plot_*` | ✅ 已有 | TTFT PDF、GPU 利用率、cache efficiency |
→ 本分支的 stratified + paired 脚本 land 后,跑实验的合作者可以一条命令出表。
---
## 4. Artifact 要求SOSP/OSDI AE
| 项目 | 标准 |
|---|---|
| Dockerfile | 单一 `Dockerfile.artifact`4×A100/H100 即可启 |
| 一键脚本 | `bash artifact/reproduce_main_table.sh`1 小时内出 Table 1 |
| 数据集 | 提供 `outputs/sample-*.jsonl` 子集(可 ~5GB 内full Ali 走 instruction |
| 复现度 | bootstrap CI 与原文重叠即算复现,不要求 bit-exact |
| 文档 | `artifact/README.md`,列出每张表 / 图对应的命令 |
→ 本路线图 §M1 修复后再准备 artifact。
---
## 5. 自检清单(提 paper draft 前用)
- [ ] 每张表 N ≥ 3含 mean±std 与 95% CI。
- [ ] 没有 "successful only" 字样;所有错误已列入 `err%`
- [ ] 所有 baseline 用同 `max_input_len` / 同 `request_timeout_s` / 同 `time_scale`
- [ ] 至少 3 个 trace + 1 个 synthetic adversarial。
- [ ] 至少 1 个 non-SGLang baseline。
- [ ] 有 negative-result 章节。
- [ ] 有 KVC 在 single-turn workload 上的 dilution 数据。
- [ ] 形式化部分Algorithm 1/2/3 + Theorem 1/2以及 D→P sync 完成后的 Theorem 4。
- [ ] 失败模式 forensicmooncake death、E3 crash、cold-D 都进 §Limitations 或 §Discussion。
---
## 6. 路线图衔接
- [ ] Phase A — 实现本分支 `scripts/analysis/stratified.py` + `scripts/analysis/paired_compare.py`(无 GPU 可做)。
- [ ] Phase B — 把现有 `kvc-real-ali-iter-v1` 的 600-req/15min 数据用新工具重出一份分层表 / paired 表,存入 `outputs/`GPU 不需重跑)。
- [ ] Phase C — 跑 ShareGPT + Synthetic adversarial baselineGPU 需 ~12h
- [ ] Phase D — 选 1 个非 SGLang baseline推荐 vLLM + prefix caching补齐 M4GPU 需 ~24h
---
**核心句**:当前结果"看起来已经赢",但按本协议重报后,赢的 magnitude 会缩小、赢的 slice 会窄化、负面 slice 会暴露。这是论文必须经历的过程;越早做越省事。

222
docs/FAILURE_MODES_ZH.md Normal file
View File

@@ -0,0 +1,222 @@
# Failure-mode Taxonomy
**日期**2026-05-13
**性质**:集中清单 + 诊断手册
**对象**:跑实验时遇到失败要立刻 lookup 的合作者;写 paper §Limitations 时需引用的人reviewer 想问"你为什么觉得这次会更稳"时的答案
本文把当前系统已识别的失败模式按"症状 → 根因 → 触发条件 → 当前缓解 → 真正的修复"梳成一张表。所有条目都有 forensic 链接到原始实验 doc。
---
## 0. TL;DR
5 类已识别失败模式,按"是否阻碍 paper claim"分组:
| 类别 | 名称 | 阻碍 paper | 真正修复 |
|---|---|:---:|---|
| **A. 控制层级联** | Mooncake "instance not alive" cascade | ✅ | admission backoff + per-D pending-seed budget |
| **B. 路由偏置** | Cold-D / overlap-pinning | ✅ | first-principles overlap term redefinition |
| **C. KV 抖动** | Evict stormsession-level evict | ✅ | [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) |
| **C'. KV 抖动** | Reseed stormturn 1 大 seed 并发) | ✅ | per-D pending-seed budget + (C 缓解后频率自降) |
| **D. Vendor 不变量** | streaming-session correction invariant crash (E3) | ❌hotfix 已 land | 删除 correction 路径block-level evict 完成后) |
A / B / C 三类是 Milestone 1 必须解决的C' 是 A 的次因D 已临时止血但根本修复绑在 C 上。
---
## 1. A — Mooncake "instance not alive" cascade
### 1.1 症状
- 客户端看:`RuntimeError: generate stream ended before producing any token`
- D scheduler 日志:`[mooncake] Decode instance could be dead, dropping ...`
- 整批请求被 abort单一 sweep 在数分钟内从健康降到 80% failure[E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md) E21054 / 1285 失败)
### 1.2 根因forensic 链路)
```
admission no-space (D KV pool 满)
→ router 立刻 fallback 走 seed/reseed 路径
→ 多个并发 seed 同时打 mooncake P→D
→ P→D 出口排队handshake 阶段超时
→ mooncake 把对端标记 dead
→ SGLang 把 dead 链路上的 in-flight req 全部 abort
→ 客户端看到批量 generate-stream 中断
```
### 1.3 触发条件
- D KV pool 接近满(≥ ρ·K_d默认 0.95
- router fallback chain 把多个 reseed 在毫秒级窗口内发起
- mooncake heartbeat 超时(默认窗口短)
### 1.4 当前缓解
- `--kvcache-seed-min-turn-id=2` 跳过 turn 1 大 seed减少首爆main 分支 stable 配置)
- `--mc-transfer-timeout=1800s` 默认值commit 905d671减少假性 dead
- `--request-timeout-s=180/300` 让客户端不至于看见整 hour 卡死,但不阻止 cascade 自身
→ 这些都是治标不是治本。E2 在 4×H200 NDR 真硬件下仍 80% 失败 ([E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md))。
### 1.5 真正的修复(路线图 §S3
1. **admission RPC backoff + jitter**:拒绝时不立刻 fallback给 D scheduler 喘息机会。
2. **per-D pending-seed budget**:同时刻最多 K 个 seed 在 transfer 队列里,超出排队而不爆裂。
3. **mooncake heartbeat 与 admission 解耦**admission 路径不再 imply "对端 alive"。
4. **Backpressure pause hint 闭环**[SGLANG_PATCH_INVENTORY_ZH.md](SGLANG_PATCH_INVENTORY_ZH.md) §2.3 当前 EXPERIMENTAL
---
## 2. B — Cold-D / overlap-pinning
### 2.1 症状
- N=k decode workers但只有 ~k-1 真正承载流量;某些 D 0 binding
- Per-D load 直方图严重偏斜E2D0:600 / D1:685 / **D2:0**
- 整体 throughput 受最忙 D 限制;裸 latency 不一定差,但容量利用率差 33%+
### 2.2 根因
Inferact / Ali coding agent trace 在每个 session 开头有 ~12K 的"system prompt + tool schema",这些 24-token 块在所有 session 之间共享 hash。kv-aware policy 的 `overlap` term 把它们当成"该 D 已经常驻这些 hash" → 任何新 session 都被 score 推向 D0/D1最先 warm 的两个)→ D2 永远 0 overlap → 永远不被选 → 永远 cold。
### 2.3 触发条件
- 多 session workload + 共享 boilerplate prefix
- `migration_reject_threshold > 0` 且 reject 从未触发(因为 D0/D1 还没满)
### 2.4 当前缓解
`KvAwarePolicy.load_floor_bonus`commit 93fce42
```
floor_bonus = K * max(0, mean - assigned) / max(1, mean)
```
E3 实测 D2 binding 从 0 升到 22.5%[E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) §1
→ 这是 patch不是修复。`K` 是 magic numberboilerplate 的 hash 数量大于 `K / sticky_bonus` 时仍 cold。
### 2.5 真正的修复(路线图 §S5
`overlap` 重新定义为 **"该 session 在该 D 上独占 prefix 的 hash 数"**
```
exclusive_overlap(s, d) := |prefix_hashes(s) ∩ resident[d] ∩ session_owned[s]|
```
其中 `session_owned[s]` 排除其它 session 也持有的 hash。Boilerplate 共享 hash 不进 `exclusive_overlap`score 自然分散。需要 D 端在 `admit_direct_append` 响应里返回 per-session resident hash 集合的 sketchBloom filter / minhash
---
## 3. C — Evict stormsession-level eviction
### 3.1 症状
- 在 D 内存有压力的 workload 下,每 12 分钟出现 3090K tokens 的 KV pool 释放峰
- 紧随其后的同 session 请求触发 `Reseed`P 重 prefill 50K + mooncake transfer 50K37s
- TTFT 长尾完全由这类 reseed 主导([V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md) §3.2
### 3.2 根因
`SessionAwareCache.release_session` 一次性 `free([cache_protected_len, kv_allocated_len))`——即整段 session-exclusive 尾部。E3 实测90 次 evict、平均一次 free 67,726 tokens、25/50 session 受影响([KVC_EVICTION_GRANULARITY_DESIGN_ZH.md](KVC_EVICTION_GRANULARITY_DESIGN_ZH.md) §0
→ 与 SGLang 标准 radix 的 leaf-by-leaf 渐进 evict 形成鲜明对比。这部分 KV 从未进 radix所以享受不到 LRU 的细粒度蚕食。
### 3.3 触发条件
- D KV pool 接近满
- `maybe_trim_decode_session_cache` 被 scheduler 触发(在 `DecodePreallocQueue` 检测到 `available_size() <= 0` 时)
### 3.4 当前缓解
- `--kvcache-session-soft-cap=N`main 分支):限制 D 上常驻 session 数 → 提前 trim避免顶到爆
- `--kvcache-direct-max-uncached-tokens=8192`v2降低 direct path 吃 KV 的速度
→ 都是放慢节奏,没有解决"单次 free 太大"的根本问题。
### 3.5 真正的修复(路线图 §S1
[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md):让 streaming-session decode 输出每 turn finish 时 `inner.cache_finished_req` 进 radix → `release_session` 退化为 `dec_lock_ref` + 删 slot → radix LRU 按 24-token leaf 蚕食。
预期:单次 evict 从 67K 降到 ≤ 500 tokensreseed 频次降一个数量级。
---
## 4. C' — Reseed stormturn 1 大 seed 并发)
### 4.1 症状
- workload 起步阶段(前 3060s所有 session 同时打 turn 1
- 多个并发 `Seed`(每个 ~5090K tokens打 mooncake → 与 §1 cascade 重合
### 4.2 根因
`KvAwarePolicy` 启动阶段 `resident[d]` 全空,所有 D score 相同,但 ε 重试 + per-trial admit 不阻止并发。
### 4.3 触发条件
- trace `time_scale=1` 重放下session 在原始到达密度内同时启动
- 没有 per-D pending-seed 限流
### 4.4 当前缓解
- `--kvcache-seed-min-turn-id=2`:跳过 turn 1 seed 完全main 分支 stable 配置)
- 副作用:失去 turn-1 的 KV 注入turn 2 必走 reseed但反而稳定因为 reseed 是分散在时间上的)
### 4.5 真正的修复
- per-D pending-seed budget同 §1.5 第 2 项)
- §3.5 完成后 evict 频次自降,间接降低 reseed 频次
---
## 5. D — Streaming-session correction invariant crash (E3 landmine)
### 5.1 症状
- D scheduler 抛 `AssertionError` at `schedule_batch.py:1646``seq_len - pre_len == req.extend_input_len`
- 整个 D worker 进程退出 → router 看见对端死 → §1 cascade
### 5.2 根因
[E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) §2streaming-session correctioncommit b8e6f13`extend_input_len` 改写为 `max(0, fill_len - prefix_len)`,但下游 invariant 还从原始 fill_ids/prefix_indices 计算。当 `fill_len < prefix_len`(多 turn 累积 prefix > 当前 turn 增量)时数学上不可能满足。
### 5.3 触发条件
- streaming session 跨 turn 已 commit prefix 长于本 turn 的新增 fill_ids
- E2 因 pipeline 阻塞从未跑到这个状态E3 修了 cold-D bottleneck → pipeline 更快 → landmine 暴露
### 5.4 当前缓解
commit 986f351 的 pre-filter pass`prepare_for_extend` 入口 drop 这类 req让 client 看错误响应而不是 worker 崩)。是止血。
### 5.5 真正的修复
`schedule_batch.py:15721646` 这整段 correction 路径在 block-level eviction refactor 完成后**结构上不再需要**——[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) §3.7 已说明 refactor 后 fill_ids / prefix_indices 一致性由 radix `match_prefix` 自动保证。
→ 不要再加更多 correction 子句;要删整段。
---
## 6. 失败诊断 cheat sheet
跑 sweep 时按下表 lookup
| 你看到 | 大概率是 | 先查 |
|---|---|---|
| 客户端 `RuntimeError: generate stream ended before...` | §1 cascade | D scheduler log 搜 `instance could be dead` |
| 某个 D `binding=0` 而其它 D 繁忙 | §2 cold-D | `per_decode_load` 直方图 |
| TTFT p99 突然抬到 58s 量级 | §3 evict storm | `release_session` 调用频次 + 平均 free tokens |
| Sweep 起步阶段失败率高、稳态低 | §4 reseed storm | mooncake transfer queue 在前 30s 的峰值 |
| D worker 进程异常退出 | §5 invariant crash | scheduler log 搜 `AssertionError``extend_input_len` |
---
## 7. 与路线图的衔接
- [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) Milestone 1 的第 1/3/4 项分别对应本表 C / A / B 的真正修复。完成 Milestone 1 后本表 §1§4 应该都从"未修"降级为"已缓解"§5 直接消失。
- 论文 §Limitations 必须老实写出现状:"we identify five failure modes; A/C are addressed by this work, B/C' are partially addressed, D is a transient artifact of the in-progress refactor."
---
**核心句**:把失败模式当 first-class artifact 来管理——每个失败都有"症状 → 根因 → 触发 → 缓解 → 真正修复"五字段,是把 prototype 推到 production-grade 的关键工具。reviewer 看见你能枚举失败远比看见你赢得 baseline 更让人信服。

View File

@@ -0,0 +1,270 @@
# H200 + Driver 570 上跑通本仓库的环境配置(含踩坑记录)
**适用范围**4× H200 节点 + NVIDIA driver `570.86.15` + 本仓库 `kvc-debug-journey-v1-to-v4` 或后续分支。
**目标读者**:拿到一台新 H200 机器、需要快速跑通 sglang 0.5.10 vendor + mooncake RDMA + agentic-pd-hybrid 的下一个 SWE/research agent。
**作者状态**:本文档定稿于 `h200-cu130 @ 初始 commit`smoke test 已 RDMA 跑通 16 reqs / 0 error。
---
## 0. TL;DR5 行)
1. **`nvidia-smi` 的 "CUDA Version: 13.0" 是个陷阱**——它是 driver 能 forward-compat 跑的 runtime 上限,不是 driver 自己 API 版本。driver `570.86.15` 提供的 driver API 是 **cu12.8**
2. vendor sglang 0.5.10 的 `jit_kernel/``tvm_ffi` + ninja + nvcc binary 在首次调用每个 kernel 时编译。系统唯一 nvcc 在 `/usr/local/cuda-13.0/bin/`cu13 编译出的 .so 会 NEEDED `libcudart.so.13`driver 570 拒绝运行 → `cudaErrorInsufficientDriver`
3. 解法是**本地装一份 cu12.8 toolkit 到 `$HOME/cuda-12.8`**(不需要 root让 tvm_ffi 走 cu12.8 nvcc编译产物 NEEDED `libcudart.so.12`driver 570 完美支持。
4. mooncake wheel (`mooncake-transfer-engine 0.3.10.post2`) 也是 cu12 build需要 `libcudart.so.12`——已经由 `nvidia-cuda-runtime-cu12` 包提供,在 venv 里。
5. 每个 shell **必须 `source scripts/setup_env.sh`** 才能跑 SGLang。已封装好。
---
## 1. 一次性 setup约 25min
```bash
cd /path/to/agentic-pd-hybrid
# (1) Python 环境 (~3min)
uv sync
# (2) cu12.8 toolkit 本地装(~5GB 下载 + 5min 解压 = ~15-20min
mkdir -p /tmp/cuda_dl && cd /tmp/cuda_dl
wget https://developer.download.nvidia.com/compute/cuda/12.8.1/local_installers/cuda_12.8.1_570.124.06_linux.run
sh cuda_12.8.1_570.124.06_linux.run \
--silent --toolkit --override \
--installpath=$HOME/cuda-12.8 \
--tmpdir=$HOME/tmp \
--no-drm --no-man-page
# (3) 验证
$HOME/cuda-12.8/bin/nvcc --version # 应该看到 release 12.8, V12.8.93
# (4) 回到 repo 根目录,首次 source每个 shell 都要做)
cd /path/to/agentic-pd-hybrid
source scripts/setup_env.sh
```
`source scripts/setup_env.sh` 输出应是:
```
agentic-pd-hybrid env ready:
CUDA_HOME=/home/<user>/cuda-12.8 (12.8, V12.8.93)
libcudart.so.12 at .../.venv/lib/python3.12/site-packages/nvidia/cuda_runtime/lib
MC_TRANSFER_TIMEOUT=1800s
```
**`MC_TRANSFER_TIMEOUT=1800` (30 min) 替代 mooncake 默认 30s**——E2 forensic 发现 D 端 LRU eviction 会让 mooncake C++ control plane 被 starved 30+s触发 `conn.py:1270` hair-trigger 永久 blacklist 整个 D 的 mooncake_session_id。1800s 给足缓冲30 分钟还没回应才是真正"D 死了"。详见 `docs/E1_E2_RESULTS_ZH.md §5c``stack.py` 也对 worker subprocess 设了同名默认值。
---
## 2. Smoke test验证整条链路
把 16 个合成 request 喂给 1P3D 拓扑,启用真 RDMA跑通后才能动 E1/E2 实验。
```bash
# 假设已 source scripts/setup_env.sh
mkdir -p outputs/smoke_rdma
uv run --no-sync python -m agentic_pd_hybrid.cli make-small-append-trace \
--output outputs/smoke_rdma/mini_trace.jsonl \
--session-count 4 --turns-per-session 4 \
--initial-input-length 1024 --append-input-length 200 --output-length 50 \
--inter-turn-gap-s 2 --session-stagger-s 1
uv run --no-sync python -m agentic_pd_hybrid.cli benchmark-live \
--trace outputs/smoke_rdma/mini_trace.jsonl \
--output-root outputs/smoke_rdma \
--mechanism pd-disaggregation --policy default \
--model-path /mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507 \
--prefill-workers 1 --decode-workers 3 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3 \
--transfer-backend mooncake \
--force-rdma --ib-device mlx5_60 \
--gpu-budget 4 --time-scale 1 \
--concurrency-limit 4 --timeout-s 1800 --request-timeout-s 300 \
--session-sample-rate 1.0 --min-turns 1 --target-duration-s 600
```
**首次跑会慢 8-15min**model load 196s + 5-10 个 JIT kernel 各编译 ~10-30s + warmup。后续跑只 ~3-5min。
**期望结果**`request_count=16, error=0, abort=0, failure=0, execution_modes={'pd-disaggregation-router': 16}`
每个 worker 的日志应有 `installTransport, type=rdma`,表示 mooncake 真的走 RDMA 而不是 TCP loopback。
---
## 3. GPU ↔ RDMA HCA 映射(本机实测)
8 块 ConnectX HCA全部 ACTIVE / 400 Gb/s NDR / RoCE v2 (link_layer=Ethernet, GID Index 3)。Mooncake 按 NUMA / PCIe affinity 自动选 preferred
| GPU | preferred HCA | NUMA |
|---|---|---|
| cuda:0 | mlx5_60 | 0 |
| cuda:1 | mlx5_88 | 0 |
| cuda:2 | mlx5_98 | 1 |
| cuda:3 | mlx5_42 | 1 |
CLI 的 `--ib-device <name>` 只接单个设备名,给所有 worker 全局 override。Smoke test 默认填 `mlx5_60`P worker 在 cuda:0 上 NUMA-localD worker 在其它 GPU 上是 cross-NUMA 但能跑。E1/E2 实验如果想最优,可以分 P/D worker 独立设环境变量,但目前 stack.py 不支持 per-worker `MOONCAKE_DEVICE`,要么所有 worker 同一个,要么走 mooncake auto需把 `MC_MS_AUTO_DISC=0` 改回 1
完整 8 块 HCA`mlx5_22, _27, _42, _60, _88, _98, _126, _135`NUMA 0/1/0/0/0/1/0/1 混杂)。
---
## 4. 踩过的坑(按时间线)
### 坑 1`nvidia-smi` 的 "CUDA Version: 13.0" 是误导
`nvidia-smi` header 显示 `Driver Version: 570.86.15 / CUDA Version: 13.0` 让人以为机器支持 cu13。**这是 driver 能 forward-compat 跑的 CUDA runtime 上限**,不是 driver 自己 API 的版本。driver 570 的 driver API 上限是 cu12.8(参见 NVIDIA "CUDA Compatibility" 矩阵)。
**正确判断方法**:跑 `torch.cuda.is_available()`,如果装了 cu13 build 的 torch 会报 `The NVIDIA driver on your system is too old (found version 12080)`。返回 `12080` 才是 driver 自己 API 版本cu12.8)。
### 坑 2vendor sglang vs pip sglang 的 patch 差异
仓库的 `third_party/sglang/python/` 是带项目自有 patches 的 SGLang 0.5.10 fork。**pip 上的 `sglang==0.5.10` 不包含核心 patches**——具体差异:
| 文件 | pip 版 | vendor 版 |
|---|---|---|
| `srt/managers/scheduler.py` | 3621 行 | 3938 行 |
| `admit_direct_append` 出现次数 | 2 | **11** |
| `DirectAppendAdmissionReqInput/Output` | 没有 | **有**(核心 RPC |
| `_should_allow_local_prefill_on_decode` | 没有 | 有 |
| `maybe_trim_decode_session_cache` | 没有 | 有 |
| `decode_direct_waiting_queue` | 没有 | 有 |
**必须用 vendor 版**。本分支已把 `pyproject.toml``sglang==0.5.10` 改成 `sglang` + `[tool.uv.sources] sglang = { path = "third_party/sglang/python", editable = true }``uv sync` 后会自动 editable 安装 vendor 版。
历史上有些 sweep 脚本用 `PYTHONPATH=src:third_party/sglang/python` 在运行时切换,但用 `uv.sources` 把它装进 venv 更彻底,不会被 pip 的 sglang 偷偷 shadow。
### 坑 3cu13 切换是死路
发现 driver 570 不兼容时第一个想到的路径是「装 cu13 PyTorch」。试过
1.`pyproject.toml``[[tool.uv.index]]` 指向 `https://download.pytorch.org/whl/cu130`
2. 同样改 vendor sglang 的 `pyproject.toml`root 项目的 sources 不会传递给 transitive editable dep
3. `uv sync` 成功装上 `torch==2.9.1+cu130``nvidia-{nccl,nvjitlink,nvshmem,cusparselt,nvtx}-cu13`
4. **但 driver 570 不支持 cu13 runtime**——`torch.cuda.is_available()=False`CUDA init 报 `driver too old (12080)`
→ cu13 路径需要 **driver 580+**。我们没有 root + 别人在用机器,所以放弃。本分支已 rollback 到 cu12 stackpyproject 干净)。
### 坑 4`--disable-overlap-schedule` 不够
第一次 smoke 崩在 `resolve_future_token_ids.cuh:49`,路径是 `event_loop_overlap_disagg_prefill`,怀疑是 overlap 模式特定 JIT kernel 问题。
cli.py 给 PD worker 加了 `--disable-overlap-schedule`event loop 切到 `event_loop_normal_disagg_prefill`,但**崩在另一个 kernel `fused_inplace_qknorm`**,错误码完全相同(`cudaErrorInsufficientDriver`)。
→ 不是 overlap-specific**整体 vendor sglang `jit_kernel/` 模块和 driver 570 不兼容**,任何 JIT kernel 都会崩在 `runtime.cuh:21``cudaOccupancyMaxActiveBlocksPerMultiprocessor` 调用CUDA runtime 初始化时 driver feature 版本检查失败)。
`--disable-overlap-schedule` 留着不会造成伤害,且能避免之后类似 overlap-path 特定问题。本分支保留它在 `cli.py:_topology_from_args`
### 坑 5pip sgl_kernel vs vendor sglang/jit_kernel/ 是两套系统
`pip install sglang-kernel` 提供 `.venv/lib/.../sgl_kernel/{flash_ops,flashmla_ops,spatial_ops}.abi3.so`——这是 AOT 预编译产物。
`third_party/sglang/python/sglang/jit_kernel/` 是 vendor SGLang 0.5.10 内置的 **另一套 JIT 模块**,运行时用 tvm_ffi 编译。Smoke 崩在 vendor 的 jit_kernel**降级 pip sgl_kernel 没用**(实测 0.4.0 / 0.4.1 同样崩)。
### 坑 6`nvidia-cuda-nvcc-cu12` PyPI 包没装 nvcc binary
发现 cu13 nvcc 是 root cause 后,第一反应是 PyPI 装 cu12 nvcc 包:
```bash
uv pip install nvidia-cuda-nvcc-cu12==12.8.93
```
装上以后 `find .venv -name nvcc` **返回空**——这个 PyPI 包只装 `ptxas``nvvm/`**没有 nvcc binary**NVIDIA 出于分发限制不把 nvcc 放 PyPI
→ 完整 nvcc 必须从 NVIDIA 官方 `.run` installer 或 apt 装。`.run` installer 可以装到 user-writable 路径不需要 root本仓库选这条路。
### 坑 7tvm_ffi 通过 ninja 调用 nvcc
vendor sglang 的 `jit_kernel/``tvm_ffi.cpp.extension`,源码在 `~/.local/lib/python3.12/site-packages/tvm_ffi/cpp/extension.py`。关键路径:
```python
def _find_cuda_home() -> str:
cuda_home = os.environ.get("CUDA_HOME") or os.environ.get("CUDA_PATH")
if cuda_home is None:
nvcc_path = shutil.which("nvcc")
if nvcc_path is not None:
cuda_home = str(Path(nvcc_path).parent.parent)
...
```
然后构造 ninja file
```
nvcc = {_find_cuda_home()}/bin/nvcc
```
**设 `CUDA_HOME=$HOME/cuda-12.8` 就能 hook 整条编译链**`scripts/setup_env.sh` 已经设好。
JIT 编译产物缓存在 `~/.cache/tvm-ffi/sgl_kernel_jit_*/*.so`。如果之前用 cu13 nvcc 编过,要先 `rm -rf ~/.cache/tvm-ffi/sgl_kernel_jit_*` 再用 cu12.8 重编。
### 坑 8mooncake import path 与 onboarding 文档不一致
`docs/ONBOARDING_NEXT_AGENT_ZH.md` §3.3 的环境验证写:
```python
from mooncake_transfer_engine import TransferEngine
```
但实际 PyPI `mooncake-transfer-engine 0.3.10.post2` wheel 的 import path 是:
```python
from mooncake.engine import TransferEngine
```
第一次 `from mooncake_transfer_engine``ModuleNotFoundError`。**ONBOARDING 文档应该更新**(本分支不动 onboarding留给主 agent 决定)。
### 坑 9mooncake.engine import 必须有 libcudart.so.12
`from mooncake.engine import TransferEngine` 在 fresh shell未 source setup_env.sh下报
```
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
```
mooncake 的 `engine.so` 是 cu12 builddynamic link `libcudart.so.12`。venv 里有但需要 LD_LIBRARY_PATH 暴露。`scripts/setup_env.sh` 已加。
### 坑 10Inferact 数据集 schema 与 agentic-pd-hybrid 期望不匹配
`huggingface.co/datasets/Inferact/codex_swebenchpro_traces` 是 ShareGPT 格式(`{"from": "human/gpt", "value": "<text>"}`),不含 token 计数 / hash_ids / 时间戳。
`agentic-pd-hybrid` 期望 JSONL`chat_id, parent_chat_id, timestamp, input_length, output_length, type, turn, hash_ids[]`
→ 已写 `scripts/convert_inferact_to_trace.py`tokenize用 model 自带 tokenizer+ 滚动 hash 切 24-token block + 伪造 timestamp。610 trials × 33 turns 处理约 37min跑出 20,230 reqs与 Inferact README 的 "20,230 total LLM calls" 完全一致)。
输出 `outputs/inferact_codex_swebenchpro.jsonl`1.3GB,被 `.gitignore` 排除不进仓库)。
### 坑 11sampling 默认 `--session-sample-rate 0.01`
`benchmark-live` 跑的时候内部会先做 sampling。默认 1%,意味着 50 sessions 才抽 1 个。Mini smoke trace 4 sessions × 1% = 0 → `ValueError: Sampling produced no requests`
→ smoke test 命令显式加 `--session-sample-rate 1.0 --target-duration-s 600`
---
## 5. 后续给下个 agent
跑 E1 / E2 sweep 之前**每个 shell 第一件事**
```bash
cd /path/to/agentic-pd-hybrid
source scripts/setup_env.sh
```
然后用 ONBOARDING §3 的 sweep 脚本(参考 `scripts/sweep_ts1_migration_v2.sh` 作为底版)。注意几处针对本机的修改:
1. **MODEL 路径**改成 `/mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507`onboarding 写的 `/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/...` 不存在)。
2. **TRACE 路径**`outputs/qwen35-swebench-50sess.jsonl` 不存在;用 `outputs/inferact_codex_swebenchpro.jsonl` converter 跑完后产生)。
3. **`--ib-device`** 选 `mlx5_60`cuda:0 NUMA-local或视实验需要自选onboarding 写的 `mlx5_0` 在本机不存在。
4. **保留 cli.py 的 `--disable-overlap-schedule`** 不要删——理论上 cu12.8 toolchain 应该让 overlap 也能跑,但目前未验证 overlap path 没有别的潜在问题,留着是 zero-cost 保险。
---
## 附录 A本分支的代码改动
- `pyproject.toml`sglang dep 改用 `[tool.uv.sources]` path source 走 `third_party/sglang/python`editable
- `src/agentic_pd_hybrid/cli.py:_topology_from_args`:给 prefill/decode worker 自动加 `--disable-overlap-schedule`
- `scripts/setup_env.sh`env wrapper每个 shell `source` 一次。
- `scripts/convert_inferact_to_trace.py`Inferact ShareGPT → agentic-pd-hybrid JSONL schema converter。
- `docs/H200_DRIVER570_SETUP_ZH.md`:本文档。
## 附录 B被 `.gitignore` 排除的产物
- `outputs/inferact_codex_swebenchpro.jsonl`1.3GB——converter 输出,用 `scripts/convert_inferact_to_trace.py` 重新生成
- `outputs/smoke_rdma/`(含 mini trace + smoke run artifacts
- `third_party/codex_swebenchpro_traces/`209MBHF dataset 下载)—— `hf download Inferact/codex_swebenchpro_traces --repo-type dataset --local-dir third_party/codex_swebenchpro_traces` 重下
- `~/cuda-12.8/`——cu12.8 toolkit用 §1 步骤 (2) 重装
- `.venv/`——`uv sync` 重建

119
docs/INDEX_ZH.md Normal file
View File

@@ -0,0 +1,119 @@
# 文档索引
**目的**:让任何合作者在 10 分钟内找到他需要的文档;让 Reviewer 知道哪些先看。
---
## 0. 时间紧的 3 篇
按这个顺序读完即可参与讨论:
1. [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) — 项目当前进度、薄弱点、路线图。
2. [KVC_ROUTER_ALGORITHM.md](KVC_ROUTER_ALGORITHM.md) — 算法形式化Algorithm 1/2/3 + Theorem 1/2
3. [V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md) §0 + §6 — v2 当前 win/lose snapshot。
---
## 1. 按主题分类
### 1.1 进度 / 现状
| 文档 | 内容 |
|---|---|
| [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) | 跨分支整合 + 路线图(本分支的总入口) |
| [PROJECT_OVERVIEW.md](PROJECT_OVERVIEW.md) | 项目目标 + 三种 mechanismpd-disagg / pd-colo / kvcache-centric的术语区分 |
| [ONBOARDING_NEXT_AGENT_ZH.md](ONBOARDING_NEXT_AGENT_ZH.md) | 接班 agent 30 分钟上手手册(来自 `kvc-debug-journey-v1-to-v4` |
### 1.2 算法 / 形式化
| 文档 | 内容 |
|---|---|
| [KVC_ROUTER_ALGORITHM.md](KVC_ROUTER_ALGORITHM.md) | Algorithm 1Route/ 2Admit/ 3Dispatch+ Theorem 1无饿死+ Theorem 2fast-path 命中下限) |
| [MIGRATION_V1_FINDINGS_ZH.md](MIGRATION_V1_FINDINGS_ZH.md) | v1 thrashing pathology 的实测 + 为什么 reset-on-success 是关键修复 |
### 1.3 实验结果
| 文档 | 内容 |
|---|---|
| [V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md) | SWE-Bench 50 sess ts=1v2 vs 4DP CA 的 6/8 win + TTFT p99 落后原因 |
| [V2_RESULTS_ZH.md](V2_RESULTS_ZH.md) | v2 原始战报headline 数字略乐观,请同时看 deep analysis |
| [E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md) | H200 + RDMA 上 E1naive 1P3D + kv-awarevs E2KVC v2E2 80% failure 的 forensic |
| [E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) | E3+load-floor bonus16 min 触发 SGLang patch invariant crash |
| [E1_E2_FIX_DESIGN_ZH.md](E1_E2_FIX_DESIGN_ZH.md) | Q1mooncake death+ Q2cold-D2的 fix 设计 |
### 1.4 当前关键 design discussion
| 文档 | 内容 |
|---|---|
| [KVC_EVICTION_GRANULARITY_DESIGN_ZH.md](KVC_EVICTION_GRANULARITY_DESIGN_ZH.md) | 架构层反思session-level evict 与 KVC continuity 设计冲突 |
| [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) | block-level evict refactor 的具体 API / 步骤 / 测试计划(本分支新增) |
| [RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md](RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md) | reseed 慢路径时间线 + D→P 同步缺口的 forensic |
| [D_TO_P_SYNC_CONTRACT_ZH.md](D_TO_P_SYNC_CONTRACT_ZH.md) | D→P sync 的接口契约、staleness budget、rollout 阶段(本分支新增) |
### 1.5 评测 / 方法论
| 文档 | 内容 |
|---|---|
| [EVALUATION_PROTOCOL_ZH.md](EVALUATION_PROTOCOL_ZH.md) | paper-quality 评测协议N、CI、paired、stratify、baseline list、trace mix—— 本分支新增 |
| [REFACTOR_PLAN_V1_ZH.md](REFACTOR_PLAN_V1_ZH.md) | 为什么从 ts=10 切到 ts=1 |
| [TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md](TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md) | ts=10 时代的结构性问题清单(多数已 supersede |
### 1.6 工程债 / 失败模式
| 文档 | 内容 |
|---|---|
| [SGLANG_PATCH_INVENTORY_ZH.md](SGLANG_PATCH_INVENTORY_ZH.md) | 785 行 vendored SGLang patch 的归类清单MUST-HAVE / WORKAROUND / EXPERIMENTAL / INSTRUMENTATION—— 本分支新增 |
| [FAILURE_MODES_ZH.md](FAILURE_MODES_ZH.md) | 5 类失败模式的诊断 + 缓解 + 真正修复mooncake cascade / cold-D / evict storm / reseed storm / E3 invariant—— 本分支新增 |
### 1.7 环境
| 文档 | 内容 |
|---|---|
| [H200_DRIVER570_SETUP_ZH.md](H200_DRIVER570_SETUP_ZH.md) | H200 + driver 570 + cu12.8 环境搭建 + 11 条 lesson learned |
### 1.7 归档(仅历史参考)
`docs/archive/` 下的内容已被新文档 supersede不必看
- `AGENTIC_FIT_ANALYSIS_ZH.md``STRUCTURAL_VALIDATION_REPORT_ZH.md`ts=10 早期分析。
- `KVCACHE_CENTRIC_PROGRESS_ZH.md`:早期项目快照。
- `KVC_DEBUG_JOURNEY_V1_TO_V5.md``V5_PROFILE_INVESTIGATION_ZH.md`v1v5 调优过程笔记。
- `REFACTOR_PLAN_ZH.md`v0 重构计划。
- `SWEBENCH_EXPERIMENT_*.md`:早期实验日志。
---
## 2. 按角色推荐阅读路径
### 2.1 我是新接手的 SWE/research agent
1. 先读本文 §0 的 3 篇。
2. 再看 [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §3薄弱点+ §5GPU-free 工作清单)。
3. 选一个 Milestone 1 子项开始做。`docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md``docs/D_TO_P_SYNC_CONTRACT_ZH.md` 是已经准备好的两条工程主线。
### 2.2 我是 paper reviewer / 审稿预读
1. [KVC_ROUTER_ALGORITHM.md](KVC_ROUTER_ALGORITHM.md):算法 + theorem。
2. [V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md):核心实测对比 + 我们自己识别的 limitation。
3. [E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md):真硬件 + RDMA 上的 ablation含 E2 的 80% failure forensic证明我们能解释失败
4. [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §3我们自己列出的薄弱点与未来工作不藏问题
### 2.3 我是要复现实验的 student
1. [H200_DRIVER570_SETUP_ZH.md](H200_DRIVER570_SETUP_ZH.md)。
2. [EVALUATION_PROTOCOL_ZH.md](EVALUATION_PROTOCOL_ZH.md):跑哪些 sweep、按什么协议比较。
3. `scripts/sweep_ts1_migration_v2.sh`v2 主 sweep`scripts/sweep_e1_naive_1p3d.sh` / `scripts/sweep_e2_kvc_v2_rdma.sh`E1/E2 ablation。
### 2.4 我想看 control plane 与 admission
1. `src/agentic_pd_hybrid/policies.py``KvAwarePolicy.select` 是 Algorithm 1 的实现。
2. `src/agentic_pd_hybrid/replay.py``_invoke_session_direct` / `_invoke_kvcache_seeded_router` 是 Algorithm 3 的 orchestration。
3. `third_party/sglang/python/sglang/srt/managers/scheduler.py`D 端 `_admit_direct_append` 是 Algorithm 2 实现。
---
## 3. 这份索引的维护约定
- 新加一份 design / experiment doc 必须在本文 §1 表格里加一行。
- 文档归档(移到 `docs/archive/`)时本文同步删除条目或标 "已归档"。
- 本文不写实质内容,只做导航;任何深入说明都在被指向的文档里。

View File

@@ -0,0 +1,228 @@
# KVC Eviction Granularity — 设计审视 (架构层)
**日期**: 2026-05-12
**Status**: 架构审视 / 待 design discussion
**Companion**: `docs/E1_E2_RESULTS_ZH.md`, `docs/E3_FINDINGS_ZH.md`, `docs/E1_E2_FIX_DESIGN_ZH.md`
**Branch**: `h200-cu130`
本文是 E2 → E3 迭代后的高层架构反思,**不是又一份 fix design**。前几轮 E2 → E3 我一直在加 local patchesload-floor bonus、Fix A skip-zero-extend、调 migration_reject_threshold 等),但 E3 实测数据迫使我们承认这些 patches 大局上看是 **KVC 在向 DP / naive PD-disagg 退化的轨迹**
---
## 0. TL;DR
1. **KVC 的 value proposition** 是"session pin 在 D 上、KV 跨 turn 连续累积、direct-to-D 快路径 0.04s TTFT"。
2. **`SessionAwareCache.release_session` 在 trim 时一次性 free 整段 session-exclusive 尾部**:实测 E3 一次 trim 平均 free **67,726 tokens**samples: 35K / 38K / 40K / 86K / 87K不是 "几个 leaf block"。
3. 被 evict 的 session 下次到来时必须**从客户端原 prompt 重 prefill 50-90K** + mooncake transfer 5-9 GB → **跟 naive PD-disagg 一模一样**
4. → 在 saturation regime 下 KVC 的 cache continuity 设计被自己的 eviction 抵消。**Session-level eviction 与 KVC 的设计意图冲突**。
5. 真正的方向不是堆 patch**改 eviction granularity**: 让 streaming-session 的 decode 输出 **progressively commit 进 radix tree**,由 SGLang 标准的 block-level LRU 蚕食最老的 leaf。SessionSlot 退化成纯 metadata。
---
## 1. 我们做对了什么,又错过了什么
### KVC 的 design promise来自 `KVC_ROUTER_ALGORITHM.md` §1
| Property | 设计意图 |
|---|---|
| Session 钉定 | Session `s` pin 在 `pin[s]` 这一个 D同 session 的所有 turn 在同一个 D 上做 KV 累积 |
| Direct-to-D 快路径 | `req.session ∈ M_d ∧ append_len ≤ τ_append ∧ cap_ok` → 仅 append 新 token**不走 P→D mooncake transfer** |
| TTFT 优势 | append-only path TTFT ≈ 40ms (历史 v2 在 SWE-Bench 的 fast-path p50) |
| 集中 cache 而非 fragment | 同 session cache 集中在一个 D 上,命中率高 |
### 我们当前实测在做什么E3, killed at 1h12min
| 指标 | 实测值 | 与设计 promise 的偏离 |
|---|---:|---|
| Eviction 次数 | **90** | 设计假设 "session 一旦绑就持续累积" |
| 平均每次 evict 释放 | **67,726 tokens** | 不是 "几个 leaf block",是整段 session 尾部 |
| 总释放 | **6,095,375 tokens** | 在 1h12min 里 trash 了 ≈ 8 个 session-pool 容量的 KV |
| 触发 reseed 的 session 数 | 25 / 50 (50%) | 这些 session 每个被 evict-revisit 一次 = 付一次 50-90K re-prefill |
| 单次 reseed 平均耗时 | 3-7s (P prefill + mooncake) | 跟 naive PD-disagg 持平 |
**E1 对照**0 eviction、0 retract、50 sessions 顺利完成。E1 用的是 `pd-disaggregation` mechanism**没有 KVC 层、没有 admission RPC**,但反而保留了 cache continuityrouter-side sticky 让 session 不挪窝)。
> **讽刺**: E1 (naive 1P2D + kv-aware policy) **意外地** 比 E3 (KVC v2 + load-floor + RDMA) 更接近 KVC 设计意图——因为 E1 没有 admission 反馈链路,所以没人会触发那 90 次 session-level evict。
---
## 2. 为什么 session-level evict 是错的
### `release_session` 实测语义(`session_aware_cache.py:250-281`
```python
def release_session(self, session_id: str):
slot = self.slots.pop(session_id, None)
...
if slot.last_node is not None:
self.inner.dec_lock_ref(slot.last_node, ...) # 解 radix 锁 ✓
if slot.is_holding_kv:
start = slot.cache_protected_len
end = slot.kv_allocated_len
if start < end:
kv_indices = self.req_to_token_pool.req_to_token[
slot.req_pool_idx, start:end
]
self.token_to_kv_pool_allocator.free(kv_indices) # 显式 free 一段 KV
...
```
`[cache_protected_len, kv_allocated_len)`**session-exclusive 尾部**——从首 turn 提交 radix tree 之后所有累积的 decode output + 后续 turn 的 extend。在 Inferact workload 上:
- `cache_protected_len` ≈ 首 turn 提交的 boilerplate 部分 (~12K)
- `kv_allocated_len` ≈ 50-100K多 turn 累积)
- **释放范围 = 38-88K**
这部分 KV **没有进 radix tree**,所以也享受不到 radix block-level LRU 的渐进式 shedding。`release_session` 一刀切。
### 与 SGLang 标准 radix LRU 的本质差异
SGLang 标准 `inner.evict()``base_prefix_cache.py` 接口由 RadixCache 实现):
```
按节点 last_access_time 排序,从 leaf 开始 evict (因为 evict 中间节点会破坏树结构)
每次释放一个 leaf node 的 KV indices
lock_ref > 0 的节点不可 evict
```
**特性对比**:
| | session-level (current) | block-level (SGLang radix) |
|---|---|---|
| 单次释放粒度 | 整段 session 尾部 (35-87K) | 一个 leaf node (~24 tokens / page-size) |
| Recent prefix 保留 | ❌ 全丢 | ✅ 保留 (recent 访问 → 时间戳新 → 不被先 evict) |
| Evict-revisit 成本 | 50-90K re-prefill | 仅丢的 leaf 部分 (≪ 50K) |
| 与 session lifecycle | 强绑定 (是 lifecycle 退出动作) | 解耦 (lifecycle 仅做 lock_ref 管理) |
### 为什么会变这样SessionAwareCache 的双重职责混淆
`SessionAwareCache` 设计承担了**两个本应分离的职责**
1. **Session lifecycle 跟踪** (合理)streaming session 跨多个 req 复用 KV需要在 turn 间保留 `(req_pool_idx, kv_committed_len, kv_allocated_len, last_node)` 这些字段,恢复给下个 turn 的 req。
2. **Eviction granularity 决策** (问题所在):把 session 当成 evict 的最小单位,绕过了 SGLang 标准 LRU 的 leaf-by-leaf 渐进 shedding。
第 2 个职责本不该存在于 SessionAwareCache 里。SGLang radix 已经能处理 block-level LRU——前提是 session 的 KV 真的进了 radix 树。但**因为 session-exclusive 尾部没 commit 进 radix tree**radix LRU 看不到它们,只能由 release_session 一次性大块 free。
---
## 3. 我们前几轮 patches 的总体轨迹
按 commit 时间线审视,每一步看似在修当下 issue整体方向却是 KVC → DP 退化:
| Iteration | 改动 | 局部目标 | 大局影响 |
|---|---|---|---|
| E2 baseline | mechanism=kvcache-centric, worker admission | 跑出 KVC v2 头条数字 | D2 cold + cascade → 1054 failures (KVC 设计前提崩塌) |
| E3 load-floor bonus | 让 fresh session 均匀分到 D2 | 解 cold-start 偏置 | 触发 migration → 25 sessions reseed → 暴露 evict granularity 问题 |
| E3 → Fix A | 修 vendored SGLang `prepare_for_extend` 的 fill_ids<prefix_indices invariant | decode-1 assertion crash | Patch 局部 bug没动 evict 设计 |
| **我之前提议: disable migration** | `--kvcache-migration-reject-threshold 0` | " session 不挪窝" | **会让 KVC 退化成 pd-disagg + load-floor**admission RPC 还在但 migration 不生效 |
| **更早提议: disable admission** | admission RPC | "省掉那个 RPC overhead" | **直接砍 KVC 的 direct-to-D fast path** (KVC_ROUTER_ALGORITHM.md §3.2 Algorithm 2 不存在) |
用户每次都正确地阻止了进一步退化。**没有人在审视 evict granularity 这个根本问题**——直到现在
---
## 4. 正确方向(粗描)
**核心思路**: streaming session decode 输出 **progressively commit 进 radix tree** SGLang 标准 radix LRU 蚕食最老的 leafSessionSlot 退化成纯 metadata
### 4.1 目标行为
| 场景 | 当前行为 | 目标行为 |
|---|---|---|
| Session 累积 50K KVD 满了 | release_session 一次释放 38K (整段 session-exclusive 尾部) | radix LRU evict 最老 leaf (可能是首 turn boilerplate tail~24 tokens) |
| Session evict 后再到来 | 必须 reseed 50K (P prefill + mooncake) | re-prefill evict leaf 部分 (e.g. ~5K) |
| TTFT evicted session 的影响 | 50-90K reseed = 3-7s | 5K append-prefill = ~200ms |
| 不被 evict session | session turns append-only | 同样 append-only (不变) |
| KVC fast-path 命中率 | 91.6% (历史 SWE-Bench) / 38% (E3 Inferact, 因为 evict-revisit) | 应稳定在 >85% 即使 saturation |
### 4.2 需要的 refactor scope
按依赖排序,每一步可独立做但有耦合:
1. **Streaming session decode output 增量进 radix tree** (vendor SGLang)
- 当前: decode output 累积在 `kv_allocated_len` 维度,但 radix tree 只记录到 `cache_protected_len`
- 改: 每 turn finish 时把新的 decode tail 通过 radix `cache_finished_req` 路径插入 radix 树
- 影响: streaming session 在 radix 树里有持续 growing 的 chain每个 24-token block 一个 node
- 牵涉: `radix_cache.py` 的 insert 路径、`schedule_batch.py` 的 cache_finished_req hook、SessionSlot.save_from_req
2. **SessionSlot 退化成纯 metadata**
- 当前: SessionSlot 拥有 `req_pool_idx` + `[cache_protected_len, kv_allocated_len)` 范围的 KV 索引所有权
- 改: SessionSlot 仅持有 `last_node`(指向 radix 树某 node和 lock_ref 状态,不直接管 KV 范围
- 影响: `restore_to_req` 改成基于 radix `match_prefix` 重建 req 状态,不直接 reuse req_pool_idx
3. **`release_session` 改为仅 dec_lock_ref + 删 slot metadata**
- 当前: 还 free `[cache_protected_len, kv_allocated_len)` 范围 KV
- 改: 只 dec_lock_ref → 让 radix LRU 自然 evict
- 影响: `maybe_trim_decode_session_cache` 不再"按 session 释放",而是用 SGLang 现有的 `tree_cache.evict(required_tokens)`
4. **`admit_direct_append` 的 capacity 检查改用 radix-resident 长度**
- 当前: `current_tokens = session.resident_tokens` (来自 SessionSlot)
- 改: `current_tokens` = radix tree 上该 session 实际 commit 的长度 = `match_prefix(session.last_node).matched_length`
- 影响: admission 评估的 "uncached = input - radix-resident" 更精确evict-revisit 场景下 admission 反映出"只丢了一部分"而不是"全丢"
5. **`prepare_for_extend` 的 streaming-session correction 重新设计**
- 当前: Fix A patches 的 fill_ids/prefix_indices invariant 是基于 session-exclusive 尾部的复杂 fixup
- 改: 如果 SessionSlot 不再拥有独立 KV 范围,整个 correction 路径需要重写或可能不再必要
### 4.3 与 onboarding §4.4 D→P sync 的关系
`docs/RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md` §4 描述的 D→P 增量同步是**针对 reseed 自身成本**的 fix让 P 端 backup 跟上,避免 reseed 时 P 重 prefill
本文 §4 描述的 eviction granularity 是**针对 reseed 触发频率**的 fix让 session 不被一次性 evict 整段,减少 evict-revisit
**两者正交、互补**:
- 单做 evict-granularity fix: reseed 频率下降,但偶发 reseed 仍然慢
- 单做 D→P sync: reseed 自身快了,但仍然频繁触发
- 都做: reseed 几乎消失、即使触发也快
工程量都是 ~1-2 周量级,可并行启动。
### 4.4 不是 local patch
注意整个 §4.2 列表里没有"调一个 hyperparameter"或者"加一个 CLI flag"这种局部改动。这是 vendor SGLang 内部数据结构的 invariants 重新设计,不能通过更精确的 K 值或更宽的 substring filter 解决。
---
## 5. 我们不该再做的事 (anti-patterns)
防止下个 agent 走同样的局部 patch 路径:
1. **不要继续调整 `migration_reject_threshold`** — 这个参数只是控制"reject 后多久换 D",跟 evict granularity 无关。调小让 migration 更频繁 → 更多 reseed → 更糟。调大 → blacklist 永久化 (v1 thrashing 问题)。
2. **不要 disable migration** — 会让 KVC 退化到 sticky pd-disagg。失去 v2 的 reset-on-success 整体设计。
3. **不要 disable admission** — 会砍掉 direct-to-D fast path 这个 KVC 唯一的差异化优势。
4. **不要继续 tune `_decode_session_cache_low_watermark_tokens`** — 调高让 LRU 更激进 → 更多 evict → 更糟。调低让 LRU 不触发 → 顶到 retract decode → 更糟。是治标。
5. **不要再加 `_ADMISSION_REJECTION_SUBSTRINGS`** — 之前修的 string filter bug (Q2 forensic) 让 migration counter 真的递增,反而暴露了 migration 本身的 reseed 成本。修这个 bug 没错,但显示出 migration 机制本身在 saturated 场景下是负收益。
---
## 6. 推荐 Decision Points
| # | Question | 推荐 |
|---|---|---|
| D1 | 接受本文的诊断session-level evict 是根本问题)? | **Yes** |
| D2 | 暂停 E1/E2/E3 ablation 线索,集中精力做 §4.2 refactor | **Yes** (current path 在用 GPU 时间确认已知结论) |
| D3 | refactor 在 vendored SGLang 主线kvc-debug-journey-v1-to-v4还是新分支 | 新分支 `feat/block-level-evict`(隔离 risk |
| D4 | 同时启动 §4.3 的 D→P sync`feat/d-to-p-sync` 分支已预留)? | 视团队带宽 |
| D5 | 在 refactor 完成前对外的 paper 表述如何处理? | 标"v2 系列在 saturation regime 下的 evict 行为是已识别的 limitation§future-work 已 propose 修复" |
---
## 7. 给下个 agent 的接班
**如果你接手要做 §4.2 refactor**,按顺序读:
1. `KVC_ROUTER_ALGORITHM.md` §2-3 — KVC 设计意图
2. 本文 §2.1, §2.2 — 实测 evict 行为
3. SGLang vendor `mem_cache/radix_cache.py` — 标准 radix LRU 实现细节
4. SGLang vendor `mem_cache/session_aware_cache.py` — 当前 SessionSlot 设计
5. SGLang vendor `managers/schedule_batch.py` — prepare_for_extend 怎么用 session state
6. `docs/RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md` §4 — D→P sync 的工程 scope互补 work
**关键 invariant 不变量**: SessionSlot.restore_to_req 必须保持幂等chunked prefill 失败可能 retry 多次)。任何 refactor 都要测试此 invariant。
**关键 testing pattern**: 单元化测试 streaming session 在 LRU 压力下的行为。具体:注入一个 fake `inner.evict()` 返回部分 leaf 被 evict 的状态,断言 SessionSlot.restore_to_req 仍然返回合法 req 状态(不抛 assertionre-prefill 长度合理)。
---
**核心句**: 我们前 3 轮 patch 都在解 saturation 暴露的 secondary 问题cold-D 偏置、admission 字符串 bug、streaming-session correction 边界),但**真正的 primary 问题是 SessionAwareCache 把 session lifecycle 跟踪和 eviction granularity 决策混在一起**。session 是 lifecycle 边界,**不应该是 eviction 边界**。Eviction 应该交还给 SGLang 已经做得很好的 block-level radix LRU。

View File

@@ -0,0 +1,165 @@
# Vendored SGLang Patch — 归类清单
**日期**2026-05-13
**基线**clean SGLang v0.5.10 snapshot @ `bded083`
**当前 HEAD**`origin/h200-cu130` + 本分支 (785 行新增 / 17 行删除 / 10 文件)
**目的**:让 reviewer 与下一个合作者一眼看清"哪些 patch 是核心机制、哪些是 workaround、哪些可以在 refactor 后下线"。对应 [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §3.2 / §S6 的工程债项。
---
## 0. TL;DR
| 分类 | 文件数 | 行数(估) | 命运 |
|---|---:|---:|---|
| MUST-HAVE — 核心机制Algorithm 1/2/3、streaming session lifecycle、admit RPC | 6 | ~450 | 长期保留,是 paper claim 的核心 |
| WORKAROUND — 已识别的 latent 问题修补,应在 refactor 后下线 | 2 | ~150 | block-level eviction refactor 完成后大量删除 |
| EXPERIMENTAL — 未闭环的特性,论文不依赖 | 1 | ~60 | 可下线或保留为 future-work hook |
| INSTRUMENTATION — 诊断 / 日志 | 1 | ~50 | 保留但应隔离到 debug build |
| MINOR — 杂项 | 1 | ~3 | 不影响决策 |
**关键指引**:当 block-level eviction refactor[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md)完成时WORKAROUND 类的 ~150 行应同步删除。E3 触发的 `schedule_batch.py` invariant landmine 是这条路径上的产物,不修引擎而是修 evict 粒度才是正解。
---
## 1. 文件粒度清单
### 1.1 `mem_cache/session_aware_cache.py` — MUST-HAVE *(待 refactor*
| 项目 | 内容 | 引入 | 分类 |
|---|---|---|---|
| `SessionSlot` dataclass | streaming session 跨 turn 复用 KV 的 metadata | b8e6f13 | MUST-HAVE |
| `last_access_time` 字段 | LRU 决策需要 | 6e5ed8d | MUST-HAVE |
| `match_prefix` / `cache_finished_req` / `cache_unfinished_req` 的 streaming 分支 | session 复用快路径 | b8e6f13 | **MUST-HAVE → 待 refactor**block-level evict 后语义大改) |
| `release_session` 直接 `free(kv_indices)` | session 退出时一次性归还 KV | b8e6f13 | **WORKAROUND → 替换**refactor 后改为只 `dec_lock_ref` |
| `slot_held_tokens` / `get_session_status` / `list_session_statuses` | 状态查询 | 6e5ed8d | MUST-HAVE |
**说明**:本文件是 KVC 设计的中枢。block-level eviction refactor[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) §3.1§3.6)改造的就是这里。`SessionSlot` 的 5 个 KV-ownership 字段(`req_pool_idx` / `kv_committed_len` / `kv_allocated_len` / `cache_protected_len` / `swa_evicted_seqlen`)应在 refactor 后删除;这部分**将由 commit message 单独标记**,方便回滚。
### 1.2 `managers/scheduler.py` — 混合类别
D worker 端的 Algorithm 2 实现,含多个独立 patch。按行级归类
| 函数 / 行段 | 内容 | 分类 | 何时可下线 |
|---|---|---|---|
| `admit_direct_append(...)` | Algorithm 2 的 D 端 admission RPC handler | **MUST-HAVE** | 不下线(论文核心) |
| `_should_allow_local_prefill_on_decode(req)` | 决定 decode worker 是否接受无 bootstrap 的本地 append-prefill | **MUST-HAVE** | 不下线 |
| `_decode_session_cache_low_watermark_tokens()` | 水位线参数读取 | **WORKAROUND** | block-level evict 后由 radix LRU 取代 |
| `_decode_session_cache_target_available_tokens()` | 目标可用 token 数计算 | **WORKAROUND** | 同上 |
| `maybe_trim_decode_session_cache(...)` | 主动 trim session触发 `release_session` | **WORKAROUND** | 同上refactor 后 radix LRU 自然蚕食trim 不再必要 |
| `_compute_backpressure_pause_hint(...)` | 给 router 的 pause 提示 | **EXPERIMENTAL** | 信号未闭环([REAL_ALI_KVC_EXPERIMENT_LOG_ZH.md](../docs/archive/) §4.3),路线图 §S10可保留为 future work hook |
| `_compute_pool_breakdown_for_diagnostics()` | 池状态快照供 `/server_info` | **INSTRUMENTATION** | 长期保留但建议门 flag 化 |
### 1.3 `managers/schedule_batch.py` — WORKAROUND待删除
| 项目 | 内容 | 引入 | 分类 |
|---|---|---|---|
| streaming-session `extend_input_len` correction (lines ~15721585) | 在 fill_ids < prefix_indices 时把 extend_input_len 改为 0 | b8e6f13 | **WORKAROUND** |
| pre-filter pass dropping `fill_ids < prefix_indices` reqs | E3 触发 assertion 后的 hotfixcommit 986f351 | 986f351 | **WORKAROUND** |
| invariant assert `seq_len - pre_len == req.extend_input_len` 的容忍逻辑 | correction 配套 | b8e6f13 | **WORKAROUND** |
**全部** ~85 行在 block-level eviction refactor 完成后**应整体删除**——`BLOCK_LEVEL_EVICTION_DESIGN_ZH §3.7` 已说明 refactor 后该不变量结构上必然成立correction 路径无需存在E3 landmine ([E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) §2) 是该 workaround 的产物
### 1.4 `managers/session_controller.py` — MUST-HAVE
| 项目 | 内容 | 分类 |
|---|---|---|
| streaming session lifecycle hooksopen / close / admit signal | P/D worker 知道何时开始 / 结束一个 streaming session | MUST-HAVE |
| session ID 路由 | admission RPC 找到正确的 SessionSlot | MUST-HAVE |
不下线
### 1.5 `managers/io_struct.py` — MUST-HAVE
| 项目 | 内容 | 分类 |
|---|---|---|
| `AdmitDirectAppendReqInput` / `AdmitDirectAppendReqOutput` | admit RPC 的请求 / 响应消息类型 | MUST-HAVE |
| backpressure pause hint 字段 | 同上消息的 optional 字段 | EXPERIMENTAL |
可以把 EXPERIMENTAL 字段折叠到 MUST-HAVE 消息里保持兼容本身不构成下线压力
### 1.6 `managers/tokenizer_communicator_mixin.py` — MUST-HAVE
admit RPC communicator-side glue19 不下线
### 1.7 `entrypoints/http_server.py` — MUST-HAVE
`/admit_direct_append` HTTP endpoint 注册6
### 1.8 `disaggregation/decode.py` — 混合类别
| 项目 | 内容 | 分类 |
|---|---|---|
| `DecodeReqToTokenPool`: `assert len(reusing) <= 1` 放宽 | local append-prefill 在一个 batch 里复用多个 req_pool_idx | **MUST-HAVE** |
| `DecodePreallocQueue` 引入 `refresh_allocatable_tokens` + `maybe_trim_decode_session_cache` 触发 | pool 满时主动 trim session | **WORKAROUND**refactor 后改由 radix LRU 自然 shed |
| `--disaggregation-decode-allow-local-prefill` flag | 服务端 opt-in 本地 append-prefill | **MUST-HAVE** |
trim 触发逻辑 ~30 行在 refactor 后应删除
### 1.9 `server_args.py` — MUST-HAVE
| 项目 | 内容 | 分类 |
|---|---|---|
| `--radix-eviction-policy priority` 选项 | E1/E2 实验需要 | MUST-HAVE |
| `--disaggregation-decode-allow-local-prefill` flag | §1.8 | MUST-HAVE |
13 全部是 CLI 接口扩展不下线
### 1.10 `disaggregation/mooncake_transfer_engine.py` — MINOR
3 行小调整不构成决策点
---
## 2. 按分类汇总
### 2.1 MUST-HAVE保留
6 个文件450
- `admit_direct_append` 主链路Algorithm 2scheduler + io_struct + tokenizer_communicator_mixin + http_server + session_controller
- `SessionSlot` 主链路streaming session lifecyclesession_aware_cache 多数字段session_controller
- CLI / server interfaceserver_argsdecode.py `allow_local_prefill`
### 2.2 WORKAROUNDblock-level evict refactor 后删除)
2.5 个文件150
- `session_aware_cache.release_session` token-free 路径
- `scheduler.py` `_decode_session_cache_*_watermark_tokens` + `maybe_trim_decode_session_cache`
- `schedule_batch.py` streaming-session correction + drop-pre-filter E3 landmine hotfix
- `decode.py` `DecodePreallocQueue` 中的 trim 触发
这些 patch 的存在是当前架构的产物refactor 后应整段删除而不是修小 bug
### 2.3 EXPERIMENTAL未闭环
60
- backpressure pause hint`_compute_backpressure_pause_hint` + io_struct 字段可作为未来 control-plane 反馈机制的 hook 保留 1 个月后仍未接通下线
### 2.4 INSTRUMENTATION长期保留但门 flag 化)
50
- `_compute_pool_breakdown_for_diagnostics` + 相关 `/server_info` 字段建议加 `--enable-diagnostic-pool-snapshot` flag避免 prod 路径背诊断开销
### 2.5 MINOR
3 忽略
---
## 3. 维护约定
1. **新加 SGLang 改动必须落到本表** commit message `feat(sglang): ...` / `fix(sglang): ...` 前缀并在 PR 描述声明落到 §2 哪一类
2. **不直接覆盖 upstream 文件**所有 patch 必须可在 v0.5.10 git apply保留 hunk header 整洁)。
3. **删除 WORKAROUND 时同步删 doc**refactor 完成的同一个 PR 应把本文表中对应行划掉
4. **不下放 EXPERIMENTAL 到主路径**未闭环的 patch 必须默认 disabled
---
## 4. 与路线图的衔接
- Milestone 1[AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §4执行 block-level eviction refactor **整段 §2.2 应该消失**——这是衡量 refactor 完成度的客观指标
- Milestone 2 control plane 拆层(§4.8,§2.3 backpressure pause hint 应或被启用或被下线不允许悬挂
- Milestone 3 引入 learning-based admission(§4.15,§2.1 `admit_direct_append` 接口应保持稳定policy 替换在 router 侧而非 D
---
**核心句**vendored SGLang 785 行不是 monolithic 黑箱——三分之二是核心机制论文必备三分之一是当前架构的 workaroundrefactor 后可整段删)。reviewer 看到本表能立刻判断"哪些是 paper 的真贡献哪些是 prototype 当前的临时支撑"。

View File

@@ -7,7 +7,7 @@ requires-python = ">=3.12"
dependencies = [
"httpx>=0.28.1",
"mooncake-transfer-engine",
"sglang==0.5.10",
"sglang",
]
[project.scripts]
@@ -20,5 +20,21 @@ build-backend = "setuptools.build_meta"
[tool.setuptools.packages.find]
where = ["src"]
[dependency-groups]
# Pure-Python unit tests. Install via:
# uv sync --group test
# These tests deliberately import only the algorithm-layer modules
# (policies, trace, topology) so they run without SGLang / GPU / CUDA.
test = [
"pytest>=8.0",
]
[tool.uv]
prerelease = "allow"
[tool.uv.sources]
sglang = { path = "third_party/sglang/python", editable = true }
[tool.pytest.ini_options]
testpaths = ["tests"]
addopts = "-q"

View File

@@ -0,0 +1,225 @@
#!/usr/bin/env python3
"""Paired latency comparison with bootstrap CI.
Implements docs/EVALUATION_PROTOCOL_ZH.md §2.2 (M2 fix): when comparing
mechanism A vs B on the same trace, the only honest comparison is paired
on same-trial-mask. This script joins two metrics.jsonl by request_id,
keeps the rows where BOTH sides succeeded, and reports paired deltas
with 95% bootstrap CIs.
Out vs the existing `compare_no_error.py`:
- works on raw metrics.jsonl, not pre-aggregated summary.json
- bootstrap CIs (not just point estimates)
- reports paired-mask size + per-side failure counts so the reader
sees how many rows were dropped from the comparison
Usage:
scripts/analysis/paired_compare.py \
--baseline outputs/run-dp/request-metrics.jsonl \
--candidate outputs/run-kvc/request-metrics.jsonl
scripts/analysis/paired_compare.py ... --bootstrap 5000 --seed 42
scripts/analysis/paired_compare.py ... --json > paired.json
stdlib only — no scipy/numpy. Runs without GPU and without SGLang.
"""
from __future__ import annotations
import argparse
import json
import math
import random
import sys
from pathlib import Path
def _load(path: Path) -> dict[str, dict]:
out: dict[str, dict] = {}
with path.open() as handle:
for line in handle:
line = line.strip()
if not line:
continue
row = json.loads(line)
rid = row.get("request_id")
if rid is None:
continue
out[rid] = row
return out
def _ok(row: dict) -> bool:
return row.get("error") is None and row.get("latency_s") is not None
def _quantile(values: list[float], q: float) -> float:
if not values:
return float("nan")
s = sorted(values)
if len(s) == 1:
return s[0]
pos = (len(s) - 1) * q
lo = math.floor(pos)
hi = math.ceil(pos)
if lo == hi:
return s[lo]
return s[lo] + (s[hi] - s[lo]) * (pos - lo)
def _stats(deltas: list[float]) -> dict[str, float]:
if not deltas:
return {"mean": float("nan"), "p50": float("nan"), "p90": float("nan"), "p99": float("nan")}
return {
"mean": sum(deltas) / len(deltas),
"p50": _quantile(deltas, 0.50),
"p90": _quantile(deltas, 0.90),
"p99": _quantile(deltas, 0.99),
}
def _bootstrap_ci(
deltas: list[float], statistic, n_boot: int, rng: random.Random
) -> tuple[float, float]:
"""Return (lo, hi) 95% CI for `statistic(deltas)`."""
if len(deltas) < 2:
return (float("nan"), float("nan"))
n = len(deltas)
samples = []
for _ in range(n_boot):
# resample with replacement
resample = [deltas[rng.randrange(n)] for _ in range(n)]
samples.append(statistic(resample))
samples.sort()
lo = samples[int(0.025 * (n_boot - 1))]
hi = samples[int(0.975 * (n_boot - 1))]
return (lo, hi)
def compare(
baseline: dict[str, dict],
candidate: dict[str, dict],
*,
metric: str,
n_boot: int,
seed: int,
) -> dict:
common_ids = set(baseline.keys()) & set(candidate.keys())
paired_ids = [
rid for rid in common_ids if _ok(baseline[rid]) and _ok(candidate[rid])
]
paired_ids.sort()
base_only_fail = sum(1 for rid in common_ids if not _ok(baseline[rid]))
cand_only_fail = sum(1 for rid in common_ids if not _ok(candidate[rid]))
deltas = []
wins = losses = ties = 0
for rid in paired_ids:
b = baseline[rid].get(metric)
c = candidate[rid].get(metric)
if b is None or c is None:
continue
d = float(c) - float(b)
deltas.append(d)
if d < 0:
wins += 1
elif d > 0:
losses += 1
else:
ties += 1
rng = random.Random(seed)
stats = _stats(deltas)
ci_mean = _bootstrap_ci(deltas, lambda x: sum(x) / len(x), n_boot, rng)
ci_p50 = _bootstrap_ci(deltas, lambda x: _quantile(x, 0.50), n_boot, rng)
ci_p90 = _bootstrap_ci(deltas, lambda x: _quantile(x, 0.90), n_boot, rng)
return {
"metric": metric,
"baseline_size": len(baseline),
"candidate_size": len(candidate),
"intersection_size": len(common_ids),
"paired_size": len(paired_ids),
"baseline_fail_in_common": base_only_fail,
"candidate_fail_in_common": cand_only_fail,
"delta_stats": stats,
"delta_mean_ci95": ci_mean,
"delta_p50_ci95": ci_p50,
"delta_p90_ci95": ci_p90,
"wins_candidate": wins,
"losses_candidate": losses,
"ties": ties,
}
def _fmt(x: float, w: int = 6) -> str:
if x is None or (isinstance(x, float) and math.isnan(x)):
return " nan "
return f"{x:+{w}.3f}"
def render(result: dict) -> str:
s = result["delta_stats"]
mlo, mhi = result["delta_mean_ci95"]
p5lo, p5hi = result["delta_p50_ci95"]
p9lo, p9hi = result["delta_p90_ci95"]
n = result["paired_size"]
lines = [
f"# paired comparison ({result['metric']})",
"",
f"baseline rows: {result['baseline_size']}",
f"candidate rows: {result['candidate_size']}",
f"intersection (rid): {result['intersection_size']}",
f"paired (both ok): {result['paired_size']}",
f" baseline fails in common: {result['baseline_fail_in_common']}",
f" candidate fails in common: {result['candidate_fail_in_common']}",
"",
"## delta (candidate - baseline) — negative = candidate is faster",
"",
"| stat | value | 95% CI |",
"|---|---:|---:|",
f"| mean | {_fmt(s['mean'])} | [{_fmt(mlo)}, {_fmt(mhi)}] |",
f"| p50 | {_fmt(s['p50'])} | [{_fmt(p5lo)}, {_fmt(p5hi)}] |",
f"| p90 | {_fmt(s['p90'])} | [{_fmt(p9lo)}, {_fmt(p9hi)}] |",
f"| p99 | {_fmt(s['p99'])} | — |",
"",
f"win/loss/tie: {result['wins_candidate']} / {result['losses_candidate']} / {result['ties']} (of {n})",
]
return "\n".join(lines)
def main() -> None:
p = argparse.ArgumentParser(description=__doc__.split("\n\n")[0])
p.add_argument("--baseline", required=True, type=Path)
p.add_argument("--candidate", required=True, type=Path)
p.add_argument(
"--metric",
default="latency_s",
choices=["latency_s", "ttft_s", "tpot_s"],
help="which per-request field to compare (default: latency_s)",
)
p.add_argument("--bootstrap", type=int, default=2000)
p.add_argument("--seed", type=int, default=20260512)
p.add_argument("--json", action="store_true")
args = p.parse_args()
baseline = _load(args.baseline)
candidate = _load(args.candidate)
if not baseline or not candidate:
print("empty input on one side", file=sys.stderr)
sys.exit(1)
result = compare(
baseline, candidate,
metric=args.metric, n_boot=args.bootstrap, seed=args.seed,
)
if args.json:
json.dump(result, sys.stdout, indent=2, default=lambda x: None if isinstance(x, float) and math.isnan(x) else x)
sys.stdout.write("\n")
else:
print(render(result))
if __name__ == "__main__":
main()

227
scripts/analysis/stratified.py Executable file
View File

@@ -0,0 +1,227 @@
#!/usr/bin/env python3
"""Stratified latency / TTFT reporter for paper-quality evaluation.
Implements docs/EVALUATION_PROTOCOL_ZH.md §1.3 (M3 fix): every headline
number must be accompanied by a stratified breakdown so reviewers can
see which slice the gains come from.
Buckets the request rows from one or more metrics.jsonl files along:
- turn_id : {1, 2-5, 6-20, 21+}
- input_length : {<=8K, 8K-64K, >64K}
- overlap_ratio : {<=0.3, 0.3-0.7, >0.7}
- append_tokens : input_length - observed_overlap_blocks * BLOCK_SIZE
For each bucket, reports:
- n (total rows in bucket)
- n_ok (rows with no error and latency_s set)
- latency_s mean / p50 / p90 / p99
- ttft_s mean / p50 / p90 / p99
- err_pct (1 - n_ok/n)
Usage:
scripts/analysis/stratified.py outputs/<run>/request-metrics.jsonl \
[outputs/<other-run>/request-metrics.jsonl ...]
scripts/analysis/stratified.py --dim turn_id outputs/<run>/request-metrics.jsonl
scripts/analysis/stratified.py --json outputs/<run>/request-metrics.jsonl > strat.json
stdlib only — no pandas/numpy. Runs without GPU and without SGLang.
"""
from __future__ import annotations
import argparse
import json
import math
import sys
from collections import defaultdict
from pathlib import Path
from typing import Iterable
BLOCK_SIZE = 24 # SGLang radix block, matches docs/KVC_ROUTER_ALGORITHM.md §2
TURN_BUCKETS: list[tuple[str, tuple[int, int]]] = [
("turn=1", (1, 1)),
("turn=2-5", (2, 5)),
("turn=6-20", (6, 20)),
("turn=21+", (21, 10**9)),
]
INPUT_BUCKETS: list[tuple[str, tuple[int, int]]] = [
("input<=8K", (0, 8 * 1024)),
("input=8K-64K", (8 * 1024 + 1, 64 * 1024)),
("input>64K", (64 * 1024 + 1, 10**9)),
]
OVERLAP_BUCKETS: list[tuple[str, tuple[float, float]]] = [
("overlap<=0.3", (0.0, 0.3)),
("overlap=0.3-0.7", (0.3, 0.7)),
("overlap>0.7", (0.7, 1.0001)),
]
APPEND_BUCKETS: list[tuple[str, tuple[int, int]]] = [
("append<=128", (0, 128)),
("append=128-1K", (129, 1024)),
("append=1K-8K", (1025, 8 * 1024)),
("append>8K", (8 * 1024 + 1, 10**9)),
]
DIM_BUCKETS: dict[str, list[tuple[str, tuple]]] = {
"turn_id": TURN_BUCKETS,
"input_length": INPUT_BUCKETS,
"overlap_ratio": OVERLAP_BUCKETS,
"append_tokens": APPEND_BUCKETS,
}
def _quantile(values: list[float], q: float) -> float:
"""Linear-interpolation quantile, stdlib only."""
if not values:
return float("nan")
s = sorted(values)
if len(s) == 1:
return s[0]
pos = (len(s) - 1) * q
lo = math.floor(pos)
hi = math.ceil(pos)
if lo == hi:
return s[lo]
return s[lo] + (s[hi] - s[lo]) * (pos - lo)
def _stats(values: list[float]) -> dict[str, float]:
if not values:
return {"mean": float("nan"), "p50": float("nan"), "p90": float("nan"), "p99": float("nan")}
return {
"mean": sum(values) / len(values),
"p50": _quantile(values, 0.50),
"p90": _quantile(values, 0.90),
"p99": _quantile(values, 0.99),
}
def _bucket_for(value: float | int, buckets: list[tuple[str, tuple]]) -> str:
for label, (lo, hi) in buckets:
if lo <= value <= hi:
return label
return "OOB"
def _classify(row: dict, dim: str) -> str:
if dim == "turn_id":
return _bucket_for(int(row.get("turn_id", 0)), TURN_BUCKETS)
if dim == "input_length":
return _bucket_for(int(row.get("input_length", 0)), INPUT_BUCKETS)
if dim == "overlap_ratio":
inp = max(1, int(row.get("input_length", 0)))
cached = int(row.get("observed_overlap_blocks", 0)) * BLOCK_SIZE
ratio = min(1.0, cached / inp)
return _bucket_for(ratio, OVERLAP_BUCKETS)
if dim == "append_tokens":
inp = int(row.get("input_length", 0))
cached = int(row.get("observed_overlap_blocks", 0)) * BLOCK_SIZE
return _bucket_for(max(0, inp - cached), APPEND_BUCKETS)
raise ValueError(f"Unknown dim: {dim}")
def load_rows(paths: Iterable[Path]) -> list[dict]:
rows: list[dict] = []
for path in paths:
with path.open() as handle:
for line in handle:
line = line.strip()
if not line:
continue
rows.append(json.loads(line))
return rows
def stratify(rows: list[dict], dim: str) -> dict[str, dict]:
by_bucket: dict[str, list[dict]] = defaultdict(list)
for row in rows:
by_bucket[_classify(row, dim)].append(row)
output: dict[str, dict] = {}
for label, _ in DIM_BUCKETS[dim]:
bucket_rows = by_bucket.get(label, [])
n = len(bucket_rows)
ok = [r for r in bucket_rows if r.get("error") is None and r.get("latency_s") is not None]
n_ok = len(ok)
lat = [float(r["latency_s"]) for r in ok]
ttft = [float(r["ttft_s"]) for r in ok if r.get("ttft_s") is not None]
output[label] = {
"n": n,
"n_ok": n_ok,
"err_pct": (n - n_ok) / n if n else 0.0,
"latency_s": _stats(lat),
"ttft_s": _stats(ttft),
}
return output
def render_table(name: str, stats: dict[str, dict]) -> str:
lines = [
f"## stratified by {name}",
"",
"| bucket | n | n_ok | err% | lat mean | lat p50 | lat p90 | lat p99 | ttft mean | ttft p50 | ttft p90 | ttft p99 |",
"|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|",
]
for label, _ in DIM_BUCKETS[name]:
s = stats[label]
lat = s["latency_s"]
ttft = s["ttft_s"]
lines.append(
"| {label} | {n} | {n_ok} | {err:.1%} | "
"{lm:.3f} | {l50:.3f} | {l90:.3f} | {l99:.3f} | "
"{tm:.3f} | {t50:.3f} | {t90:.3f} | {t99:.3f} |".format(
label=label,
n=s["n"],
n_ok=s["n_ok"],
err=s["err_pct"],
lm=lat["mean"],
l50=lat["p50"],
l90=lat["p90"],
l99=lat["p99"],
tm=ttft["mean"],
t50=ttft["p50"],
t90=ttft["p90"],
t99=ttft["p99"],
)
)
return "\n".join(lines)
def main() -> None:
parser = argparse.ArgumentParser(description=__doc__.split("\n\n")[0])
parser.add_argument("metrics_paths", nargs="+", type=Path)
parser.add_argument(
"--dim",
choices=list(DIM_BUCKETS.keys()) + ["all"],
default="all",
help="stratification dimension (default: all four)",
)
parser.add_argument(
"--json",
action="store_true",
help="emit JSON instead of markdown tables",
)
args = parser.parse_args()
rows = load_rows(args.metrics_paths)
if not rows:
print("no rows loaded", file=sys.stderr)
sys.exit(1)
dims = list(DIM_BUCKETS.keys()) if args.dim == "all" else [args.dim]
result = {dim: stratify(rows, dim) for dim in dims}
if args.json:
json.dump(result, sys.stdout, indent=2, default=lambda x: None if isinstance(x, float) and math.isnan(x) else x)
sys.stdout.write("\n")
return
header_paths = ", ".join(str(p) for p in args.metrics_paths)
print(f"# stratified report ({len(rows)} rows from {header_paths})\n")
for dim in dims:
print(render_table(dim, result[dim]))
print()
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,189 @@
"""Convert Inferact codex_swebenchpro_traces (ShareGPT) to agentic-pd-hybrid trace JSONL.
Output schema (one JSON object per line, matching src/agentic_pd_hybrid/trace.py):
chat_id, parent_chat_id, timestamp, input_length, output_length, type, turn, hash_ids
Each trial in the input becomes one session. Each (human, gpt) pair within a trial
becomes one turn. The prefix at turn N is the concatenation of all (human, gpt) pairs
from turns 0..N-1 plus the current human message — this mirrors how agentic coding
agents grow context across calls.
hash_ids are derived per 24-token block via sha256 of the block's text + previous hash,
which gives stable, deterministic, prefix-shared hashes across turns of the same session.
"""
from __future__ import annotations
import argparse
import hashlib
import json
import sys
import time
from pathlib import Path
BLOCK_TOKEN_BUDGET = 24
def _block_hash(text: str, prev_hash: int) -> int:
h = hashlib.sha256(text.encode("utf-8") + prev_hash.to_bytes(8, "big")).digest()
return int.from_bytes(h[:8], "big") & 0x7FFFFFFFFFFFFFFF
def _build_hash_ids(token_ids: list[int]) -> list[int]:
out: list[int] = []
prev = 0
for start in range(0, len(token_ids), BLOCK_TOKEN_BUDGET):
block = token_ids[start : start + BLOCK_TOKEN_BUDGET]
block_repr = ",".join(str(t) for t in block)
prev = _block_hash(block_repr, prev)
out.append(prev)
return out
def _pair_turns(conv: list[dict]) -> list[tuple[str, str]]:
"""Pair consecutive (human, gpt) messages. Skip malformed."""
pairs: list[tuple[str, str]] = []
i = 0
while i + 1 < len(conv):
a, b = conv[i], conv[i + 1]
if (
isinstance(a, dict)
and isinstance(b, dict)
and a.get("from") == "human"
and b.get("from") == "gpt"
):
pairs.append((str(a.get("value", "")), str(b.get("value", ""))))
i += 2
else:
i += 1
return pairs
def convert(
input_path: Path,
output_path: Path,
*,
tokenizer_path: str,
max_trials: int | None,
inter_turn_gap_s: float,
session_stagger_s: float,
request_type: str,
) -> None:
from transformers import AutoTokenizer
print(f"loading tokenizer from {tokenizer_path}", file=sys.stderr)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True)
print(f"loading {input_path}", file=sys.stderr)
data = json.loads(input_path.read_text())
if max_trials is not None:
data = data[:max_trials]
print(f"{len(data)} trials to process", file=sys.stderr)
next_chat_id = 1_000_000
written = 0
skipped_trials = 0
t0 = time.time()
with output_path.open("w", encoding="utf-8") as out_f:
for trial_idx, trial in enumerate(data):
conv = trial.get("conversations") or []
turns = _pair_turns(conv)
if not turns:
skipped_trials += 1
continue
base_ts = trial_idx * session_stagger_s
ts = base_ts
parent_chat_id = -1
prefix_text = ""
for turn_idx, (human, assistant) in enumerate(turns):
# Input at this turn = full prior context + current human message.
current_text = (
prefix_text + ("\n\n[USER]\n" if prefix_text else "[USER]\n") + human
)
input_ids = tokenizer.encode(current_text, add_special_tokens=False)
input_length = len(input_ids)
output_ids = tokenizer.encode(assistant, add_special_tokens=False)
output_length = max(1, len(output_ids))
hash_ids = _build_hash_ids(input_ids)
chat_id = next_chat_id
next_chat_id += 1
record = {
"chat_id": chat_id,
"parent_chat_id": parent_chat_id,
"timestamp": round(ts, 6),
"input_length": input_length,
"output_length": output_length,
"type": request_type,
"turn": turn_idx,
"hash_ids": hash_ids,
}
out_f.write(json.dumps(record) + "\n")
written += 1
parent_chat_id = chat_id
ts += inter_turn_gap_s
prefix_text = current_text + "\n\n[ASSISTANT]\n" + assistant
if (trial_idx + 1) % 20 == 0:
elapsed = time.time() - t0
rate = (trial_idx + 1) / elapsed if elapsed > 0 else 0
eta = (len(data) - trial_idx - 1) / rate if rate > 0 else 0
print(
f" trial {trial_idx + 1}/{len(data)} reqs={written} "
f"rate={rate:.1f} trial/s eta={eta:.0f}s",
file=sys.stderr,
)
elapsed = time.time() - t0
print(
f"done: wrote {written} requests across {len(data) - skipped_trials} sessions "
f"({skipped_trials} trials skipped, empty conversations) in {elapsed:.1f}s "
f"to {output_path}",
file=sys.stderr,
)
def main() -> None:
p = argparse.ArgumentParser(description=__doc__)
p.add_argument(
"--input",
type=Path,
default=Path("third_party/codex_swebenchpro_traces/codex_swebenchpro.json"),
)
p.add_argument("--output", type=Path, required=True)
p.add_argument(
"--tokenizer",
default="/mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507",
help="Path or HF id for the tokenizer. Default matches v2 sweep model.",
)
p.add_argument(
"--max-trials",
type=int,
default=None,
help="Cap number of trials processed (useful for smoke / quick tests).",
)
p.add_argument("--inter-turn-gap-s", type=float, default=2.5)
p.add_argument("--session-stagger-s", type=float, default=1.0)
p.add_argument("--request-type", default="chat")
args = p.parse_args()
args.output.parent.mkdir(parents=True, exist_ok=True)
convert(
input_path=args.input,
output_path=args.output,
tokenizer_path=args.tokenizer,
max_trials=args.max_trials,
inter_turn_gap_s=args.inter_turn_gap_s,
session_stagger_s=args.session_stagger_s,
request_type=args.request_type,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,81 @@
"""Deterministically slice the first N sessions of an agentic-pd-hybrid trace.
Method: scan in file order, count records whose `parent_chat_id == -1` (= a
session's turn 0), and write every record until the (N+1)-th such record is
seen. No RNG, no hashing — re-running on the same input produces a byte-
identical output. Used to derive matched subsets for paired sweeps (E1 vs E2)
without spending GPU hours on the full trace.
Usage:
uv run --no-sync python scripts/sample_trace_subset.py \
--input outputs/inferact_codex_swebenchpro.jsonl \
--output outputs/inferact_50sess.jsonl \
--sessions 50
"""
from __future__ import annotations
import argparse
import hashlib
import json
import sys
from pathlib import Path
def slice_first_n_sessions(input_path: Path, output_path: Path, n_sessions: int) -> dict:
sessions_seen = 0
requests_written = 0
input_length_sum = 0
output_length_sum = 0
min_in = float("inf")
max_in = 0
with input_path.open("r", encoding="utf-8") as f_in, output_path.open(
"w", encoding="utf-8"
) as f_out:
for line in f_in:
rec = json.loads(line)
if rec["parent_chat_id"] == -1:
sessions_seen += 1
if sessions_seen > n_sessions:
break
f_out.write(line)
requests_written += 1
il = int(rec["input_length"])
input_length_sum += il
output_length_sum += int(rec["output_length"])
if il < min_in:
min_in = il
if il > max_in:
max_in = il
h = hashlib.md5(output_path.read_bytes()).hexdigest()
return {
"sessions": min(sessions_seen, n_sessions),
"requests": requests_written,
"input_length_mean": input_length_sum / max(1, requests_written),
"input_length_min": int(min_in) if min_in != float("inf") else 0,
"input_length_max": max_in,
"output_length_mean": output_length_sum / max(1, requests_written),
"output_md5": h,
}
def main() -> None:
p = argparse.ArgumentParser(description=__doc__)
p.add_argument(
"--input",
type=Path,
default=Path("outputs/inferact_codex_swebenchpro.jsonl"),
)
p.add_argument("--output", type=Path, required=True)
p.add_argument("--sessions", type=int, default=50)
args = p.parse_args()
args.output.parent.mkdir(parents=True, exist_ok=True)
stats = slice_first_n_sessions(args.input, args.output, args.sessions)
print(json.dumps(stats, indent=2), file=sys.stderr)
if __name__ == "__main__":
main()

44
scripts/setup_env.sh Executable file
View File

@@ -0,0 +1,44 @@
#!/usr/bin/env bash
# Source this file in every shell that will run agentic-pd-hybrid.
#
# source scripts/setup_env.sh
#
# Why all three are needed:
# - CUDA_HOME / PATH point tvm_ffi (vendor sglang JIT compiler) at cu12.8 nvcc.
# Without this it falls back to /usr/local/cuda-13.0/bin/nvcc and the
# resulting .so links libcudart.so.13 which driver 570 (cu12.8 API) rejects
# with cudaErrorInsufficientDriver.
# - LD_LIBRARY_PATH must expose libcudart.so.12 for mooncake.engine (cu12 wheel)
# AND ~/cuda-12.8/lib64 for tvm_ffi compile-time linker searches.
#
# See docs/H200_DRIVER570_SETUP_ZH.md for the full rationale.
REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
if [ ! -x "$HOME/cuda-12.8/bin/nvcc" ]; then
echo "ERROR: $HOME/cuda-12.8/bin/nvcc not found." >&2
echo "Install cu12.8 toolkit first (see docs/H200_DRIVER570_SETUP_ZH.md §3)." >&2
return 1 2>/dev/null || exit 1
fi
if [ ! -f "$REPO_ROOT/.venv/lib/python3.12/site-packages/nvidia/cuda_runtime/lib/libcudart.so.12" ]; then
echo "ERROR: venv libcudart.so.12 missing. Run 'uv sync' from $REPO_ROOT." >&2
return 1 2>/dev/null || exit 1
fi
export CUDA_HOME="$HOME/cuda-12.8"
export PATH="$HOME/cuda-12.8/bin:$PATH"
export LD_LIBRARY_PATH="$REPO_ROOT/.venv/lib/python3.12/site-packages/nvidia/cuda_runtime/lib:$HOME/cuda-12.8/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}"
# Mooncake batch_transfer_sync C++ timeout (seconds). Default in mooncake is
# 30 s; a single LRU eviction sweep on a saturated D scheduler can exceed
# that and cause the hair-trigger blacklist in conn.py:1270 to permanently
# mark the D's mooncake_session_id "failed". 1800 s = 30 min gives us
# headroom while still detecting genuinely broken peers eventually.
# See docs/E1_E2_RESULTS_ZH.md §5c and docs/E1_E2_FIX_DESIGN_ZH.md Q1.C.
export MC_TRANSFER_TIMEOUT="${MC_TRANSFER_TIMEOUT:-1800}"
echo "agentic-pd-hybrid env ready:"
echo " CUDA_HOME=$CUDA_HOME ($(nvcc --version | grep release | sed 's/.*release //'))"
echo " libcudart.so.12 at $REPO_ROOT/.venv/lib/python3.12/site-packages/nvidia/cuda_runtime/lib"
echo " MC_TRANSFER_TIMEOUT=${MC_TRANSFER_TIMEOUT}s"

82
scripts/sweep_e1_naive_1p3d.sh Executable file
View File

@@ -0,0 +1,82 @@
#!/usr/bin/env bash
# E1 — naive 1P3D + kv-aware + RDMA, ts=1
#
# Tests hypothesis H1 from ONBOARDING_NEXT_AGENT_ZH §3.1: separate the
# contribution of "1P3D topology + kv-aware policy" from "KVC layer
# (admission / migration / direct-to-D)".
#
# Mechanism = pd-disaggregation (no KVC layer); policy = kv-aware.
# Topology = 1P3D, RDMA on (mlx5_60 = cuda:0 NUMA-local).
#
# Prerequisites:
# - source scripts/setup_env.sh (sets CUDA_HOME etc.)
# - outputs/inferact_codex_swebenchpro.jsonl exists
# (run scripts/convert_inferact_to_trace.py if not)
#
# Usage:
# bash scripts/sweep_e1_naive_1p3d.sh
#
# Override defaults via env:
# MODEL=/path TRACE=path OUTPUT=path IB_DEVICE=mlx5_XX bash scripts/sweep_e1_naive_1p3d.sh
set -euo pipefail
cd "$(dirname "$0")/.."
if [ -z "${CUDA_HOME:-}" ]; then
echo "ERROR: CUDA_HOME not set. Source scripts/setup_env.sh first." >&2
exit 1
fi
MODEL=${MODEL:-/mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507}
TRACE=${TRACE:-outputs/inferact_50sess.jsonl}
OUTPUT=${OUTPUT:-outputs/e1_naive_1p3d_kvaware_rdma_50sess}
IB_DEVICE=${IB_DEVICE:-mlx5_60}
if [ ! -f "$TRACE" ]; then
echo "ERROR: trace not found at $TRACE" >&2
echo "Run: uv run --no-sync python scripts/convert_inferact_to_trace.py --output $TRACE" >&2
exit 1
fi
mkdir -p "$OUTPUT"
LOG="$OUTPUT/sweep.log"
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG"; }
log "=== E1: naive 1P3D kv-aware + RDMA, ts=1 ==="
log "MODEL=$MODEL"
log "TRACE=$TRACE ($(wc -l < $TRACE) requests)"
log "OUTPUT=$OUTPUT"
log "IB_DEVICE=$IB_DEVICE"
label=e1_naive_1p3d_kvaware_run1
log ""
log "=== [E1] $label starting ==="
uv run --no-sync python -m agentic_pd_hybrid.cli benchmark-live \
--trace "$TRACE" \
--output-root "$OUTPUT" \
--mechanism pd-disaggregation \
--policy kv-aware \
--model-path "$MODEL" \
--prefill-workers 1 --decode-workers 3 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3 \
--transfer-backend mooncake \
--force-rdma --ib-device "$IB_DEVICE" \
--gpu-budget 4 \
--time-scale 1 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 1800 \
--request-timeout-s 300 2>&1 | tee -a "$LOG"
run_dir=$(ls -td "$OUTPUT"/pd-disaggregation-*/ 2>/dev/null | head -1)
log "=== [E1] $label COMPLETED, artifacts at $run_dir ==="
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
log "=== summary saved to $OUTPUT/${label}_summary.json ==="
fi

90
scripts/sweep_e2_kvc_v2_rdma.sh Executable file
View File

@@ -0,0 +1,90 @@
#!/usr/bin/env bash
# E2 — KVC v2 + RDMA, ts=1
#
# Tests hypotheses H2/H3 from ONBOARDING_NEXT_AGENT_ZH §3.1: validate
# that enabling real RDMA pushes TTFT p99 from the reported 1.28s
# (TCP loopback) down toward ~0.7s (still expected to lose to DP 0.43s
# because re-prefill segment of reseed slow-path remains).
#
# Mechanism = kvcache-centric; policy = kv-aware; topology = 1P3D.
# All --kvcache-* tuning flags from sweep_ts1_migration_v2.sh
# (reset-on-success + threshold 8192). RDMA on (mlx5_60).
#
# Uses the same outputs/inferact_50sess.jsonl as E1 — see
# scripts/sample_trace_subset.py — so the two runs are paired.
#
# Prerequisites:
# - source scripts/setup_env.sh
# - E1 must already have completed (releases GPUs)
#
# Usage:
# bash scripts/sweep_e2_kvc_v2_rdma.sh
set -euo pipefail
cd "$(dirname "$0")/.."
if [ -z "${CUDA_HOME:-}" ]; then
echo "ERROR: CUDA_HOME not set. Source scripts/setup_env.sh first." >&2
exit 1
fi
MODEL=${MODEL:-/mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507}
TRACE=${TRACE:-outputs/inferact_50sess.jsonl}
OUTPUT=${OUTPUT:-outputs/e2_kvc_v2_rdma_50sess}
IB_DEVICE=${IB_DEVICE:-mlx5_60}
if [ ! -f "$TRACE" ]; then
echo "ERROR: trace not found at $TRACE" >&2
echo "Run: uv run --no-sync python scripts/sample_trace_subset.py --output $TRACE --sessions 50" >&2
exit 1
fi
mkdir -p "$OUTPUT"
LOG="$OUTPUT/sweep.log"
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG"; }
log "=== E2: KVC v2 + RDMA, ts=1 ==="
log "MODEL=$MODEL"
log "TRACE=$TRACE ($(wc -l < $TRACE) requests)"
log "OUTPUT=$OUTPUT"
log "IB_DEVICE=$IB_DEVICE"
label=e2_kvc_v2_rdma_run1
log ""
log "=== [E2] $label starting ==="
uv run --no-sync python -m agentic_pd_hybrid.cli benchmark-live \
--trace "$TRACE" \
--output-root "$OUTPUT" \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path "$MODEL" \
--prefill-workers 1 --decode-workers 3 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3 \
--transfer-backend mooncake \
--force-rdma --ib-device "$IB_DEVICE" \
--gpu-budget 4 \
--time-scale 1 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 1800 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction \
--kvcache-migration-reject-threshold 3 \
--kvcache-direct-max-uncached-tokens 8192 2>&1 | tee -a "$LOG"
run_dir=$(ls -td "$OUTPUT"/kvcache-centric-*/ 2>/dev/null | head -1)
log "=== [E2] $label COMPLETED, artifacts at $run_dir ==="
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
log "=== summary saved to $OUTPUT/${label}_summary.json ==="
fi

View File

@@ -0,0 +1,105 @@
#!/usr/bin/env bash
# E3 — KVC v2 + RDMA + load-floor bonus, ts=1
#
# Validates the load-floor bonus fix proposed in
# docs/E1_E2_FIX_DESIGN_ZH.md §Q2.B. Identical to E2 except:
# --kvcache-load-floor-bonus 200
#
# Pair-wise vs E1 (no KVC layer) and E2 (KVC v2 without bonus) on the
# exact same outputs/inferact_50sess.jsonl subset.
#
# Hypotheses being tested:
# H1 (load balance): D2 should now receive non-trivial bindings
# (E1/E2 had 0 — see E1_E2_RESULTS_ZH.md §5d).
# H2 (failure rate): mooncake batch_transfer_sync timeouts should
# stop firing because D0/D1 KV pool no longer
# saturates → no LRU thrash → control plane no
# longer starves. E2 had 1054 failures; expect
# ≤ E1's 85.
# H3 (TTFT): the 231 successful E2 reqs had TTFT p50 = 0.43s,
# well under E1's 88.6s. With the failure cascade
# removed, these should generalize to most reqs.
#
# Prerequisites:
# - source scripts/setup_env.sh
# (sets CUDA_HOME, MC_TRANSFER_TIMEOUT=1800, etc.)
# - outputs/inferact_50sess.jsonl exists (md5 7bb263a32600ef5a6ef5099ba340a487)
# - Previous sweep done; GPUs idle.
#
# Usage:
# bash scripts/sweep_e3_kvc_v2_loadfloor_rdma.sh
#
# Override defaults via env:
# K=500 LOAD_FLOOR_BONUS=$K bash scripts/sweep_e3_kvc_v2_loadfloor_rdma.sh
set -euo pipefail
cd "$(dirname "$0")/.."
if [ -z "${CUDA_HOME:-}" ]; then
echo "ERROR: CUDA_HOME not set. Source scripts/setup_env.sh first." >&2
exit 1
fi
MODEL=${MODEL:-/mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507}
TRACE=${TRACE:-outputs/inferact_50sess.jsonl}
OUTPUT=${OUTPUT:-outputs/e3_kvc_v2_loadfloor_rdma_50sess}
IB_DEVICE=${IB_DEVICE:-mlx5_60}
LOAD_FLOOR_BONUS=${LOAD_FLOOR_BONUS:-200}
if [ ! -f "$TRACE" ]; then
echo "ERROR: trace not found at $TRACE" >&2
echo "Run: uv run --no-sync python scripts/sample_trace_subset.py --output $TRACE --sessions 50" >&2
exit 1
fi
mkdir -p "$OUTPUT"
LOG="$OUTPUT/sweep.log"
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG"; }
log "=== E3: KVC v2 + RDMA + load-floor bonus K=$LOAD_FLOOR_BONUS, ts=1 ==="
log "MODEL=$MODEL"
log "TRACE=$TRACE ($(wc -l < $TRACE) requests)"
log "OUTPUT=$OUTPUT"
log "IB_DEVICE=$IB_DEVICE"
log "MC_TRANSFER_TIMEOUT=${MC_TRANSFER_TIMEOUT:-default-30s}"
label=e3_kvc_v2_loadfloor_run1
log ""
log "=== [E3] $label starting ==="
uv run --no-sync python -m agentic_pd_hybrid.cli benchmark-live \
--trace "$TRACE" \
--output-root "$OUTPUT" \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path "$MODEL" \
--prefill-workers 1 --decode-workers 3 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3 \
--transfer-backend mooncake \
--force-rdma --ib-device "$IB_DEVICE" \
--gpu-budget 4 \
--time-scale 1 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 1800 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction \
--kvcache-migration-reject-threshold 3 \
--kvcache-direct-max-uncached-tokens 8192 \
--kvcache-load-floor-bonus "$LOAD_FLOOR_BONUS" 2>&1 | tee -a "$LOG"
run_dir=$(ls -td "$OUTPUT"/kvcache-centric-*/ 2>/dev/null | head -1)
log "=== [E3] $label COMPLETED, artifacts at $run_dir ==="
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
log "=== summary saved to $OUTPUT/${label}_summary.json ==="
fi

View File

@@ -48,6 +48,7 @@ class BenchmarkConfig:
enable_backpressure: bool = False
backpressure_max_pause_s: float = 2.0
kvcache_migration_reject_threshold: int = 3
kvcache_load_floor_bonus: int = 0
sample_profile: str = "default"
min_initial_input_tokens: int | None = None
max_initial_input_tokens: int | None = None
@@ -200,6 +201,7 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
enable_backpressure=config.enable_backpressure,
backpressure_max_pause_s=config.backpressure_max_pause_s,
kvcache_migration_reject_threshold=config.kvcache_migration_reject_threshold,
kvcache_load_floor_bonus=config.kvcache_load_floor_bonus,
)
if config.request_timeout_s is not None:
replay_config = replace(
@@ -261,6 +263,7 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
"enable_backpressure": config.enable_backpressure,
"backpressure_max_pause_s": config.backpressure_max_pause_s,
"kvcache_migration_reject_threshold": config.kvcache_migration_reject_threshold,
"kvcache_load_floor_bonus": config.kvcache_load_floor_bonus,
"sample_profile": config.sample_profile,
"min_initial_input_tokens": config.min_initial_input_tokens,
"max_initial_input_tokens": config.max_initial_input_tokens,

View File

@@ -270,6 +270,19 @@ def main() -> None:
"See REFACTOR_PLAN_V1 §6.2 / TEAM_REPORT §2.1."
),
)
replay.add_argument(
"--kvcache-load-floor-bonus",
type=int,
default=0,
help=(
"Graduated bonus added to lex-score position 0 for under-loaded D "
"workers (gated on not-sticky so turn-1+ requests still stick). "
"Magnitude scales as K * max(0, mean - assigned[D]) / mean. "
"Set above max expected cross-session boilerplate overlap "
"(Inferact ~50 → use 200). 0 disables. "
"See docs/E1_E2_FIX_DESIGN_ZH.md §Q2."
),
)
sample = subparsers.add_parser(
"sample-sessions",
@@ -521,6 +534,19 @@ def main() -> None:
"See REFACTOR_PLAN_V1 §6.2 / TEAM_REPORT §2.1."
),
)
benchmark.add_argument(
"--kvcache-load-floor-bonus",
type=int,
default=0,
help=(
"Graduated bonus added to lex-score position 0 for under-loaded D "
"workers (gated on not-sticky so turn-1+ requests still stick). "
"Magnitude scales as K * max(0, mean - assigned[D]) / mean. "
"Set above max expected cross-session boilerplate overlap "
"(Inferact ~50 → use 200). 0 disables. "
"See docs/E1_E2_FIX_DESIGN_ZH.md §Q2."
),
)
benchmark.add_argument(
"--sample-profile",
choices=["default", "small-append"],
@@ -607,6 +633,7 @@ def main() -> None:
enable_backpressure=args.enable_backpressure,
backpressure_max_pause_s=args.backpressure_max_pause_s,
kvcache_migration_reject_threshold=args.kvcache_migration_reject_threshold,
kvcache_load_floor_bonus=args.kvcache_load_floor_bonus,
)
results = asyncio.run(replay_trace(config))
print(
@@ -754,6 +781,7 @@ def main() -> None:
enable_backpressure=args.enable_backpressure,
backpressure_max_pause_s=args.backpressure_max_pause_s,
kvcache_migration_reject_threshold=args.kvcache_migration_reject_threshold,
kvcache_load_floor_bonus=args.kvcache_load_floor_bonus,
sample_profile=args.sample_profile,
min_initial_input_tokens=args.min_initial_input_tokens,
max_initial_input_tokens=args.max_initial_input_tokens,
@@ -848,6 +876,8 @@ def _topology_from_args(args: argparse.Namespace):
force_rdma=args.force_rdma,
trust_remote_code=not args.no_trust_remote_code,
ib_device=args.ib_device,
prefill_extra_server_args=("--disable-overlap-schedule",),
decode_extra_server_args=("--disable-overlap-schedule",),
direct_extra_server_args=("--enable-streaming-session",),
)

View File

@@ -152,6 +152,49 @@ class StickyDecodePolicy:
)
CandidateScore = tuple[int, int, int, int]
def score_candidate(
*,
overlap: int,
sticky: bool,
inflight: int,
assigned: int,
mean_assigned: float,
sticky_bonus: int,
load_floor_bonus: int,
) -> CandidateScore:
"""Pure scoring function for KvAwarePolicy (Algorithm 1 in KVC_ROUTER_ALGORITHM.md).
Returns the 4-tuple compared lexicographically by `select()` to pick the
best D. Extracted as a top-level function so unit tests can exercise it
without constructing topology/state objects.
Score tuple positions:
0: overlap + sticky_bonus*sticky + floor_bonus — primary, KV reuse aware
1: sticky — tie-1, session locality
2: -inflight — tie-2, prefer low load
3: -assigned — tie-3, prefer rarely-picked
Load-floor bonus is gated on `not sticky` (turn-1+ sessions continue to
stick to their original D). The boost magnitude scales linearly with the
D's deficit relative to the running mean of decode_assignment_counts:
floor_bonus = load_floor_bonus * max(0, mean - assigned) / max(1, mean)
When mean == 0 (warmup) the bonus is 0 for all candidates (lex tiebreak
falls through to iteration order).
See docs/E1_E2_FIX_DESIGN_ZH.md §Q2 for the load-floor design and
docs/KVC_ROUTER_ALGORITHM.md §3.1 for the lex-score formalism.
"""
floor_bonus = 0
if load_floor_bonus > 0 and not sticky and mean_assigned > 0:
deficit = max(0.0, mean_assigned - assigned)
floor_bonus = int(load_floor_bonus * deficit / mean_assigned)
primary = overlap + (sticky_bonus if sticky else 0) + floor_bonus
return (primary, int(sticky), -inflight, -assigned)
@dataclass(frozen=True)
class KvAwarePolicy:
name: str = "kv-aware"
@@ -161,6 +204,12 @@ class KvAwarePolicy:
# 0 disables the mechanism. Default 3 picked empirically to allow brief
# transient saturation without panicking, but to reroute persistent starvation.
migration_reject_threshold: int = 3
# Load-floor bonus: see score_candidate() docstring for the exact formula.
# Set above the max cross-session boilerplate overlap you expect (so fresh
# sessions reach under-loaded D's even at 0 overlap), but below the
# magnitude of "real" prefix overlap (so a warm D still wins for its own
# session). 0 disables.
load_floor_bonus: int = 0
def select(
self,
@@ -172,9 +221,12 @@ class KvAwarePolicy:
prefill_worker_id = state.next_prefill_worker_id(topology)
session = state.session_state.get(request.session_id)
n_route_workers = max(1, len(topology.route_workers))
total_assigned = sum(state.decode_assignment_counts.values())
mean_assigned = total_assigned / n_route_workers
best_decode_worker_id: str | None = None
best_score: tuple[int, int, int, int] | None = None
candidates_considered = 0
best_score: CandidateScore | None = None
for worker in topology.route_workers:
# Migration: skip workers that have rejected this session too many times.
# If all candidates get filtered (degenerate case), fall through to
@@ -185,16 +237,17 @@ class KvAwarePolicy:
)
if rejects >= self.migration_reject_threshold:
continue
candidates_considered += 1
overlap = _overlap_blocks(request, state, worker.worker_id)
sticky = int(session is not None and session.last_decode_worker == worker.worker_id)
inflight_penalty = -state.inflight_decode.get(worker.worker_id, 0)
assignment_penalty = -state.decode_assignment_counts.get(worker.worker_id, 0)
score = (
overlap + sticky * self.sticky_bonus,
sticky,
inflight_penalty,
assignment_penalty,
score = score_candidate(
overlap=_overlap_blocks(request, state, worker.worker_id),
sticky=(
session is not None
and session.last_decode_worker == worker.worker_id
),
inflight=state.inflight_decode.get(worker.worker_id, 0),
assigned=state.decode_assignment_counts.get(worker.worker_id, 0),
mean_assigned=mean_assigned,
sticky_bonus=self.sticky_bonus,
load_floor_bonus=self.load_floor_bonus,
)
if best_score is None or score > best_score:
best_score = score
@@ -223,14 +276,22 @@ class KvAwarePolicy:
)
def create_policy(name: str, *, migration_reject_threshold: int = 3) -> RoutingPolicy:
def create_policy(
name: str,
*,
migration_reject_threshold: int = 3,
load_floor_bonus: int = 0,
) -> RoutingPolicy:
normalized = name.strip().lower()
if normalized == "default":
return DefaultPolicy()
if normalized == "sticky":
return StickyDecodePolicy()
if normalized in {"kv-aware", "kv_aware", "kv"}:
return KvAwarePolicy(migration_reject_threshold=migration_reject_threshold)
return KvAwarePolicy(
migration_reject_threshold=migration_reject_threshold,
load_floor_bonus=load_floor_bonus,
)
raise ValueError(f"Unsupported policy: {name}")

View File

@@ -111,6 +111,11 @@ class ReplayConfig:
# KvAwarePolicy skips that D for the session (forcing migration). Default 3.
# Set 0 to disable. See REFACTOR_PLAN_V1 §6.2.
kvcache_migration_reject_threshold: int = 3
# Load-floor bonus magnitude for KvAwarePolicy: graduated boost added to
# under-loaded D workers to break overlap-pinning imbalance on workloads
# with shared cross-session prefix. 0 disables. See
# docs/E1_E2_FIX_DESIGN_ZH.md §Q2.
kvcache_load_floor_bonus: int = 0
structural_log_dir: Path | None = None
@@ -198,6 +203,7 @@ async def replay_trace(config: ReplayConfig) -> list[RequestMetrics]:
policy = create_policy(
config.policy_name,
migration_reject_threshold=config.kvcache_migration_reject_threshold,
load_floor_bonus=config.kvcache_load_floor_bonus,
)
state = RoutingState.create(config.topology)
state_lock = asyncio.Lock()

View File

@@ -201,6 +201,14 @@ def _build_process_env(topology: SingleNodeTopology) -> dict[str, str]:
# Default to TCP when RDMA is not forced (e.g. loopback on same node)
env.setdefault("MOONCAKE_PROTOCOL", "tcp")
# Mooncake C++ batch_transfer_sync default timeout is 30 s, which can
# fire as a false positive when a saturated D scheduler thread is busy
# with LRU eviction (see docs/E1_E2_RESULTS_ZH.md §5c). Default to 1800 s
# so the hair-trigger blacklist in conn.py:1270 doesn't latch on
# transient stalls. Caller can override via shell env (setup_env.sh).
if topology.transfer_backend == "mooncake":
env.setdefault("MC_TRANSFER_TIMEOUT", "1800")
repo_root = Path(__file__).resolve().parents[2]
python_paths = [
str(repo_root / "src"),

39
tests/README.md Normal file
View File

@@ -0,0 +1,39 @@
# Tests
Pure-Python unit + property tests for the algorithm layer. These tests do
**not** import SGLang and do **not** need a GPU — they validate the routing
algorithm (Algorithm 1/2/3 in `docs/KVC_ROUTER_ALGORITHM.md`) and its
theorems against the pure functions extracted from `policies.py`.
## Run
```bash
uv sync --group test
uv run pytest
```
Or, without uv:
```bash
pip install pytest
PYTHONPATH=src pytest tests
```
## Scope
- `test_policy_scoring.py` — Algorithm 1 lex-score properties (overlap
dominates sticky, load-floor gating, tie-breakers).
- `test_no_starvation.py` — Theorem 1: bounded retries before some D either
accepts or the least-rejected D is forced through the degenerate path.
Future:
- block-level eviction `MockRadixCache` tests (see
`docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md` §5).
- D→P sync `staleness_budget` property tests (see
`docs/D_TO_P_SYNC_CONTRACT_ZH.md` §1).
## Why no integration tests here
Anything that needs SGLang, mooncake, or a real model is an integration
test and must run on hardware. Those tests live as `scripts/sweep_*.sh`
under the evaluation protocol in `docs/EVALUATION_PROTOCOL_ZH.md`.

0
tests/__init__.py Normal file
View File

66
tests/_fixtures.py Normal file
View File

@@ -0,0 +1,66 @@
"""Lightweight fixtures for algorithm-layer tests.
Builds minimal TraceRequest / SingleNodeTopology / RoutingState instances
without invoking build_single_node_topology() (which validates GPU budgets
we don't care about in unit tests).
"""
from __future__ import annotations
from agentic_pd_hybrid.topology import SingleNodeTopology, WorkerSpec
from agentic_pd_hybrid.trace import TraceRequest
def make_topology(decode_count: int = 3, prefill_count: int = 1) -> SingleNodeTopology:
prefill_workers = tuple(
WorkerSpec(
role="prefill",
ordinal=i,
gpu_ids=(i,),
host="127.0.0.1",
port=30000 + i,
)
for i in range(prefill_count)
)
decode_workers = tuple(
WorkerSpec(
role="decode",
ordinal=i,
gpu_ids=(prefill_count + i,),
host="127.0.0.1",
port=31000 + i,
)
for i in range(decode_count)
)
return SingleNodeTopology(
model_path="/dev/null/test-model",
prefill_workers=prefill_workers,
decode_workers=decode_workers,
direct_workers=(),
router_host="127.0.0.1",
router_port=8000,
transfer_backend="mooncake",
trust_remote_code=True,
)
def make_request(
*,
session_id: str = "sess-1",
turn_id: int = 0,
hash_ids: tuple[int, ...] = (),
input_length: int = 1024,
output_length: int = 64,
) -> TraceRequest:
return TraceRequest(
request_id=f"{session_id}-t{turn_id}",
session_id=session_id,
chat_id=int(turn_id),
parent_chat_id=-1 if turn_id == 0 else int(turn_id - 1),
timestamp_s=float(turn_id),
input_length=input_length,
output_length=output_length,
request_type="user",
turn_id=turn_id,
hash_ids=hash_ids,
)

150
tests/test_no_starvation.py Normal file
View File

@@ -0,0 +1,150 @@
"""Theorem 1 — no permanent starvation under bounded retries.
Reference: docs/KVC_ROUTER_ALGORITHM.md §4.1.
For any session s with τ_reject ≥ 1, after at most |D| · τ_reject
consecutive admission rejects on s, the routing policy MUST still
return a valid decision (via the degenerate "least-rejected D"
fallback). The session cannot be permanently starved at the policy
layer.
We can't exercise the full Dispatch loop here (it lives in replay.py and
needs HTTP, mooncake, etc.). What we CAN test is the policy-layer
guarantee: after K = |D| · τ_reject reject bumps, select() never raises
and never returns a worker that's both blacklisted *and* has positive
overlap (the degenerate path chooses by least-rejected).
This is the property-layer companion to test_policy_scoring.py's
quantitative checks.
"""
from __future__ import annotations
from agentic_pd_hybrid.policies import KvAwarePolicy, RoutingState
from ._fixtures import make_request, make_topology
def test_select_returns_valid_decision_under_full_blacklist():
"""Bump all (s, d) reject counters past τ_reject. select() must still
pick a worker (degenerate fallback, no exception, no None)."""
topology = make_topology(decode_count=3)
state = RoutingState.create(topology)
request = make_request(session_id="s-stuck", turn_id=0)
policy = KvAwarePolicy(migration_reject_threshold=3)
# Pre-fill the blacklist for every D.
for worker in topology.route_workers:
for _ in range(3):
state.record_admission_reject(request.session_id, worker.worker_id)
decision = policy.select(request=request, topology=topology, state=state)
assert decision.decode_worker_id is not None
assert decision.decode_worker_id in {w.worker_id for w in topology.route_workers}
def test_bounded_retries_to_force_degenerate_path():
"""Theorem 1: at most |D| · τ_reject rejects suffice to either exhaust
every D or to force the degenerate fallback. Simulate the worst case
where each retry picks a fresh D and is immediately rejected."""
topology = make_topology(decode_count=4)
state = RoutingState.create(topology)
request = make_request(session_id="s-worst", turn_id=0)
threshold = 3
policy = KvAwarePolicy(migration_reject_threshold=threshold)
seen_decoders: set[str] = set()
max_retries = len(topology.route_workers) * threshold
for retry in range(max_retries):
decision = policy.select(request=request, topology=topology, state=state)
seen_decoders.add(decision.decode_worker_id)
# Adversary: this D rejects this session.
state.record_admission_reject(request.session_id, decision.decode_worker_id)
# After |D|·τ_reject rejects every D must be blacklisted, so the next
# select() takes the degenerate "least-rejected" branch and STILL
# returns a valid worker.
final = policy.select(request=request, topology=topology, state=state)
assert final.decode_worker_id in {w.worker_id for w in topology.route_workers}
# And we should have explored every D over the bounded retries — the
# algorithm cannot trap a session on a single D when all are rejecting.
assert seen_decoders == {w.worker_id for w in topology.route_workers}
def test_least_rejected_d_chosen_when_all_blacklisted():
"""When every D is past threshold, the degenerate fallback chooses the
one with the *fewest* rejects (Algorithm 1, line 4)."""
topology = make_topology(decode_count=3)
state = RoutingState.create(topology)
request = make_request(session_id="s-lr", turn_id=0)
policy = KvAwarePolicy(migration_reject_threshold=3)
# Skew rejections: decode-0 has 5, decode-1 has 10, decode-2 has 3.
# All are >= threshold=3, so the filter wipes out every candidate.
# The fallback should pick decode-2 (smallest rejection count).
workers = list(topology.route_workers)
bumps = {workers[0].worker_id: 5, workers[1].worker_id: 10, workers[2].worker_id: 3}
for wid, n in bumps.items():
for _ in range(n):
state.record_admission_reject(request.session_id, wid)
decision = policy.select(request=request, topology=topology, state=state)
assert decision.decode_worker_id == workers[2].worker_id
def test_other_session_unaffected_by_blacklist():
"""Algorithm 1's filter is per-(session, D), not per-D. Session A's
rejects must not influence session B's routing."""
topology = make_topology(decode_count=2)
state = RoutingState.create(topology)
policy = KvAwarePolicy(migration_reject_threshold=3)
# Blacklist decode-0 for session A.
workers = list(topology.route_workers)
for _ in range(3):
state.record_admission_reject("session-A", workers[0].worker_id)
# Session B sees a clean slate — should be able to pick decode-0
# (which is the iteration-order winner under empty state).
decision_b = policy.select(
request=make_request(session_id="session-B"),
topology=topology,
state=state,
)
# decode-0 wins iteration-order tiebreak when all scores are (0,0,0,0).
assert decision_b.decode_worker_id == workers[0].worker_id
def test_threshold_zero_disables_blacklist():
"""migration_reject_threshold=0 means the migration mechanism is off:
every D stays a candidate regardless of its reject count."""
topology = make_topology(decode_count=2)
state = RoutingState.create(topology)
request = make_request(session_id="s-no-mig")
policy = KvAwarePolicy(migration_reject_threshold=0)
workers = list(topology.route_workers)
# Pile a huge number of rejects on decode-0.
for _ in range(100):
state.record_admission_reject(request.session_id, workers[0].worker_id)
decision = policy.select(request=request, topology=topology, state=state)
# decode-0 should still be eligible; with empty overlap/sticky/inflight,
# iteration order picks decode-0 first.
assert decision.decode_worker_id == workers[0].worker_id
def test_reject_counter_only_grows_on_record():
"""RoutingState.record_admission_reject is the ONLY mutator for the
counter. select() must not silently bump it."""
topology = make_topology(decode_count=2)
state = RoutingState.create(topology)
request = make_request(session_id="s-clean")
policy = KvAwarePolicy()
for _ in range(5):
policy.select(request=request, topology=topology, state=state)
# No explicit record_admission_reject -> all counters stay zero.
assert sum(state.session_d_rejects.values()) == 0

View File

@@ -0,0 +1,189 @@
"""Unit tests for Algorithm 1 (KvAwarePolicy score_candidate).
Reference: docs/KVC_ROUTER_ALGORITHM.md §3.1. The lex-score is
(overlap + sticky_bonus*sticky + floor_bonus,
sticky,
-inflight,
-assigned)
These tests pin down the qualitative properties that the algorithm's
correctness arguments rely on. They run without SGLang/GPU.
"""
from __future__ import annotations
from agentic_pd_hybrid.policies import score_candidate
def _score(**overrides):
"""Helper: build a score with all defaults and per-test overrides."""
args = dict(
overlap=0,
sticky=False,
inflight=0,
assigned=0,
mean_assigned=0.0,
sticky_bonus=1,
load_floor_bonus=0,
)
args.update(overrides)
return score_candidate(**args)
# -- Determinism ----------------------------------------------------------------
def test_score_is_pure():
"""Same kwargs must produce the same tuple (no hidden state)."""
a = _score(overlap=3, sticky=True, inflight=1, assigned=7)
b = _score(overlap=3, sticky=True, inflight=1, assigned=7)
assert a == b
def test_score_returns_4_tuple():
s = _score()
assert isinstance(s, tuple)
assert len(s) == 4
assert all(isinstance(x, int) for x in s)
# -- Primary term: overlap dominates sticky --------------------------------------
def test_overlap_strictly_dominates_pure_sticky():
"""Theorem-2 building block: any positive overlap on a non-sticky D wins
against a sticky-only D with zero overlap (sticky_bonus=1)."""
overlap = _score(overlap=2, sticky=False)
sticky_only = _score(overlap=0, sticky=True)
assert overlap > sticky_only
def test_overlap_plus_sticky_beats_overlap_alone():
"""Two D's with equal overlap: sticky one wins (sticky_bonus contributes
to primary AND wins tie-1)."""
sticky_d = _score(overlap=5, sticky=True)
fresh_d = _score(overlap=5, sticky=False)
assert sticky_d > fresh_d
# -- Tie breakers ----------------------------------------------------------------
def test_tiebreaker_inflight_lower_wins():
"""Equal primary & sticky: prefer the D with fewer in-flight requests."""
low = _score(overlap=3, sticky=False, inflight=0, assigned=10)
high = _score(overlap=3, sticky=False, inflight=5, assigned=10)
assert low > high
def test_tiebreaker_assigned_lower_wins():
"""Equal primary & sticky & inflight: prefer rarely-picked D."""
rare = _score(overlap=3, sticky=False, inflight=2, assigned=1)
frequent = _score(overlap=3, sticky=False, inflight=2, assigned=99)
assert rare > frequent
def test_tiebreaker_strict_lex_order():
"""Sticky always beats non-sticky on tie-1 even if non-sticky has lower
inflight (the lex order is strict, position 1 outranks positions 2/3)."""
sticky_busy = _score(overlap=4, sticky=True, inflight=10, assigned=10)
fresh_idle = _score(overlap=4, sticky=False, inflight=0, assigned=0)
# Note: with sticky_bonus=1 added to position 0, sticky_busy actually wins
# on position 0 first (5 > 4). Force equal primary by lowering sticky's
# overlap.
sticky_busy_eq_primary = _score(overlap=3, sticky=True, inflight=10, assigned=10)
fresh_idle_eq_primary = _score(overlap=4, sticky=False, inflight=0, assigned=0)
# Now equal primary (3+1=4 vs 4). Sticky wins position 1.
assert sticky_busy_eq_primary > fresh_idle_eq_primary
# -- Load-floor bonus ------------------------------------------------------------
def test_load_floor_disabled_by_default():
"""load_floor_bonus=0 → no contribution to primary."""
s = _score(overlap=0, sticky=False, mean_assigned=10, assigned=0)
assert s[0] == 0
def test_load_floor_gated_off_when_sticky():
"""Even with load_floor_bonus>0, sticky D does NOT receive the boost.
Otherwise a session would migrate away from its warm D under load."""
sticky_under_loaded = _score(
overlap=0, sticky=True, mean_assigned=10, assigned=0, load_floor_bonus=200
)
# primary = overlap(0) + sticky_bonus(1) + floor(0) = 1
assert sticky_under_loaded[0] == 1
def test_load_floor_zero_when_mean_zero():
"""Warmup case: mean_assigned=0 -> no D gets boost -> degenerate to lex
tiebreak by iteration order."""
s = _score(
overlap=0, sticky=False, mean_assigned=0, assigned=0, load_floor_bonus=200
)
assert s[0] == 0
def test_load_floor_proportional_to_deficit():
"""floor_bonus = K * deficit / mean. assigned=0, mean=10, K=200 -> 200."""
s_zero = _score(
overlap=0, sticky=False, mean_assigned=10, assigned=0, load_floor_bonus=200
)
s_half = _score(
overlap=0, sticky=False, mean_assigned=10, assigned=5, load_floor_bonus=200
)
s_full = _score(
overlap=0, sticky=False, mean_assigned=10, assigned=10, load_floor_bonus=200
)
# deficit = max(0, 10-0)=10 -> bonus = int(200*10/10) = 200
# deficit = max(0, 10-5)=5 -> bonus = int(200*5/10) = 100
# deficit = max(0, 10-10)=0 -> bonus = 0
assert s_zero[0] == 200
assert s_half[0] == 100
assert s_full[0] == 0
def test_load_floor_does_not_underflow_when_overloaded():
"""assigned > mean -> deficit clamped to 0, no negative bonus."""
s = _score(
overlap=0, sticky=False, mean_assigned=10, assigned=50, load_floor_bonus=200
)
assert s[0] == 0
# -- Routing intent: real overlap beats load-floor bonus -------------------------
def test_real_prefix_overlap_beats_load_floor_on_warm_d():
"""E1_E2_FIX_DESIGN_ZH §Q2: load_floor should be set such that
real per-session prefix overlap outweighs the cold-D bonus.
With overlap=800 (a per-session prefix) and load_floor_bonus=200,
a warm D (high overlap, possibly high load) should still win against
a cold D with floor bonus."""
warm = _score(
overlap=800, sticky=True, mean_assigned=10, assigned=10, load_floor_bonus=200
)
cold = _score(
overlap=0, sticky=False, mean_assigned=10, assigned=0, load_floor_bonus=200
)
# warm primary = 800 + 1 + 0 = 801. cold primary = 0 + 0 + 200 = 200.
assert warm[0] == 801
assert cold[0] == 200
assert warm > cold
def test_boilerplate_overlap_loses_to_load_floor_for_cold_d():
"""Same §Q2: load_floor should beat cross-session boilerplate overlap.
If load_floor_bonus=200 and the worst-case boilerplate overlap is ~50,
a fresh cold D should still win against a slightly-warm-from-boilerplate D."""
warm_boilerplate = _score(
overlap=50, sticky=False, mean_assigned=10, assigned=10, load_floor_bonus=200
)
cold_under_loaded = _score(
overlap=0, sticky=False, mean_assigned=10, assigned=0, load_floor_bonus=200
)
# warm_boilerplate primary = 50 + 0 + 0 = 50 (assigned=mean, no deficit).
# cold_under_loaded primary = 0 + 0 + 200 = 200.
assert cold_under_loaded > warm_boilerplate

View File

@@ -1564,6 +1564,74 @@ class ScheduleBatch(ScheduleBatchDisaggregationDecodeMixin):
# For DLLM, we use a separate forward mode
self.forward_mode = ForwardMode.DLLM_EXTEND
# Pre-filter pass: drop streaming-session reqs whose committed prefix
# already covers fill_ids. The streaming-session correction below would
# set extend_input_len = max(0, fill_len - prefix_len) = 0 for these
# reqs, but the downstream invariant at the per-req loop
# (`assert seq_len - pre_len == req.extend_input_len`) is computed from
# raw fill_ids/prefix_indices lengths and has no path to be satisfied
# when fill_len < prefix_len. Treat the condition as upstream state
# inconsistency, abort the affected reqs (so the client sees an error
# response instead of the worker crashing), and continue with the
# remaining batch. See docs/E3_FINDINGS_ZH.md for the failure mode
# this guards against.
if self.reqs:
kept_reqs = []
for req in self.reqs:
if (
req.session is not None
and req.session.streaming
and len(req.fill_ids) < len(req.prefix_indices)
):
logger.error(
"Dropping streaming-session req with fill_ids shorter than "
"prefix_indices (rid=%s, session_id=%s, fill_len=%d, "
"prefix_len=%d, kv_committed_len=%d). Upstream state "
"inconsistency would crash prepare_for_extend's invariant; "
"aborting this req. See docs/E3_FINDINGS_ZH.md.",
req.rid,
req.session.session_id,
len(req.fill_ids),
len(req.prefix_indices),
req.kv_committed_len,
)
req.finished_reason = FINISH_ABORT(
message=(
"streaming-session inconsistency: fill_ids "
f"({len(req.fill_ids)}) < prefix_indices "
f"({len(req.prefix_indices)})"
),
)
else:
kept_reqs.append(req)
if len(kept_reqs) != len(self.reqs):
self.reqs = kept_reqs
if not self.reqs:
# Whole batch filtered. Set empty tensor / list state so
# downstream callers (model_runner.forward, batch_result handlers)
# see a valid no-op batch and skip the model pass cleanly.
_pin = is_pin_memory_available(self.device)
empty_long = torch.zeros(0, dtype=torch.int64, pin_memory=_pin).to(
self.device, non_blocking=True
)
empty_int = torch.zeros(0, dtype=torch.int32, pin_memory=_pin).to(
self.device, non_blocking=True
)
self.input_ids = empty_long
self.req_pool_indices = empty_int
self.seq_lens = empty_long
self.seq_lens_cpu = torch.zeros(0, dtype=torch.int64)
self.orig_seq_lens = empty_int
self.prefix_lens = []
self.extend_lens = []
self.extend_num_tokens = 0
self.out_cache_loc = empty_int
self.input_embeds = None
self.multimodal_inputs = []
self.token_type_ids = None
return
# Init tensors
reqs = self.reqs
for req in reqs:

615
uv.lock generated
View File

@@ -2,15 +2,33 @@ version = 1
revision = 3
requires-python = ">=3.12"
resolution-markers = [
"python_full_version >= '3.14' and sys_platform == 'win32'",
"python_full_version >= '3.14' and sys_platform == 'emscripten'",
"python_full_version >= '3.14' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and sys_platform == 'win32'",
"python_full_version == '3.13.*' and sys_platform == 'emscripten'",
"python_full_version == '3.13.*' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and sys_platform == 'win32'",
"python_full_version < '3.13' and sys_platform == 'emscripten'",
"python_full_version < '3.13' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform == 'win32'",
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform == 'win32'",
"python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
"python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform == 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform == 'win32'",
"python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
"python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform == 'win32'",
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform == 'win32'",
"python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
"python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
[options]
@@ -30,7 +48,7 @@ dependencies = [
requires-dist = [
{ name = "httpx", specifier = ">=0.28.1" },
{ name = "mooncake-transfer-engine" },
{ name = "sglang", specifier = "==0.5.10" },
{ name = "sglang", editable = "third_party/sglang/python" },
]
[[package]]
@@ -457,7 +475,8 @@ source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "loguru" },
{ name = "pydantic" },
{ name = "torch" },
{ name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
{ name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
{ name = "transformers" },
]
sdist = { url = "https://files.pythonhosted.org/packages/98/c0/8fb99aa86bc538d3a025749633d1d0105d849b35eb240ba7ba30e22de49b/compressed_tensors-0.15.1a20260409.tar.gz", hash = "sha256:a9a477691c2887bc8d2c46aef82aa60c85fe1f014cacb2218b423904aff04f4d", size = 238217, upload-time = "2026-04-09T21:21:52.922Z" }
@@ -565,8 +584,8 @@ name = "decord2"
version = "3.3.0"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "numpy", version = "2.3.5", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.13' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
{ name = "numpy", version = "2.4.4", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.13' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
{ name = "numpy", version = "2.3.5", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.13' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
{ name = "numpy", version = "2.4.4", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.13' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/51/c3/fbc81c2cc18b2b7ca8a3a26ca2e8dfa243a2c7f5c4431f4b3839a8f12f0a/decord2-3.3.0-cp312-cp312-manylinux_2_28_aarch64.whl", hash = "sha256:3a67fb644041a031bc3f21b2e1adcf92b9742d980bd90f3bc45396c2a0ddcbfa", size = 25036754, upload-time = "2026-04-06T18:09:46.005Z" },
@@ -664,7 +683,8 @@ dependencies = [
{ name = "einops" },
{ name = "nvidia-cutlass-dsl" },
{ name = "quack-kernels" },
{ name = "torch" },
{ name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
{ name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
{ name = "torch-c-dlpack-ext" },
{ name = "typing-extensions" },
]
@@ -699,7 +719,8 @@ dependencies = [
{ name = "packaging" },
{ name = "requests" },
{ name = "tabulate" },
{ name = "torch" },
{ name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
{ name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
{ name = "tqdm" },
]
sdist = { url = "https://files.pythonhosted.org/packages/cc/95/81eafb78574312db79ef7144a4e77f2fee015343f413ef3000f279c8a118/flashinfer_python-0.6.7.post2.tar.gz", hash = "sha256:924cb1788d0335225293eea384da40f40daa6b4e32b6a5ebc214ab679b4e2125", size = 6509418, upload-time = "2026-04-04T07:10:25.516Z" }
@@ -904,34 +925,34 @@ wheels = [
[[package]]
name = "hf-xet"
version = "1.5.0.dev1"
version = "1.5.0"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/c9/b5/73db543ba19129c23b2ca52d837373eb4243f0332130093f31b3ecc6739f/hf_xet-1.5.0.dev1.tar.gz", hash = "sha256:a21c9c85869ee122747543dd93471826cc0e9b5f61b11411aabd4adf72e345b1", size = 823729, upload-time = "2026-04-17T08:22:19.349Z" }
sdist = { url = "https://files.pythonhosted.org/packages/74/d8/5c06fc76461418326a7decf8367480c35be11a41fd938633929c60a9ec6b/hf_xet-1.5.0.tar.gz", hash = "sha256:e0fb0a34d9f406eed88233e829a67ec016bec5af19e480eac65a233ea289a948", size = 837196, upload-time = "2026-05-06T06:18:15.583Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/79/c1/15fb7a67b1fad51b0d3e3a4e0a33ac2fca8197da842a922bf2f707521915/hf_xet-1.5.0.dev1-cp313-cp313t-macosx_10_12_x86_64.whl", hash = "sha256:41abc1601e9449c57880c203332221bc571a9c85154c1789a740259781ba9596", size = 6903797, upload-time = "2026-04-17T08:21:38.028Z" },
{ url = "https://files.pythonhosted.org/packages/c5/a6/66924109da0089c803a0b42eeccd37f321906b0224bad6c220e46a9f6ad2/hf_xet-1.5.0.dev1-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:045c43a49776d1dc9836ee0782e85fecbd2e85a6f55ebc39a4a14eb9c83fc004", size = 6570723, upload-time = "2026-04-17T08:21:35.605Z" },
{ url = "https://files.pythonhosted.org/packages/ad/19/c9d51b5512eae52dd3b6eac5f02552cfe78156410e71e1e3d1295f778a0c/hf_xet-1.5.0.dev1-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:908325bf4e53209dfe56d99a5cfed63907e677a32b1ba1f000cd72a8290871e4", size = 63298006, upload-time = "2026-04-17T08:21:12.867Z" },
{ url = "https://files.pythonhosted.org/packages/66/a7/1781b5a465fb4cce525a96c8bf7719583d115eaf2ea4d4ef560a394801a2/hf_xet-1.5.0.dev1-cp313-cp313t-manylinux_2_28_aarch64.whl", hash = "sha256:d51c3c20460012540dca4094615b74e1b757a7d702910149c7b8175eda91567a", size = 58640118, upload-time = "2026-04-17T08:21:07.745Z" },
{ url = "https://files.pythonhosted.org/packages/38/ef/2c02f7602b94b0f0454f66f9f52e7f37edaf81c3ccfa57073c17ee7e57d8/hf_xet-1.5.0.dev1-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:36d45543060cfda059a910cfa702fe2221cba88a49401d9359ae442ccb6fe8e7", size = 59133723, upload-time = "2026-04-17T08:21:51.701Z" },
{ url = "https://files.pythonhosted.org/packages/7d/76/732941c4ce0c0f5991ec1962a1848325a4ee11da2942c2f85100b68cba28/hf_xet-1.5.0.dev1-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:3363073f1abc0a55027ba5e666bbdd0147681e856ed3ddda083428f8d81786cf", size = 60269392, upload-time = "2026-04-17T08:21:56.95Z" },
{ url = "https://files.pythonhosted.org/packages/c3/22/65e1146977ddb940136ccd932675425a2fa1a13aef2a35fa54b969e07d77/hf_xet-1.5.0.dev1-cp313-cp313t-win_amd64.whl", hash = "sha256:aa93dcb1271a3cd2846ab07f9e37f27280604dd5c50ea299050553a4fe6fd60d", size = 3993380, upload-time = "2026-04-17T08:22:23.592Z" },
{ url = "https://files.pythonhosted.org/packages/eb/8c/71bc286a6d52a53682c669abeea1d4dd3f320812d9c1816f8d71ad4e99ba/hf_xet-1.5.0.dev1-cp313-cp313t-win_arm64.whl", hash = "sha256:7928c15eef205aaa1786e63294331f184152e8e7d9f0f352047bf1b590f540cd", size = 3851055, upload-time = "2026-04-17T08:22:21.556Z" },
{ url = "https://files.pythonhosted.org/packages/3c/79/42bace8f9651276eb96463b2ad275f6b53fe2b22ba3c5ea7f1819b580785/hf_xet-1.5.0.dev1-cp314-cp314t-macosx_10_12_x86_64.whl", hash = "sha256:11a00f8ec39f69c3cd32fb8980b86c91945aaf0588667079994edda9fa2e3cb2", size = 6897594, upload-time = "2026-04-17T08:21:47.543Z" },
{ url = "https://files.pythonhosted.org/packages/c1/b0/7d950c8f68280c1907b146e848e244eec054300769b6645455cf92075094/hf_xet-1.5.0.dev1-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:d333be26f91cbfa573d24005c5502ce48eb19ec416982ebd5cf8212cdb549942", size = 6569370, upload-time = "2026-04-17T08:21:45.24Z" },
{ url = "https://files.pythonhosted.org/packages/be/20/60828b7429397f5fe417e312b3b222f97a3293e129977c7d6c1fe07b14cc/hf_xet-1.5.0.dev1-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:44ca5ad2a82c60f1b749a65e361c006fa8c9feaab703e4c9e72b5ff830dca1f6", size = 63253090, upload-time = "2026-04-17T08:21:32.004Z" },
{ url = "https://files.pythonhosted.org/packages/71/54/3fc89b6e47e9e43b86613e32c1cccb8cdeaaa5b19a99decc41d6b57f0d65/hf_xet-1.5.0.dev1-cp314-cp314t-manylinux_2_28_aarch64.whl", hash = "sha256:df5ba34b731c0be6eb5290cd46adb7b245583bdbf271f87caed60f3a3f65e859", size = 58659612, upload-time = "2026-04-17T08:21:27.084Z" },
{ url = "https://files.pythonhosted.org/packages/18/76/2165625d83309a38dd2b91ce3b7ccb0384151f7f205b033575849b996546/hf_xet-1.5.0.dev1-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:c4661dd045f6d59f838119423948d9cec06ac498ac09a869f7df4abbe70f01aa", size = 59152315, upload-time = "2026-04-17T08:22:11.349Z" },
{ url = "https://files.pythonhosted.org/packages/ef/b1/e0effd9fb1acbd142c6e9345db171254f953a701b16799b815535cae771c/hf_xet-1.5.0.dev1-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:2b07f87bb1d21cde3889d684f194e0c6047091c94b54c3e52d1b80e738d016ed", size = 60228716, upload-time = "2026-04-17T08:22:16.177Z" },
{ url = "https://files.pythonhosted.org/packages/aa/9e/73921723685e27f6b54a016374894d69fb06eb0452fe7b7ada12b54b32fd/hf_xet-1.5.0.dev1-cp314-cp314t-win_amd64.whl", hash = "sha256:bb81277c04fcd49a4c3e93bc5bcf1d33a9604b32085f3f7e95f52edb9c2deca6", size = 3994035, upload-time = "2026-04-17T08:22:31.471Z" },
{ url = "https://files.pythonhosted.org/packages/4c/7f/a2f422bb7d3050760d0aae59f4999dbfcb84708b822432f2d5bc3dd76234/hf_xet-1.5.0.dev1-cp314-cp314t-win_arm64.whl", hash = "sha256:724fa6f5f644295de503e6cdb1b1c96a7ad2512db6a641daa32b0f33888e88f7", size = 3851354, upload-time = "2026-04-17T08:22:29.647Z" },
{ url = "https://files.pythonhosted.org/packages/85/fa/6c404999f13892e8ef2b75ec07af0b118fa1241a7bd278f6b93d61063746/hf_xet-1.5.0.dev1-cp37-abi3-macosx_10_12_x86_64.whl", hash = "sha256:5a180160a120357cabc0cd60167864f110bb8f0b1c38b71e0a93cde13839475e", size = 6907817, upload-time = "2026-04-17T08:21:42.228Z" },
{ url = "https://files.pythonhosted.org/packages/ad/d1/6c828e215079a436d6e916d30248093b7b3ea911e4e6d40b954d21089fc8/hf_xet-1.5.0.dev1-cp37-abi3-macosx_11_0_arm64.whl", hash = "sha256:8701d2e1268c78a1c3cd0e4480b74c0a505cfa864269308efae9d73d0e2203f9", size = 6577425, upload-time = "2026-04-17T08:21:40.097Z" },
{ url = "https://files.pythonhosted.org/packages/e3/c9/2b93ba287824948450ddf64e2596220b58633d019dda278c12abadbf7bb5/hf_xet-1.5.0.dev1-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e5480448001f9e59046ac4c463f2e25fb652066605dd183a82d2b5625b939487", size = 63137387, upload-time = "2026-04-17T08:21:21.775Z" },
{ url = "https://files.pythonhosted.org/packages/dc/b5/c74899d4da67155db8b4f9d8b21110a919d969a15b75aceaec9502c8e7c3/hf_xet-1.5.0.dev1-cp37-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:14e9773ade3fb48dcfa9f493c8ed065704dd3031d29a5a289fed58b8223f2409", size = 58503933, upload-time = "2026-04-17T08:21:17.434Z" },
{ url = "https://files.pythonhosted.org/packages/27/42/d9d511d425696a8b54cf67af0d3de0f8564f81f81e046b107a967f35f00e/hf_xet-1.5.0.dev1-cp37-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:21accf171949d78b18099bf57a4e8490db1ad88c0a4e907f8930c78ffe21f47d", size = 59035994, upload-time = "2026-04-17T08:22:01.526Z" },
{ url = "https://files.pythonhosted.org/packages/8c/b6/49afbe73752f8d176231e49bc02b8b3fe96284ba82d856481c598b5343f4/hf_xet-1.5.0.dev1-cp37-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:07d8ec5c300a7ce3a39fa8598024992f6d2fcfa167b71cc0cde07abdcd05ca01", size = 60139405, upload-time = "2026-04-17T08:22:06.759Z" },
{ url = "https://files.pythonhosted.org/packages/98/ab/e243e97ba2d5e55c848cdb5622466300990d2d0380c4456132d209ce1252/hf_xet-1.5.0.dev1-cp37-abi3-win_amd64.whl", hash = "sha256:ad32cfd5aa66bdf922b7f8eb9a94eb9f64a8f68a31ffede803060b44bd4060f8", size = 4004017, upload-time = "2026-04-17T08:22:27.78Z" },
{ url = "https://files.pythonhosted.org/packages/f7/08/645da274ebe22d06a1ad103667deae75eb658e2b8e493f3a04a8ab140e2d/hf_xet-1.5.0.dev1-cp37-abi3-win_arm64.whl", hash = "sha256:2093091921534e51e13cbeb956550cded7b97aa7ba1d774123c21d9b06f06231", size = 3859306, upload-time = "2026-04-17T08:22:25.602Z" },
{ url = "https://files.pythonhosted.org/packages/68/9b/6912c99070915a4f28119e3c5b52a9abd1eec0ad5cb293b8c967a0c6f5a2/hf_xet-1.5.0-cp313-cp313t-macosx_10_12_x86_64.whl", hash = "sha256:7d70fe2ce97b9db73b9c9b9c81fe3693640aec83416a966c446afea54acfae3c", size = 4023383, upload-time = "2026-05-06T06:17:53.947Z" },
{ url = "https://files.pythonhosted.org/packages/0f/6d/9563cfde59b5d8128a9c7ec972a087f4c782e4f7bac5a85234edfd5d5e49/hf_xet-1.5.0-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:73a0dae8c71de3b0633a45c73f4a4a5ed09e94b43441d82981a781d4f12baa42", size = 3792751, upload-time = "2026-05-06T06:17:51.791Z" },
{ url = "https://files.pythonhosted.org/packages/07/a5/ed5a0cf35b49a0571af5a8f53416dad1877a718c021c9937c3a53cb45781/hf_xet-1.5.0-cp313-cp313t-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:a60290ec57e9b71767fba7c3645ddafdd0759974b540441510c629c6db6db24a", size = 4456058, upload-time = "2026-05-06T06:17:40.735Z" },
{ url = "https://files.pythonhosted.org/packages/60/fb/3ae8bf2a7a37a4197d0195d7247fd25b3952e15cb8a599e285dfaa6f52b3/hf_xet-1.5.0-cp313-cp313t-manylinux_2_28_aarch64.whl", hash = "sha256:e5de0f6deada0dada870bb376a11bcd1f08abf3a968a6d118f33e72d1b1eb480", size = 4250783, upload-time = "2026-05-06T06:17:38.412Z" },
{ url = "https://files.pythonhosted.org/packages/a2/9b/8bae40d4d91525085137196e84eb0ed49cf65b5e96e5c3ecdadd8bd0fac2/hf_xet-1.5.0-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:c799d49f1a5544a0ef7591c0ee75e0d6b93d6f56dc7a4979f59f7518d2872216", size = 4445594, upload-time = "2026-05-06T06:18:04.219Z" },
{ url = "https://files.pythonhosted.org/packages/13/59/c74efbbd4e8728172b2cc72a2bc014d2947a4b7bdced932fbd3f5da1a4e5/hf_xet-1.5.0-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:2baea1b0b989e5c152fe81425f7745ddc8901280ba3d97c98d8cdece7b706c60", size = 4663995, upload-time = "2026-05-06T06:18:06.1Z" },
{ url = "https://files.pythonhosted.org/packages/73/32/8e1e0410af64cda9b139d1dcebdc993a8ff9c8c7c0e2696ae356d75ccc0d/hf_xet-1.5.0-cp313-cp313t-win_amd64.whl", hash = "sha256:526345b3ed45f374f6317349df489167606736c876241ba984105afe7fd4839d", size = 3966608, upload-time = "2026-05-06T06:18:19.74Z" },
{ url = "https://files.pythonhosted.org/packages/fc/34/a8febc8f4edbea8b3e21b02ebc8b628679b84ba7e45cde624a7736b51500/hf_xet-1.5.0-cp313-cp313t-win_arm64.whl", hash = "sha256:786d28e2eb8315d5035544b9d137b4a842d600c434bb91bf7d0d953cce906ad4", size = 3796946, upload-time = "2026-05-06T06:18:17.568Z" },
{ url = "https://files.pythonhosted.org/packages/2a/20/8fc8996afe5815fa1a6be8e9e5c02f24500f409d599e905800d498a4e14d/hf_xet-1.5.0-cp314-cp314t-macosx_10_12_x86_64.whl", hash = "sha256:872d5601e6deea30d15865ede55d29eac6daf5a534ab417b99b6ef6b076dd96c", size = 4023495, upload-time = "2026-05-06T06:18:01.94Z" },
{ url = "https://files.pythonhosted.org/packages/32/6a/93d84463c00cecb561a7508aa6303e35ee2894294eac14245526924415fe/hf_xet-1.5.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:9929561f5abf4581c8ea79587881dfef6b8abb2a0d8a51915936fc2a614f4e73", size = 3792731, upload-time = "2026-05-06T06:18:00.021Z" },
{ url = "https://files.pythonhosted.org/packages/9d/5a/8ec8e0c863b382d00b3c2e2af6ded6b06371be617144a625903a6d562f4b/hf_xet-1.5.0-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:f7b7bbae318e583a86fb21e5a4a175d6721d628a2874f4bd022d0e660c32a682", size = 4456738, upload-time = "2026-05-06T06:17:49.574Z" },
{ url = "https://files.pythonhosted.org/packages/c5/ca/f7effa1a67717da2bcc6b6c28f71c6ca648c77acaec4e2c32f40cbe16d85/hf_xet-1.5.0-cp314-cp314t-manylinux_2_28_aarch64.whl", hash = "sha256:cf7b2dc6f31a4ea754bb50f74cde482dcf5d366d184076d8530b9872787f3761", size = 4251622, upload-time = "2026-05-06T06:17:47.096Z" },
{ url = "https://files.pythonhosted.org/packages/65/f2/19247dba3e231cf77dec59ddfb878f00057635ff773d099c9b59d37812c3/hf_xet-1.5.0-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:8dbcbab554c9ef158ef2c991545c3e970ddd8cc7acdcd0a78c5a41095dab4ded", size = 4445667, upload-time = "2026-05-06T06:18:11.983Z" },
{ url = "https://files.pythonhosted.org/packages/7f/64/6f116801a3bcfb6f59f5c251f48cadc47ea54026441c4a385079286a94fa/hf_xet-1.5.0-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:5906bf7718d3636dc13402914736abe723492cb730f744834f5f5b67d3a12702", size = 4664619, upload-time = "2026-05-06T06:18:13.771Z" },
{ url = "https://files.pythonhosted.org/packages/5c/e8/069542d37946ed08669b127e1496fa99e78196d71de8d41eda5e9f1b7a58/hf_xet-1.5.0-cp314-cp314t-win_amd64.whl", hash = "sha256:5f3dc2248fc01cc0a00cd392ab497f1ca373fcbc7e3f2da1f452480b384e839e", size = 3966802, upload-time = "2026-05-06T06:18:28.162Z" },
{ url = "https://files.pythonhosted.org/packages/f9/91/fc6fdec27b14d04e88c386ac0a0129732b53fa23f7c4a78f4b83a039c567/hf_xet-1.5.0-cp314-cp314t-win_arm64.whl", hash = "sha256:b285cea1b5bab46b758772716ba8d6854a1a0310fed1c249d678a8b38601e5a0", size = 3797168, upload-time = "2026-05-06T06:18:26.287Z" },
{ url = "https://files.pythonhosted.org/packages/3d/fb/69ff198a82cae7eb1a69fb84d93b3a3e4816564d76817fe541ddc96874eb/hf_xet-1.5.0-cp37-abi3-macosx_10_12_x86_64.whl", hash = "sha256:dad0dc84e941b8ba3c860659fe1fdc35c049d47cce293f003287757e971a8f56", size = 4030814, upload-time = "2026-05-06T06:17:57.933Z" },
{ url = "https://files.pythonhosted.org/packages/9b/ff/edcc2b40162bef3ff78e14ab637e5f3b89243d6aee72f5949d3bb6a5af83/hf_xet-1.5.0-cp37-abi3-macosx_11_0_arm64.whl", hash = "sha256:fd6e5a9b0fdac4ed03ed45ef79254a655b1aaab514a02202617fbf643f5fdf7a", size = 3798444, upload-time = "2026-05-06T06:17:55.79Z" },
{ url = "https://files.pythonhosted.org/packages/49/4d/103f76b04310e5e57656696cc184690d20c466af0bca3ca88f8c8ea5d4f3/hf_xet-1.5.0-cp37-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:3531b1823a0e6d77d80f9ed15ca0e00f0d115094f8ac033d5cae88f4564cc949", size = 4465986, upload-time = "2026-05-06T06:17:44.886Z" },
{ url = "https://files.pythonhosted.org/packages/c4/a2/546f47f464737b3edbab6f8ddb57f2599b93d2cbb66f06abb475ccb48651/hf_xet-1.5.0-cp37-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:9a0ee58cd18d5ea799f7ed11290bbccbe56bdd8b1d97ca74b9cc49a3945d7a3b", size = 4259865, upload-time = "2026-05-06T06:17:42.639Z" },
{ url = "https://files.pythonhosted.org/packages/95/7f/1be593c1f28613be2e196473481cd81bfc5910795e30a34e8f744f6cac4f/hf_xet-1.5.0-cp37-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:1e60df5a42e9bed8628b6416af2cba4cba57ae9f02de226a06b020d98e1aab18", size = 4459835, upload-time = "2026-05-06T06:18:08.026Z" },
{ url = "https://files.pythonhosted.org/packages/aa/b2/703569fc881f3284487e68cda7b42179978480da3c438042a6bbbb4a671c/hf_xet-1.5.0-cp37-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:4b35549ce62601b84da4ff9b24d970032ace3d4430f52d91bcbb26c901d6c690", size = 4672414, upload-time = "2026-05-06T06:18:09.864Z" },
{ url = "https://files.pythonhosted.org/packages/af/37/1b6def445c567286b50aa3b33828158e135b1be44938dde59f11382a500c/hf_xet-1.5.0-cp37-abi3-win_amd64.whl", hash = "sha256:2806c7c17b4d23f8d88f7c4814f838c3b6150773fe339c20af23e1cfaf2797e4", size = 3977238, upload-time = "2026-05-06T06:18:23.621Z" },
{ url = "https://files.pythonhosted.org/packages/62/94/3b66b148778ee100dcfd69c2ca22b57b41b44d3063ceec934f209e9184ce/hf_xet-1.5.0-cp37-abi3-win_arm64.whl", hash = "sha256:b6c9df403040248c76d808d3e047d64db2d923bae593eb244c41e425cf6cd7be", size = 3806916, upload-time = "2026-05-06T06:18:21.7Z" },
]
[[package]]
@@ -1635,9 +1656,15 @@ name = "numpy"
version = "2.3.5"
source = { registry = "https://pypi.org/simple" }
resolution-markers = [
"python_full_version < '3.13' and sys_platform == 'win32'",
"python_full_version < '3.13' and sys_platform == 'emscripten'",
"python_full_version < '3.13' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform == 'win32'",
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform == 'win32'",
"python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
"python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
sdist = { url = "https://files.pythonhosted.org/packages/76/65/21b3bc86aac7b8f2862db1e808f1ea22b028e30a225a34a5ede9bf8678f2/numpy-2.3.5.tar.gz", hash = "sha256:784db1dcdab56bf0517743e746dfb0f885fc68d948aba86eeec2cba234bdf1c0", size = 20584950, upload-time = "2025-11-16T22:52:42.067Z" }
wheels = [
@@ -1703,12 +1730,24 @@ name = "numpy"
version = "2.4.4"
source = { registry = "https://pypi.org/simple" }
resolution-markers = [
"python_full_version >= '3.14' and sys_platform == 'win32'",
"python_full_version >= '3.14' and sys_platform == 'emscripten'",
"python_full_version >= '3.14' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and sys_platform == 'win32'",
"python_full_version == '3.13.*' and sys_platform == 'emscripten'",
"python_full_version == '3.13.*' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform == 'win32'",
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform == 'win32'",
"python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
"python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform == 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform == 'win32'",
"python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
"python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
sdist = { url = "https://files.pythonhosted.org/packages/d7/9f/b8cef5bffa569759033adda9481211426f12f53299629b410340795c2514/numpy-2.4.4.tar.gz", hash = "sha256:2d390634c5182175533585cc89f3608a4682ccb173cc9bb940b2881c8d6f8fa0", size = 20731587, upload-time = "2026-03-29T13:22:01.298Z" }
wheels = [
@@ -1771,42 +1810,116 @@ wheels = [
name = "nvidia-cublas-cu12"
version = "12.8.4.1"
source = { registry = "https://pypi.org/simple" }
resolution-markers = [
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/dc/61/e24b560ab2e2eaeb3c839129175fb330dfcfc29e5203196e5541a4c44682/nvidia_cublas_cu12-12.8.4.1-py3-none-manylinux_2_27_x86_64.whl", hash = "sha256:8ac4e771d5a348c551b2a426eda6193c19aa630236b418086020df5ba9667142", size = 594346921, upload-time = "2025-03-07T01:44:31.254Z" },
]
[[package]]
name = "nvidia-cublas-cu12"
version = "12.9.1.4"
source = { registry = "https://pypi.org/simple" }
resolution-markers = [
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/82/6c/90d3f532f608a03a13c1d6c16c266ffa3828e8011b1549d3b61db2ad59f5/nvidia_cublas_cu12-12.9.1.4-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:7a950dae01add3b415a5a5cdc4ec818fb5858263e9cca59004bb99fdbbd3a5d6", size = 575006342, upload-time = "2025-06-05T20:04:16.902Z" },
]
[[package]]
name = "nvidia-cuda-cupti-cu12"
version = "12.8.90"
source = { registry = "https://pypi.org/simple" }
resolution-markers = [
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/f8/02/2adcaa145158bf1a8295d83591d22e4103dbfd821bcaf6f3f53151ca4ffa/nvidia_cuda_cupti_cu12-12.8.90-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:ea0cb07ebda26bb9b29ba82cda34849e73c166c18162d3913575b0c9db9a6182", size = 10248621, upload-time = "2025-03-07T01:40:21.213Z" },
]
[[package]]
name = "nvidia-cuda-cupti-cu12"
version = "12.9.79"
source = { registry = "https://pypi.org/simple" }
resolution-markers = [
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/b4/78/351b5c8cdbd9a6b4fb0d6ee73fb176dcdc1b6b6ad47c2ffff5ae8ca4a1f7/nvidia_cuda_cupti_cu12-12.9.79-py3-none-manylinux_2_25_aarch64.whl", hash = "sha256:791853b030602c6a11d08b5578edfb957cadea06e9d3b26adbf8d036135a4afe", size = 10077166, upload-time = "2025-06-05T20:01:01.385Z" },
]
[[package]]
name = "nvidia-cuda-nvrtc-cu12"
version = "12.8.93"
source = { registry = "https://pypi.org/simple" }
resolution-markers = [
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/05/6b/32f747947df2da6994e999492ab306a903659555dddc0fbdeb9d71f75e52/nvidia_cuda_nvrtc_cu12-12.8.93-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl", hash = "sha256:a7756528852ef889772a84c6cd89d41dfa74667e24cca16bb31f8f061e3e9994", size = 88040029, upload-time = "2025-03-07T01:42:13.562Z" },
]
[[package]]
name = "nvidia-cuda-nvrtc-cu12"
version = "12.9.86"
source = { registry = "https://pypi.org/simple" }
resolution-markers = [
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/64/eb/c2295044b8f3b3b08860e2f6a912b702fc92568a167259df5dddb78f325e/nvidia_cuda_nvrtc_cu12-12.9.86-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:096d4de6bda726415dfaf3198d4f5c522b8e70139c97feef5cd2ca6d4cd9cead", size = 44528905, upload-time = "2025-06-05T20:02:29.754Z" },
]
[[package]]
name = "nvidia-cuda-runtime-cu12"
version = "12.8.90"
source = { registry = "https://pypi.org/simple" }
resolution-markers = [
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/0d/9b/a997b638fcd068ad6e4d53b8551a7d30fe8b404d6f1804abf1df69838932/nvidia_cuda_runtime_cu12-12.8.90-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:adade8dcbd0edf427b7204d480d6066d33902cab2a4707dcfc48a2d0fd44ab90", size = 954765, upload-time = "2025-03-07T01:40:01.615Z" },
]
[[package]]
name = "nvidia-cuda-runtime-cu12"
version = "12.9.79"
source = { registry = "https://pypi.org/simple" }
resolution-markers = [
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/bc/e0/0279bd94539fda525e0c8538db29b72a5a8495b0c12173113471d28bce78/nvidia_cuda_runtime_cu12-12.9.79-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:83469a846206f2a733db0c42e223589ab62fd2fabac4432d2f8802de4bded0a4", size = 3515012, upload-time = "2025-06-05T20:00:35.519Z" },
]
[[package]]
name = "nvidia-cudnn-cu12"
version = "9.10.2.21"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "nvidia-cublas-cu12", marker = "sys_platform != 'emscripten' and sys_platform != 'win32'" },
{ name = "nvidia-cublas-cu12", version = "12.8.4.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
{ name = "nvidia-cublas-cu12", version = "12.9.1.4", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/fa/41/e79269ce215c857c935fd86bcfe91a451a584dfc27f1e068f568b9ad1ab7/nvidia_cudnn_cu12-9.10.2.21-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:c9132cc3f8958447b4910a1720036d9eff5928cc3179b0a51fb6d167c6cc87d8", size = 705026878, upload-time = "2025-06-06T21:52:51.348Z" },
{ url = "https://files.pythonhosted.org/packages/ba/51/e123d997aa098c61d029f76663dedbfb9bc8dcf8c60cbd6adbe42f76d049/nvidia_cudnn_cu12-9.10.2.21-py3-none-manylinux_2_27_x86_64.whl", hash = "sha256:949452be657fa16687d0930933f032835951ef0892b37d2d53824d1a84dc97a8", size = 706758467, upload-time = "2025-06-06T21:54:08.597Z" },
]
@@ -1830,58 +1943,160 @@ wheels = [
name = "nvidia-cufft-cu12"
version = "11.3.3.83"
source = { registry = "https://pypi.org/simple" }
resolution-markers = [
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
dependencies = [
{ name = "nvidia-nvjitlink-cu12", marker = "sys_platform != 'emscripten' and sys_platform != 'win32'" },
{ name = "nvidia-nvjitlink-cu12", version = "12.8.93", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/1f/13/ee4e00f30e676b66ae65b4f08cb5bcbb8392c03f54f2d5413ea99a5d1c80/nvidia_cufft_cu12-11.3.3.83-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:4d2dd21ec0b88cf61b62e6b43564355e5222e4a3fb394cac0db101f2dd0d4f74", size = 193118695, upload-time = "2025-03-07T01:45:27.821Z" },
]
[[package]]
name = "nvidia-cufft-cu12"
version = "11.4.1.4"
source = { registry = "https://pypi.org/simple" }
resolution-markers = [
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
dependencies = [
{ name = "nvidia-nvjitlink-cu12", version = "12.9.86", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/9b/2b/76445b0af890da61b501fde30650a1a4bd910607261b209cccb5235d3daa/nvidia_cufft_cu12-11.4.1.4-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:1a28c9b12260a1aa7a8fd12f5ebd82d027963d635ba82ff39a1acfa7c4c0fbcf", size = 200822453, upload-time = "2025-06-05T20:05:27.889Z" },
]
[[package]]
name = "nvidia-cufile-cu12"
version = "1.13.1.3"
source = { registry = "https://pypi.org/simple" }
resolution-markers = [
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/bb/fe/1bcba1dfbfb8d01be8d93f07bfc502c93fa23afa6fd5ab3fc7c1df71038a/nvidia_cufile_cu12-1.13.1.3-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:1d069003be650e131b21c932ec3d8969c1715379251f8d23a1860554b1cb24fc", size = 1197834, upload-time = "2025-03-07T01:45:50.723Z" },
]
[[package]]
name = "nvidia-cufile-cu12"
version = "1.14.1.1"
source = { registry = "https://pypi.org/simple" }
resolution-markers = [
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/b9/d2/110af3a1f77999d5eebf6ffae5d2305ab839e53c76eec3696640cc25b35d/nvidia_cufile_cu12-1.14.1.1-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:8dea77590761e02cb6dd955a57cb6414c58aa3cb1b7adbf9919869a11509cf65", size = 1135994, upload-time = "2025-06-05T20:06:03.952Z" },
]
[[package]]
name = "nvidia-curand-cu12"
version = "10.3.9.90"
source = { registry = "https://pypi.org/simple" }
resolution-markers = [
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/fb/aa/6584b56dc84ebe9cf93226a5cde4d99080c8e90ab40f0c27bda7a0f29aa1/nvidia_curand_cu12-10.3.9.90-py3-none-manylinux_2_27_x86_64.whl", hash = "sha256:b32331d4f4df5d6eefa0554c565b626c7216f87a06a4f56fab27c3b68a830ec9", size = 63619976, upload-time = "2025-03-07T01:46:23.323Z" },
]
[[package]]
name = "nvidia-curand-cu12"
version = "10.3.10.19"
source = { registry = "https://pypi.org/simple" }
resolution-markers = [
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/14/1c/2a45afc614d99558d4a773fa740d8bb5471c8398eeed925fc0fcba020173/nvidia_curand_cu12-10.3.10.19-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:de663377feb1697e1d30ed587b07d5721fdd6d2015c738d7528a6002a6134d37", size = 68292066, upload-time = "2025-05-01T19:39:13.595Z" },
]
[[package]]
name = "nvidia-cusolver-cu12"
version = "11.7.3.90"
source = { registry = "https://pypi.org/simple" }
resolution-markers = [
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
dependencies = [
{ name = "nvidia-cublas-cu12", marker = "sys_platform != 'emscripten' and sys_platform != 'win32'" },
{ name = "nvidia-cusparse-cu12", marker = "sys_platform != 'emscripten' and sys_platform != 'win32'" },
{ name = "nvidia-nvjitlink-cu12", marker = "sys_platform != 'emscripten' and sys_platform != 'win32'" },
{ name = "nvidia-cublas-cu12", version = "12.8.4.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
{ name = "nvidia-cusparse-cu12", version = "12.5.8.93", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
{ name = "nvidia-nvjitlink-cu12", version = "12.8.93", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/85/48/9a13d2975803e8cf2777d5ed57b87a0b6ca2cc795f9a4f59796a910bfb80/nvidia_cusolver_cu12-11.7.3.90-py3-none-manylinux_2_27_x86_64.whl", hash = "sha256:4376c11ad263152bd50ea295c05370360776f8c3427b30991df774f9fb26c450", size = 267506905, upload-time = "2025-03-07T01:47:16.273Z" },
]
[[package]]
name = "nvidia-cusolver-cu12"
version = "11.7.5.82"
source = { registry = "https://pypi.org/simple" }
resolution-markers = [
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
dependencies = [
{ name = "nvidia-cublas-cu12", version = "12.9.1.4", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
{ name = "nvidia-cusparse-cu12", version = "12.5.10.65", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
{ name = "nvidia-nvjitlink-cu12", version = "12.9.86", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/03/99/686ff9bf3a82a531c62b1a5c614476e8dfa24a9d89067aeedf3592ee4538/nvidia_cusolver_cu12-11.7.5.82-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:62efa83e4ace59a4c734d052bb72158e888aa7b770e1a5f601682f16fe5b4fd2", size = 337869834, upload-time = "2025-06-05T20:06:53.125Z" },
]
[[package]]
name = "nvidia-cusparse-cu12"
version = "12.5.8.93"
source = { registry = "https://pypi.org/simple" }
resolution-markers = [
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
dependencies = [
{ name = "nvidia-nvjitlink-cu12", marker = "sys_platform != 'emscripten' and sys_platform != 'win32'" },
{ name = "nvidia-nvjitlink-cu12", version = "12.8.93", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/c2/f5/e1854cb2f2bcd4280c44736c93550cc300ff4b8c95ebe370d0aa7d2b473d/nvidia_cusparse_cu12-12.5.8.93-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:1ec05d76bbbd8b61b06a80e1eaf8cf4959c3d4ce8e711b65ebd0443bb0ebb13b", size = 288216466, upload-time = "2025-03-07T01:48:13.779Z" },
]
[[package]]
name = "nvidia-cusparse-cu12"
version = "12.5.10.65"
source = { registry = "https://pypi.org/simple" }
resolution-markers = [
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
dependencies = [
{ name = "nvidia-nvjitlink-cu12", version = "12.9.86", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/5e/6f/8710fbd17cdd1d0fc3fea7d36d5b65ce1933611c31e1861da330206b253a/nvidia_cusparse_cu12-12.5.10.65-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:221c73e7482dd93eda44e65ce567c031c07e2f93f6fa0ecd3ba876a195023e83", size = 366359408, upload-time = "2025-06-05T20:07:42.501Z" },
]
[[package]]
name = "nvidia-cusparselt-cu12"
version = "0.7.1"
source = { registry = "https://pypi.org/simple" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/73/b9/598f6ff36faaece4b3c50d26f50e38661499ff34346f00e057760b35cc9d/nvidia_cusparselt_cu12-0.7.1-py3-none-manylinux2014_aarch64.whl", hash = "sha256:8878dce784d0fac90131b6817b607e803c36e629ba34dc5b433471382196b6a5", size = 283835557, upload-time = "2025-02-26T00:16:54.265Z" },
{ url = "https://files.pythonhosted.org/packages/56/79/12978b96bd44274fe38b5dde5cfb660b1d114f70a65ef962bcbbed99b549/nvidia_cusparselt_cu12-0.7.1-py3-none-manylinux2014_x86_64.whl", hash = "sha256:f1bb701d6b930d5a7cea44c19ceb973311500847f81b634d802b7b539dc55623", size = 287193691, upload-time = "2025-02-26T00:15:44.104Z" },
]
@@ -1929,6 +2144,7 @@ name = "nvidia-nccl-cu12"
version = "2.27.5"
source = { registry = "https://pypi.org/simple" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/bb/1c/857979db0ef194ca5e21478a0612bcdbbe59458d7694361882279947b349/nvidia_nccl_cu12-2.27.5-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:31432ad4d1fb1004eb0c56203dc9bc2178a1ba69d1d9e02d64a6938ab5e40e7a", size = 322400625, upload-time = "2025-06-26T04:11:04.496Z" },
{ url = "https://files.pythonhosted.org/packages/6e/89/f7a07dc961b60645dbbf42e80f2bc85ade7feb9a491b11a1e973aa00071f/nvidia_nccl_cu12-2.27.5-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:ad730cf15cb5d25fe849c6e6ca9eb5b76db16a80f13f425ac68d8e2e55624457", size = 322348229, upload-time = "2025-06-26T04:11:28.385Z" },
]
@@ -1936,15 +2152,34 @@ wheels = [
name = "nvidia-nvjitlink-cu12"
version = "12.8.93"
source = { registry = "https://pypi.org/simple" }
resolution-markers = [
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/f6/74/86a07f1d0f42998ca31312f998bd3b9a7eff7f52378f4f270c8679c77fb9/nvidia_nvjitlink_cu12-12.8.93-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl", hash = "sha256:81ff63371a7ebd6e6451970684f916be2eab07321b73c9d244dc2b4da7f73b88", size = 39254836, upload-time = "2025-03-07T01:49:55.661Z" },
]
[[package]]
name = "nvidia-nvjitlink-cu12"
version = "12.9.86"
source = { registry = "https://pypi.org/simple" }
resolution-markers = [
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/97/bc/2dcba8e70cf3115b400fef54f213bcd6715a3195eba000f8330f11e40c45/nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:994a05ef08ef4b0b299829cde613a424382aff7efb08a7172c1fa616cc3af2ca", size = 39514880, upload-time = "2025-06-05T20:10:04.89Z" },
]
[[package]]
name = "nvidia-nvshmem-cu12"
version = "3.3.20"
source = { registry = "https://pypi.org/simple" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/92/9d/3dd98852568fb845ec1f7902c90a22b240fe1cbabda411ccedf2fd737b7b/nvidia_nvshmem_cu12-3.3.20-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:0b0b960da3842212758e4fa4696b94f129090b30e5122fea3c5345916545cff0", size = 124484616, upload-time = "2025-08-04T20:24:59.172Z" },
{ url = "https://files.pythonhosted.org/packages/3b/6c/99acb2f9eb85c29fc6f3a7ac4dccfd992e22666dd08a642b303311326a97/nvidia_nvshmem_cu12-3.3.20-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:d00f26d3f9b2e3c3065be895e3059d6479ea5c638a3f38c9fec49b1b9dd7c1e5", size = 124657145, upload-time = "2025-08-04T20:25:19.995Z" },
]
@@ -1952,10 +2187,28 @@ wheels = [
name = "nvidia-nvtx-cu12"
version = "12.8.90"
source = { registry = "https://pypi.org/simple" }
resolution-markers = [
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/a2/eb/86626c1bbc2edb86323022371c39aa48df6fd8b0a1647bc274577f72e90b/nvidia_nvtx_cu12-12.8.90-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:5b17e2001cc0d751a5bc2c6ec6d26ad95913324a4adb86788c944f8ce9ba441f", size = 89954, upload-time = "2025-03-07T01:42:44.131Z" },
]
[[package]]
name = "nvidia-nvtx-cu12"
version = "12.9.79"
source = { registry = "https://pypi.org/simple" }
resolution-markers = [
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/c4/e4/82155e4aaedb41621087ba219c95e99c5e417f37a7649b4fb6ec32dcb14d/nvidia_nvtx_cu12-12.9.79-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:d1f258e752294acdb4f61c3d31fee87bd0f60e459f1e2f624376369b524cd15d", size = 86120, upload-time = "2025-06-05T20:02:51.838Z" },
]
[[package]]
name = "openai"
version = "2.6.1"
@@ -2072,7 +2325,8 @@ dependencies = [
{ name = "pydantic" },
{ name = "referencing" },
{ name = "requests" },
{ name = "torch" },
{ name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
{ name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
{ name = "tqdm" },
{ name = "typing-extensions" },
]
@@ -2893,7 +3147,8 @@ source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "apache-tvm-ffi" },
{ name = "nvidia-cutlass-dsl" },
{ name = "torch" },
{ name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
{ name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
{ name = "torch-c-dlpack-ext" },
]
sdist = { url = "https://files.pythonhosted.org/packages/73/34/bcc87d1ee53cf245bf58ea563b276b9bd86a405bda5a42e7bd1386db9941/quack_kernels-0.3.11.tar.gz", hash = "sha256:d589417476030fb62e70730c4bd0732339a04b8bb91fd49bf4cc70e20a27170b", size = 246675, upload-time = "2026-04-20T01:08:12.269Z" }
@@ -3315,8 +3570,7 @@ wheels = [
[[package]]
name = "sglang"
version = "0.5.10"
source = { registry = "https://pypi.org/simple" }
source = { editable = "third_party/sglang/python" }
dependencies = [
{ name = "aiohttp" },
{ name = "anthropic" },
@@ -3369,7 +3623,8 @@ dependencies = [
{ name = "soundfile" },
{ name = "tiktoken" },
{ name = "timm" },
{ name = "torch" },
{ name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
{ name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
{ name = "torch-memory-saver" },
{ name = "torchao" },
{ name = "torchaudio" },
@@ -3382,10 +3637,118 @@ dependencies = [
{ name = "watchfiles" },
{ name = "xgrammar" },
]
sdist = { url = "https://files.pythonhosted.org/packages/c8/4e/bd00d332098337ae13fa783a13258935d568dd5b7e1fd9df205184145224/sglang-0.5.10.tar.gz", hash = "sha256:db78367f41a1f385f8624a10e9506b671e788f9943978df6a37a486867c1edc7", size = 4700833, upload-time = "2026-04-05T23:57:27.556Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/1f/ee/f7a946162ed538f47a1c5542f93410e5bf9a0c4ca6021d4000e6f9b87f7d/sglang-0.5.10-py3-none-any.whl", hash = "sha256:ac8855a5d57dac8831fee526bca5212f1ae451f378e2ab08b3baecbc4deb4076", size = 6064398, upload-time = "2026-04-05T23:57:25.28Z" },
[package.metadata]
requires-dist = [
{ name = "accelerate", marker = "extra == 'test'" },
{ name = "addict", marker = "extra == 'diffusion'", specifier = "==2.4.0" },
{ name = "addict", marker = "extra == 'test'" },
{ name = "aiohttp" },
{ name = "anthropic", specifier = ">=0.20.0" },
{ name = "apache-tvm-ffi", specifier = ">=0.1.5,<0.2" },
{ name = "av", marker = "(platform_machine == 'aarch64' and sys_platform == 'linux') or (platform_machine == 'arm64' and sys_platform == 'linux') or (platform_machine == 'armv7l' and sys_platform == 'linux')" },
{ name = "av", marker = "extra == 'diffusion'", specifier = "==16.1.0" },
{ name = "bitsandbytes", marker = "extra == 'test'" },
{ name = "blobfile", specifier = "==3.0.0" },
{ name = "build" },
{ name = "cache-dit", marker = "extra == 'diffusion'", specifier = "==1.3.0" },
{ name = "checkpoint-engine", marker = "extra == 'checkpoint-engine'", specifier = "==0.1.2" },
{ name = "cloudpickle", marker = "extra == 'diffusion'", specifier = "==3.1.2" },
{ name = "compressed-tensors" },
{ name = "cuda-python", specifier = "==12.9" },
{ name = "datasets" },
{ name = "decord2", marker = "(platform_machine == 'aarch64' and sys_platform == 'linux') or (platform_machine == 'arm64' and sys_platform == 'linux') or (platform_machine == 'armv7l' and sys_platform == 'linux')" },
{ name = "diff-cover", marker = "extra == 'test'" },
{ name = "diffusers", marker = "extra == 'diffusion'", specifier = "==0.37.0" },
{ name = "einops" },
{ name = "expecttest", marker = "extra == 'test'" },
{ name = "fastapi" },
{ name = "flash-attn-4", specifier = ">=4.0.0b4" },
{ name = "flashinfer-cubin", specifier = "==0.6.7.post2" },
{ name = "flashinfer-python", specifier = "==0.6.7.post2" },
{ name = "gguf" },
{ name = "imageio", marker = "extra == 'diffusion'", specifier = "==2.36.0" },
{ name = "imageio-ffmpeg", marker = "extra == 'diffusion'", specifier = "==0.5.1" },
{ name = "interegular" },
{ name = "ipython" },
{ name = "jsonlines", marker = "extra == 'test'" },
{ name = "llguidance", specifier = ">=0.7.11,<0.8.0" },
{ name = "lm-eval", extras = ["api"], marker = "extra == 'test'", specifier = ">=0.4.9.2" },
{ name = "matplotlib", marker = "extra == 'test'" },
{ name = "mistral-common", specifier = ">=1.9.0" },
{ name = "modelscope" },
{ name = "moviepy", marker = "extra == 'diffusion'", specifier = ">=2.0.0" },
{ name = "msgspec" },
{ name = "ninja" },
{ name = "numpy" },
{ name = "nvidia-cutlass-dsl", specifier = ">=4.4.1" },
{ name = "nvidia-ml-py" },
{ name = "openai", specifier = "==2.6.1" },
{ name = "openai-harmony", specifier = "==0.0.4" },
{ name = "opencv-python-headless", marker = "extra == 'diffusion'", specifier = "==4.10.0.84" },
{ name = "opentelemetry-api", marker = "extra == 'tracing'" },
{ name = "opentelemetry-exporter-otlp", marker = "extra == 'tracing'" },
{ name = "opentelemetry-exporter-otlp-proto-grpc", marker = "extra == 'tracing'" },
{ name = "opentelemetry-sdk", marker = "extra == 'tracing'" },
{ name = "orjson" },
{ name = "outlines", specifier = "==0.1.11" },
{ name = "packaging" },
{ name = "pandas", marker = "extra == 'test'" },
{ name = "parameterized", marker = "extra == 'test'" },
{ name = "partial-json-parser" },
{ name = "peft", marker = "extra == 'test'", specifier = ">=0.18.0" },
{ name = "pillow" },
{ name = "polars", marker = "extra == 'test'" },
{ name = "prometheus-client", specifier = ">=0.20.0" },
{ name = "psutil" },
{ name = "py-spy" },
{ name = "pybase64" },
{ name = "pydantic" },
{ name = "pytest", marker = "extra == 'test'" },
{ name = "pytest-cov", marker = "extra == 'test'" },
{ name = "python-multipart" },
{ name = "pyyaml", marker = "extra == 'diffusion'", specifier = "==6.0.1" },
{ name = "pyzmq", specifier = ">=25.1.2" },
{ name = "quack-kernels", specifier = ">=0.3.0" },
{ name = "ray", extras = ["default"], marker = "extra == 'ray'", specifier = ">=2.54.0" },
{ name = "remote-pdb", marker = "extra == 'diffusion'", specifier = "==2.1.0" },
{ name = "requests" },
{ name = "runai-model-streamer", marker = "extra == 'diffusion'", specifier = ">=0.15.7" },
{ name = "runai-model-streamer", extras = ["azure", "gcs", "s3"], marker = "extra == 'runai'", specifier = ">=0.15.7" },
{ name = "scikit-image", marker = "extra == 'diffusion'", specifier = "==0.25.2" },
{ name = "scipy" },
{ name = "sentence-transformers", marker = "extra == 'test'" },
{ name = "sentencepiece" },
{ name = "setproctitle" },
{ name = "sglang", extras = ["diffusion"], marker = "extra == 'all'" },
{ name = "sglang", extras = ["test"], marker = "extra == 'dev'" },
{ name = "sglang", extras = ["tracing"], marker = "extra == 'all'" },
{ name = "sglang-kernel", specifier = "==0.4.1" },
{ name = "smg-grpc-servicer", specifier = ">=0.5.0" },
{ name = "soundfile", specifier = "==0.13.1" },
{ name = "st-attn", marker = "platform_machine != 'aarch64' and platform_machine != 'arm64' and extra == 'diffusion'", specifier = "==0.0.7" },
{ name = "tabulate", marker = "extra == 'test'" },
{ name = "tiktoken" },
{ name = "timm", specifier = "==1.0.16" },
{ name = "torch", marker = "platform_machine != 'aarch64' and platform_machine != 'x86_64'", specifier = "==2.9.1" },
{ name = "torch", marker = "platform_machine == 'aarch64'", specifier = "==2.9.1", index = "https://download.pytorch.org/whl/cu129" },
{ name = "torch", marker = "platform_machine == 'x86_64'", specifier = "==2.9.1", index = "https://pypi.org/simple" },
{ name = "torch-memory-saver", specifier = "==0.0.9" },
{ name = "torchao", specifier = "==0.9.0" },
{ name = "torchaudio", specifier = "==2.9.1" },
{ name = "torchcodec", marker = "(platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l') or sys_platform != 'linux'", specifier = "==0.9.1" },
{ name = "torchvision" },
{ name = "tqdm" },
{ name = "transformers", specifier = "==5.3.0" },
{ name = "trimesh", marker = "extra == 'diffusion'", specifier = ">=4.0.0" },
{ name = "uvicorn" },
{ name = "uvloop" },
{ name = "vsa", marker = "platform_machine != 'aarch64' and platform_machine != 'arm64' and extra == 'diffusion'", specifier = "==0.0.4" },
{ name = "watchfiles" },
{ name = "xatlas", marker = "extra == 'diffusion'" },
{ name = "xgrammar", specifier = "==0.1.32" },
]
provides-extras = ["checkpoint-engine", "runai", "diffusion", "ray", "tracing", "test", "dev", "all"]
[[package]]
name = "sglang-kernel"
@@ -3574,7 +3937,8 @@ dependencies = [
{ name = "huggingface-hub" },
{ name = "pyyaml" },
{ name = "safetensors" },
{ name = "torch" },
{ name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
{ name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
{ name = "torchvision" },
]
sdist = { url = "https://files.pythonhosted.org/packages/94/f6/4d7a8c261341fa6ad281920618739f2a650f41043afcedb570f24e99a776/timm-1.0.16.tar.gz", hash = "sha256:a3b8130dd2cb8dc3b9f5e3d09ab6d677a6315a8695fd5264eb6d52a4a46c1044", size = 2339999, upload-time = "2025-06-26T17:09:44.208Z" }
@@ -3612,30 +3976,50 @@ wheels = [
name = "torch"
version = "2.9.1"
source = { registry = "https://pypi.org/simple" }
resolution-markers = [
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform == 'win32'",
"python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
"python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
"python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform == 'win32'",
"python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
"python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
"python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform == 'win32'",
"python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
"python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
"python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
dependencies = [
{ name = "filelock" },
{ name = "fsspec" },
{ name = "jinja2" },
{ name = "networkx" },
{ name = "nvidia-cublas-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "nvidia-cuda-cupti-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "nvidia-cuda-nvrtc-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "nvidia-cuda-runtime-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "filelock", marker = "platform_machine != 'aarch64'" },
{ name = "fsspec", marker = "platform_machine != 'aarch64'" },
{ name = "jinja2", marker = "platform_machine != 'aarch64'" },
{ name = "networkx", marker = "platform_machine != 'aarch64'" },
{ name = "nvidia-cublas-cu12", version = "12.8.4.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "nvidia-cuda-cupti-cu12", version = "12.8.90", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "nvidia-cuda-nvrtc-cu12", version = "12.8.93", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "nvidia-cuda-runtime-cu12", version = "12.8.90", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "nvidia-cudnn-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "nvidia-cufft-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "nvidia-cufile-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "nvidia-curand-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "nvidia-cusolver-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "nvidia-cusparse-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "nvidia-cufft-cu12", version = "11.3.3.83", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "nvidia-cufile-cu12", version = "1.13.1.3", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "nvidia-curand-cu12", version = "10.3.9.90", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "nvidia-cusolver-cu12", version = "11.7.3.90", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "nvidia-cusparse-cu12", version = "12.5.8.93", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "nvidia-cusparselt-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "nvidia-nccl-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "nvidia-nvjitlink-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "nvidia-nvjitlink-cu12", version = "12.8.93", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "nvidia-nvshmem-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "nvidia-nvtx-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "setuptools" },
{ name = "sympy" },
{ name = "nvidia-nvtx-cu12", version = "12.8.90", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "setuptools", marker = "platform_machine != 'aarch64'" },
{ name = "sympy", marker = "platform_machine != 'aarch64'" },
{ name = "triton", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "typing-extensions" },
{ name = "typing-extensions", marker = "platform_machine != 'aarch64'" },
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/0f/27/07c645c7673e73e53ded71705045d6cb5bae94c4b021b03aa8d03eee90ab/torch-2.9.1-cp312-cp312-manylinux_2_28_aarch64.whl", hash = "sha256:da5f6f4d7f4940a173e5572791af238cb0b9e21b1aab592bd8b26da4c99f1cd6", size = 104126592, upload-time = "2025-11-12T15:20:41.62Z" },
@@ -3660,12 +4044,61 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/db/2b/f7818f6ec88758dfd21da46b6cd46af9d1b3433e53ddbb19ad1e0da17f9b/torch-2.9.1-cp314-cp314t-win_amd64.whl", hash = "sha256:c88d3299ddeb2b35dcc31753305612db485ab6f1823e37fb29451c8b2732b87e", size = 111163659, upload-time = "2025-11-12T15:23:20.009Z" },
]
[[package]]
name = "torch"
version = "2.9.1+cu129"
source = { registry = "https://download.pytorch.org/whl/cu129" }
resolution-markers = [
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform == 'win32'",
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
"python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform == 'win32'",
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
"python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform == 'win32'",
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
"python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]
dependencies = [
{ name = "filelock", marker = "platform_machine == 'aarch64'" },
{ name = "fsspec", marker = "platform_machine == 'aarch64'" },
{ name = "jinja2", marker = "platform_machine == 'aarch64'" },
{ name = "networkx", marker = "platform_machine == 'aarch64'" },
{ name = "nvidia-cublas-cu12", version = "12.9.1.4", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
{ name = "nvidia-cuda-cupti-cu12", version = "12.9.79", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
{ name = "nvidia-cuda-nvrtc-cu12", version = "12.9.86", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
{ name = "nvidia-cuda-runtime-cu12", version = "12.9.79", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
{ name = "nvidia-cudnn-cu12", marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
{ name = "nvidia-cufft-cu12", version = "11.4.1.4", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
{ name = "nvidia-cufile-cu12", version = "1.14.1.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
{ name = "nvidia-curand-cu12", version = "10.3.10.19", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
{ name = "nvidia-cusolver-cu12", version = "11.7.5.82", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
{ name = "nvidia-cusparse-cu12", version = "12.5.10.65", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
{ name = "nvidia-cusparselt-cu12", marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
{ name = "nvidia-nccl-cu12", marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
{ name = "nvidia-nvjitlink-cu12", version = "12.9.86", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
{ name = "nvidia-nvshmem-cu12", marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
{ name = "nvidia-nvtx-cu12", version = "12.9.79", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
{ name = "setuptools", marker = "platform_machine == 'aarch64'" },
{ name = "sympy", marker = "platform_machine == 'aarch64'" },
{ name = "triton", marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
{ name = "typing-extensions", marker = "platform_machine == 'aarch64'" },
]
wheels = [
{ url = "https://download-r2.pytorch.org/whl/cu129/torch-2.9.1%2Bcu129-cp312-cp312-manylinux_2_28_aarch64.whl", hash = "sha256:c501c66fe5b0e2fc70f9d8a18e17a265f92ad1d1009dba03f5938d2f15a9066f", upload-time = "2026-01-26T17:26:29Z" },
{ url = "https://download-r2.pytorch.org/whl/cu129/torch-2.9.1%2Bcu129-cp313-cp313-manylinux_2_28_aarch64.whl", hash = "sha256:ab44cf28e6ca2df679f0845fb4b950c81834431218840ca01c0a1583892a0986", upload-time = "2026-01-26T17:26:26Z" },
{ url = "https://download-r2.pytorch.org/whl/cu129/torch-2.9.1%2Bcu129-cp313-cp313t-manylinux_2_28_aarch64.whl", hash = "sha256:794482180a4f2d92a960f470fcd47e066dbe2eeb27816880e618d3ce031805f7", upload-time = "2026-01-26T17:26:04Z" },
{ url = "https://download-r2.pytorch.org/whl/cu129/torch-2.9.1%2Bcu129-cp314-cp314-manylinux_2_28_aarch64.whl", hash = "sha256:4559e1254e2c8e1a337758626d1cf33ca5a5ded3509fa012070334bf886b686b", upload-time = "2026-01-26T17:25:38Z" },
{ url = "https://download-r2.pytorch.org/whl/cu129/torch-2.9.1%2Bcu129-cp314-cp314t-manylinux_2_28_aarch64.whl", hash = "sha256:cbe8955514ace826d3638a5d5dc1faa2f9dda1de4de74941d2e86b1a0859477c", upload-time = "2026-01-26T17:25:36Z" },
]
[[package]]
name = "torch-c-dlpack-ext"
version = "0.1.5"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "torch" },
{ name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
{ name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
]
sdist = { url = "https://files.pythonhosted.org/packages/37/de/921b6491efce5c389a5ef9bbed3d2d6660005840dae488124173180859ab/torch_c_dlpack_ext-0.1.5.tar.gz", hash = "sha256:d06f0357d575d22a168cc77acb9020fc4bae30968ceb6718a055dcbe92bacabe", size = 12913, upload-time = "2026-01-12T11:25:08.484Z" }
wheels = [
@@ -3706,7 +4139,8 @@ name = "torchaudio"
version = "2.9.1"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "torch" },
{ name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
{ name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/f1/83/71cbadd7b66753818b5775f2088bad4f721d581de276996df4968000a626/torchaudio-2.9.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:7581ef170794c599aed55918e00d0acd9e5c9a0f19400c9a9a840955180365c5", size = 808098, upload-time = "2025-11-12T15:26:01.408Z" },
@@ -3755,7 +4189,8 @@ dependencies = [
{ name = "numpy", version = "2.3.5", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.13'" },
{ name = "numpy", version = "2.4.4", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.13'" },
{ name = "pillow" },
{ name = "torch" },
{ name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
{ name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/f0/af/18e2c6b9538a045f60718a0c5a058908ccb24f88fde8e6f0fc12d5ff7bd3/torchvision-0.24.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:e48bf6a8ec95872eb45763f06499f87bd2fb246b9b96cb00aae260fda2f96193", size = 1891433, upload-time = "2025-11-12T15:25:03.232Z" },
@@ -3827,10 +4262,15 @@ name = "triton"
version = "3.5.1"
source = { registry = "https://pypi.org/simple" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/db/53/2bcc46879910991f09c063eea07627baef2bc62fe725302ba8f46a2c1ae5/triton-3.5.1-cp312-cp312-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:275a045b6ed670dd1bd005c3e6c2d61846c74c66f4512d6f33cc027b11de8fd4", size = 159940689, upload-time = "2025-11-11T17:51:55.938Z" },
{ url = "https://files.pythonhosted.org/packages/f2/50/9a8358d3ef58162c0a415d173cfb45b67de60176e1024f71fbc4d24c0b6d/triton-3.5.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:d2c6b915a03888ab931a9fd3e55ba36785e1fe70cbea0b40c6ef93b20fc85232", size = 170470207, upload-time = "2025-11-11T17:41:00.253Z" },
{ url = "https://files.pythonhosted.org/packages/f1/ba/805684a992ee32d486b7948d36aed2f5e3c643fc63883bf8bdca1c3f3980/triton-3.5.1-cp313-cp313-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:56765ffe12c554cd560698398b8a268db1f616c120007bfd8829d27139abd24a", size = 159955460, upload-time = "2025-11-11T17:52:01.861Z" },
{ url = "https://files.pythonhosted.org/packages/27/46/8c3bbb5b0a19313f50edcaa363b599e5a1a5ac9683ead82b9b80fe497c8d/triton-3.5.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:f3f4346b6ebbd4fad18773f5ba839114f4826037c9f2f34e0148894cd5dd3dba", size = 170470410, upload-time = "2025-11-11T17:41:06.319Z" },
{ url = "https://files.pythonhosted.org/packages/84/1e/7df59baef41931e21159371c481c31a517ff4c2517343b62503d0cd2be99/triton-3.5.1-cp313-cp313t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:02c770856f5e407d24d28ddc66e33cf026e6f4d360dcb8b2fabe6ea1fc758621", size = 160072799, upload-time = "2025-11-11T17:52:07.293Z" },
{ url = "https://files.pythonhosted.org/packages/37/92/e97fcc6b2c27cdb87ce5ee063d77f8f26f19f06916aa680464c8104ef0f6/triton-3.5.1-cp313-cp313t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:0b4d2c70127fca6a23e247f9348b8adde979d2e7a20391bfbabaac6aebc7e6a8", size = 170579924, upload-time = "2025-11-11T17:41:12.455Z" },
{ url = "https://files.pythonhosted.org/packages/14/f9/0430e879c1e63a1016cb843261528fd3187c872c3a9539132efc39514753/triton-3.5.1-cp314-cp314-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:f617aa7925f9ea9968ec2e1adaf93e87864ff51549c8f04ce658f29bbdb71e2d", size = 159956163, upload-time = "2025-11-11T17:52:12.999Z" },
{ url = "https://files.pythonhosted.org/packages/a4/e6/c595c35e5c50c4bc56a7bac96493dad321e9e29b953b526bbbe20f9911d0/triton-3.5.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:d0637b1efb1db599a8e9dc960d53ab6e4637db7d4ab6630a0974705d77b14b60", size = 170480488, upload-time = "2025-11-11T17:41:18.222Z" },
{ url = "https://files.pythonhosted.org/packages/41/1e/63d367c576c75919e268e4fbc33c1cb33b6dc12bb85e8bfe531c2a8bd5d3/triton-3.5.1-cp314-cp314t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:8932391d7f93698dfe5bc9bead77c47a24f97329e9f20c10786bb230a9083f56", size = 160073620, upload-time = "2025-11-11T17:52:18.403Z" },
{ url = "https://files.pythonhosted.org/packages/16/b5/b0d3d8b901b6a04ca38df5e24c27e53afb15b93624d7fd7d658c7cd9352a/triton-3.5.1-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:bac7f7d959ad0f48c0e97d6643a1cc0fd5786fe61cb1f83b537c6b2d54776478", size = 170582192, upload-time = "2025-11-11T17:41:23.963Z" },
]
@@ -4029,7 +4469,8 @@ dependencies = [
{ name = "numpy", version = "2.3.5", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.13'" },
{ name = "numpy", version = "2.4.4", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.13'" },
{ name = "pydantic" },
{ name = "torch" },
{ name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
{ name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
{ name = "transformers" },
{ name = "triton", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
{ name = "typing-extensions" },