Commit Graph

43 Commits

Author SHA1 Message Date
tim
ef4dc81ea9 docs(experiments): forensic explanation for E2 80% failure rate
Pulling admission-events.jsonl, prefill-0.log, and request-metrics
sampling shows the 1054 failures are NOT timeouts as initially
assumed. They are a 3-layer cascade:

  L1: 562 "no-space" + 43 "session-not-resident" worker admission
      rejects (51% of all admit attempts) because D0/D1 KV pools
      saturate while D2 stays empty.
  L2: rejects re-route to seed/reseed which need mooncake P→D KV
      transfer; the backlog drops mooncake heartbeats and prefill-0
      logs "Decode instance could be dead, remote mooncake session
      ... is not alive".
  L3: SGLang aborts the request, SSE stream closes with 0 tokens,
      agentic-pd-hybrid raises "generate stream ended before
      producing any token" (the literal error string for all 1054).

E1 didn't hit this because pd-disaggregation has no admission RPC —
sessions just queue behind the running batch, paying TTFT instead
of failing. KVC v2's worker admission is supposed to be a safety
valve; on the cold-D pathology it becomes a failure amplifier.

The real fix is upstream D rebalancing (cold-D bonus or pre-warm),
not relaxing admission.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 10:38:49 +08:00
tim
3db2d84df8 docs(experiments): E2 complete — qualified H1 with a surprise
E2 finished 1h33min wall. Headline contrast on the matched Inferact
50-session subset:

E1 (naive 1P3D + kv-aware + RDMA):
  1200/1285 succ, lat p50=93s p99=219s, TTFT p50=89s p99=207s
E2 (KVC v2 + RDMA):
   231/1285 succ, lat p50= 7.4s p99=65s, TTFT p50=0.43s p99=8.7s

E2 is 12.4× worse on failure rate but 20× better on TTFT p50 among
the requests that did complete. Both runs leave D2 entirely unused
for the same structural reason: Inferact's shared "permissions
instructions" boilerplate makes overlap dominate the kv-aware lex
score, and v2's migration mechanism only fires on capacity rejects
which never reach D2. The 1054 E2 timeouts are downstream of that
imbalance, not a v2 bug per se.

The doc closes with five concrete follow-ups for the next agent —
cold-D bonus, router-mode admission, default-policy control arm,
TCP-loopback comparison, failure mode forensics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 03:23:33 +08:00
tim
e3e5c45ed4 docs(experiments): E2 mid-run finding — D2 stays cold in KVC v2 too
Same pathological imbalance E1 showed reproduces in E2: D2 has zero
bindings at 33% POSTs in. Root cause is structural, not a KVC v2 bug:
all 50 Inferact sessions begin with identical "permissions
instructions" boilerplate, so the converter assigns them identical
first-block hash_ids. kv-aware policy's overlap term (lex-score
position 0) makes any already-resident D dominate a fresh D
unconditionally, and v2's migration only activates on admission
rejects which never fire because D0/D1 KV pools have headroom. The
H1 conclusion is qualified: KVC v2 helps per-request work (direct-
to-D fast path) but does not rebalance D worker load on workloads
with shared cross-session prefixes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 02:08:00 +08:00
tim
631b2c8847 docs(experiments): E1 results — naive 1P3D + kv-aware confirms H1 baseline
E1 finished 1h29min wall on the 50-session Inferact subset. Headline:
1200/1285 succeeded, latency p50=93s p99=219s, TTFT p50=89s p99=207s,
85 timeouts. Decode-2 was never bound to a single session — all 50
sessions stuck to decode-0/1 by kv-aware policy stickiness with no
migration to rebalance, so effective topology was 1P2D, not 1P3D.
This is exactly the failure mode H1 predicts naive pd-disaggregation
should exhibit, giving E2 (full KVC v2 with migration) a concrete
baseline to improve against.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 01:49:52 +08:00
tim
ad8aaa8c5a feat(experiments): E2 sweep — KVC v2 + RDMA on the matched subset
KVC v2 config from sweep_ts1_migration_v2.sh (reset-on-success +
direct-append threshold 8192) layered on top of the RDMA-enabled
mooncake stack, against the same outputs/inferact_50sess.jsonl
subset that E1 uses. Pair-wise contrast tests H1 (KVC layer marginal
contribution on top of 1P3D + kv-aware) and H2/H3 (RDMA reducing
reseed slow-path tail).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 00:49:53 +08:00
tim
bb9cc249cd feat(experiments): E1 sweep on 50-session deterministic subset
scripts/sample_trace_subset.py — file-order head-cut that takes the
first N sessions of a converted trace. No RNG, no hashing — same
input yields byte-identical output (the included assertion compares
md5 across two runs).

scripts/sweep_e1_naive_1p3d.sh — E1 of ONBOARDING_NEXT_AGENT_ZH §3.1:
mechanism=pd-disaggregation, policy=kv-aware, 1P3D, RDMA on
(mlx5_60). Defaults to outputs/inferact_50sess.jsonl so E1 and E2
can share the exact same subset; override via TRACE= env var to run
on the full 20,230-request trace.

Reproducing the subset:
  uv run --no-sync python scripts/sample_trace_subset.py \\
    --input outputs/inferact_codex_swebenchpro.jsonl \\
    --output outputs/inferact_50sess.jsonl \\
    --sessions 50
  # expected output_md5: 7bb263a32600ef5a6ef5099ba340a487
  # 1285 requests, mean input_length 67631 tokens

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 00:21:36 +08:00
tim
b55371fe69 docs: H200 + driver 570 setup guide + 11 lessons learned
Captures the full debugging journey of getting vendored SGLang 0.5.10
+ mooncake RDMA running on a 4×H200 node with the older driver
570.86.15. Driver 570's actual API is cu12.8 — nvidia-smi's
"CUDA Version: 13.0" header is a forward-compat ceiling, not the
driver's own version — and that single misreading drove most of the
detours. Lessons cover: pip vs vendor sglang divergence, why cu13
switching was a dead end (mooncake is cu12-only by wheel, driver 570
can't run cu13 anyway), why --disable-overlap-schedule alone isn't
enough, why pip nvidia-cuda-nvcc-cu12 doesn't ship the nvcc binary,
and how tvm_ffi's ninja-driven nvcc invocation makes CUDA_HOME the
single hook point that fixes everything.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 00:10:14 +08:00
tim
d11a66d11b feat(scripts): cu12.8 env wrapper + Inferact trace converter
setup_env.sh: source-able shell snippet that points tvm_ffi (vendor
sglang JIT compiler) at \$HOME/cuda-12.8/bin/nvcc and exposes both
libcudart.so.12 (for mooncake.engine, a cu12 wheel) and cu12.8 lib64
(for tvm_ffi compile-time linker) on LD_LIBRARY_PATH. Without this,
JIT-compiled kernels NEEDED libcudart.so.13 and driver 570 rejected
them at every JIT call.

convert_inferact_to_trace.py: turns Inferact codex_swebenchpro_traces
(ShareGPT {"from","value"} pairs) into the chat_id/parent_chat_id/
turn/hash_ids JSONL schema replay.py expects. Tokenizes with the
model's own tokenizer, builds prefix-sharing 24-token block hashes,
synthesizes timestamps. Output cross-checks 20,230 LLM calls — exactly
matches the Inferact README count for 610 successful trials.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 00:10:06 +08:00
tim
a418aafeed feat(stack): pin PD workers to --disable-overlap-schedule
On a node with driver 570.86.15 (cu12.8 driver API ceiling), SGLang's
overlap event loop hits cudaErrorInsufficientDriver inside
event_loop_overlap_disagg_prefill → resolve_future_token_ids JIT
kernel. Switching to the normal event loop sidesteps this specific
codepath. The flag is harmless on newer drivers and remains a useful
default until overlap is independently re-validated on this hardware.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 00:09:56 +08:00
tim
e874b1f055 feat(env): install vendored SGLang via uv path source
Replace pip-resolved sglang==0.5.10 with an editable install from
third_party/sglang/python. The vendored fork carries patches the pip
release does not (admit_direct_append RPC types, _should_allow_local_
prefill_on_decode, maybe_trim_decode_session_cache, backpressure pause
hint) — KVC routing depends on them, so the vendored copy must be the
import target, not just on PYTHONPATH at runtime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 00:09:50 +08:00
kzlin
7590e55189 docs: archive deprecated docs to docs/archive/, drop E1 from onboarding
Two cleanups:

1. Drop "E1: naive 1P3D default" experiment from the onboarding manual.
   GPU hours are precious; naive 1P3D + policy=default has near-certain
   loss on multi-turn cache hit (it's round-robin without prefix awareness),
   so the comparison doesn't add information vs E1=naive 1P3D kv-aware.
   The new manifest has only 2 runs: E1 (naive 1P3D kv-aware) + E2 (KVC
   v2 + RDMA). Run-time budget drops from 16.5h serial to 11h serial /
   5.5h parallel. Updated:
   - §0 TL;DR ("3 组" -> "2 组")
   - §2 H1 hypothesis (drop "default and kv-aware each one" -> just kv-aware)
   - §3.1 experiment matrix (3 rows -> 2 rows + rationale for the drop)
   - §3.2 startup config (drop E1 default section, renumber E2/E3 -> E1/E2)
   - §6 decision table + expected-range table
   - §7 FAQ ("3 个 E1-E3" -> "2 个 E1-E2")
   - §9 deliverables

2. Move 8 deprecated docs to docs/archive/:
     AGENTIC_FIT_ANALYSIS_ZH.md         (ts=10 era analysis; superseded)
     STRUCTURAL_VALIDATION_REPORT_ZH.md (ts=10 era validation; superseded)
     KVC_DEBUG_JOURNEY_V1_TO_V5.md      (v1-v5 sweep process notes)
     V5_PROFILE_INVESTIGATION_ZH.md     (v5 1Hz polling investigation)
     REFACTOR_PLAN_ZH.md                (v0 plan; superseded by V1)
     KVCACHE_CENTRIC_PROGRESS_ZH.md     (earliest 2026-04-27 progress)
     SWEBENCH_EXPERIMENT_PROGRESS.md    (early SWE trace setup)
     SWEBENCH_EXPERIMENT_RESULTS.md     (early SWE result snapshot)

   All cross-references in active docs (V2_DEEP_ANALYSIS / V2_RESULTS /
   REFACTOR_PLAN_V1 / TEAM_REPORT / ONBOARDING) rewritten from
   `docs/FOO.md` to `docs/archive/FOO.md` via sed pass.

   Added `docs/archive/README.md` explaining what each archived doc is
   and when (if ever) to reopen it. Designed so a new reader hitting
   the archive dir immediately knows it's not required reading.

After this commit the active docs in docs/ are 9 files (down from 17),
which should make the onboarding doc's "Level 1 / Level 2 / Level 3"
classification self-evident.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 22:40:35 +08:00
kzlin
5a2fb8799c docs(kvc): onboarding manual for the next SWE agent
A single self-contained reading manual designed to bring a fresh agent
(LLM or human) to current-state proficiency in 30 min of reading +
30 min of environment validation, then have them run the next round of
ablation experiments without re-litigating questions already settled.

Structure:
  §0 TL;DR -- what you are inheriting in 5 lines
  §1 Reading order, tiered into Must-Read / On-Demand / Archive,
     with reasons for each
  §2 Current-state snapshot: trace/hardware/branches + claims verified
     + hypotheses pending
  §3 The three ablation experiments (E1/E2/E3) with full CLI flag
     specifications and environment-validation checklist
  §4 Known gotchas (8 of them) with symptoms and fixes -- the most
     important section to skim before you start
  §5 CLI cheatsheet: run experiments / read data / plot / git
  §6 Result-analysis checklist: numbers to collect, expected ranges
  §7 FAQ for likely stuck-points
  §8 Anti-patterns: what NOT to do
  §9 Two specific deliverables the main agent expects back
  Appendix A: file location lookup table
  Appendix B: commit lookup table (by intent)

Goals encoded into the doc:
- Frame "your job is ablation, not new development" -- the new agent
  should not be tempted to start D->P sync work; that goes on the
  feat/d-to-p-sync branch in a separate phase.
- Make abort-accounting / max-input-len / mooncake-TCP-default
  pitfalls extremely visible up front so they don't get repeated.
- Provide expected-result ranges so a 2x deviation is treated as a
  config check, not a "finding".
- Make the critic-vs-production framing explicit so the new agent
  knows when an audit-style "MAJOR" is actually a design intent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 22:31:08 +08:00
kzlin
506d360160 fix(figures): GPU utilization figure annotation/headroom polish
Bar-overlap fix: extend ylim by 35-45% above the tallest bar to give the
"P GPU only sees 328 requests" and "P GPU does 1.07M tokens" annotations
clean white-bbox space above the bars instead of crashing into the KVC D
bars at x=1. Move both annotation xytext positions to x=2.4 (left panel)
and x=5.5 (right panel) so the arrows pull away from the orange P bar
toward the center of the panel.

Group labels (KVC 1P3D / DP 4-way CA) kept in axes-fraction bboxes at
y=1.02; subplot titles raised to pad=24 to leave room.

Note: a small visual collision between the bboxed group labels and the
subplot-title second line remains in the rendered output (acknowledged
in the prior conversation). Acceptable for now; full layout rework is
deferred. The annotation-vs-bar overlap (the original blocker) is fixed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 22:28:39 +08:00
kzlin
c01d6101d6 docs(kvc): freeze reseed slow-path audit + three reviewer challenges
Standalone reference document capturing the v2 reseed slow-path forensic
audit before opening the feat/d-to-p-sync branch. Designed to be quoted
directly by future paper drafts and to prevent the team from re-relitigating
the same questions verbally.

Contents:

§1. The three team-member challenges that disproved "capacity-backup will
    save the slow path" (each with code citation and verdict):
    1) P pool can't fit all backups -- replay.py:1618-1620 caps backup
       count at 1 for sessions with ~50K peak input.
    2) P's backup is a stale snapshot -- 49K of direct-to-D append work
       never flows through P. _commit_prefill_backup_residency
       (replay.py:1483) is only called from seed/reseed paths;
       direct-to-D path (replay.py:2719) never touches P-side state.
    3) When D evicts, old KV is freed directly (no D->P dump).
       session_aware_cache.release_session only calls
       kv_pool_allocator.free().

§2. End-to-end reseed timeline (t=0 to t=4550ms) with code citations
    showing exactly where each component sits. P-side re-prefill =
    1.5-3s, mooncake transfer = 1.5-4s, both contributing 50/50 to
    total reseed cost.

§3. Table of "looks like D->P but isn't" code locations -- every
    candidate found during forensic search ruled out with line citations.

§4. Specification of what D->P incremental sync would require:
    mooncake bidirectional roles (~400 LOC), D-side append commit hook
    (easy), P-side radix tree multi-producer extension (the real blocker),
    agentic-pd-hybrid replay.py hooks. Estimated 1-2 weeks engineering.

§5. Confirmation via `git ls-remote origin --refs` that author has NOT
    secretly implemented D->P on another branch -- only main + this
    working branch exist on the server.

§6. Roadmap for the upcoming feat/d-to-p-sync branch.

Appendices: code position crosswalk, related commits, paper section
suggestions.

This document is referenced by V2_DEEP_ANALYSIS_ZH §4.2 and by
KVC_ROUTER_ALGORITHM §9 Open Question 4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 22:20:34 +08:00
kzlin
9ccd853066 docs(kvc): correct reseed cost decomposition + flag D->P sync gap
After an independent Opus-agent forensic audit, the previous "(c) 增量
fetch (工程量较大,未实现)" line in V2_DEEP_ANALYSIS §4.2 was understating
the gap. The audit confirmed:

- No D->P KV transfer code exists in the framework at any layer
  (agentic_pd_hybrid orchestration, vendored SGLang disaggregation,
  or mooncake transport).
- Mooncake MooncakeKVManager has a hard role split: PREFILL = sender,
  DECODE = receiver-only loop. `add_transfer_request` asserts the
  disaggregation_mode is PREFILL.
- The BaseKVSender / BaseKVReceiver abstraction has no bidirectional slot.
- session_aware_cache.release_session only calls kv_pool_allocator.free()
  on eviction -- no serialization, no outbound network call.
- _commit_prefill_backup_residency is only called from the seed/reseed
  path (_invoke_kvcache_seeded_router). direct-to-D path never updates
  P-side backup state.
- "capacity-backup" policy semantics: it only skips the close on P after
  reseed -- the backup is the seed-time static snapshot, never refreshed
  by D-side append-prefill activity.

V2_DEEP_ANALYSIS §4.2:
- Decomposed the 3-7s reseed cost into the P-side re-prefill segment
  (1.5-3s, dominant) and the P->D mooncake transfer segment (1.5-4s).
- Quantified the realistic effect of enabling RDMA: only the transfer
  segment shrinks, reseed reduces to 1.7-3.2s, TTFT p99 ~0.7s, still
  loses to DP's 0.43s.
- Replaced the throwaway "(c) incremental fetch" line with a full
  paragraph explaining what D->P sync would require, why it's the
  largest engineering gap, and that the blocker is SGLang's radix-tree
  single-producer assumption, not the network layer.

KVC_ROUTER_ALGORITHM §9:
- Refined Open Question 3 (RDMA) to clarify it only helps the transfer
  segment, not the re-prefill segment.
- Added Open Question 4: D->P incremental KV sync as the central
  future-work contribution gap, with cited evidence for why it doesn't
  currently exist.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 22:07:14 +08:00
kzlin
517677d7f2 docs(kvc): add GPU-utilization and cache-efficiency figures (rebut critic)
Two figures inserted into V2_DEEP_ANALYSIS §4.5 and §4.4 respectively, to
visually rebut the two critic-agent claims that we argued in prose were
design intent, not deficiencies.

(1) gpu_utilization.png  -- §4.5  "P GPU is wasted 90% of the time"
  Two-panel side-by-side:
    Left  (request count view, the naive reading): KVC P = 328 reqs (7.4%),
          KVC D = ~1450 each, DP = ~1100 each. P "looks idle."
    Right (compute work view, the honest reading): KVC P does 1.07M tokens
          of prefill, comparable to each KVC D worker's ~0.80M. P is a
          low-frequency high-cost safety net, not idle capacity.
  Bonus finding: KVC's total compute (3.47M tokens across 4 GPUs) is 33%
  LESS than DP's (5.17M). Same GPUs, less work done. That's the affinity
  win.

(2) cache_efficiency.png  -- §4.4  "Cache concentration is not policy win"
  Two-panel side-by-side. The setup: KVC has 27% LESS total KV pool
  (276K vs 351K tokens) yet caches MORE per request.
    Left  (cache hit rate vs turn number): KVC's session-affinity lets
          hit rate accumulate with turns; DP's hash + radix-LRU causes
          a mid-turn drift around turns 8-25 where KVC = 97.0% vs DP
          = 95.8% (1.24pp gap). Shows mechanism, not just outcome.
    Right (ECDF of per-request uncached tokens, log x): KVC's distribution
          concentrates near zero (50% < 187 tokens), DP's is spread
          (50% < 781 tokens). At uncached = 500 tokens threshold, KVC
          has 74% of requests below, DP has 31%.
  → smaller pool, better retention, less per-request work. Direct empirical
  rebuttal to "fragmentation is architectural, not policy."

Bundled scripts (rerunable):
- scripts/analysis/plot_gpu_utilization.py
- scripts/analysis/plot_cache_efficiency.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 18:04:49 +08:00
kzlin
c5519066de docs(kvc): add TTFT probability density figure (KVC v2 vs 4DP)
Adds a two-panel TTFT PDF comparison plot inserted as a new V2_DEEP_ANALYSIS
§3.4 ("TTFT 概率密度对比: bimodal vs unimodal"). Single-percentile numbers
(p50 / p99) hide the qualitative difference between the two distributions;
the figure makes it visible at a glance.

Left panel (linear x in [0, 0.6]s, body):
  KVC has a sharp peak at ~40ms (the direct-to-D fast path).
  DP has a broad peak around 50-200ms (full prefill per request).
  Annotated with p50 and p90 markers for each side.

Right panel (log x in [10ms, 10s], full range):
  KVC is visibly bimodal: a tall fast-path peak plus a small reseed tail
  around 1-5s.
  DP is unimodal: a single broad peak with shorter tail.
  Annotated with p99 callouts pointing to each tail.

KDE: scipy.stats.gaussian_kde, bandwidth=0.15 for the body (Scott's rule
oversmooths the sharp fast-path peak), log10-transformed for the full-range
panel so the bimodal structure is visible.

Bundled:
- scripts/analysis/plot_ttft_pdf.py -- rerunable when v2 / DP data change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 17:46:27 +08:00
kzlin
b5af19583b docs(kvc): replace v2 path breakdown tables with generated figures
V2_DEEP_ANALYSIS §3.1 (execution_mode distribution) and §3.2 (path-level
latency vs DP) had hand-typed tables with approximate latencies (e.g.
"~1.0s") and required readers to mentally compare 5+ rows × 5 columns.
Both sections now reference generated PNG figures derived directly from
the v2 + DP metrics.jsonl files.

§3.1 figure (v2_execution_mode_distribution.png):
  Horizontal bar chart, log x-axis. 4076 direct-to-D fast-path requests
  (green) dwarf the rest by ~30x; the long tail of slow / fallback /
  failure modes is visible at one glance. Counts and percentages
  annotated on each bar.

§3.2 figure (v2_path_level_latency.png):
  Grouped bar chart, log y-axis. Per-path TTFT p50 / TTFT p99 / Lat p50
  with exact numeric labels (no more "~1.0s" approximations). Sample
  counts annotated below each path. Quick visual reads:
   - KVC fast path TTFT p50 41ms vs DP 92ms (2.2x faster)
   - KVC reseed TTFT p99 5.12s vs DP 0.43s (12x slower) -- the cost
   - KVC no-d-capacity TTFT p99 7.65s (worst case)

Bundled:
- scripts/analysis/plot_v2_path_breakdown.py -- the script that
  generates both figures; rerunable when v2 data changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 17:38:43 +08:00
kzlin
37e9caa431 docs(kvc): production-decision reframe + formal router algorithm spec
After the critic-agent audit, V2_DEEP_ANALYSIS had drifted into an
audit-grade "5 wins / 1 loss / 3 draws" framing that mistook KVC's
deliberate design motifs (cache concentration via session affinity;
prefill-GPU idle as TTFT-stability trade-off) for "comparison
unfairness." This commit corrects the framing back to a production-
decision lens and adds a paper-track formal specification of the
router algorithm.

V2_DEEP_ANALYSIS_ZH.md changes:
- §0 TL;DR: lead with "online coding agent serving should pick
  KVC 1P3D"; the only real cost is TTFT p99 long-tail (3x DP) from
  the 8.3% mooncake reseed path, mitigable with real RDMA.
- §4 restructured into three buckets:
    real costs (TTFT p99 tail, abort accounting now fixed),
    counter-arguments to the critic (cache concentration and idle
      prefill GPU are design intent, not deficits),
    methodology to-do (naive-1P3D control, v2 N>=2 determinism).
- §6 replaces "5/1/3 rescoring" with production decision rationale:
  KVC wins on 6 latency/TTFT metrics + lower failure rate; pays
  TTFT p99 tail; lists workloads where DP would reverse the call.
- §8 decision points: D1 recommends Yes (accept v2 as milestone);
  D8 added: paper motif "KVC trades P idle for TTFT stability."

KVC_ROUTER_ALGORITHM.md (new, paper-track, Chinese narrative + English
algorithm boxes / variable names / theorems for direct paper reuse):
- Problem formulation, system model, full notation
- Algorithm 1 Route: lexicographic-tuple scoring on
    (overlap+alpha*sticky, sticky, -inflight, -assigned)
- Algorithm 2 Admit: D-worker autonomous admission deciding
    Direct / Seed / Reseed / reject (with reason)
- Algorithm 3 Dispatch: end-to-end orchestration with reset-on-success
    (the v2-specific fix that eliminates v1's self-amplifying thrashing)
- Theorem 1 (no permanent starvation) and Theorem 2 (fast-path
    determinism), each with a proof sketch
- Comparison table vs vanilla pd-disagg / DP cache-aware
- Anti-patterns ("what KVC explicitly is NOT")
- Open questions for reviewers
- Suggested paper citation phrasing
- Appendix A: algorithm-step to source-file:line crosswalk

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 17:29:18 +08:00
kzlin
5eac9b4f6b fix(metrics): exclude aborted requests from latency/ttft/tpot stats
The old filter `if row.latency_s is not None` accepted SGLang's fast
input-length-aborts (latency_s ~ 0.08s, finish_reason='abort/BadRequest')
as if they were successful zero-cost requests. This deflated mean/p50
of any run where the model rejected oversized inputs.

Impact on existing comparisons (ts=1 4-run validation + v2):
  KVC v2 has 40 aborts + 5 ReadTimeouts (was reported as just 5);
  DP 4w  has 67 aborts (was reported as 5).
Both runs have abort behavior; the asymmetry (40 vs 67) is purely from
SGLang's mem-fraction-derived max-input-len: KVC decode-only worker gets
~10 GB free GPU mem -> max-input=92098, DP fused worker gets ~9 GB ->
max-input=87811, because DP also needs chunked-prefill workspace.

The KVC-vs-DP latency-win direction holds and widens slightly under the
fixed filter (lat mean delta: -0.8% -> -1.4%); see V2_DEEP_ANALYSIS_ZH
§4.3 for the recomputed table.

Changes:
- metrics.py: new _is_failed_request(row) helper; latency/ttft/tpot
  stats now exclude both errors and aborts. New summary fields
  abort_count and failure_count expose the counts directly.
- scripts/analysis/recompute_summary.py: re-derives summary.json from
  existing metrics.jsonl using the fixed code, with optional --diff
  against the old buggy summary for inspection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 17:29:18 +08:00
kzlin
0c25168cad docs(kvc): v2 deep analysis vs TEAM_REPORT baseline
Post-v2 audit consolidating ts=1 validation + v1 thrashing + v2 win, plus
critic-agent adversarial review of the v2 vs 4DP comparison.

Headline outcomes:
- TEAM_REPORT §1 (session pin starvation) fully fixed by v2 migration +
  reset-on-success; direct-to-D 42.8% -> 91.6%.
- TEAM_REPORT §2/§3/§5 (LRU, backpressure, admission RPC) are absorbed by
  ts=1 natural drain time, not mechanism-fixed -- will resurface under
  ts=10/longer traces/higher concurrency.
- TEAM_REPORT §6 (ts=10 distortion) confirmed and locked as precondition;
  TEAM_REPORT §8 (N=1 unreliable) rewritten to "high-pressure N>=3, normal N=1".

Three new problems exposed by adversarial review:
- TTFT p99: KVC 1.285s vs DP 0.427s (KVC 3.0x worse) -- cherry-picked out of
  the V2_RESULTS_ZH.md headline table. Root cause: 8.3% non-direct path pays
  3-7s mooncake reseed cost on 50-90K-token KV transfer.
- Error accounting asymmetry: DP has 67 fast-aborts (not 5) at ~0.08s each
  counted in latency stats; KVC's 5 ReadTimeouts excluded entirely. Root
  cause: --max-input-len 87811 (DP) vs 92098 (KVC) + metrics.py:124 filter.
- Topology mismatch: KVC 1P3D's prefill GPU is idle 91.7% of the time
  (only ~373/4449 requests use seed/P path); 4DP CA has all 4 GPUs at full
  utilization. Plus: no naive 1P3D control exists in the repo -- cannot
  isolate KVC-layer contribution from 1P3D-topology contribution.

Re-scored headline: 5 KVC wins / 1 DP win / 3 draws -- still net positive
but not the "7/8 wins" framing the V2_RESULTS_ZH.md claims.

Recommended follow-ups (ROI order):
1. naive 1P3D ts=1 N=1 control (critic's only CRITICAL finding)
2. v2 N=2/N=3 to verify ts=1 determinism with new code paths
3. symmetric error accounting recompute + DP max-input-len = 92098 rerun

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 11:17:00 +08:00
kzlin
2ec0debef4 feat(kvc): session migration with reset-on-success + direct-append threshold tuning
KVC v2 beats 4DP at ts=1 same-scale on 7/8 metrics:
  TTFT mean -24%, p50 -54%, p90 -64%; lat mean -0.8%, p50 -12.6%, p90 -0.7%.
  Direct-to-D rate jumped 42.8% -> 91.7%. REFACTOR_PLAN_V1 scenario C achieved.

Two-knob fix:
- reset-on-success blacklist decay: clear (sess, D) reject counter on
  successful direct-to-D path. Eliminates v1 thrashing where session 6880
  was stable on decode-1 for 70 turns then collapsed to 75 D-changes after
  cumulative transient pressure tripped the permanent blacklist.
- bump --kvcache-direct-max-uncached-tokens default 2048 -> 8192 via CLI flag.
  41% of v1 fallbacks were 'real-large-append' (>2048 token append); raising
  the threshold lets these go through the direct-to-D fast path.

Code:
- policies.py: RoutingState.session_d_rejects counter + KvAwarePolicy
  migration_reject_threshold; degenerate fallback picks least-rejected D.
- replay.py: record_admission_reject + reset-on-success in _run_request;
  _fallthrough_reason classifies turn-2+ fall-throughs as session-not-resident
  / real-large-append / etc, replacing misleading 'large-append' suffix
  (TEAM_REPORT §2.7).
- cli.py + benchmark.py: --kvcache-migration-reject-threshold flag wiring.

Docs:
- REFACTOR_PLAN_V1_ZH.md: forward-looking plan after ts=1 validation.
- MIGRATION_V1_FINDINGS_ZH.md: v1 thrashing root-cause analysis.
- V2_RESULTS_ZH.md: v2 results, scenario C achievement, attribution.
- TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md: comprehensive team report.

Scripts:
- sweep_ts1_kvc_n3_plus_dp.sh: ts=1 baseline (KVC 1P3D N=3 + 4DP CA).
- sweep_ts1_migration_v1.sh / v2.sh: validation runs.
- analyze_ts1_validation.py: 4-way comparison analyzer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 01:18:13 +08:00
kzlin
1d51704dad docs(kvc): agentic-fit analysis, refactor plan, validation report
Three new docs covering the structural-fit investigation:

- AGENTIC_FIT_ANALYSIS_ZH.md: §1-§7 of structural design issues that
  surface KVC vs vanilla DP gap on real agentic workloads (SWE 50sess).
  Quantifies session pinning, LRU shortfall, P-side imbalance,
  time-scale distortion, etc., with code citations and N=3 rerun data.

- REFACTOR_PLAN_ZH.md: KISS-edition refactor plan. After verifying the
  original "estimate inflation" and "resident_blocks aging" claims were
  not real bugs, scope shrinks to one code change (backpressure) plus a
  4-run smoke sweep within an 8h budget.

- STRUCTURAL_VALIDATION_REPORT_ZH.md: validates §1-§7 claims using
  existing v5 baseline rerun data + 8DP CA baseline. Each claim labeled
  fully-supported / indirect / retracted with the data source. Notes
  that backpressure E2E validation is pending GPU smoke run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 21:30:11 +08:00
kzlin
7affb565b2 feat(kvc): add backpressure smoke sweep + analyzer (and v6 p1 profile script)
scripts/sweep_backpressure_smoke.sh: 4-run smoke matrix (KVC baseline /
KVC + backpressure / KVC + backpressure @ time-scale=1 / DP @
time-scale=1) designed to fit ~3-4h GPU budget. Validates §3 backpressure
implementation and partially probes §7 time-scale distortion.

scripts/analysis/analyze_backpressure_smoke.py: consumes the new
structural/* jsonl files plus request-metrics; emits headline metrics,
backpressure histograms, admission probe stats, and per-session pinning
distribution.

scripts/sweep_tp1_v6_p1_profile.sh: pre-existing v6 P1 profile sweep
script (was untracked; included for completeness).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 21:29:56 +08:00
kzlin
c47adaf8e3 feat(kvc): honor admission backpressure hints + structural event logging
Replay-side changes paired with the SGLang admission hint:

- DecodeResidencyState gains pause_until_s; admission probe parses
  recommended_pause_ms and updates the per-D pause window.
- _wait_for_decode_pause is invoked at request entry points
  (_invoke_router, _invoke_session_direct) so requests stall before
  hitting a saturated D instead of timing out via mooncake.
- New CLI flags: --enable-backpressure (default off, baseline preserved),
  --backpressure-max-pause-s (cap on per-request sleep, default 2s).

Structural instrumentation written under <run_dir>/structural/:
- admission-events.jsonl: every admission probe (RTT, queue_depth,
  pause_ms, available_tokens, evicted_count)
- backpressure-events.jsonl: every actual pause sleep
- session-d-binding.jsonl: per-request policy decision

Used to validate the structural claims documented separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 21:29:46 +08:00
kzlin
ca4b64c79a feat(sglang): expose backpressure pause hint in admit_direct_append
Add `recommended_pause_ms` field to DirectAppendAdmissionReqOutput so D
can advise callers when its transfer queue is heavy or KV pool is near
capacity. The hint is computed from transfer_queue_depth,
retracted_queue_depth, and post-trim token_usage; thresholds are simple
heuristics (>0.90 usage, >=8 queue depth, retracted>0).

Default behavior is unchanged for callers that ignore the field.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 21:29:30 +08:00
kzlin
4978c0d0cd profile(kvc): rewrite v5+profile report after critic audit + P0/P1 instrument
Hostile audit of the original report flagged three load-bearing errors:

1. held_tokens semantic was inverted. session_held_tokens() at
   session_aware_cache.py:278-282 sums (kv_allocated_len - cache_protected_len)
   per slot, i.e. slot-private (NOT in radix tree). So "other = cap - held -
   avail" actually CONTAINS the radix-tree protected prefix cache (likely the
   single biggest component for shared agentic prefixes), not just running
   batch + in-flight as the original report claimed.

2. Admission-race causal hypothesis for the 415 EXP2+profile errors is
   contradicted by the data: 414/415 errors have kv_transfer_blocks > 0 — they
   passed admission and died downstream ("generate stream ended before
   producing any token", raised by the client when a 200 response had an empty
   stream).

3. Polling deconfound was too quickly dismissed. Mode counts shift ~1:1
   (session-cap-fb -356 / kvcache-centric +406), and /server_info is not a
   passive read — it dispatches into the scheduler main loop and iterates
   every session slot.

Plus: per-D error% confounded by sticky session affinity (only 18 unique
sessions cause 415 errors, decode-3 had 0 errors only because no high-error
session landed there); decile 10 "recovery" was an equal-time binning
artifact (24.5% under equal-count); v5 vs v5+profile time gap was 21h not
6h; p50/p90 latency comparison is N=1.

Rewritten report (docs/V5_PROFILE_INVESTIGATION_ZH.md) marks each correction
with ⚠️ and demotes admission-race to one of four hypotheses (H1-H4).

Action items split into P0 (verify, must do first) and P1 (instrument):

P0 — scripts/sweep_tp1_v5_baseline_rerun_exp2.sh runs 3x v5 baseline EXP2
(no polling, identical config to the original v5 run) to test whether the
9-error baseline result is reproducible. If 3 runs give ~9 errors and
profile gives 415, polling is the leading suspect. Currently running
in background.

P1 — scheduler.py:_compute_pool_breakdown_for_diagnostics adds a read-only
"pool_breakdown" dict to /server_info covering: radix_evictable_tokens,
radix_protected_tokens, slot_private_held_tokens, session_slot_count,
running_batch_{reqs,kv_tokens}, transfer_queue_{reqs,tokens},
prealloc_queue_{reqs,tokens}, retracted_queue_{reqs,tokens}. With these,
"unaccounted = cap - sum(known)" exposes true leakage. replay.py captures
all fields into the per-tick row; analyzer prints the decomposition and
gracefully handles old timeseries (prints "P1 instrument absent").

Mock-tested end-to-end. SGLang patch is read-only and does not affect
admission/scheduling. Old v5+profile data still analyzes correctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 22:29:21 +08:00
kzlin
51f5386691 profile(kvc): add D KV pool timeseries poller + analyzer for v6 root-cause
v5 dropped errors but pushed session-cap fallback to 46-51%. Before adding
v6 mitigations we need to attribute that capacity loss to one of:
  (a) active sessions — real footprint
  (b) idle-evictable sessions — LRU not aggressive enough
  (c) prefill backup blocks / in-flight / fragmentation — release timing

Without this it's all guessing. Plumb a 1Hz poller into replay that hits
each P/D worker's /server_info, captures session_cache + memory_usage, and
writes a per-worker time-series JSONL to <run_dir>/d-pool-timeseries.jsonl.
Off by default (--pool-poll-interval-s 0); v5+profile sweep enables it at
1.0s. Per-tick HTTP cost is ~8 parallel /server_info calls — negligible
relative to the 50min run.

Analyzer (scripts/analysis/analyze_pool_timeseries.py) decomposes each D's
capacity into active_held / idle_evictable / other (= cap-held-avail, the
backup-blocks bucket) / free, and reports session residency churn across
workers as a starvation/thrashing signal.

Mock-tested poller end-to-end (cancellation clean, file flushed, sessions
captured); analyzer validated against synthetic timeseries.

Next: run scripts/sweep_tp1_v5_optD_profile.sh on hardware (~90min), then
analyze results to pick a v6 direction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:04:21 +08:00
kzlin
6572d7f3f4 docs: add v5 chapter (Option D worker-mode admission) and rename to V1_TO_V5
v5 sweep (sweep_tp1_v5_optD.sh) lands the previously-deferred Option D:
worker admission_mode authoritative for direct_append + seed + reseed,
bypassing replay's local _decode_session_soft_cap.

Key findings now documented:
- errors collapse from 9-10% to 0.2% (mooncake timeouts gone)
- session-cap fallback rises 33-35% -> 46-51% — D's true KV pool is the
  binding constraint, not replay's estimator; v4's "low fallback" was
  hiding capacity overruns as transfer-timeout errors
- direct-to-D subset latency unchanged from v4 (admission overhead negligible)
- new bottleneck: D's physical KV pool — points v6 at prefill backup release
  timing, priority eviction tuning, chunked seed, cross-D session migration,
  and real RDMA

Also adds a 5th lesson on errors-vs-fallback reciprocity and updates the
code index with the v5 endpoint extension and new CLI knobs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 16:13:25 +08:00
kzlin
6e5ed8da80 feat(kvc): Option D - delegate seed/reseed admission to D worker
v4 (cap=16) saw 35% session-cap fallback because the local soft_cap
min(16, usable / target) evaluates to 1-2 for large agentic inputs.
The cap was hit not because D was full but because replay's heuristic
underestimated capacity.

This change makes worker admission_mode authoritative for ALL paths:

SGLang side:
- io_struct.py: DirectAppendAdmissionReqInput gains a `mode` field
  ("direct_append" | "seed", default "direct_append" preserves prior
  behavior).
- scheduler.py:admit_direct_append: when mode == "seed", skip the
  resident-on-D requirement and run the same capacity check + LRU
  eviction (maybe_trim_decode_session_cache) that direct_append uses.
  This lets D atomically decide if a new session can be admitted based
  on actual token_to_kv_pool_allocator state.

Replay side (replay.py):
- _query_decode_direct_admission gains a `mode` parameter.
- _reserve_decode_session_capacity: in worker admission_mode, the
  seed/reseed branch now queries D with mode="seed" and trusts the
  result, instead of estimating capacity from the residency snapshot.
- _should_admit_new_decode_session: in worker mode, skip the local
  soft_cap pre-check and let D decide. Same-D session fast-path is
  preserved.

Effects:
- Local hardcoded cap of 16 is bypassed under worker mode; D's real
  KV pool size is the only constraint.
- LRU eviction runs in D's process atomically with admission, so
  starvation (the v3 bimodal "lucky vs starved sessions" pattern)
  should resolve.

scripts/sweep_tp1_v5_optD.sh added to run the same 1P7D / 2P6D
configs as v4 with the new admission path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:40:03 +08:00
kzlin
74194e660a docs: v4 final results, error analysis, and updated journey
Add v4 sweep results and post-mortem analysis showing:

- direct-to-D path: 54.3% (1P7D) / 58.0% (2P6D) of requests now use
  KVC cleanly. P50=0.5s and TTFT P50=0.043s; this path beats baseline
  8DP across the board (P50 -24%, TTFT P50 -54%, TTFT P90 -79%).

- Overall vs baseline (errors+truncated excluded):
  v4 2P6D P50=0.85s vs baseline 0.66s (28% slower).
  Reason is not errors -- 35% of requests still hit
  fallback-large-append-session-cap, where capacity-based
  cap = usable_tokens / target_tokens evaluates to 1-2 (not 16)
  for large agentic inputs.

- 9-10% errors on KVC variants are mooncake TCP transfer timeouts,
  not SGLang logic bugs. Prefill log shows
  "Failed to send kv chunk ... 32s timeout ... session not alive".
  Errors concentrate in turn>=31 (large inputs) after run >44.8%.

Track:
- docs/KVC_DEBUG_JOURNEY_V1_TO_V4.md: append v4 results table,
  per-mode breakdown, and error root cause.
- scripts/analysis/{analyze_v3,analyze_v4,analyze_errors,compare_no_error}.py
- outputs/qwen3-30b-tp1-v{3,4}*/exp*_summary.json (force-added,
  small JSON; metrics.jsonl excluded due to size).
- outputs/qwen3-30b-tp1-v{3,4}*/sweep_results.txt

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:34:01 +08:00
kzlin
c9d350b372 docs: KVC v1-v4 debug journey + raise session soft_cap to 16
Document the iterative debugging from v1 (broken KVC) through v4
(routing fixed + session cap raised), with code-level analysis of
the two main bugs encountered:

1. v2 root cause (mis-diagnosed previously as `allow_local_prefill`):
   `--policy default` for KVC mechanism caused replay's round-robin
   policy and the PD router's round-robin to diverge, sending requests
   with `session_params` to a D worker that did not have the session
   open. Resulted in 56-61% truncation with finish_reason
   "session id X does not exist".
   Fix: use `--policy kv-aware` (sweep_tp1_v3_kvaware.sh) so replay
   emits `x-smg-target-worker` and PD router uses consistent_hashing.

2. v3 new bottleneck: `pd-router-fallback-large-append-session-cap`
   dominated 52-65% of requests. Root cause was hardcoded
   `min(4, ...)` in `_decode_session_soft_cap`. With 7 D workers x 4
   sessions = 28 slots for 52 trace sessions, ~24 sessions starved
   permanently (bimodal direct-to-D rate of 0% or 99%).
   Fix: raise the cap to 16 (replay.py).

Also includes the v3 finding that direct-to-d-session path P50=0.495s
and TTFT P50=0.043s already beats the 8-way DP baseline (0.65s/0.093s)
- the KVC core mechanism works when fallback paths are avoided.

Files:
- docs/KVC_DEBUG_JOURNEY_V1_TO_V4.md: full journey + code location index
- docs/SWEBENCH_EXPERIMENT_{PROGRESS,RESULTS}.md: prior session notes
- scripts/sweep_tp1_v{2,3,4}*.sh: experiment driver scripts
- src/agentic_pd_hybrid/replay.py: cap 4 -> 16, audit fields
- src/agentic_pd_hybrid/pd_router.py: strip session_params from prefill
- src/agentic_pd_hybrid/metrics.py: truncated_request_count

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 21:10:41 +08:00
e9062b1d6e Document PD baseline comparison 2026-04-25 17:29:27 +00:00
c928c7db23 Add transfer queue admission knobs 2026-04-25 17:29:15 +00:00
fe583fb413 Document kvcache-centric experiment progress 2026-04-25 16:01:31 +00:00
13bb31a446 Add kvcache-centric profiling and admission controls 2026-04-25 16:00:52 +00:00
08b13d22bc docs: rewrite project docs in concise chinese 2026-04-24 12:41:52 +00:00
5bdc0ed4f0 docs: document sglang maintenance workflow 2026-04-24 12:31:32 +00:00
b8e6f13c20 feat(sglang): support decode session cache admission 2026-04-24 12:30:41 +00:00
bded08301f chore: vendor sglang v0.5.10 snapshot 2026-04-24 12:29:36 +00:00
78f0d15221 docs: document project design and status 2026-04-24 12:17:55 +00:00
4bca741f32 feat: add agentic pd hybrid benchmark prototype 2026-04-24 12:17:46 +00:00
d2fe014db7 chore: initialize repo hygiene 2026-04-24 12:17:40 +00:00