19 Commits

Author SHA1 Message Date
kzlin
c01d6101d6 docs(kvc): freeze reseed slow-path audit + three reviewer challenges
Standalone reference document capturing the v2 reseed slow-path forensic
audit before opening the feat/d-to-p-sync branch. Designed to be quoted
directly by future paper drafts and to prevent the team from re-relitigating
the same questions verbally.

Contents:

§1. The three team-member challenges that disproved "capacity-backup will
    save the slow path" (each with code citation and verdict):
    1) P pool can't fit all backups -- replay.py:1618-1620 caps backup
       count at 1 for sessions with ~50K peak input.
    2) P's backup is a stale snapshot -- 49K of direct-to-D append work
       never flows through P. _commit_prefill_backup_residency
       (replay.py:1483) is only called from seed/reseed paths;
       direct-to-D path (replay.py:2719) never touches P-side state.
    3) When D evicts, old KV is freed directly (no D->P dump).
       session_aware_cache.release_session only calls
       kv_pool_allocator.free().

§2. End-to-end reseed timeline (t=0 to t=4550ms) with code citations
    showing exactly where each component sits. P-side re-prefill =
    1.5-3s, mooncake transfer = 1.5-4s, both contributing 50/50 to
    total reseed cost.

§3. Table of "looks like D->P but isn't" code locations -- every
    candidate found during forensic search ruled out with line citations.

§4. Specification of what D->P incremental sync would require:
    mooncake bidirectional roles (~400 LOC), D-side append commit hook
    (easy), P-side radix tree multi-producer extension (the real blocker),
    agentic-pd-hybrid replay.py hooks. Estimated 1-2 weeks engineering.

§5. Confirmation via `git ls-remote origin --refs` that author has NOT
    secretly implemented D->P on another branch -- only main + this
    working branch exist on the server.

§6. Roadmap for the upcoming feat/d-to-p-sync branch.

Appendices: code position crosswalk, related commits, paper section
suggestions.

This document is referenced by V2_DEEP_ANALYSIS_ZH §4.2 and by
KVC_ROUTER_ALGORITHM §9 Open Question 4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 22:20:34 +08:00
kzlin
9ccd853066 docs(kvc): correct reseed cost decomposition + flag D->P sync gap
After an independent Opus-agent forensic audit, the previous "(c) 增量
fetch (工程量较大,未实现)" line in V2_DEEP_ANALYSIS §4.2 was understating
the gap. The audit confirmed:

- No D->P KV transfer code exists in the framework at any layer
  (agentic_pd_hybrid orchestration, vendored SGLang disaggregation,
  or mooncake transport).
- Mooncake MooncakeKVManager has a hard role split: PREFILL = sender,
  DECODE = receiver-only loop. `add_transfer_request` asserts the
  disaggregation_mode is PREFILL.
- The BaseKVSender / BaseKVReceiver abstraction has no bidirectional slot.
- session_aware_cache.release_session only calls kv_pool_allocator.free()
  on eviction -- no serialization, no outbound network call.
- _commit_prefill_backup_residency is only called from the seed/reseed
  path (_invoke_kvcache_seeded_router). direct-to-D path never updates
  P-side backup state.
- "capacity-backup" policy semantics: it only skips the close on P after
  reseed -- the backup is the seed-time static snapshot, never refreshed
  by D-side append-prefill activity.

V2_DEEP_ANALYSIS §4.2:
- Decomposed the 3-7s reseed cost into the P-side re-prefill segment
  (1.5-3s, dominant) and the P->D mooncake transfer segment (1.5-4s).
- Quantified the realistic effect of enabling RDMA: only the transfer
  segment shrinks, reseed reduces to 1.7-3.2s, TTFT p99 ~0.7s, still
  loses to DP's 0.43s.
- Replaced the throwaway "(c) incremental fetch" line with a full
  paragraph explaining what D->P sync would require, why it's the
  largest engineering gap, and that the blocker is SGLang's radix-tree
  single-producer assumption, not the network layer.

KVC_ROUTER_ALGORITHM §9:
- Refined Open Question 3 (RDMA) to clarify it only helps the transfer
  segment, not the re-prefill segment.
- Added Open Question 4: D->P incremental KV sync as the central
  future-work contribution gap, with cited evidence for why it doesn't
  currently exist.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 22:07:14 +08:00
kzlin
517677d7f2 docs(kvc): add GPU-utilization and cache-efficiency figures (rebut critic)
Two figures inserted into V2_DEEP_ANALYSIS §4.5 and §4.4 respectively, to
visually rebut the two critic-agent claims that we argued in prose were
design intent, not deficiencies.

(1) gpu_utilization.png  -- §4.5  "P GPU is wasted 90% of the time"
  Two-panel side-by-side:
    Left  (request count view, the naive reading): KVC P = 328 reqs (7.4%),
          KVC D = ~1450 each, DP = ~1100 each. P "looks idle."
    Right (compute work view, the honest reading): KVC P does 1.07M tokens
          of prefill, comparable to each KVC D worker's ~0.80M. P is a
          low-frequency high-cost safety net, not idle capacity.
  Bonus finding: KVC's total compute (3.47M tokens across 4 GPUs) is 33%
  LESS than DP's (5.17M). Same GPUs, less work done. That's the affinity
  win.

(2) cache_efficiency.png  -- §4.4  "Cache concentration is not policy win"
  Two-panel side-by-side. The setup: KVC has 27% LESS total KV pool
  (276K vs 351K tokens) yet caches MORE per request.
    Left  (cache hit rate vs turn number): KVC's session-affinity lets
          hit rate accumulate with turns; DP's hash + radix-LRU causes
          a mid-turn drift around turns 8-25 where KVC = 97.0% vs DP
          = 95.8% (1.24pp gap). Shows mechanism, not just outcome.
    Right (ECDF of per-request uncached tokens, log x): KVC's distribution
          concentrates near zero (50% < 187 tokens), DP's is spread
          (50% < 781 tokens). At uncached = 500 tokens threshold, KVC
          has 74% of requests below, DP has 31%.
  → smaller pool, better retention, less per-request work. Direct empirical
  rebuttal to "fragmentation is architectural, not policy."

Bundled scripts (rerunable):
- scripts/analysis/plot_gpu_utilization.py
- scripts/analysis/plot_cache_efficiency.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 18:04:49 +08:00
kzlin
c5519066de docs(kvc): add TTFT probability density figure (KVC v2 vs 4DP)
Adds a two-panel TTFT PDF comparison plot inserted as a new V2_DEEP_ANALYSIS
§3.4 ("TTFT 概率密度对比: bimodal vs unimodal"). Single-percentile numbers
(p50 / p99) hide the qualitative difference between the two distributions;
the figure makes it visible at a glance.

Left panel (linear x in [0, 0.6]s, body):
  KVC has a sharp peak at ~40ms (the direct-to-D fast path).
  DP has a broad peak around 50-200ms (full prefill per request).
  Annotated with p50 and p90 markers for each side.

Right panel (log x in [10ms, 10s], full range):
  KVC is visibly bimodal: a tall fast-path peak plus a small reseed tail
  around 1-5s.
  DP is unimodal: a single broad peak with shorter tail.
  Annotated with p99 callouts pointing to each tail.

KDE: scipy.stats.gaussian_kde, bandwidth=0.15 for the body (Scott's rule
oversmooths the sharp fast-path peak), log10-transformed for the full-range
panel so the bimodal structure is visible.

Bundled:
- scripts/analysis/plot_ttft_pdf.py -- rerunable when v2 / DP data change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 17:46:27 +08:00
kzlin
b5af19583b docs(kvc): replace v2 path breakdown tables with generated figures
V2_DEEP_ANALYSIS §3.1 (execution_mode distribution) and §3.2 (path-level
latency vs DP) had hand-typed tables with approximate latencies (e.g.
"~1.0s") and required readers to mentally compare 5+ rows × 5 columns.
Both sections now reference generated PNG figures derived directly from
the v2 + DP metrics.jsonl files.

§3.1 figure (v2_execution_mode_distribution.png):
  Horizontal bar chart, log x-axis. 4076 direct-to-D fast-path requests
  (green) dwarf the rest by ~30x; the long tail of slow / fallback /
  failure modes is visible at one glance. Counts and percentages
  annotated on each bar.

§3.2 figure (v2_path_level_latency.png):
  Grouped bar chart, log y-axis. Per-path TTFT p50 / TTFT p99 / Lat p50
  with exact numeric labels (no more "~1.0s" approximations). Sample
  counts annotated below each path. Quick visual reads:
   - KVC fast path TTFT p50 41ms vs DP 92ms (2.2x faster)
   - KVC reseed TTFT p99 5.12s vs DP 0.43s (12x slower) -- the cost
   - KVC no-d-capacity TTFT p99 7.65s (worst case)

Bundled:
- scripts/analysis/plot_v2_path_breakdown.py -- the script that
  generates both figures; rerunable when v2 data changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 17:38:43 +08:00
kzlin
37e9caa431 docs(kvc): production-decision reframe + formal router algorithm spec
After the critic-agent audit, V2_DEEP_ANALYSIS had drifted into an
audit-grade "5 wins / 1 loss / 3 draws" framing that mistook KVC's
deliberate design motifs (cache concentration via session affinity;
prefill-GPU idle as TTFT-stability trade-off) for "comparison
unfairness." This commit corrects the framing back to a production-
decision lens and adds a paper-track formal specification of the
router algorithm.

V2_DEEP_ANALYSIS_ZH.md changes:
- §0 TL;DR: lead with "online coding agent serving should pick
  KVC 1P3D"; the only real cost is TTFT p99 long-tail (3x DP) from
  the 8.3% mooncake reseed path, mitigable with real RDMA.
- §4 restructured into three buckets:
    real costs (TTFT p99 tail, abort accounting now fixed),
    counter-arguments to the critic (cache concentration and idle
      prefill GPU are design intent, not deficits),
    methodology to-do (naive-1P3D control, v2 N>=2 determinism).
- §6 replaces "5/1/3 rescoring" with production decision rationale:
  KVC wins on 6 latency/TTFT metrics + lower failure rate; pays
  TTFT p99 tail; lists workloads where DP would reverse the call.
- §8 decision points: D1 recommends Yes (accept v2 as milestone);
  D8 added: paper motif "KVC trades P idle for TTFT stability."

KVC_ROUTER_ALGORITHM.md (new, paper-track, Chinese narrative + English
algorithm boxes / variable names / theorems for direct paper reuse):
- Problem formulation, system model, full notation
- Algorithm 1 Route: lexicographic-tuple scoring on
    (overlap+alpha*sticky, sticky, -inflight, -assigned)
- Algorithm 2 Admit: D-worker autonomous admission deciding
    Direct / Seed / Reseed / reject (with reason)
- Algorithm 3 Dispatch: end-to-end orchestration with reset-on-success
    (the v2-specific fix that eliminates v1's self-amplifying thrashing)
- Theorem 1 (no permanent starvation) and Theorem 2 (fast-path
    determinism), each with a proof sketch
- Comparison table vs vanilla pd-disagg / DP cache-aware
- Anti-patterns ("what KVC explicitly is NOT")
- Open questions for reviewers
- Suggested paper citation phrasing
- Appendix A: algorithm-step to source-file:line crosswalk

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 17:29:18 +08:00
kzlin
5eac9b4f6b fix(metrics): exclude aborted requests from latency/ttft/tpot stats
The old filter `if row.latency_s is not None` accepted SGLang's fast
input-length-aborts (latency_s ~ 0.08s, finish_reason='abort/BadRequest')
as if they were successful zero-cost requests. This deflated mean/p50
of any run where the model rejected oversized inputs.

Impact on existing comparisons (ts=1 4-run validation + v2):
  KVC v2 has 40 aborts + 5 ReadTimeouts (was reported as just 5);
  DP 4w  has 67 aborts (was reported as 5).
Both runs have abort behavior; the asymmetry (40 vs 67) is purely from
SGLang's mem-fraction-derived max-input-len: KVC decode-only worker gets
~10 GB free GPU mem -> max-input=92098, DP fused worker gets ~9 GB ->
max-input=87811, because DP also needs chunked-prefill workspace.

The KVC-vs-DP latency-win direction holds and widens slightly under the
fixed filter (lat mean delta: -0.8% -> -1.4%); see V2_DEEP_ANALYSIS_ZH
§4.3 for the recomputed table.

Changes:
- metrics.py: new _is_failed_request(row) helper; latency/ttft/tpot
  stats now exclude both errors and aborts. New summary fields
  abort_count and failure_count expose the counts directly.
- scripts/analysis/recompute_summary.py: re-derives summary.json from
  existing metrics.jsonl using the fixed code, with optional --diff
  against the old buggy summary for inspection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 17:29:18 +08:00
kzlin
0c25168cad docs(kvc): v2 deep analysis vs TEAM_REPORT baseline
Post-v2 audit consolidating ts=1 validation + v1 thrashing + v2 win, plus
critic-agent adversarial review of the v2 vs 4DP comparison.

Headline outcomes:
- TEAM_REPORT §1 (session pin starvation) fully fixed by v2 migration +
  reset-on-success; direct-to-D 42.8% -> 91.6%.
- TEAM_REPORT §2/§3/§5 (LRU, backpressure, admission RPC) are absorbed by
  ts=1 natural drain time, not mechanism-fixed -- will resurface under
  ts=10/longer traces/higher concurrency.
- TEAM_REPORT §6 (ts=10 distortion) confirmed and locked as precondition;
  TEAM_REPORT §8 (N=1 unreliable) rewritten to "high-pressure N>=3, normal N=1".

Three new problems exposed by adversarial review:
- TTFT p99: KVC 1.285s vs DP 0.427s (KVC 3.0x worse) -- cherry-picked out of
  the V2_RESULTS_ZH.md headline table. Root cause: 8.3% non-direct path pays
  3-7s mooncake reseed cost on 50-90K-token KV transfer.
- Error accounting asymmetry: DP has 67 fast-aborts (not 5) at ~0.08s each
  counted in latency stats; KVC's 5 ReadTimeouts excluded entirely. Root
  cause: --max-input-len 87811 (DP) vs 92098 (KVC) + metrics.py:124 filter.
- Topology mismatch: KVC 1P3D's prefill GPU is idle 91.7% of the time
  (only ~373/4449 requests use seed/P path); 4DP CA has all 4 GPUs at full
  utilization. Plus: no naive 1P3D control exists in the repo -- cannot
  isolate KVC-layer contribution from 1P3D-topology contribution.

Re-scored headline: 5 KVC wins / 1 DP win / 3 draws -- still net positive
but not the "7/8 wins" framing the V2_RESULTS_ZH.md claims.

Recommended follow-ups (ROI order):
1. naive 1P3D ts=1 N=1 control (critic's only CRITICAL finding)
2. v2 N=2/N=3 to verify ts=1 determinism with new code paths
3. symmetric error accounting recompute + DP max-input-len = 92098 rerun

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 11:17:00 +08:00
kzlin
2ec0debef4 feat(kvc): session migration with reset-on-success + direct-append threshold tuning
KVC v2 beats 4DP at ts=1 same-scale on 7/8 metrics:
  TTFT mean -24%, p50 -54%, p90 -64%; lat mean -0.8%, p50 -12.6%, p90 -0.7%.
  Direct-to-D rate jumped 42.8% -> 91.7%. REFACTOR_PLAN_V1 scenario C achieved.

Two-knob fix:
- reset-on-success blacklist decay: clear (sess, D) reject counter on
  successful direct-to-D path. Eliminates v1 thrashing where session 6880
  was stable on decode-1 for 70 turns then collapsed to 75 D-changes after
  cumulative transient pressure tripped the permanent blacklist.
- bump --kvcache-direct-max-uncached-tokens default 2048 -> 8192 via CLI flag.
  41% of v1 fallbacks were 'real-large-append' (>2048 token append); raising
  the threshold lets these go through the direct-to-D fast path.

Code:
- policies.py: RoutingState.session_d_rejects counter + KvAwarePolicy
  migration_reject_threshold; degenerate fallback picks least-rejected D.
- replay.py: record_admission_reject + reset-on-success in _run_request;
  _fallthrough_reason classifies turn-2+ fall-throughs as session-not-resident
  / real-large-append / etc, replacing misleading 'large-append' suffix
  (TEAM_REPORT §2.7).
- cli.py + benchmark.py: --kvcache-migration-reject-threshold flag wiring.

Docs:
- REFACTOR_PLAN_V1_ZH.md: forward-looking plan after ts=1 validation.
- MIGRATION_V1_FINDINGS_ZH.md: v1 thrashing root-cause analysis.
- V2_RESULTS_ZH.md: v2 results, scenario C achievement, attribution.
- TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md: comprehensive team report.

Scripts:
- sweep_ts1_kvc_n3_plus_dp.sh: ts=1 baseline (KVC 1P3D N=3 + 4DP CA).
- sweep_ts1_migration_v1.sh / v2.sh: validation runs.
- analyze_ts1_validation.py: 4-way comparison analyzer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 01:18:13 +08:00
kzlin
1d51704dad docs(kvc): agentic-fit analysis, refactor plan, validation report
Three new docs covering the structural-fit investigation:

- AGENTIC_FIT_ANALYSIS_ZH.md: §1-§7 of structural design issues that
  surface KVC vs vanilla DP gap on real agentic workloads (SWE 50sess).
  Quantifies session pinning, LRU shortfall, P-side imbalance,
  time-scale distortion, etc., with code citations and N=3 rerun data.

- REFACTOR_PLAN_ZH.md: KISS-edition refactor plan. After verifying the
  original "estimate inflation" and "resident_blocks aging" claims were
  not real bugs, scope shrinks to one code change (backpressure) plus a
  4-run smoke sweep within an 8h budget.

- STRUCTURAL_VALIDATION_REPORT_ZH.md: validates §1-§7 claims using
  existing v5 baseline rerun data + 8DP CA baseline. Each claim labeled
  fully-supported / indirect / retracted with the data source. Notes
  that backpressure E2E validation is pending GPU smoke run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 21:30:11 +08:00
kzlin
7affb565b2 feat(kvc): add backpressure smoke sweep + analyzer (and v6 p1 profile script)
scripts/sweep_backpressure_smoke.sh: 4-run smoke matrix (KVC baseline /
KVC + backpressure / KVC + backpressure @ time-scale=1 / DP @
time-scale=1) designed to fit ~3-4h GPU budget. Validates §3 backpressure
implementation and partially probes §7 time-scale distortion.

scripts/analysis/analyze_backpressure_smoke.py: consumes the new
structural/* jsonl files plus request-metrics; emits headline metrics,
backpressure histograms, admission probe stats, and per-session pinning
distribution.

scripts/sweep_tp1_v6_p1_profile.sh: pre-existing v6 P1 profile sweep
script (was untracked; included for completeness).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 21:29:56 +08:00
kzlin
c47adaf8e3 feat(kvc): honor admission backpressure hints + structural event logging
Replay-side changes paired with the SGLang admission hint:

- DecodeResidencyState gains pause_until_s; admission probe parses
  recommended_pause_ms and updates the per-D pause window.
- _wait_for_decode_pause is invoked at request entry points
  (_invoke_router, _invoke_session_direct) so requests stall before
  hitting a saturated D instead of timing out via mooncake.
- New CLI flags: --enable-backpressure (default off, baseline preserved),
  --backpressure-max-pause-s (cap on per-request sleep, default 2s).

Structural instrumentation written under <run_dir>/structural/:
- admission-events.jsonl: every admission probe (RTT, queue_depth,
  pause_ms, available_tokens, evicted_count)
- backpressure-events.jsonl: every actual pause sleep
- session-d-binding.jsonl: per-request policy decision

Used to validate the structural claims documented separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 21:29:46 +08:00
kzlin
ca4b64c79a feat(sglang): expose backpressure pause hint in admit_direct_append
Add `recommended_pause_ms` field to DirectAppendAdmissionReqOutput so D
can advise callers when its transfer queue is heavy or KV pool is near
capacity. The hint is computed from transfer_queue_depth,
retracted_queue_depth, and post-trim token_usage; thresholds are simple
heuristics (>0.90 usage, >=8 queue depth, retracted>0).

Default behavior is unchanged for callers that ignore the field.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 21:29:30 +08:00
kzlin
4978c0d0cd profile(kvc): rewrite v5+profile report after critic audit + P0/P1 instrument
Hostile audit of the original report flagged three load-bearing errors:

1. held_tokens semantic was inverted. session_held_tokens() at
   session_aware_cache.py:278-282 sums (kv_allocated_len - cache_protected_len)
   per slot, i.e. slot-private (NOT in radix tree). So "other = cap - held -
   avail" actually CONTAINS the radix-tree protected prefix cache (likely the
   single biggest component for shared agentic prefixes), not just running
   batch + in-flight as the original report claimed.

2. Admission-race causal hypothesis for the 415 EXP2+profile errors is
   contradicted by the data: 414/415 errors have kv_transfer_blocks > 0 — they
   passed admission and died downstream ("generate stream ended before
   producing any token", raised by the client when a 200 response had an empty
   stream).

3. Polling deconfound was too quickly dismissed. Mode counts shift ~1:1
   (session-cap-fb -356 / kvcache-centric +406), and /server_info is not a
   passive read — it dispatches into the scheduler main loop and iterates
   every session slot.

Plus: per-D error% confounded by sticky session affinity (only 18 unique
sessions cause 415 errors, decode-3 had 0 errors only because no high-error
session landed there); decile 10 "recovery" was an equal-time binning
artifact (24.5% under equal-count); v5 vs v5+profile time gap was 21h not
6h; p50/p90 latency comparison is N=1.

Rewritten report (docs/V5_PROFILE_INVESTIGATION_ZH.md) marks each correction
with ⚠️ and demotes admission-race to one of four hypotheses (H1-H4).

Action items split into P0 (verify, must do first) and P1 (instrument):

P0 — scripts/sweep_tp1_v5_baseline_rerun_exp2.sh runs 3x v5 baseline EXP2
(no polling, identical config to the original v5 run) to test whether the
9-error baseline result is reproducible. If 3 runs give ~9 errors and
profile gives 415, polling is the leading suspect. Currently running
in background.

P1 — scheduler.py:_compute_pool_breakdown_for_diagnostics adds a read-only
"pool_breakdown" dict to /server_info covering: radix_evictable_tokens,
radix_protected_tokens, slot_private_held_tokens, session_slot_count,
running_batch_{reqs,kv_tokens}, transfer_queue_{reqs,tokens},
prealloc_queue_{reqs,tokens}, retracted_queue_{reqs,tokens}. With these,
"unaccounted = cap - sum(known)" exposes true leakage. replay.py captures
all fields into the per-tick row; analyzer prints the decomposition and
gracefully handles old timeseries (prints "P1 instrument absent").

Mock-tested end-to-end. SGLang patch is read-only and does not affect
admission/scheduling. Old v5+profile data still analyzes correctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 22:29:21 +08:00
kzlin
51f5386691 profile(kvc): add D KV pool timeseries poller + analyzer for v6 root-cause
v5 dropped errors but pushed session-cap fallback to 46-51%. Before adding
v6 mitigations we need to attribute that capacity loss to one of:
  (a) active sessions — real footprint
  (b) idle-evictable sessions — LRU not aggressive enough
  (c) prefill backup blocks / in-flight / fragmentation — release timing

Without this it's all guessing. Plumb a 1Hz poller into replay that hits
each P/D worker's /server_info, captures session_cache + memory_usage, and
writes a per-worker time-series JSONL to <run_dir>/d-pool-timeseries.jsonl.
Off by default (--pool-poll-interval-s 0); v5+profile sweep enables it at
1.0s. Per-tick HTTP cost is ~8 parallel /server_info calls — negligible
relative to the 50min run.

Analyzer (scripts/analysis/analyze_pool_timeseries.py) decomposes each D's
capacity into active_held / idle_evictable / other (= cap-held-avail, the
backup-blocks bucket) / free, and reports session residency churn across
workers as a starvation/thrashing signal.

Mock-tested poller end-to-end (cancellation clean, file flushed, sessions
captured); analyzer validated against synthetic timeseries.

Next: run scripts/sweep_tp1_v5_optD_profile.sh on hardware (~90min), then
analyze results to pick a v6 direction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:04:21 +08:00
kzlin
6572d7f3f4 docs: add v5 chapter (Option D worker-mode admission) and rename to V1_TO_V5
v5 sweep (sweep_tp1_v5_optD.sh) lands the previously-deferred Option D:
worker admission_mode authoritative for direct_append + seed + reseed,
bypassing replay's local _decode_session_soft_cap.

Key findings now documented:
- errors collapse from 9-10% to 0.2% (mooncake timeouts gone)
- session-cap fallback rises 33-35% -> 46-51% — D's true KV pool is the
  binding constraint, not replay's estimator; v4's "low fallback" was
  hiding capacity overruns as transfer-timeout errors
- direct-to-D subset latency unchanged from v4 (admission overhead negligible)
- new bottleneck: D's physical KV pool — points v6 at prefill backup release
  timing, priority eviction tuning, chunked seed, cross-D session migration,
  and real RDMA

Also adds a 5th lesson on errors-vs-fallback reciprocity and updates the
code index with the v5 endpoint extension and new CLI knobs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 16:13:25 +08:00
kzlin
6e5ed8da80 feat(kvc): Option D - delegate seed/reseed admission to D worker
v4 (cap=16) saw 35% session-cap fallback because the local soft_cap
min(16, usable / target) evaluates to 1-2 for large agentic inputs.
The cap was hit not because D was full but because replay's heuristic
underestimated capacity.

This change makes worker admission_mode authoritative for ALL paths:

SGLang side:
- io_struct.py: DirectAppendAdmissionReqInput gains a `mode` field
  ("direct_append" | "seed", default "direct_append" preserves prior
  behavior).
- scheduler.py:admit_direct_append: when mode == "seed", skip the
  resident-on-D requirement and run the same capacity check + LRU
  eviction (maybe_trim_decode_session_cache) that direct_append uses.
  This lets D atomically decide if a new session can be admitted based
  on actual token_to_kv_pool_allocator state.

Replay side (replay.py):
- _query_decode_direct_admission gains a `mode` parameter.
- _reserve_decode_session_capacity: in worker admission_mode, the
  seed/reseed branch now queries D with mode="seed" and trusts the
  result, instead of estimating capacity from the residency snapshot.
- _should_admit_new_decode_session: in worker mode, skip the local
  soft_cap pre-check and let D decide. Same-D session fast-path is
  preserved.

Effects:
- Local hardcoded cap of 16 is bypassed under worker mode; D's real
  KV pool size is the only constraint.
- LRU eviction runs in D's process atomically with admission, so
  starvation (the v3 bimodal "lucky vs starved sessions" pattern)
  should resolve.

scripts/sweep_tp1_v5_optD.sh added to run the same 1P7D / 2P6D
configs as v4 with the new admission path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:40:03 +08:00
kzlin
74194e660a docs: v4 final results, error analysis, and updated journey
Add v4 sweep results and post-mortem analysis showing:

- direct-to-D path: 54.3% (1P7D) / 58.0% (2P6D) of requests now use
  KVC cleanly. P50=0.5s and TTFT P50=0.043s; this path beats baseline
  8DP across the board (P50 -24%, TTFT P50 -54%, TTFT P90 -79%).

- Overall vs baseline (errors+truncated excluded):
  v4 2P6D P50=0.85s vs baseline 0.66s (28% slower).
  Reason is not errors -- 35% of requests still hit
  fallback-large-append-session-cap, where capacity-based
  cap = usable_tokens / target_tokens evaluates to 1-2 (not 16)
  for large agentic inputs.

- 9-10% errors on KVC variants are mooncake TCP transfer timeouts,
  not SGLang logic bugs. Prefill log shows
  "Failed to send kv chunk ... 32s timeout ... session not alive".
  Errors concentrate in turn>=31 (large inputs) after run >44.8%.

Track:
- docs/KVC_DEBUG_JOURNEY_V1_TO_V4.md: append v4 results table,
  per-mode breakdown, and error root cause.
- scripts/analysis/{analyze_v3,analyze_v4,analyze_errors,compare_no_error}.py
- outputs/qwen3-30b-tp1-v{3,4}*/exp*_summary.json (force-added,
  small JSON; metrics.jsonl excluded due to size).
- outputs/qwen3-30b-tp1-v{3,4}*/sweep_results.txt

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:34:01 +08:00
kzlin
c9d350b372 docs: KVC v1-v4 debug journey + raise session soft_cap to 16
Document the iterative debugging from v1 (broken KVC) through v4
(routing fixed + session cap raised), with code-level analysis of
the two main bugs encountered:

1. v2 root cause (mis-diagnosed previously as `allow_local_prefill`):
   `--policy default` for KVC mechanism caused replay's round-robin
   policy and the PD router's round-robin to diverge, sending requests
   with `session_params` to a D worker that did not have the session
   open. Resulted in 56-61% truncation with finish_reason
   "session id X does not exist".
   Fix: use `--policy kv-aware` (sweep_tp1_v3_kvaware.sh) so replay
   emits `x-smg-target-worker` and PD router uses consistent_hashing.

2. v3 new bottleneck: `pd-router-fallback-large-append-session-cap`
   dominated 52-65% of requests. Root cause was hardcoded
   `min(4, ...)` in `_decode_session_soft_cap`. With 7 D workers x 4
   sessions = 28 slots for 52 trace sessions, ~24 sessions starved
   permanently (bimodal direct-to-D rate of 0% or 99%).
   Fix: raise the cap to 16 (replay.py).

Also includes the v3 finding that direct-to-d-session path P50=0.495s
and TTFT P50=0.043s already beats the 8-way DP baseline (0.65s/0.093s)
- the KVC core mechanism works when fallback paths are avoided.

Files:
- docs/KVC_DEBUG_JOURNEY_V1_TO_V4.md: full journey + code location index
- docs/SWEBENCH_EXPERIMENT_{PROGRESS,RESULTS}.md: prior session notes
- scripts/sweep_tp1_v{2,3,4}*.sh: experiment driver scripts
- src/agentic_pd_hybrid/replay.py: cap 4 -> 16, audit fields
- src/agentic_pd_hybrid/pd_router.py: strip session_params from prefill
- src/agentic_pd_hybrid/metrics.py: truncated_request_count

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 21:10:41 +08:00
69 changed files with 10506 additions and 108 deletions

View File

@@ -0,0 +1,434 @@
# Agentic 场景下的结构性设计缺陷分析
**日期**2026-05-06
**对照数据**`outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/exp2_2p6d_run1_*`KVC kv-aware Option D2P6D4449 reqs / 52 sessions+ `outputs/qwen3-30b-tp1-exps/exp1_8way_dp_cache_aware_summary.json`(同 trace 8-way DP cache-aware baseline
**模型**Qwen3-30B-A3BTP1单机 8×H100 80GB。
**研究问题**:把 SWE trace 视为"真实 agentic"的代表KVC 机制相对 vanilla DP 系统性输在哪里——除了"D 容量 4.6× 过载"之外的结构性原因。
> 本文是对 `docs/KVC_DEBUG_JOURNEY_V1_TO_V5.md` 与 `docs/V5_PROFILE_INVESTIGATION_ZH.md` 的补充:版本演进与瓶颈定位之外,从设计层看哪些假设和真实 agentic workload 不匹配。
---
## TL;DR
按重要性排序的结构性缺陷:
| # | 缺陷 | 数据 | 修复方向 | 工程量 |
|---|---|---|---|---|
| 1 | **KvAwarePolicy 不感知 D 容量session 永久 pin 到首次落点 D** | session 平均访问的不同 D 数 = **1.00**direct-to-D 命中率呈极端双峰15 session 0-20%、14 session 80-100% | score 函数加 capacity-aware 项;允许跨 D session 迁移 | 中 |
| 2 | **D 端 LRU 只能 evict idle sessionhot session 永远踢不掉** | D 跑全程仅 9-43 次 trim 事件 vs 80-150 次 transfer 错误token_usage 顶到 1.00 | 加 score-based eviction按访问频率/最近性多层) | 中 |
| 3 | **没有 D→Router→Replay 的 backpressure 通道** | concurrency 一路 32 不降D 失败时 replay 无感 | admission 响应加 `recommended_pause_ms`replay 端按它降并发 | 小 |
| 4 | **Admission HTTP round-trip 与 scheduler 主循环耦合** | v5+profile 仅加 1Hz polling 就让 errors 从 9 涨到 415 | 拆成 lock-free `/probe` + 进 scheduler 队列的 `/commit_evict` | 中 |
| 5 | **P-side round-robin 不感知 D 健康** | prefill-0 出 367 KVTransferErrorprefill-1 仅 4——但请求量近乎对半 | router 选 P 时考虑目标 D 健康度 | 中 |
| 6 | **Replay 端 session footprint 估算膨胀 30×** | `_estimate_session_resident_tokens = input + output`,把 turn-50 的 80K 上下文当成"需要全新 80K 空间" | 改成"增量 token"估算 | 小 |
| 7 | **time-scale=10 把测试条件人为推到失真区间** | inter-turn gap p50 从 2.5s 压到 0.25s——KVC 想利用的"自然 idle 窗口"被消除 | 跑一组 time-scale=1 baseline 验证 | 小(仅配置) |
**最重要的对照事实**:同 trace、同硬件、同模型下 8-way DP cache-aware无 PD 拆分、无 KVC、无 session 抽象):
| 指标 | 8-way DP CA | v5 KVC 2P6D |
|---|---|---|
| Errors | **0** | 372 (8.4%) |
| Latency mean | **1.43s** | 3.50s |
| Latency P50 | **0.65s** | 1.11s |
| Latency P99 | **8.37s** | 20.37s |
| TTFT mean | **0.12s** | 2.13s |
| TTFT P90 | **0.26s** | 6.47s |
| Per-worker 请求量分布 | 508619±10% | 561858±26% |
**naive DP 在每一项都赢,包括 latency mean 的 145% 优势**。这定义了 KVC 在该 workload 下"必须超过"的基线。
---
## 1. Session 永久 pin 到 D + 容量盲选(最核心问题)
### 1.1 现象
每个 session 在整次运行中只访问 **1.00 个不同 D worker**(见上文数据)。结合 direct-to-D 命中率分布:
```
direct-to-D 命中率分桶n=52 sessions
0-20%: 15 sessions ← 几乎每 turn 都失败回退到 P→D 全量传输
20-40%: 7
40-60%: 11
60-80%: 5
80-100%: 14 sessions ← 几乎每 turn 都走 direct-to-D 快路径
```
**几乎没有中间态**——这是典型的不公平资源分配信号。
被饿死与被照顾的 session 在工作量上差异明显:
- 饿死 session 平均 peak input56,011 token
- 顺利 session 平均 peak input31,344 token**1.8× 差距**
**大 session 倾向被饿死**——因为它们在容量已紧张的 D 上更容易触发 admission 拒。
### 1.2 根因(代码级)
`policies.py:166-172` `KvAwarePolicy.select`
```python
score = (
overlap + sticky * self.sticky_bonus, # 主项: 历史 KV overlap
sticky, # 二级: 是否 last_decode_worker
inflight_penalty, # 三级: 当前 inflight 数(很小)
assignment_penalty, # 四级: 累计被分配数(更小)
)
```
评分中**完全无 D 当前容量项**。Session X 第一次落到 D-2 时积累 hash_id 在 D-2 上;之后无论 D-2 多满X 的 turn N+1 都会被打分到 D-2因为 overlap 主导)。
更糟的是 `RoutingState.decode_resident_blocks``policies.py:46`)从不缩减——即使 D 早 evict 了某些块replay 仍认为它们在那。运行中期所有 D 的 overlap 集合都接近"trace 全部 hash_id"policy 退化为纯 sticky。
### 1.3 后果——具体到 session 的体验
**饿死 session如 session 50400105 turns0 次 direct-to-D每 turn 流程**
1. policy 选 D永远是同一个
2. admission 拒D 容量已被占住)
3. 走 fallback-session-cap → P 全量 prefill 50K-100K token
4. mooncake 推 KV → D 仍无空间 → 32s timeout 或 KVTransferError
5. 用户每 turn 体验 5-10s 延迟,反复出错
**顺利 session如 session 3840118 turns97% direct-to-D每 turn 流程**
1. policy 选 D永远是该 session 的初始 D
2. admission 通过(这个 session 一直占着这个 D 的 slot
3. direct-to-DD 上 append-prefill 几百 token零 P 介入、零 mooncake transfer
4. TTFT 0.043s、E2E 0.495s
**这不是"平均慢一点",是结构性不公平**——SLO 视角下 P99 是被饿死那 15 session 的尾巴拉出来的。
### 1.4 为什么 naive DP 反而赢
8-way DP cache-aware 用纯 hash-based 路由,没有 session 抽象,没有 PD 拆分:
- 每个请求按 prefix hash 路由到一个 worker → 同 session 的 turn 在 worker 上自然有 prefix 命中
- 容量过载时 SGLang 自己的 radix cache + 调度器统一管 KV 池
- 不存在 admission/fallback/reseed 路径
- 不存在 mooncake transfer
- per-worker 负载误差 ±10%vs KVC ±26%),自动接近均衡
**KVC 引入的 session affinity / KV 复用 / admission 三件套,在容量紧张时反而加剧了不均衡,没有任何一项能挽回 vs DP 的差距。**
### 1.5 修复方向
`KvAwarePolicy.select` 里加:
```python
# 当前 D 容量利用率worker-mode admission 已经能查到)
capacity_penalty = -worker_capacity_used_ratio[worker.worker_id]
# 当多个 D 都有 overlap 时,按容量挑最空的;
# 当某 D 容量 > 阈值时,禁止该 D 进入候选
if worker_capacity_used_ratio[worker.worker_id] > HARD_CAP:
continue
score = (
overlap_capped, # overlap 但限幅,避免单个 D 永远赢
capacity_penalty, # ← 新增
sticky,
inflight_penalty,
)
```
更激进的修法:当一个 session 被某 D 反复拒 N 次后,主动 release 它在该 D 上的 session 状态,**允许下次 turn 走另一个 D**(代价是丢失已积累的 KV但目前 fallback 路径本来也丢了)。
---
## 2. D 端 LRU eviction 跟不上压力
### 2.1 数据
每个 D 全程:
| Worker | Trim 事件(主动 LRU | KVTransferError + OOM | 峰值 token_usage |
|---|---:|---:|---:|
| decode-0 | 9 | 0 | 0.99 |
| decode-1 | 43 | 12 (4 err + 8 oom) | 0.99 |
| decode-2 | 16 | 459 (153 err + 306 oom) | 0.97 |
| decode-3 | 37 | 87 (29 err + 58 oom) | 0.99 |
| decode-4 | 28 | 270 (90 err + 180 oom) | **1.00** |
| decode-5 | 30 | 279 (93 err + 186 oom) | **1.00** |
**LRU 触发频率比错误次数低 5-15 倍。** D-4 / D-5 直接顶到 token_usage=1.00。
### 2.2 根因
`scheduler.py:2040` `evict_idle_streaming_sessions_lru` 的 idle 判定:
```python
# 只能 evict "所有 req 都 finished + streaming 模式" 的 session
```
但 SWE 高并发下每个 session 几乎一直有 inflight reqtime-scale=10 又压缩了 inter-turn gap。**hot session 永远不 idleLRU 永远找不到东西可踢**。结果 D 一路开到 100% → 下一笔 transfer 来直接 OOM/timeout。
### 2.3 修复方向
引入分层 eviction
1. **Idle session 优先**(当前)
2. **冷 session 次优**(最近 N 秒无访问,即使有 inflight也可以 retract 那个 inflight 让位)
3. **hot session 强制 retract**(在 hard cap 触发时)
vanilla SGLang 已有 `disagg_decode_prealloc_queue.retracted_queue` 机制(看 `admit_direct_append` 引用),但**没有人主动触发 retract**——目前只有内部异常时才会进 retracted_queue。需要把 retract 提升为正常 admission 路径的一部分。
---
## 3. 没有 D→Replay 的 backpressure 通道
### 3.1 名词解释
**Backpressure反压** = 流式系统下游过载时把信号反向传给上游让它降速。例TCP 滑动窗口、Kafka consumer lag、gRPC HTTP/2 flow control。
### 3.2 当前状态
- D 端 transfer queue 堆 → 32s 后 timeout → 抛 KVTransferError
- error 抛回 P → P 抛给 router → router 抛给 replay → replay 走 fallback 路径
- **整个链路上没有"D 过载,请慢点发"的信号**——concurrency 一直保持上限
后果D 一旦开始失败,会**持续失败**(因为 replay 没降速),直到 D 自己消化完积压。
### 3.3 修复方向
`admit_direct_append` 响应里加:
```python
{
"can_admit": ...,
"recommended_pause_ms": int, # ← 新增:下次发同类请求前建议等多久
"queue_depth": int, # ← 新增D transfer queue 当前深度
...
}
```
replay 端在 admission 拒被拒时按 `recommended_pause_ms` 降并发或退避。**这是最便宜的一条改动**——不改协议、不改 SGLang 内部,只改两端代码。
---
## 4. Admission RPC 与 scheduler 耦合——结构 vs 工程的精确边界
### 4.1 现象
`docs/V5_PROFILE_INVESTIGATION_ZH.md` 报告:仅加 1Hz `/server_info` polling 就让 EXP2 errors 从 9 涨到 415。`/server_info` 在 scheduler 主循环里遍历 session slots 算 `is_idle`1 Hz × 8 worker 就足以扰动调度。
但实际负载下 admission RPC 频率远高于 1Hz每个 turn 1 + reseed + direct-to-D 都调一次。concurrency=32 + 4449 reqs / ~2700s ≈ **每秒 16+ 次 admission RPC**
### 4.2 这是结构问题还是工程问题——精确拆解
`admit_direct_append``scheduler.py:3581`)做两件事:
```python
# (a) 读池子状态——轻
available_tokens = self.token_to_kv_pool_allocator.available_size()
# (b) 触发 LRU 扫描——重,且必须修改池子状态
trim_result = self.maybe_trim_decode_session_cache(...)
```
| 部分 | 性质 | 是否能靠工程化解决 |
|---|---|---|
| (a) 读池子状态 | 几个原子读 | **完全可工程化**——做成 lock-free shared-memory snapshot 即可 |
| (b) LRU eviction | 修改 GPU 池子,必须独占 | **结构性的**——Python GIL + 共享 GPU 池子无法并发修改 |
**关键观察**:实际负载里 (b) 是少数路径——大部分 admission 只需要"看一下够不够",不需要立即 evict。
### 4.3 工程化修复方案
把 admission API 拆成两个端点:
```
POST /session_cache/probe ← 90% 流量
- 只读 lock-free snapshot
- 返回 (can_admit_estimate, available_tokens, queue_depth)
- 不进 scheduler 队列
POST /session_cache/commit_evict ← 10% 流量
- probe 不够时才调
- 进 scheduler 队列,做实际 LRU
- 保留当前 admit_direct_append 语义
```
snapshot 由 scheduler 在每个 step 末尾写到一段 mmap 共享内存atomic publishreplay 端 mmap 读,零 syscall 零序列化。一秒内能撑数千次 probe。
### 4.4 关于"协程/多线程/多进程/换语言"
| 工具 | 对本问题的实际效果 |
|---|---|
| asyncio 协程 | SGLang 已用,对 scheduler 主循环本身无帮助 |
| Python 多线程 | GIL 拦着,且 GPU 池子状态只能 scheduler 进程改 |
| 多进程 | scheduler 已是独立进程;问题是它**自己的 step 循环**串行了 admission 与 decode |
| orjson / uvloop | 网络/JSON 加速 5-10×但 LRU 遍历不在那条热路径 |
| Rust/C++ 重写 scheduler | 把 LRU 遍历提速 5-10×但**结构性共享问题仍在** |
**正确的工程化解法是重设计 API拆 probe / commit不是单纯换更快的库或语言。**
---
## 5. P-side 路由不感知 D 健康
### 5.1 数据
```
prefill-0: 367 KVTransferError, 361 "Decode instance could be dead"
prefill-1: 4 KVTransferError, 0 "Decode instance could be dead"
请求量对比:
prefill-0: 2225 requests
prefill-1: 2224 requests ← 几乎对半
```
**两 P 请求量完全均衡,错误率差 92×**。日志里 prefill-0 的错误反复指向某个特定 D`10.45.80.47:XXXXX`)——它跟某个 hot D 形成了"死亡链路"。
### 5.2 根因
`pd_router.py:43-49` 的 P 选择是裸 round-robin
```python
prefill_url, bootstrap_port = self.config.prefill_urls[
self.prefill_cursor % len(self.config.prefill_urls)
]
```
不知道 D 是否健康,不会避开"正在和 D-X 死磕"的 P。
### 5.3 修复方向
router 选 P 时考虑 (P 当前 inflight transfer 数, 目标 D 健康度) 联合得分。健康度可以用 §3 提的 `queue_depth` 字段。
---
## 6. Replay 端 session footprint 估算膨胀 30×
### 6.1 代码
`replay.py:898-899`
```python
def _estimate_session_resident_tokens(request: TraceRequest) -> int:
return request.input_length + request.output_length
```
被用于 `_decode_session_soft_cap``replay.py:1051`)和 `_should_admit_new_decode_session`
### 6.2 问题
对一个已经在 D 上有 80K KV 的 turn 50
- 真实增量需求input 新增几千 token + output 几百 token = ~3K
- 估算返回值80K + 1K = 81K**膨胀 ~27×**
后果router-mode admission 系统性误判——本来能 admit 的 session 被 replay 自己拒掉。v5 worker-mode 让 D 自己看真实容量部分修了这个,**但 KvAwarePolicy 选 D 时仍用这个膨胀估算**——选 D 仍然是错的。
### 6.3 修复
```python
def _estimate_session_resident_tokens(request: TraceRequest) -> int:
if request.turn_id == 1:
return request.input_length + request.output_length
# turn 2+: only the increment matters for additional reservation
return max(0, request.input_length - request.cached_tokens) + request.output_length
```
---
## 7. time-scale=10 测量失真
### 7.1 它是什么
`replay.py` 把原始 trace 每个请求的 `timestamp` 字段做 `t / time_scale` 缩放后再按这个时间发。
- 原始 trace 跨度 ~6000s≈100 分钟)
- time-scale=10 → 实际 replay 跨度 ~600s≈10 分钟)
### 7.2 为什么这么设计
**纯粹为了节省测试时间**——单次 1× 跑 100 分钟sweep 5 版 × 3 重复 = 25h GPU 时间10× 只要 2.5h。
### 7.3 它扭曲了什么
| 维度 | 原始 trace | replay (time-scale=10) |
|---|---|---|
| inter-turn gap p10 | 1.6s | 0.16s |
| inter-turn gap p50 | 2.5s | 0.25s |
| inter-turn gap p90 | 7.8s | 0.78s |
| inter-turn gap max | 261s | 26s |
真实 agentic 用户/agent 在每个 turn 之间停 2-8 秒思考、打字、tool call。**这些间隙正好是 KVC 想利用的"自然 idle 窗口"**——session 短暂 idle 时 LRU 可以 evict、其他 session 可以 admit。
time-scale=10 把这些窗口压到 0.2-0.8s**人为消除了 KVC 的设计前提条件**。
### 7.4 严重的实验有效性威胁
所有 v3-v6 数据基于 time-scale=10。这意味着前面所有"KVC 在 SWE 上输给 baseline"的结论都带着这个失真。**真实部署里 inter-turn gap 是 2.5s 的话KVC 可能根本不会撞到当前看到的容量瓶颈**——D 有时间在 turn 之间释放/重排。
**应该单独跑一组 time-scale=1 的 baseline 对比**,才能判断 KVC 输给 DP 是因为机制本身不行,还是因为 benchmark 把它推到了不该工作的区间。这是这个项目目前**最重要但还没做**的验证。
---
## 8. 应用层抽象不需要在引擎层引入(撤回)
之前草稿里提过"框架不支持 speculative 多分支、嵌套 sub-agent、tool call 中断"——这是过度抽象。**应用层模式都可以由 timestamp + 独立 session_id 隐式表达**
| 应用层模式 | 表现在 trace 里 | 推理引擎需要做什么 |
|---|---|---|
| Tool call 异步返回 | turn N 与 N+1 之间 timestamp gap 很大 | 啥都不用,按时间发请求即可 |
| 嵌套 sub-agent | 父 session timestamp 突然停顿sub-agent 是独立 session_id | 把它们当成两个独立 session 即可KV 也无需共享) |
| Speculative N 分支 | N 个独立 session_id 同时发 | 用 radix prefix cache 自然命中前缀;不需要任何额外抽象 |
**这条不构成结构性缺陷。** 已从结论中移除。
---
## 9. 行动项(按 ROI 排序)
### 优先级 P0修了显著改善饿死/不公平)
1. **[§1] KvAwarePolicy 加 capacity-aware penalty + 允许 session 跨 D 迁移** — 工程量中、收益最大
2. **[§2] D 端引入分层 eviction冷 session、hot retract** — 工程量中、收益大
3. **[§7] 跑一组 time-scale=1 baseline** — 工程量小(仅配置),但**不做这条所有结论都不可信**
### 优先级 P1修了把工程稳定性补齐
4. **[§3] D→Replay backpressure 通道**admission 响应加 pause hint — 工程量小
5. **[§4] 拆 admission 为 probe + commit_evict** — 工程量中
6. **[§6] 修 `_estimate_session_resident_tokens` 用增量** — 工程量小
### 优先级 P2等 P0 数据后再决定)
7. **[§5] P-side 选 P 时考虑 D 健康** — 工程量中
---
## 10. 局限与未验证假设
1. **N=1**:所有数据来自单次 runv6 P0 已证 EXP2 errors 在 9-912 间漂移single-run variance 巨大)。本文所有数字都应理解为"代表性观察"而非"统计显著结论"。
2. **time-scale=10 失真**§7所有"KVC 输给 DP"的程度可能是被 benchmark 放大的。这是最大的不确定性。
3. **8DP 对比的硬件优势**DP 是 8 个 worker 全部跑 prefill+decodeKVC 是 2P+6D只有 6 个能解码。理论上 8 worker 对 6 worker 自带 1.33× 解码并发优势。本文未折算这部分——但 8DP 优势远大于 1.33×latency mean 145% 优势所以核心结论KVC 在该 workload 下系统性输)不受此影响。
4. **mooncake TCP loopback**:所有 transfer 错误是单机 TCP 模拟下的产物。生产环境 RDMA 下错误率分布可能完全不同。
5. **KvAwarePolicy 的 stale `decode_resident_blocks`**§1.2 末尾)现象有数据观察支撑(运行中期 overlap 失去判别力),但**没有系统性测过"清掉 stale 状态会怎样"**。
6. **P-side 错误集中在 prefill-0**§5.1)的因果链是推测——可能也是"prefill-0 早启动 + race"的偶然结果。N>1 数据未验证。
---
## 附录 A数据产物索引
```
outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/
├── exp2_2p6d_run1_metrics.jsonl ← 本文主数据源
├── exp2_2p6d_run1_summary.json
├── exp2_2p6d_run2_* (errors=912, single-run variance 证据)
├── exp2_2p6d_run3_* (errors=396)
└── kvcache-centric-*-20260429T142429Z/logs/
├── decode-{0..5}.log ← §2.1 LRU vs error 计数
└── prefill-{0,1}.log ← §5.1 P 错误分布
outputs/qwen3-30b-tp1-exps/
├── exp1_8way_dp_cache_aware_summary.json ← 对照 baseline
└── RESULTS_SUMMARY.md
```
## 附录 B相关文档
- `docs/PROJECT_OVERVIEW.md` — 项目目标与已实现功能
- `docs/KVC_DEBUG_JOURNEY_V1_TO_V5.md` — v1→v5 版本演进
- `docs/V5_PROFILE_INVESTIGATION_ZH.md` — v5+profile 调查(已 critic 修订)
- `docs/SWEBENCH_EXPERIMENT_RESULTS.md` — Qwen3.5-35B-A3B SWE 实验

View File

@@ -0,0 +1,367 @@
# KVC 实验踩坑记录与代码 Bug 分析v1 → v5
记录从 v1 到 v5 KVC 实验的踩坑过程、错误诊断、以及最终定位的代码 bug。
模型: Qwen3-30B-A3B (TP1),硬件: 单节点 8×H100 80GB。
Trace: `qwen35-swebench-50sess.jsonl`4449 请求52 sessions
## TL;DR
| 版本 | 关键变化 | 截断率 | direct-to-D 占比 | P50 | 主要瓶颈 |
|------|----------|:---:|:---:|:---:|----------|
| v1 (smoke / 早期) | mechanism 跑通 | - | - | - | - |
| v2 | KVC + `--policy default` | **56.8% / 61.4%** | <0.1% | 0.08s* | Routing 错位默认策略 |
| v3 | KVC + `--policy kv-aware` | **0.9%** | 30-42% | 1.5-1.8s | session-cap fallback (52-65%) |
| v4 | v3 + soft_cap 416 | 1.0% | 54-58% | 1.08 / 0.84s | session-cap fb 35%、9-10% mooncake errors |
| v5 | Option Dworker-mode 驱动 seed/reseed | 0.9% | 41-45% | 1.59 / 1.31s | D KV pool 真容量不足 fallback 反而 46-51% |
`*` v2 P50 是假数字——超过半数请求只生成 1 token 就被 abort
## v2 踩坑Default policy 与 KVC 机制根本不兼容
### 表象
`scripts/sweep_tp1_v2_fixed.sh` 跑出来
- Exp18-way DPbaseline4449/4449 成功P50=0.65serror=0
- Exp21P7D KVC**2524 truncated (56.8%)**18 errorsP50=0.08s* ()
- Exp32P6D KVC**2733 truncated (61.4%)**17 errorsP50=0.08s* ()
每个截断请求 `actual_output_tokens=1``finish_reason="abort: session id X does not exist"`
### 错误的早期诊断
之前 `RESULTS_SUMMARY.md` 把锅扣在 SGLang `--disaggregation-decode-allow-local-prefill` flag 认为是 D worker 在有 `bootstrap_room` 时仍然做了 local prefill这个诊断**完全错误**—— `scheduler.py:1975-1980` `_should_allow_local_prefill_on_decode`
```python
def _should_allow_local_prefill_on_decode(self, req: Req) -> bool:
return (
self.disaggregation_mode == DisaggregationMode.DECODE
and self.server_args.disaggregation_decode_allow_local_prefill
and req.bootstrap_room is None # ← 有 bootstrap_room 不会走 local prefill
)
```
KVC reseed 路径的请求都带 `bootstrap_room`根本不会触发 local prefill
### 实际根因Replay 与 PD Router 的 round-robin 错位
实验脚本里 KVC `--policy default` baseline `--policy kv-aware`
`benchmark.py:287-300` 这两者的差别巨大
```python
def _decode_policy_for(policy_name: str) -> str:
if policy_name == "sticky": return "manual"
if policy_name == "kv-aware": return "consistent_hashing"
return "round_robin" # default
def _header_mode_for(policy_name: str) -> str:
if policy_name == "sticky": return "routing-key"
if policy_name == "kv-aware": return "target-worker"
return "none" # default
```
`default` policy + KVC 机制下
1. Replay policy`policies.py:DefaultPolicy`round-robin 选一个 D比如 D-3
2. Replay D-3 `open_session(session_id=X)``replay.py:1722-1731`
3. Replay 通过 PD Router 发请求 `session_params` `header_mode=none`**不发任何 routing header**
4. PD Router (`pd_router.py:_select_decode_index`) 看到 `decode_policy=round_robin`**自己独立的计数器**round-robin发到了 D-5
5. D-5 scheduler 看到 `session_params` 里有 session_id但自己的 `session_controller` 里没这个 sessionsession D-3 )→ abort with `"Invalid request: session id X does not exist"` (`scheduler.py:1824-1836`)
两个独立的 round-robin 计数器只要一次错位任何并发或 direct-to-D 绕过 router 的请求都会引起就永远对不上
### 为什么 turn 0 不出问题?
Turn 0 `_invoke_plain_router``replay.py:1894`不带 `session_params`作为普通 PD disagg 请求处理发到任何 D 都行Turn 1+ 才开始走带 session_params KVC 路径撞上路由错位
### 数据特征验证per-session pattern
```
session 11360 (58 turns): pattern = .TTTTT.TTTTTTT.TTTTTT... ← turn 0 OK1+ 全 T
session 18720 (87 turns): pattern = .TTTTTTTTTTTTTTTTTT...
```
每个 D worker 收到了全部 52 session 的请求理想情况下应该是 ~7-8 /D因为 round-robin session 完全打散)。
### 修复
唯一正确的修复是把 KVC policy `default` 改成 `kv-aware`
```diff
- --policy default
+ --policy kv-aware
```
`KvAwarePolicy` (`policies.py:146-187`) 做两件事
1. `_overlap_blocks` + `sticky_bonus` 给每个 D 打分session 自然粘在同一个 D**session 亲和性**
2. `header_mode=target-worker` `x-smg-target-worker` header
3. PD Router `consistent_hashing` 模式看到 header 就直接用不再 round-robin
## v3 改 kv-aware policy 后:路由对了,但新瓶颈出现
`scripts/sweep_tp1_v3_kvaware.sh` 把所有 KVC 实验改成 `--policy kv-aware`结果
| 指标 | v2 1P7D (default) | **v3 1P7D (kv-aware)** | v3 2P6D | 8-way DP baseline |
|------|:---:|:---:|:---:|:---:|
| 截断 | 56.8% | **0.9%** | 0.9% | 1.5% |
| Errors | 18 | 363 (8.2%) | 9 | 0 |
| Mean | 4.74s | 4.88s | 3.58s | 1.43s |
| P50 | 0.08s* () | 1.75s | 1.52s | 0.65s |
| P90 | 12.14s | 12.67s | 9.23s | 3.61s |
| TTFT P50 | - | 0.36s | 0.33s | 0.09s |
**截断从 56.8% 降到 0.9%,路由问题彻底解决**
P50 仍然是 baseline 2-3
### Direct-to-D 路径表现优秀KVC 该有的样子)
execution_mode 拆开看
| 路径 | Exp1 1P7D 占比 | Exp1 1P7D P50 | Exp1 1P7D TTFT P50 |
|------|:---:|:---:|:---:|
| `kvcache-direct-to-d-session` | 42.0% | **0.495s** | **0.043s** |
| `pd-router-fallback-large-append-session-cap` 🔥 | **52.6%** | 5.6s | 3.7s |
Direct-to-D 路径下
- P50 = 0.495s**比 baseline 0.65s 25%**
- TTFT P50 = 0.043s**比 baseline 0.093s 2 **
- KV transfer = 0 P 介入 D append-prefill
这才是 KVC 真正的价值但只有 30-42% 请求走到这条路
### 新瓶颈session-cap fallback 占了 52-65%
`pd-router-fallback-large-append-session-cap` 1P7D 52.6%、2P6D 65.4%。这条路径意味着 router 想开新 session D admission 拒绝了"d-session-cap"只好回退到 plain routerP 全量 prefill + 传给 D session 复用)。
### Bimodal session 分布starvation
| Session | Total turns | Direct-to-D | Session-cap fallback |
|---------|:---:|:---:|:---:|
| 22080 | 129 | **98%** | 0% |
| 3840 | 118 | **97%** | 0% |
| 70560 | 150 | **0%** | **99%** |
| 39360 | 148 | **0%** | **99%** |
| 61600 | 117 | **0%** | **99%** |
要么完全幸运要么完全饿死——典型的双峰分布
### 根因:硬编码 cap=4
`replay.py:_decode_session_soft_cap` 原始代码
```python
def _decode_session_soft_cap(...) -> int:
target_tokens = max(1, _estimate_session_resident_tokens(request))
usable_capacity_tokens = _usable_capacity_tokens(residency, server_url)
...
if usable_capacity_tokens <= 0:
return 4
return max(1, min(4, usable_capacity_tokens // target_tokens))
# ^^^ 硬编码上限 4
```
7 D × 每个 D 最多 4 session = **28 个 session slot 总容量**。Trace 52 session 24 session 永远抢不到 slot
启动期 race condition 决定了哪些 session "幸运儿"—— 28 个挤进来的 session 的所有后续 turn 都走 direct-to-D剩下 24 session 永远走 session-cap fallback)。
## v4 改进:把硬 cap 从 4 提到 16
`replay.py:_decode_session_soft_cap` 一行修改
```diff
- if usable_capacity_tokens <= 0:
- return 4
- return max(1, min(4, usable_capacity_tokens // target_tokens))
+ if usable_capacity_tokens <= 0:
+ return 16
+ return max(1, min(16, usable_capacity_tokens // target_tokens))
```
7 D × 16 = 112 slot远超 52 session 需求
### v4 实际结果vs v3 1P7D / 2P6D
| 指标 | v3 1P7D | **v4 1P7D** | v3 2P6D | **v4 2P6D** | baseline 8DP |
|------|:---:|:---:|:---:|:---:|:---:|
| Errors | 363 (8%) | 435 (10%) | 9 (0%) | **403 (9%)** | 0 |
| 截断 | 42 | 43 | 42 | 36 | 68 |
| **direct-to-D** | 38.6% | **54.3%** | 30.5% | **58.0%** | - |
| **session-cap fallback** | 48.3% | 37.4% | 65.4% | **34.7%** | - |
| Session reused | 1716 | 2180 | 1358 | **2348** | - |
| KV transfer blocks | 62K | 53K | 79K | **51K** | - |
| Mean | 4.88s | 4.21s | 3.58s | **2.51s** | 1.43s |
| **P50** | 1.75s | 1.08s | 1.52s | **0.84s** | **0.65s** |
| P90 | 12.67s | 13.38s | 9.23s | **6.51s** | 3.61s |
| P99 | 28.72s | 24.45s | 18.70s | 18.34s | 8.38s |
| **TTFT P50** | 0.36s | 0.056s | 0.33s | **0.051s** | 0.094s |
| TTFT P90 | 10.97s | 11.90s | 6.95s | **2.64s** | 0.26s |
direct-to-D 占比从 v3 30-38% 涨到 v4 54-58%
session 复用 +27% (1P7D) / +73% (2P6D)
KV transfer -15% (1P7D) / -36% (2P6D)
TTFT P50 反超 baseline 46%0.051s vs 0.094s
### Direct-to-D 路径全面碾压 baselineKVC 真实价值)
| Config | n | Lat P50 | Lat P90 | TTFT P50 | TTFT P90 |
|--------|:---:|:---:|:---:|:---:|:---:|
| baseline 8DP | 4381 | 0.66s | 3.65s | 0.094s | 0.256s |
| v4 1P7D direct-to-D | 2179 | 0.495s | 3.03s | 0.044s | 0.055s |
| **v4 2P6D direct-to-D** | **2348** | **0.499s** | **2.86s** | **0.043s** | **0.054s** |
direct-to-D 子集相对 baseline
- P50 24-30%
- P90 16-22%
- TTFT P50 54%
- TTFT P90 79%
### 整体性能(去掉 errors 和 truncatedvs baseline
| Config | clean | Mean | P50 | P90 | P99 |
|--------|:---:|:---:|:---:|:---:|:---:|
| baseline 8DP | 4381 | 1.45s | 0.66s | 3.65s | 8.38s |
| v4 2P6D | 4010 | 2.53s | 0.85s | 6.55s | 18.33s |
vs baselineP50 28%、P90 80%、P99 119%。即使错误率为 0整体仍输 baseline——根因是 35% 请求被推到 fallback 路径
### 新瓶颈 135% 请求仍走 session-cap fallback
抬到 16 后真实瓶颈是 capacity-based 计算`min(16, usable_capacity_tokens // target_tokens)`
- `target_tokens = input + output`agentic 里常见 50-100K
- D KV pool 100-150K tokens80GB H100, mem_fraction=0.835
- `usable / target` = 1-2远没到 16 真实 cap capacity 算出来的小数字
要解决必须改 capacity-based 估算逻辑或上方案 D D 自己决定)。
### 新瓶颈 29-10% errorsmooncake 传输超时)
P-side log 显示
```
KVTransferError: Failed to send kv chunk of <bootstrap_room> to 10.45.7.165:40319
Sync batch data transfer timeout after 32722558107ns (32 秒超时)
Decode instance could be dead, remote mooncake session ... is not alive
```
特征
- 所有 errors run 44.8% 之后出现系统压力累积
- 98% errors 集中在 turn 31 input 的请求
- v3 cap=4 1P7D 已有 363 errors 1 D 集中受冲击v4 cap=16 把压力均匀分布但量级更大
mooncake TCP loopback 在并发上去后撞超时**不是 SGLang 逻辑 bug**。修复方向
1. 加长 mooncake transfer timeout现在 32s
2. 限制并发 inflight transfer 数量
3. 改用 RDMAloopback 是单机模拟生产环境换真 RDMA
4. chunked KV transfer
## v5 落地方案 Dworker-mode 驱动 seed/reseed
`scripts/sweep_tp1_v5_optD.sh` 真正把方案 D 落到了代码里改动核心 `--kvcache-admission-mode` `local`(replay 估算) 改成 `worker`(D 决策)并扩展到 **direct_append + seed + reseed 全部路径**
### 关键代码改动
1. SGLang `scheduler.py` `admit_direct_append` 端点新增 `mode` 字段支持 `direct_append | seed`seed 模式会触发 D 真正去 reserve KV pool 块并主动调用 `maybe_trim_decode_session_cache` LRU
2. Replay `replay.py` reseed / turn-1 seed / large-append-reseed 都改走同一个 admit endpoint`_decode_session_soft_cap` worker mode 下被完全 bypass
3. 新增运行参数`--kvcache-admission-mode worker``--kvcache-seed-min-turn-id 1``--kvcache-seed-max-inflight-decode -1``--kvcache-prefill-backup-policy release-after-transfer``--kvcache-prefill-priority-eviction`
### 假设
- v4 35% session-cap fallback 来自 replay 视图过期 + capacity-based 计算保守 D 自己看 KV pool 应该把这 35% 救回来
- D 主动 LRU eviction replay 自己写的 reservation 更准确**应该**让更多 session seed 进来
### v5 实际结果vs v4 同配置)
| 指标 | v4 1P7D | **v5 1P7D** | v4 2P6D | **v5 2P6D** | baseline 8DP |
|------|:---:|:---:|:---:|:---:|:---:|
| Errors | 435 (10%) | **9 (0.2%)** | 403 (9%) | **9 (0.2%)** | 0 |
| 截断 | 43 | 42 | 36 | 42 | 68 |
| direct-to-D | 54.3% | 44.7% | 58.0% | 41.3% | - |
| **session-cap fallback** | 37.4% | **45.6%** | 34.7% | **50.6%** | - |
| no-d-capacity fallback | 0.3% | 1.2% | 0.2% | 0.8% | - |
| pd-router-turn1-seed (新可见) | - | 1.2% | - | 1.1% | - |
| pd-router-d-session-reseed (新可见) | - | 4.8% | - | 3.4% | - |
| pd-router-large-append-reseed (新可见) | - | 1.0% | - | 1.0% | - |
| Session reused | 2180 | 1990 | 2348 | 1837 | - |
| KV transfer blocks | 53K | 66K | 51K | 69K | - |
| Mean | 4.21s | 5.18s | 2.51s | 3.49s | 1.45s |
| **P50** | 1.08s | 1.59s | 0.84s | 1.31s | 0.66s |
| P90 | 13.38s | 14.67s | 6.51s | 9.09s | 3.65s |
| P99 | 24.45s | 26.09s | 18.34s | 24.92s | 8.38s |
| TTFT P50 | 0.056s | 0.21s | 0.051s | 0.24s | 0.094s |
| TTFT P90 | 11.90s | 13.06s | 2.64s | 6.90s | 0.26s |
**可靠性大幅提升**mooncake 传输超时 errors 9-10% 跌到 0.2%。D 真容量决策避免了 v4 那种"乐观 admit 30s 后超时"的死亡链路
reseed / turn1-seed 路径首次显式出现证明 admission 端点对 seed 模式确实生效了
**session-cap fallback 不降反升**3746% 3551%)。说明 v4 的本地 soft_cap 实际上** D 真实容量更乐观**——admit 进来后转身就 OOM统计成了 error 而不是 fallback
直接结果**direct-to-D 占比下降整体延迟全面变差**。P50/P90/P99 TTFT 都退步
### Direct-to-D 子集还是稳的KVC 真实价值仍在)
| Config | n | Lat P50 | Lat P90 | TTFT P50 | TTFT P90 |
|--------|:---:|:---:|:---:|:---:|:---:|
| baseline 8DP | 4381 | 0.66s | 3.65s | 0.094s | 0.256s |
| v4 2P6D direct-to-D | 2348 | 0.499s | 2.86s | 0.043s | 0.054s |
| **v5 1P7D direct-to-D** | 1990 | 0.475s | 3.04s | 0.043s | 0.055s |
| **v5 2P6D direct-to-D** | 1837 | 0.483s | 3.04s | 0.043s | 0.054s |
direct-to-D 的尾延迟和 TTFT v4 几乎完全一致端点决策开销可忽略**v5 的回退不是路径本身变慢而是更多请求被赶到 fallback**。
### Fallback 路径反而比 v4 更糟
| Config | n | Lat P50 | Lat P90 | TTFT P50 |
|--------|:---:|:---:|:---:|:---:|
| v5 1P7D session-cap fallback | 2027 | 6.38s | 17.47s | 4.49s |
| v5 2P6D session-cap fallback | 2253 | 3.13s | 11.25s | 0.89s |
由于 fallback 占比上升且这条路径本身就比 direct-to-D 慢一个数量级整体均值被拖累得更厉害
### v5 真正暴露的瓶颈D 的 KV pool 物理容量
admission 决策权交给 D 之后瓶颈从"replay 估得太死"变成"D 真的装不下"
- 80GB H100 × `mem_fraction_static=0.835` D 单卡 KV pool 100-150K tokens
- agentic context session turn footprint 50-100K
- D 上能并存的 session 数量本就 2-3 7 D 50 session 基本不可能
v4 cap=16 之所以"看起来好"部分是因为本地 soft_cap 没真的查 D free pool开了一堆**最终会失败** session统计成 errors 而非 fallback)。v5 把这部分洗成了"诚实的拒绝"——可靠性跃升的代价是看见了真实容量上限
### v6 应该针对什么
D 物理容量管理打开而不是再调 replay
1. **prefill backup 提早 release**已经加了 `release-after-transfer` 但可能还不够及时 P 上的 backup blocks 不要长期占用 KV pool
2. **priority eviction 策略调优**已开 `--kvcache-prefill-priority-eviction`当前 LRU 可能把 hot session 误踢需要按 session 命中频率/最近访问做加权
3. **chunked / streamed seed**不要一次 reserve 整个 prompt 的容量 chunk 分摊
4. **跨 D 的 session migration**当一个 D 满了但隔壁 D 空时主动迁移而不是直接 fallback P
5. **真正的多机 RDMA**单机 mooncake loopback errors 的根因之一上多机 + RDMA 才能让 prefill backup release 后的 KV transfer 真的稳
工程量1-3 SGLang 内部改 (`scheduler.py` + `session_controller.py`)4 需要 router 协议扩展5 是部署变更
## 关键文件与代码位置索引
| 现象 | 代码位置 |
|------|----------|
| Replay policy round-robin | `policies.py:63-67` `RoutingState.next_decode_worker_id` |
| KV-aware policysession 亲和 | `policies.py:146-187` `KvAwarePolicy.select` |
| PD router decode 选择 | `pd_router.py:51-74` `_select_decode_index` |
| Header 构建 | `replay.py:2407-2424` `_build_headers` |
| Policy router config 映射 | `benchmark.py:287-300` `_decode_policy_for/_header_mode_for` |
| Session admission cap | `replay.py:889-905` `_decode_session_soft_cap` |
| 已有的 D admission 端点 | `scheduler.py:3497-3580` `admit_direct_append`v5 扩展支持 `mode=seed` |
| Worker-mode admission 调用方 | `replay.py` reseed / turn1-seed / large-append-reseed 路径 |
| Prefill backup 释放策略v5 引入 | `--kvcache-prefill-backup-policy release-after-transfer` |
| Prefill priority evictionv5 引入 | `--kvcache-prefill-priority-eviction` |
| Session D 上找不到的报错 | `scheduler.py:1824-1836` |
| `_should_allow_local_prefill_on_decode` | `scheduler.py:1975-1980` |
| Reseed 流程入口 | `replay.py:1665-1809` `_invoke_kvcache_seeded_router` |
| Direct-to-D 流程 | `replay.py:2351-2398` `_invoke_decode_session_direct` |
## 经验教训
1. **policy 和 mechanism 是两个正交维度**——`--policy default` 不是"无脑默认值"它真的是 round-robin session 亲和性KVC 机制必须配 session 亲和的 policy
2. **不要无脑相信前一个 agent 的 RESULTS_SUMMARY**——v2 的诊断"local prefill bug"和实际 finish_reason"session id does not exist"完全对不上任何错误诊断必须用 finish_reasonexecution_mode 这些原始字段交叉验证
3. **bimodal 分布是 starvation 的强信号**——v3 数据里某些 session 100% 走快路径某些 100% 走慢路径几乎肯定是某种"先到先得"的资源竞争看到这种模式立刻去找硬编码 cap 或全局共享资源
4. **测量要看分组而非整体均值**——v3 整体 P50=1.5s 看似比 baseline 但拆开看 direct-to-D 子集 P50=0.495s 已经反超 baseline整体均值被 fallback 路径拖累 KVC 的核心价值是真实存在的
5. **errors 与 fallback 是同一类资源压力的两副面孔**——v4 " fallback + error "不是更优解是把容量超限的失败从"显式拒绝"伪装成"超时失败"。v5 把决策权交给真容量后fallback errors 这是更诚实的指标不要被 v4 fallback 数字误导当看到错误率和 fallback 率呈反相关时要警惕 admission 决策是否在说谎

View File

@@ -0,0 +1,356 @@
# KVC-Router面向 Agentic 多轮 LLM Serving 的 Session-Aware 调度算法
**性质**:论文级形式化规范——用于团队内部对齐 + 外部读者 onboarding。
**对象**:项目团队(统一术语);论文 reviewer算法定义
**最近更新**2026-05-11。
本文给出本项目所开发的 **KVCache-Centric Router**(以下简称 "KVC-Router")调度算法的形式化、与实现无关的定义。本文设计为可直接被论文引用,并作为"KVC 到底在谈论什么调度算法"的标准回答。
对应的参考实现位于:
- `src/agentic_pd_hybrid/policies.py``KvAwarePolicy``RoutingState`
- `src/agentic_pd_hybrid/replay.py` — orchestrationadmission RPC、reset-on-success、fallback chain
- `third_party/sglang/python/sglang/srt/managers/scheduler.py` — D-worker 端的 admission 决策
---
## 1. 问题定义
我们要服务一群多轮 agentic LLM session如 Claude Code、Codex、Cursor 等 coding agent底层是异构 worker 池,分成:
- **Prefill workers**`P`GPU 常驻的模型副本,针对长输入 prompt 的 batched prefill 做了优化。
- **Decode workers**`D`GPU 常驻的模型副本,配备 session-aware KV cache"SessionAwareCache"),具备:(i) 跨 turn 保留 session 的 KV 状态;(ii) 在本地已缓存的 prefix 上做 append-prefill无需绕回 `P`
在一个 agent turn 内,请求 `r` 到达时其对话 prefix 已经从前序 turn 累积;**新增**的 tokens工具输出、用户消息等构成小规模 **append**。驱动 KVC 设计的根本观察是:
> 当 prefix KV **已经驻留在将要解码该请求的 D worker 上**,请求的 first-token 延迟仅由 *append* 大小决定(典型 O(10²10³) tokens而非完整 prompt 大小(典型 O(10⁴10⁵) tokens
Router 的工作就是最大化满足上述条件的请求占比,同时尊重容量约束、不造成 session 无限饿死。
### 1.1 优化目标
给定来自 `S` 个 session 的请求流 `R = (r_1, r_2, ...)`,最小化 SLO 加权的 TTFT 与端到端延迟混合:
```
minimize E[ w_ttft · TTFT(r) + w_lat · E2E_Latency(r) ]
subject to capacity[d] ≤ K_d 对任意 D worker d 在任意时刻 t,
没有 session 被永久拒绝服务.
```
参考实现中通过 measurement 隐式取 `w_ttft = 1, w_lat = 1`per-D KV 池预算 `K_d` 取 SGLang 启动时上报的 `max_total_num_tokens`
---
## 2. 系统模型与记号
### 2.1 集合
| 符号 | 含义 |
|---|---|
| `P = {p₁, …, p_|P|}` | Prefill worker 池 |
| `D = {d₁, …, d_|D|}` | Decode worker 池 |
| `S` | Session 标识符集合(由上游 agent runtime 分配) |
| `H` | KV block hash 的全集(本实现中每 `BLOCK_TOKEN_BUDGET = 24` tokens 对应一个 hash |
### 2.2 请求
一个请求 `r` 是一个元组:
```
r = ⟨ s(r), t(r), prefix_hashes(r), append_len(r), input_len(r) ⟩
```
其中:
- `s(r) ∈ S` — session id
- `t(r) ∈ ` — 该 session 内的 turn index0 = 首轮)
- `prefix_hashes(r) ⊂ H` — 覆盖请求输入 prefix 的 block hash 集合
- `append_len(r) ∈ ` — 新到达、**不在** `prefix_hashes(r)` 中的 token 数
- `input_len(r) = (|prefix_hashes(r)| · 24) + append_len(r)` — 总 token 数
### 2.3 Router 状态 (`Σ`)
Router 跨请求维护的全局状态:
| 字段 | 类型 | 语义 |
|---|---|---|
| `resident[d]` | `set[H]` | Router 估计的 D `d` 当前 SessionAwareCache 中常驻的 block hash 集合router 端估计,真值在 worker 上) |
| `pin[s]` | `D {⊥}` | Session `s` 最近一次成功服务的 D`⊥` 表示从未见过 |
| `inflight[d]` | `` | 当前已派发给 `d` 但尚未完成的请求数 |
| `assigned[d]` | `` | 累计派发到 `d` 的路由决策次数(负载 tie-breaker |
| `rejects[s,d]` | `` | per-(session, D) 的 admission 拒绝计数v2 引入的 migration 机制) |
### 2.4 超参数
| 符号 | 默认值 | 描述 |
|---|---|---|
| `α``sticky_bonus` | 1 | 匹配 `pin[s]` 的 D 在评分中获得的 bonus |
| `τ_reject``migration_reject_threshold` | 3 | (s, d) 被拒绝达此次数后d 对 s 进入 blacklist |
| `τ_append``kvcache_direct_max_uncached_tokens` | 8192v2 | 走 Direct-to-D 路径允许的最大 append 长度 |
| `K_d` | 取自 SGLang `max_total_num_tokens` | per-D 的 KV 池预算 |
| `ρ` | 0.95 | 容量高水位线(隐式由 SGLang 强制) |
| `ε`(最大 fallback 重试数) | `|D| - 1` | router 在退化到 vanilla PD-disagg 之前最多探测几个 D |
### 2.5 路由结果
路由决策 `δ(r)` 取以下四种之一:
| Mode | 含义 | KV transfer |
|---|---|---|
| `Direct(d)` | r 完全在 D `d` 上执行D 在其常驻 KV 上做 append | **无**(快路径) |
| `Seed(d)` | Session 首轮P 做完整 prefillKV 通过 mooncake 传到 `d` | 完整 input |
| `Reseed(d)` | Session 之前在某个 D' 上,但已不再常驻;按 Seed 处理 | 完整 input |
| `Fallback(p, d)` | Vanilla pd-disagg 路径(其它 D 均被 blacklist 或拒绝) | 完整 input |
---
## 3. 算法
KVC-Router 由三个相互配合的过程组成:
- **Algorithm 1 (`Route`)**router 端基于评分的候选选择。
- **Algorithm 2 (`Admit`)**D-worker 端的 admission 决策(在 D scheduler 中执行,非 router
- **Algorithm 3 (`Dispatch`)**:端到端 orchestration把 Route + Admit + reset-on-success 串起来。
### 3.1 Algorithm 1`Route(r, Σ)` — 基于评分的候选选择
```
输入:请求 r状态 Σ
输出:候选 d* ∈ D若所有 D 都被过滤后仍无候选,退化分支兜底返回最少被拒的 D
1. blacklisted ← { d ∈ D : Σ.rejects[s(r), d] ≥ τ_reject }
2. C ← D blacklisted // 候选 D 集合
3. if C = ∅ : // 退化
4. return argmin_{d ∈ D} Σ.rejects[s(r), d] // 选最少被拒的 D
5. for each d ∈ C :
6. overlap(d) ← |prefix_hashes(r) ∩ Σ.resident[d]|
7. sticky(d) ← 1 if Σ.pin[s(r)] = d else 0
8. infl(d) ← Σ.inflight[d]
9. assn(d) ← Σ.assigned[d]
10. score(d) ← ⟨ overlap(d) + α·sticky(d), // 主项
sticky(d), // tie-1
infl(d), // tie-2负载小者占优
assn(d) ⟩ // tie-3
11. return argmax_{d ∈ C} score(d) // 按字典序最大
```
**说明**
- 评分是 **4 元组按字典序比较**,不是单个标量——这样避免在不同维度之间调权重。
- 第 10 行的主项 `overlap + α·sticky` 同时奖励 KV 复用与 session stickiness。取 `α=1``overlap` 以 block24 tokens为单位时**任何一次 hash 命中都压制纯 sticky 的候选**。
- 第 14 行的 blacklist 过滤防止永久绑死在已饱和的 D 上;与 Algorithm 3 的 reset-on-success 配合,限定了 migration 频率。
### 3.2 Algorithm 2`Admit(d, r, M, K)` — D-worker admission 决策
在 D worker 自己的 scheduler 内部执行(非 router这是 **KVC 的机制核心**:每个 D 自治判断能否把 `r` 当作 Directappend-only服务还是必须改走 P 路径。
```
输入D worker d请求 rd 上本地常驻的 session 集合 M_dKV 池预算 K_d
输出⟨can_admit ∈ {True, False}, mode ∈ {Direct, Seed, Reseed, ⊥}, reason⟩
1. used_tokens ← Σ_{s' ∈ M_d} resident_tokens(s', d) // D 自己的 bookkeeping
2. cap_ok ← (used_tokens + input_len(r)) ≤ ρ · K_d // 高水位线 ρ ≈ 0.95
3. if s(r) ∈ M_d : // session 在 d 上有常驻
4. if append_len(r) ≤ τ_append and cap_ok :
5. return ⟨True, Direct, ∅⟩ // → 快路径
6. elif append_len(r) > τ_append :
7. return ⟨False, ⊥, "real-large-append"⟩
8. else :
9. return ⟨False, ⊥, "no-d-capacity"⟩
10. else : // session 在 d 上无常驻
11. if cap_ok :
12. mode ← Seed if t(r) = 0 else Reseed
13. return ⟨True, mode, ∅⟩ // → 经 P 做 KV seeding
14. else :
15. return ⟨False, ⊥, "session-not-resident-no-capacity"⟩
```
**说明**
- 该过程通过同步 HTTP RPC`/admit_direct_append`)从 router 调用。RPC 阻塞直到 D scheduler 给出权威答复——这是 v5 引入的 **"worker-mode admission"**,替换了更早的 router-端容量估算(系统性偏乐观)。
- reason 字符串被回传给 router用于(i) 在 Algorithm 3 中驱动 fallback chain(ii) 标注 `execution_mode` 字段便于分析。
### 3.3 Algorithm 3`Dispatch(r, Σ)` — 端到端 orchestration
```
输入:请求 r状态 Σ
输出:执行模式 μ ∈ {Direct, Seed, Reseed, Fallback}
1. retries ← 0
2. tried ← ∅
3. while retries < ε :
4. d* ← Route(r, Σ \ {对 tried 中的 d 已 bump 过的 rejects})
5. if d* = ⊥ : break // 无候选
6. resp ← Admit(d*, r) // RPC 到 D scheduler
7. if resp.can_admit :
8. Σ.rejects[s(r), d*] ← 0 // ◀ reset-on-successv2
9. Σ.pin[s(r)] ← d*
10. Σ.inflight[d*] ← Σ.inflight[d*] + 1
11. if resp.mode = Direct :
12. 在 d* 上完整执行 rappend-prefill + decode
13. return Direct
14. else : // Seed 或 Reseed
15. p ← round_robin_next(Σ, P)
16. 在 p 上做 r 的 prefill
17. 经 mooncake 把 KV(r) 从 p 传到 d*
18. 在 d* 上 decode r
19. return resp.mode
20. else :
21. Σ.rejects[s(r), d*] ← Σ.rejects[s(r), d*] + 1
22. tried ← tried {d*}
23. retries ← retries + 1
24.
25. // ε 次重试耗尽——退化 Fallback 到 vanilla pd-disagg
26. p ← round_robin_next(Σ, P)
27. d ← round_robin_next(Σ, D)
28. 通过 ⟨p, d⟩ 走 pd-disagg(r)
29. return Fallback
```
**维持的关键不变量**
1. **不会静默过载**:一个 D 永不接受会让 `used_tokens > ρ · K_d` 的请求Algorithm 2 第 2 行)。
2. **不存在永久饿死**:对任意 session `s`,只要曾在某 D `d*` 上成功过一次,之后 `Σ.rejects[s, d*] = 0`Algorithm 3 第 8 行)。因此 blacklist 计数器不会对仍在某处成功获得服务的 session 累积——这阻止了 **v1 的 thrashing 病理**:原本 blacklist 计数器单调增长 + 退化 fallback 形成自放大的 round-robin 死循环。
3. **migration 有界**:一个 session 从 D `a` 迁移到 D `b` 必须经过连续 `τ_reject` 次在 `a` 上失败、期间无任何成功。每个 session 生命周期内的最坏 migration 次数 ≤ `(|D| 1) · τ_reject`
### 3.4 Reset-on-success为什么这是关键修复v1 → v2 演化)
v1 实现**省略了** Algorithm 3 第 8 行——一旦 `(s, d)` 累积 `τ_reject` 次拒绝d 对该 session **整个 run 永久 blacklist**。实测Migration v1`docs/MIGRATION_V1_FINDINGS_ZH.md`)触发了自放大的失效模式:
```
session s 在 d 上稳定服务 70 个 turn
↓ 瞬时 burst 让 d 短暂饱和
3 次到 d 的 admission 被拒 → rejects[s,d] = 3 → d 对 s 永久 blacklist
↓ s 迁到 d'd' 也在负载中 → 被拒 → blacklist
↓ d'' 同理
所有 D 都 blacklist → 退化 fallback round-robin → 每次重试都 bump 一次计数器
→ s 永远在 D 之间 thrashing每次都丢失 KV residency
```
reset-on-success 关上了这个回路:只要 `s` 在任一 d 上真正完成一次 Direct针对该 session 的 blacklist 立刻清零。该机制只对**持续性**(不是瞬时性)容量压力触发。
---
## 4. 性质
### 4.1 Theorem 1在有界 ε 下无永久饿死)
*假设 `τ_reject ≥ 1` 且每个 D worker 的容量非零。则对任意能在 admission 时容下的 session `s`Algorithm 3 在至多 `|D| · τ_reject` 次重试内返回 `{Direct, Seed, Reseed}` 之一;之后任意一次 Direct 成功即可清空 `s` 的所有 blacklist。*
**证明概要**每次循环要么成功return、要么恰好让某个 `rejects[s, d]` 计数器 +1第 21 行)。经过 `|D| · τ_reject` 次迭代后,每个 D 要么对 `s` 已被 blacklist`Route` 第 1 行会过滤),要么已成功(已终止)。在所有 D 都被 blacklist 的饱和点,`Route` 第 3 行返回最少被拒的 D打破对称性强制取得进展。∎
### 4.2 Theorem 2fast-path 命中下限)
*假设 session `s` 在 D `d` 上已积累 KV residency `R_s ⊂ H`,且在某 turn `t > 0` 提交的请求 `r` 满足 `prefix_hashes(r) ⊆ R_s`、`append_len(r) ≤ τ_append` 且 admission 容量充足。则 Algorithm 3 将 `r` 路由为 Direct(d)。*
**证明概要**:由 Algorithm 1`overlap(d) = |R_s|` 取得最大值;结合 `α·sticky(d) ≥ 1`d 的字典序得分严格高于任何 `prefix_hashes(r) ⊈ R_{s,d'}` 的 d'。故 `Route` 返回 d。`Admit(d, r)` 进入 `s ∈ M_d ∧ append ≤ τ_append ∧ cap_ok` 分支,返回 Direct。∎
这是 **支持架构设计的机制级保证**:只要 residency、append 大小、容量三者同时成立,快路径就被**确定性地**选中KVC 在典型场景下的 TTFT 优势是结构性属性,不是概率性。
### 4.3 复杂度
每个请求:
- `Route``O(|D|)`(每个候选 D 算一次 score。生产规模下 `|D| ≤ 8`,主要开销在 Python 层,≪ 1 ms。
- `Admit`D scheduler 内部 O(1)(查自己的 bookkeeping无全局锁
- Router 层的单请求总开销:`O(|D|)` 计算 + 1 次到目标 D 的 HTTP RTTloopback 亚毫秒,跨机数据中心约 1 ms
---
## 5. 与 baseline 的对比
| 性质 | Vanilla pd-disagg | DPcache-aware | **KVC-Router**(本文) |
|---|---|---|---|
| P/D 分离 | 是(`|P| + |D|` GPU | 否(每个 worker fused P+D | 是 |
| 跨 turn cache locality | 无(每个请求都 P→D 传 KV | 仅在单 fused worker 内部走 hash prefix 路由 | session 钉在某 D 上,本地 append-prefill |
| 同 session cache 集中度 | 无 | 散到 `|D|` 个 worker每个占 1/|D| | 集中在一个 D整段常驻 |
| 最坏 turn-2 prefill 工作量 | 完整 input 经 P→mooncake→D | 在目标 worker 上做完整 prefill带 prefix cache 命中) | 本地 `append_len ≤ τ_append` tokens |
| 容量感知 admission | 无router 盲发) | 隐式靠 worker 队列深度 | 显式的 per-D `Admit()` 决策 |
| Migration 机制 | N/A | N/A | 带 reset-on-success 的 reject-counter blacklist |
| Idle prefill 成本 | 是——P 永远在算 | 否 | 是——P 只在 cache miss 时启用(本工作 SWE-Bench 评测下约 8% 请求) |
KVC 的关键架构权衡:**用 P 端 GPU 闲置换 D 端 TTFT 稳定性**。在 per-session cache 复用率高的 agentic workload 上Inferact 的 Codex trace 报告 94.2% cache hit我们的 SWE-Bench replay 实测 91.6% Direct 命中),这个交换显著有利。在 session 短或 cache hit 低的 workload 上权衡反转、DP 胜出。
---
## 6. 符号速查表
| 符号 | 含义 |
|---|---|
| `P, D` | Prefill / Decode worker 池 |
| `s(r), t(r)` | 请求 r 的 session id 与 turn index |
| `prefix_hashes(r)` | r 输入 prefix 的 KV block hash |
| `append_len(r)` | r 中新增(未缓存)部分的 token 数 |
| `Σ.resident[d]` | Router 对 d 缓存 block 集合的估计 |
| `Σ.pin[s]` | session s 最近一次成功的 D |
| `Σ.rejects[s,d]` | per-(s,d) 的 admission 拒绝计数 |
| `α` | sticky bonus 权重(默认 1 |
| `τ_reject` | migration 阈值(默认 3 |
| `τ_append` | Direct 路径允许的 max append 大小v2 默认 8192 |
| `K_d` | D worker d 的 KV 池预算 |
| `ρ` | 容量高水位(默认 0.95 |
| `ε` | fallback 重试上限(默认 `|D| 1` |
| `δ(r)` | 路由决策:`Direct(d)` / `Seed(d)` / `Reseed(d)` / `Fallback(p, d)` |
---
## 7. 本工作评测中实际使用的默认参数
| 参数 | 取值 | 说明 |
|---|---|---|
| `|P|, |D|` | 1, 31P3D 配置) | 单机 4× H100 80GB |
| `α` | 1 | |
| `τ_reject` | 3 | |
| `τ_append` | 8192 | v2 调优后取值v0/v1 用 2048 |
| `K_d` | 92104 tokens | SGLang 按 `mem_fraction_static=0.835` 自动算出 |
| `ρ` | 隐式 ~0.95 | 由 SGLang 的 `max_total_num_tokens` 强制 |
| `ε` | 2 | `|D| 1 = 2` |
| 每次 run 的 session 数 | 52 | SWE-Bench 50sess trace |
| 总请求数 | 4449 | |
| Time-scale | 1.0(真实 trace 时序) | |
| 并发 | 32 | |
---
## 8. Anti-patternsKVC **不**是什么)
1. **KVC 不仅仅是 kv-aware routing**。DP 和 KVC 都可以跑 `kv-aware` policyKVC 在此之上加了三件事:(i) session 钉定,(ii) worker 端 admission(iii) 带 reset-on-success 的 migration。如果在比较 "KVC vs DP" 时缺这三个要素的任何一个,**测的就不是 KVC 与 DP 的差异**。
2. **KVC 在 policy 项里不直接感知容量**`Route` 不查 per-D 容量;容量感知完全经由 `Admit` 拒绝来传导。我们刻意做了这层分层——把容量判断放进 `Route` 会引入"换 D"的决策空间,导致 orphan KV 滞留问题。
3. **KVC 不保证 load balance**。一个 session 若能舒服地装在某个 D 上,可能永远钉在那里,而其它 D 大部分时间空闲。在低容量压力下这是设计意图;高压力下 Theorem 1 的 migration 会触发再均衡。
4. **`Fallback` 不是"降级路径"**。它和 vanilla pd-disagg 请求结构性等价延迟特征相同。KVC 的价值在于让 Fallback 占比在典型 agentic workload 下 ≪ 10%。
---
## 9. 公开问题reviewer 关注点)
以下问题在当前评测中尚未解决,主动列出以保持透明:
1. **Session 钉定相对于纯 P/D disaggregation 的边际贡献是多少?** 需要 `naive 1P3D` 对照实验vanilla SGLang xPyD不带 KVC 层)——仓库当前缺失(见 `docs/V2_DEEP_ANALYSIS_ZH.md §4.7`)。
2. **Algorithm 3 在更高压下行为如何**(例如 ts=10 加速、session 数 ≫ |D|·K_d/peak_input当前 ts=1 评测对应真实 agentic 区间,但算法在更高负载下的鲁棒性未经实验验证。
3. **真 RDMA 下的 reseed 代价**:本次评测的 37 s reseed 延迟由两段组成——P 端 re-prefill1.5-3s+ P→D mooncake transfer1.5-4s。当前 sweep 用的是 TCP loopback启用 IB/RoCE节点有 mlx5_0/_1 @ 200 Gb/s × 2 active需在 sweep 加 `--force-rdma --ib-device mlx5_0`)只能压缩 transfer 段到 ~200ms**不动 re-prefill 段**。预期 TTFT p99 从 1.28s 降到 ~0.7s(仍输 DP 0.43s)。待独立验证。
4. **D→P 增量 KV 同步(核心 future-work 缺口)**reseed 长尾的真正消除需要让 P 端 backup 跟上 D 的 direct-to-D append 增长。经独立 forensic 审查,**当前代码、vendored SGLang、mooncake 三层均无 D→P KV transfer 实现**mooncake `MooncakeKVManager` 是 PREFILL=sender / DECODE=receiver 的硬角色分支(`add_transfer_request` 上有 `assert disaggregation_mode == PREFILL` 硬约束),`BaseKVSender` / `BaseKVReceiver` 抽象无 bidirectional slot`session_aware_cache.release_session` 在驱逐时只调 `kv_pool_allocator.free()` 无出站,`_commit_prefill_backup_residency` 唯一 caller 是 seed/reseed 路径;`capacity-backup` policy 的真实语义只是"reseed 完不关 P streaming session"——backup 是 seed-time 的静态快照,不随 direct-to-D append 同步。要实现 D→P 增量同步,工程量 ~1-2 周,最难的不是 mooncake 加 D-sender / P-receiver 角色(~400 LOC而是 **SGLang radix tree 改成允许从外部 worker 喂数据**——radix cache 当前假设单一生产者(本 worker model 输出)。这是论文里最值得做的 contribution 之一。
5. **v2 代码路径下的确定性**v0 代码库的 ts=1 N=3 categorical 确定性已经证实;新增的 reset-on-success 分支与 threshold=8192 路径未被独立 re-validate。两个额外的 N=1 run 即可解决。
---
## 10. 论文引用建议
论文中提到本算法时建议表述:
> "We use the KVC-Router scheduling algorithm (Algorithms 13 of [our paper], formally defined in our supplementary materials). The router selects a decode worker by lexicographic scoring on `(overlap+α·sticky, sticky, inflight, assigned)` (Algorithm 1), defers the admission decision to the chosen worker via a synchronous RPC (Algorithm 2), and maintains a per-(session, decode worker) rejection counter that is reset on every successful Direct admission (Algorithm 3). This last detail — reset-on-success — is what distinguishes our v2 from the unstable v1 implementation that exhibits self-amplifying session thrashing."
---
**附录 A — 算法步骤到代码实现的对照**
| 算法步骤 | 文件 | 符号 |
|---|---|---|
| `Route` 第 511 行 | `policies.py:189202` | `KvAwarePolicy.select` 内层循环 |
| `Route` 第 14 行blacklist 过滤 + 退化分支) | `policies.py:182187, 204211` | `migration_reject_threshold``select` 的 fallback |
| `Admit` | `third_party/sglang/python/sglang/srt/managers/scheduler.py` | `handle_admit_direct_append_request` |
| `Dispatch` 第 8 行reset-on-success | `replay.py: _run_request` | finish 路径中的 reset |
| `Dispatch` 第 21 行(记录 reject | `replay.py: _run_request` | `state.record_admission_reject(...)` |
| 超参数 `τ_append` | CLI flag | `--kvcache-direct-max-uncached-tokens` |
| 超参数 `τ_reject` | CLI flag | `--kvcache-migration-reject-threshold` |

View File

@@ -0,0 +1,283 @@
# Migration v1 实验发现blacklist 永久性导致 thrashing
**日期**2026-05-08
**状态**v1 run 进行中(~23% 完成时的中期分析)
**前置文档**
- `docs/REFACTOR_PLAN_V1_ZH.md` §6.2v1 设计)
- `docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md` §2.1§1 starvation claim
**触发**v1 实现的 session migrationrejection blacklist 机制)部署后,观测到 session-level thrashing——某些 session 在 3 个 D 之间 round-robin 高达 75-116 次。本文记录中期数据、根因诊断、v2 设计。
---
## 0. TL;DR
1. **v1 修复了 §1 starvation 但引入了新的 thrashing 失效模式**——不是 admission 过严,是 blacklist 永久累积的设计 bug
2. **核心证据**session 6880 在 decode-1 上稳定 70 turns然后某瞬时 burst 把 reject 计数累积到阈值,被永久 blacklist之后陷入 3-D 间 round-robin 死循环
3. **85% admission 拒绝是 `session-not-resident`**——非 D 真容量问题,而是迁移后"新 D 第一次见你"的正常语义
4. **v2 设计**reset-on-success 让 reject 计数在成功 turn 后清零,只有**持续**失败才迁移
5. **深层观察**baseline 的"100% pin 但稳定"可能比"分布均匀但 thrashing"更好——糟糕的优化可能比不优化还糟
---
## 1. v1 实施回顾
### 1.1 改动文件
- `src/agentic_pd_hybrid/policies.py``RoutingState.session_d_rejects` Counter`KvAwarePolicy.migration_reject_threshold=3` skip blacklisted Ddegenerate fallback 选最少拒的 D
- `src/agentic_pd_hybrid/replay.py``_run_request` 末尾 `state.record_admission_reject(sess, D)`(基于 execution_mode 子串匹配);`_fallthrough_reason``pd-router-fallback-large-append-*` 拆成 `session-not-resident` / `real-large-append` / 等
- CLI / benchmark wiring
### 1.2 v1 假设(事后看部分错误)
- "reject 计数 + 阈值 3 = 容忍短期波动 + 持续失败迁移" ← **错**counter 永久增长导致迁移成必然
- "迁移到新 D 后 session 在新 D 稳定下来" ← **部分错**,迁移到的新 D 也很可能很快 reject
- "session-not-resident 不会触发计数" ← **大致对**,但下游 fallback 可能间接触发
---
## 2. 中期数据1023/4449 reqs~23%
### 2.1 头部指标 vs baseline
| 指标 | baseline kvc_1p3d_run1 | v1中期 |
|---|---:|---:|
| Per-D 调用分布 | 1502/1445/1502±3.8%| 796/785/779**±1.1%**,更均衡)|
| Per-D 峰值 token_usage | 0.99/0.99/0.99 | 0.31/0.30/0.00**容量充裕**,未顶到 1.00|
| KVTransferError | 5全程| 6中期趋势相近|
| 已见 sessions | 52全程| 29中期|
**好的方面**
- 负载均衡度跃升±26%→±1.1% if normalized
- D 容量从未饱和——§2 假设的"D drain time"机制配合 ts=1 充分发挥
- 0 sessions 永久 stuck 在饿死状态
### 2.2 Migration 触发情况(已见 29 sessions
| 类别 | 数量 | 占比 |
|---|---:|---:|
| 仍 pin 在 1 个 D | 9 | 31% |
| 触碰 2 个 D | 3 | 10% |
| **触碰所有 3 个 D** | **17** | **59%** |
**D-切换次数分布**
- mean = 26 次/session
- median = 16 次
- **max = 116 次**
- 15 sessions 切换 >10 次(明显 thrashing
- **6 sessions 切换 >50 次**(严重 thrashing
---
## 3. 根因诊断session 6880 的轨迹
### 3.1 数据
```
turn 0-70: 全部在 decode-1 (71-turn 稳定 streak) ← §1 baseline 行为
turn 71-150: 在 3 个 D 间剧烈 thrashing
decode-0: 26 个短 streak
decode-1: 25 个短 streak
decode-2: 25 个短 streak
平均 streak 长度 = 2 turns
total streaks = 76
```
### 3.2 解读
**前 70 turn 完美稳定**session 6880 在 decode-1 上正常运行 70 个 turn每次都成功是 baseline §1 "100% pin" 的复现——稳定但不公平(其他 session 没分到 decode-1 的资源)。
**第 71 turn 后崩溃**
1. 某个瞬时 burst其他 session 的活动?)让 decode-1 短暂饱和
2. session 6880 在 decode-1 上连续 3 次被 admission 拒(`no-space``d-session-cap`
3. v1 的 `state.session_d_rejects[(6880, decode-1)]` 累积到 3 → blacklist
4. policy 改选 decode-0 → 同样发生 → blacklist
5. 改选 decode-2 → 同样 → blacklist
6. **3 D 全部 blacklisted** → degenerate fallback 在 3 D 间 round-robin
7. 每次 round-robin 又触发新 reject → 计数继续涨 → 永远在 thrashing 死循环
### 3.3 admission 数据交叉验证
中期 1932 admission events 解构:
| mode × can_admit × reason | count |
|---|---:|
| `direct_append, True, None` | 1721成功|
| `direct_append, False, session-not-resident` | **62** |
| `seed, True, None` | 142成功|
| `seed, False, no-space` | **11** |
**只有 11 个 "no-space" 才是真容量拒绝**(占总 admission 的 0.6%。62 个 "session-not-resident" 是迁移后"新 D 第一次见你"的正常语义。
但因为 v1 用 `_is_admission_rejection_mode` 通过 execution_mode 子串匹配,下游 fallback chain 会把 `session-not-resident` 也间接累积到计数器fallback 链路本身可能触发 session-cap
---
## 4. 设计 bug 三层
### 4.1 Bug 1blacklist 永久性
```python
# policies.py 当前实现
if rejects >= self.migration_reject_threshold:
continue # skip this D forever
```
`session_d_rejects[(sess, D)]` 是单调递增 Counter。一旦达到阈值**永远**被 skip。但 D 的容量是动态的——70 个 turn 后短暂饱和不代表它后续不能服务这个 session。
### 4.2 Bug 2degenerate fallback 加剧问题
当所有 D 都被 blacklist
```python
best_decode_worker_id = min(
(w.worker_id for w in topology.route_workers),
key=lambda wid: state.session_d_rejects.get((sess, wid), 0),
)
```
选"最少被拒"的 D。但每次 fallback 又增加该 D 的计数 → 下次选另一个 D → 形成完美 round-robin永远走不出 thrashing。
### 4.3 Bug 3信号归并粗糙
`_is_admission_rejection_mode` 子串匹配 `session-cap` / `no-d-capacity` / `d-backpressure`,但执行链路可能这样:
```
direct_append → session-not-resident85% 占比,正常迁移后语义)
→ fallback 试 seed
→ seed admit ok142/153 = 93%)→ execution_mode = pd-router-d-session-reseed-*(不计 reject
→ seed no-space11/153 = 7%)→ execution_mode = pd-router-fallback-X-no-d-capacity计 reject
```
绝大多数 fallback 不会触发 reject 计数。但 thrashing 一旦开始,很容易踩到那 7% no-space 路径calculator 增长一次。15+ 次 thrashing 后,单 D 计数累到 3 完全可能。
**所以设计 bug 不在信号粗糙,而在永久累积 + degenerate round-robin。**
---
## 5. 深层观察:稳定 vs 公平的 trade-off
| | baselinev0| v1 |
|---|---|---|
| 公平性 | 18/52 永久饿死 | 0 永久饿死 |
| 稳定性 | 100% pin结构稳定| 6/29 严重 thrashing |
| Per-D 负载均衡 | ±26% | ±1.1% |
| 大 session 体验 | 慢但稳定(每 turn 都走 fallback ~1.0s| 不稳定 + 频繁 D 切换 + 丢 KV state |
**预想反直觉的结果**v1 在头部指标per-D 均衡)赢,但在 session 体验可能输——
- baseline 的 fallback 路径有稳定 ~1s latency
- v1 的 thrashing session 每次 D 切换都 close 旧 session、丢 KV、新 D 上重新建立——有可能 latency 反而更高
需要等 run 结束的 lat mean / TTFT mean 数据验证。**糟糕的优化可能比不优化还糟。**
---
## 6. v2 设计
按 ROI 排序的修复层。**先做 #1,验证后再决定是否需要 #2/#3**。
### 6.1 v2-fix-1reset-on-success最高 ROI
```python
# replay.py _run_request 末尾,在 state.finish 后
if execution.execution_mode == "kvcache-direct-to-d-session":
# 这次 direct-to-D 成功 = D-X 仍能服务这个 session
# 清零累积的 reject 计数(消除永久 blacklist
state.session_d_rejects[(request.session_id, decision.decode_worker_id)] = 0
```
**预测效果**
- session 6880 在 decode-1 上 70 个成功 turn 把计数反复清零
- 即使中间出现 1-2 次瞬时 reject下次成功立刻清零
- 只有**持续**失败reject 后 reject 后 reject没有夹杂 success才能累到阈值
- 真饿死的 session如 35680/39360 input >92K才会触发迁移
**工程量**~5 行代码 + 1 个 smoke + 1 个完整 run~5.5h
### 6.2 v2-fix-2sliding window如果 #1 不够)
`Counter` 改成 `dict[(sess, D), deque[float]]` 存最近 K 次拒绝时间戳。判断时用最近 N 秒(或 N 个 turn内的次数。
更稳健但更复杂。**若 #1 已能彻底解决 thrashing跳过此项。**
### 6.3 v2-fix-3reject 类型分离(如果 #1 + #2 不够)
把 admission reason 显式传到 _run_request区分
- `no-space` / `session-cap` / `backpressure` → 计 reject
- `session-not-resident` → 不计
需改 `ExecutionResult``admission_reject_reason` 字段,并在 fallback 链路传递。**不在第一轮**——先看 #1 是否够用。
### 6.4 v2 应保留的 v1 设计
- 阈值 3不变
- `record_admission_reject` 的子串匹配(不变)
- 新 fallback labels`session-not-resident` 等)(不变)
- degenerate fallback 选最少拒的 D不变但因为 reset-on-success 几乎不会触发到此分支)
---
## 7. 实验计划
| 阶段 | 动作 | 时间 |
|---|---|---|
| 1 | 等 v1 run 完成ETA ~16:30| 自然 |
| 2 | 跑 analyzer 量化 v1 thrashing 实际代价 | 5 min |
| 3 | 实现 v2-fix-1reset-on-success| 30 min |
| 4 | smoke test | 10 min |
| 5 | 完整 v2 runKVC 1P3D ts=1 N=1| ~5.5h |
| 6 | 三方对比baseline / v1 / v2 | 30 min |
| 7 | 决定是否需要 v2-fix-2 / v2-fix-3 | |
---
## 8. 三方对比预测(待数据验证)
| 指标 | baselinev0| v1thrashing| **v2self-healing 预测)** |
|---|---:|---:|---:|
| Errors | 5 | ? | 2-5仅 35680/39360 等真容量超限)|
| Per-D 均衡 | ±26% | **±1.1%** | ±5-10%(部分 pin 仍 sticky|
| Direct-to-D rate | 42.8% | ?(可能因 thrash 反而下降)| **65-75%**(持续 affinity转换 §1 fallback|
| Lat mean | 1.574s | ?(可能因 thrash 上升)| **1.30-1.45s**(达到 4DP 1.443s 水平)|
| TTFT mean | 0.244s | ? | **0.10-0.15s** |
| 最大 D-switches/session | 0 | 116 | <10仅真饿死 session|
| Sessions 永久饿死 | 18 | 0 | 2-3仅真容量超限|
预测核心v2 应该结合 baseline 的稳定性70-turn streak 应保留+ v1 的公平性无永久饿死消除 v1 thrashing 副作用
---
## 9. 局限与未验证
1. **v1 中期数据 (23%) 推测**完整数据可能改变 thrashing 严重性的判断
2. **session 6880 trajectory 的崩溃机理是推断**基于 admission events 数据 + streak 模式但没有直接日志证明 reject 计数何时跨阈值需要在 v2 instrument 输出
3. **reset-on-success 的预测效果未验证**基于"70 turn 成功" + "1-2 次瞬时 reject" 的假设如果 burst 持续多 turn仍可能跨阈值
4. **可能还有未发现的设计 bug**v2 也许还会暴露新问题
5. **三方对比需 same trace + same scale + same ts=1**baseline 已有 N=3v1/v2 N=1ts=1 确定性 N=1 可信
---
## 10. 给 TEAM_REPORT 和 REFACTOR_PLAN_V1 的更新建议
完成 v2 验证后
1. `TEAM_REPORT` §3 ts=1 验证更新章节加入 §3.3 "Migration mechanism evolution: v0 v1 v2"
2. `REFACTOR_PLAN_V1` §6.2 标注实施反思——预设的 "rejection blacklist" 设计漏掉了 reset-on-success 这条
3. 在新文档 `docs/POLICY_DESIGN_PRINCIPLES_ZH.md` 提炼出原则"任何会累积的代价机制必须配 healing/decay 机制否则会陷入 self-amplifying 失效模式"
---
## 附录 A本文数据来源
| 章节 | 数据源 |
|---|---|
| §2 | `outputs/qwen3-30b-tp1-ts1-migration-v1/kvcache-centric-*/` 中期日志 |
| §3.1 | `structural/session-d-binding.jsonl` turn 序列 |
| §3.3 | `structural/admission-events.jsonl` mode/reason 交叉表 |
## 附录 B相关代码位置
| 内容 | 位置 |
|---|---|
| RoutingState.session_d_rejects | `src/agentic_pd_hybrid/policies.py:46` |
| KvAwarePolicy.select 跳过 blacklisted D | `src/agentic_pd_hybrid/policies.py:155-162` |
| Degenerate fallback 选最少拒的 D | `src/agentic_pd_hybrid/policies.py:184-192` |
| record_admission_reject 触发位置 | `src/agentic_pd_hybrid/replay.py:359-364`_run_request |
| _is_admission_rejection_mode 子串集合 | `src/agentic_pd_hybrid/replay.py` `_ADMISSION_REJECTION_SUBSTRINGS` |
| _fallthrough_reason 分类 | `src/agentic_pd_hybrid/replay.py` `_fallthrough_reason` |

385
docs/REFACTOR_PLAN_V1_ZH.md Normal file
View File

@@ -0,0 +1,385 @@
# Refactor Plan v1基于 ts=1 验证后的重构方向
**日期**2026-05-08
**前置文档**
- `docs/REFACTOR_PLAN_ZH.md`v0已被本文 supersede——v0 的 backpressure 切入点结论已撤回)
- `docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md`(包含 §1-§7 结构性问题清单)
- `docs/STRUCTURAL_VALIDATION_REPORT_ZH.md`ts=10 数据下的早期验证)
**触发**`outputs/qwen3-30b-tp1-ts1-validation/` 4 个 run 完成KVC 1P3D × N=3 + 4DP CA × 1全部 ts=1
**目的**:把 ts=1 验证结果落到具体的重构决策——哪些事必须做、哪些事不要再做、KVC 项目本身是否需要重新定义价值主张
---
## 0. TL;DR
1. **ts=10 失真是真的,影响 5-10×**——KVC 在 ts=10 灾难性输 DP 是 benchmark artifact不是机制本身有问题
2. **ts=1 同 scale 下 KVC ≈ DP**lat mean 差 9%TTFT 差 47%errors 双 0
3. **TEAM_REPORT 的 §1session pin 不公平)是真问题,但代价从 6× 降到 ~2×**——仍是唯一值得做的 KVC 优化
4. **TEAM_REPORT 的 §2/§3/§4/§5 大多是 ts=10 高压 artifact**——ts=1 下要么不显著、要么自然吸收
5. **N=1 不可信是 ts=10 现象**——ts=1 下系统在 categorical 层面完全确定routing/admission/errors 三次 run 完全相同)
**项目落到情景 BKVC ≈ DP**——三种 forward 路径任团队决策(见 §6
---
## 1. ts=1 验证数据
### 1.1 实验配置
| 项 | 值 |
|---|---|
| Trace | `outputs/qwen35-swebench-50sess.jsonl`4449 reqs / 52 sessions |
| 模型 | Qwen3-30B-A3B-Instruct-2507TP1 |
| 硬件 | 单机 4× H100 80GB原始 ts=10 实验是 8 GPU本次缩配 |
| Time-scale | 1真实 trace 时序inter-turn gap p50 = 2.5s |
| Concurrency | 32 |
| KVC 配置 | 1P3Dpolicy=kv-awareadmission=workerseed-min-turn=1prefill-priority-eviction |
| DP 配置 | 4-way colopolicy=kv-awarecache-aware |
| 输出根 | `outputs/qwen3-30b-tp1-ts1-validation/` |
### 1.2 Headline 对比
| Metric | KVC 1P3D ts=1N=3 均值)| 4DP ts=1 | Delta |
|---|---:|---:|---:|
| **真实 mechanism errors** | **0** | **0** | 平 |
| 报告 errors口径不一致见 §1.3 | 5 | 0 | |
| Lat mean | 1.574s | **1.443s** | DP 优 9% |
| Lat p50 | 0.810s | **0.659s** | DP 优 19% |
| Lat p90 | 3.796s | **3.641s** | DP 优 4% |
| Lat p99 | 8.722s | **8.433s** | DP 优 3% |
| TTFT mean | 0.244s | **0.129s** | DP 优 47% |
| TTFT p50 | 0.122s | **0.090s** | DP 优 26% |
| TTFT p90 | 0.572s | **0.252s** | DP 优 56% |
| Per-worker spread | ±3.8% (3D) | ±3.1% (4 direct) | 接近 |
### 1.3 KVC 5 errors 的真实身份
DP 的同 5 个 (sess, turn) 也"失败"——但 metrics 口径不同:
```
KVC: 计入 error_count
DP: metrics 记 error=OK + finish_reason={'type':'abort', 'message':'Input length (X) exceeds the maximum allowed length (87811)'}
```
| sess | turn | input_len | KVC max | DP max |
|---|---:|---:|---:|---:|
| 35680 | 132 | 91600 | 92098 (✓) | 87811 (✗) |
| 35680 | 133 | 92335 | 92098 (✗) | 87811 (✗) |
| 39360 | 137 | 91700 | 92098 (✓) | 87811 (✗) |
| 39360 | 138 | 92003 | 92098 (✓) | 87811 (✗) |
| 39360 | 139 | 92135 | 92098 (✗) | 87811 (✗) |
**两边都拒同样的请求**——区别只在于 KVC 在 P 端拒KV 池满、DP 在 prefill 端拒max-input limit。**真实 mechanism 错误率KVC 0 / DP 0**。
### 1.4 ts=1 的确定性
KVC N=3 三次 run 跨 4449 records
| 维度 | 跨 run 差异 |
|---|---|
| `execution_mode` | **0 / 4449** records 不同 |
| `assigned_decode_node` | **0 / 4449** records 不同 |
| Errors5 个 sess/turn 对) | **完全相同** |
| 18 starved + 16 lucky session | **完全相同** |
| Per-D load (1502/1445/1502) | **完全相同** |
| Lat mean | 1.574 / 1.573 / 1.574**0.06%** 漂移)|
| Lat p50 | 0.811 / 0.809 / 0.812**0.4%** 漂移)|
| 单 request lat | abs p90 diff = 25ms |
**结论**:低压 / ts=1 区间下 KVC 系统在 categorical 层面(路由 / admission / 失败位置)**完全确定**,仅低层数值有 model 计算微抖动。
---
## 2. 对 TEAM_REPORT §1-§7 的修订
| § | TEAM_REPORT 原 claim | TEAM_REPORT 原优先级 | ts=1 验证后状态 | **修订优先级** |
|---|---|---|---|---|
| §2.1 | session pin + 容量盲选 → 25% 饿死 | **P0** | ✅ 结构性问题仍在18/52 session 永久 pin但代价从 6× 慢降到 ~2× | **P0**(唯一值得做的 KVC 优化)|
| §2.2 | D-side LRU 跟不上 → 8% errors | **P0** | ⚠️ D 仍瞬时顶到 token_usage=1.00,但**ts=1 下 drain time 自然吸收**——0 KVTransferError 雪崩vs ts=10 369 次) | **降级 P3**drain time 已解决症状)|
| §2.3 | 无 backpressure 通道 | P1已实现| ❌ ts=1 下 transfer cascade 不存在backpressure 无作用对象 | **冷藏**(代码留着,但默认 off|
| §2.4 | P-side round-robin 不感知 D 健康 → prefill-0/-1 错误差 180× | P1 | ⚠️ 1P 配置不可测ts=10 现象**高度怀疑也是 artifact**(错误本身在 ts=1 消失) | **存疑 / 重测后再说** |
| §2.5 | admission RPC 进 scheduler 主循环 → 1Hz polling 让 errors ↑46× | P2 | ❌ 是 ts=10 高压时的现象ts=1 下不显著 | **冷藏** |
| §2.6 | time-scale=10 失真 → 所有 KVC vs DP 结论可能被放大 | **P0** | ✅ **完全证实**74× errors↓, 8.7× TTFT↓, 7× per-D spread↓ | **DONE作为前置条件锁定** |
| §2.7 | execution_mode 标签命名错位 | P1 | ✅ 仍存在;本次 ts=1 又发现 `error_count` 在 KVC vs DP 口径不一致 | **P1**(纯 labeling 修复,~半天)|
| §2.8 | N=1 不可信 → 实验必 N≥3 | P2 | ⚠️ **是 ts=10 高压现象**——ts=1 下 N=1 categorical 完全确定 | **改写规则**:高压 N≥3 / 常规 N=1 |
| §2.9 | microbench 把 KVC 失效条件全规避 | | 仍成立 | **保留观察**(实验设计原则)|
---
## 3. v0 REFACTOR_PLAN 回顾
### 3.1 v0 做对的
- **唯一代码改动选 backpressure**:作为对 §2.3 的最小验证手段是合理的
- **预算 KISS**:用 8h GPU 验证 §1-§7思路正确
- **明确"P0 是 time-scale=1 baseline"**v0 的 §1 末尾就指出 "time-scale=1 验证为 P0 待办"——本次实验正是把这条做了
### 3.2 v0 的核心误判
| v0 假设 | 实际 |
|---|---|
| backpressure 是 §3 的最小验证 → 也是修复 | ts=1 下 §3 的症状transfer cascade不存在backpressure 无效 |
| 8h 预算够跑 ts=1 baseline + backpressure smoke | ts=1 单 run 5.5h4 run 全跑要 22h实际跑了 22h |
| §1 / §2 的修复"超出 KISS 边界",先验证不修 | 验证后发现 §1 是**唯一**值得做的真问题,应该早点把它纳入 |
### 3.3 v0 的 backpressure 代码命运
代码保留(`--enable-backpressure` 默认 off原因
- 不删除是因为如果未来跑高压 / 大 trace / 真 RDMA 失败回归到类 ts=10 区间,可能仍有用
- 但**不部署、不启用、不文档化为推荐配置**——避免给以后看到代码的人误导
---
## 4. 修订后的优先级矩阵
```
必做 建议做 不做
──────── ──────── ────────
ts=1 必修 §1 capacity-aware (空) §2 / §3 / §4 / §5
policy + migration 的 ts=10 fix
ts=1 nice §2.7 metrics 标签 (空) §2.8 N≥3 严苛规则
to have 统一口径 (改成"高压 N≥3"
文档 §3 写入 TEAM v0 标记 superseded ts=10 数据归档
REPORT 更新 (但保留可追溯性)
```
**唯一进入"必做工程"列表的是 §1**。其他全是文档或冷藏。
---
## 5. KVC vs DP 拆分到 path-level 看真实差距
理解 §1 的 ROI 必须先看 path-level不是整体均值
### 5.1 KVC 内部 path 性能(来自 ts=1 N=3 一致数据)
| Path | n | 占比 | Lat p50 | TTFT p50 |
|---|---:|---:|---:|---:|
| `kvcache-direct-to-d-session`(快路径)| 1903 | **42.8%** | **0.475s** | **0.042s** |
| `pd-router-fallback-large-append-session-cap`(慢路径)| 2409 | **54.2%** | 1.04s | 0.32s |
| `pd-router-turn1-seed`(每 session 一次)| 52 | 1.2% | 0.375s | 0.057s |
| 其余 | 85 | 1.8% | 多种 | 多种 |
### 5.2 DP 全部 path单一
| Path | n | 占比 | Lat p50 | TTFT p50 |
|---|---:|---:|---:|---:|
| `dp-colo-router` | 4449 | 100% | 0.659s | **0.090s** |
### 5.3 路径级对比
| | KVC direct | KVC fallback | DP |
|---|---|---|---|
| Lat p50 | **0.475s**(赢 DP 28%| 1.04s(输 DP 58%| 0.659s |
| TTFT p50 | **0.042s**(赢 DP 53%| 0.317s(输 DP 252%| 0.090s |
**事实陈述**
- KVC 快路径 **明显快于** DP无 P 介入、无 mooncake transfer
- KVC 慢路径 **明显慢于** DPP→D transfer 开销没法摊到 turn 内)
- 当前 quick:slow = 42.8% : 54.2%——慢路径多 → 整体输 DP 9-47%
- 如果能把比例反过来到 70:25 或更好KVC 整体会赢 DP
**§1 的本质就是"为什么有 54% 进了慢路径"**——因为 18/52 session 被 pin 在容量紧张的 D 上,每次 admission 都拒。
---
## 6. 三种 forward 路径
> **更新2026-05-09**:情景 C **已实现**——见 `docs/V2_RESULTS_ZH.md`。下面三个分支保留作历史记录。
>
> | 情景 | 描述 | 状态 |
> |---|---|---|
> | A | KVC < DP接受现状转维护 | 不适用 |
> | B | KVC ≈ DP重新定义价值主张 | 不适用 |
> | **C** | **KVC > DP优化拉大差距** | **✓ 实现v2 在 7/8 头部指标击败 4DPTTFT mean -24%, p50 -54%, p90 -64%lat mean -0.8%, p50 -12.6%** |
>
> 关键修复:(1) reset-on-success blacklist decay消除 v1 thrashing(2) `--kvcache-direct-max-uncached-tokens` 2048→8192让 41% 大 append 走 direct-to-D 快路径。direct-to-D rate 从 baseline 42.8% 升到 v2 91.7%。
### 6.1 选项 A接受现状项目转维护
**判断**KVC 在 ts=1 + 同 scale 下 ≈ DP9% 慢、47% TTFT 慢),但**也没灾难性输**。如果项目目标是"验证 KV-aware routing 在 agentic 上是否可行",答案是 **可行但收益不显著**
**操作**
- 写 TEAM_REPORT §3 总结 ts=1 实验
- 把 ts=1 数据 + 4 个 run 归档到 `RESULTS_FROZEN_TS1.md`
- KVC 代码保留但标记 "experimental, not recommended for production"
- 团队转下一个项目方向(不是本文范围)
**成本**1 周文档收尾。
**风险**:放弃了 §1 修复后可能的 KVC > DP 上限。
### 6.2 选项 B做 §1目标让 KVC > DP
**判断**5.3 节的路径分析表明 KVC 快路径已经赢 DP如果把饿死 session 救回快路径KVC 整体可能赢 DP。
**具体改动**
#### 6.2.1 capacity-aware policy`policies.py:166-172`
当前评分(无容量项):
```python
score = (
overlap + sticky * self.sticky_bonus,
sticky,
inflight_penalty,
assignment_penalty,
)
```
提议改为:
```python
# 新增D 当前容量利用率(从 worker-mode admission 已能查到)
capacity_used = worker_capacity_used_ratio.get(worker.worker_id, 0.0)
# Hard cap容量 > X 时禁止该 D 进入候选
if capacity_used > HARD_CAP_THRESHOLD: # e.g. 0.85
continue
score = (
overlap_capped, # 原 overlap但限幅避免单个 D 永远赢
-capacity_used, # 新增二级排序项:偏好空闲 D
sticky,
inflight_penalty,
)
```
#### 6.2.2 session migration`replay.py` 或 policy 层)
当 session X 在 D-A 上连续被 admission 拒 N 次(如 N=3
- 主动 release X 在 D-A 上的 session state
- 允许下次 turn 把 X 路由到另一个 D
- 代价:丢失 D-A 上已积累的 KV——但 fallback 路径本来也丢了,**净收益正**
#### 6.2.3 metric 修复(`replay.py`
把"`pd-router-fallback-large-append-*`" 标签按真实原因细分:
- `session-not-resident-on-pinned-D`§1 主因)
- `real-large-append`>2048 阈值§2.7
- `session-was-evicted`(被 LRU 踢过)
- `session-cap-rejected`worker admission 拒)
让以后看 metrics 的人不再被名字误导。
#### 6.2.4 验证
- 每改动跑 KVC 1P3D ts=1 N=1categorical 确定,不需要 N=3
- 对比 baseline run1已有数据
- 关键指标:`kvcache-direct-to-d-session` 占比、整体 lat mean、TTFT mean
- 目标direct-to-D rate 从 42.8% 升到 > 70%、整体 lat 追平或赢 DP
**成本**3 天编码 + 5 天测试 + 2 天文档 ≈ 2 周。
**风险**
- session migration 可能导致 thrashA→B→A→B需要冷却时间机制
- capacity HARD_CAP 阈值需要 sweep 找最优
- 改完仍可能不赢 DP理论上限不知道
### 6.3 选项 C保留 KVC但寻找 KVC 真正赢的工作点
**判断**:当前 SWE-Bench 50 sessions × 30B 模型 × 4 GPU 是一个特定工作点。KVC 的设计初衷是"长 multi-turn session 的 KV 复用"——可能在某些其他工作点有显著优势。
**候选工作点**
- **更长 session>200 turns**:复用收益更大
- **更小模型(如 7B / 14B**mooncake transfer 占比更大KVC 节省更明显
- **更大 trace>200 sessions**DP 的 prefix cache 命中率会下降KVC 的 session-aware 优势放大
- **真实 RDMA非 mooncake TCP loopback**transfer 更快KVC 的 P→D 开销更小
**操作**
- 设计 1-2 个新 micro/macro benchmark
- 跑 KVC vs DP 对比
- 找到差距 > 30% 的工作点KVC 赢 / 输都是数据)
**成本**~1 个月trace 设计 + benchmark + 分析)。
**风险**:可能找不到 KVC 显著赢的工作点。
---
## 7. 推荐组合
按风险 / 收益排序:
1. **必做**(无论选 A/B/C
-`TEAM_REPORT §3 ts=1 验证更新`
-`metrics 标签口径`§2.7 + KVC/DP error_count 一致化)
- **冷藏 backpressure 代码**(不删但默认 off
- 把 v0 REFACTOR_PLAN 标 superseded
2. **强烈推荐**:选项 B 的 §6.2.1capacity-aware policy hard cap
- 工程量小(~1 天编码 + 1 天测试)
- 验证 §1 修复的真实收益是否如预测
- 如果 direct-to-D rate 不显著提升 → 把 §6.2.2 也加上
- 如果还不行 → 接受现状走选项 A
3. **看团队带宽**:选项 C 的工作点探索
- 不与 §6.2 冲突,可以并行
- 找到一个 KVC 真正赢的工作点会极大改变项目价值主张
---
## 8. 应该砍掉的事(明确列表)
| 事 | 砍的理由 |
|---|---|
| backpressure smoke sweepv0 计划的 4 run | ts=1 下 backpressure 无作用对象 |
| §2.5 admission API probe/commit 拆分 | 高压才显著,等找到 KVC 高压 workload 再说 |
| §2.2 D-side 分层 LRU evictionhot retract | drain time 自然吸收 |
| §2.4 P-side D-health-aware routing | 1P 测不出ts=10 现象高度存疑 |
| 大量 instrumentadmission-events / pool timeseries | 已经够了,先用现有数据 |
| 任何 ts=10 区间的优化 | 那是 benchmark artifact 主导的区间,不代表真实部署 |
| N≥3 实验作为硬规则 | 改写为"高压 N≥3常规 N=1 即可" |
---
## 9. 风险与未验证的假设
1. **4DP ts=1 是 N=1**:虽然 KVC ts=1 是确定性的DP 是新机制 N=1理论上需要 N≥3 验证。但 DP 在 ts=10 也是 0 errors / 1.43s mean行为相对 KVC 更稳定N=1 风险较小。**如选项 B 推进,建议补 N=2**。
2. **2 个 input-too-long session 是 trace 数据问题**:这两个 session35680、39360在 turn 132+ / 137+ 才超过 input limit。可能是 trace 生成时没控制好 max input。**应该独立把这两个 session 从 trace 移除或截断后重跑作为对照**。
3. **4 GPU 缩配 vs 8 GPU 原始**:本次 1P3D / 4DP 数据无法跨 8 GPU 原始数据直接比,需要在结论中明确。但 ts=1 + 同 scale 内部对比是干净的。
4. **mooncake TCP loopback**:所有 transfer 在单机 TCP 模拟下进行。生产 RDMA 下 KVC 的 transfer 开销可能显著降低KVC 优势可能扩大——这是 **选项 C 的一个候选维度**
5. **§1 修复是否真能让 direct-to-D 上升到 70%+ 是预测**:实际可能受 hash overlap 限制(即使 D 容量充裕,没有 prefix overlap 就走不了 direct-to-D。**需要 §6.2 验证后才知道天花板**。
6. **input-limit error 的 metrics 口径修复影响以后所有比较**:注意修改后 ts=10 历史数据的 error_count 也需要重算(或在分析时显式补偿)。
---
## 10. 决策点(需要团队确认)
请审阅后回答:
| # | 决策 | 选项 |
|---|---|---|
| D1 | 选哪条 forward 路径? | A维护/ B修 §1/ C探索 workload/ B+C |
| D2 | 写 TEAM_REPORT §3 ts=1 验证更新章节? | Yes / No |
| D3 | 把 v0 REFACTOR_PLAN 标 superseded | Yes / No |
| D4 | 删除 backpressure 代码 vs 冷藏? | 删 / 冷藏(默认 off|
| D5 | 修 metrics 标签口径§2.7 + error_count 一致化)? | Yes / No |
| D6 | 是否补 4DP ts=1 N=2 / N=3 做更稳的 baseline | Yes / No |
| D7 | 是否把 sess 35680 / 39360 从 trace 移除做"干净" baseline | Yes / No |
---
## 附录 A本文数据来源
| 章节 | 数据源 |
|---|---|
| §1.2-§1.4 | `outputs/qwen3-30b-tp1-ts1-validation/{kvc_1p3d_run{1,2,3},dp4}_{summary.json,metrics.jsonl}` |
| §1.4 跨 run 一致性 | per-record diff via `scripts/analysis/analyze_ts1_validation.py` + 临时 diff 脚本 |
| §5 path-level | metrics.jsonl 按 `execution_mode` 分组 |
| §2 §1-§7 修订 | `docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md` 原数据 + ts=1 新数据交叉对比 |
## 附录 B相关文档
- `docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md` — §1-§7 原结构性问题清单
- `docs/REFACTOR_PLAN_ZH.md` — v0 重构计划(本文 supersede
- `docs/AGENTIC_FIT_ANALYSIS_ZH.md` — 早期 fit 分析§1-§7 来源)
- `docs/STRUCTURAL_VALIDATION_REPORT_ZH.md` — ts=10 结构性 claim 验证
- `docs/KVC_DEBUG_JOURNEY_V1_TO_V5.md` — v1→v5 演进
- `docs/V5_PROFILE_INVESTIGATION_ZH.md` — v5+profile 调查(已 critic 修订)
- `scripts/sweep_ts1_kvc_n3_plus_dp.sh` — 本次 4 run sweep 脚本
- `scripts/analysis/analyze_ts1_validation.py` — 本次分析脚本
---
**作者注**:本文偏决策导向。如果要写更技术的 §1 capacity-aware policy 实现细节,应该在 D1 决策为 B 之后单独出一份 `IMPL_CAPACITY_AWARE_POLICY.md`

123
docs/REFACTOR_PLAN_ZH.md Normal file
View File

@@ -0,0 +1,123 @@
# Refactor Plan v0极简版
**日期**2026-05-06
**目标**:用最小改动 + 轻量实验,验证 `docs/AGENTIC_FIT_ANALYSIS_ZH.md` 提出的结构性缺陷是否真实存在、影响多大。
**预算**8h GPU 时间(约 4-6 次 ~30-60 min smoke run
**KISS 边界**:不动 SGLang `scheduler.py` 主循环结构;不引入新 mooncake 协议;不实现 cross-D session migration不做 admission probe/commit 拆分;不动 LRU eviction 策略。
## 计划结论(与用户已确认的)
回审 plan-v0 时发现两个原 Phase 1 改动**都不是 bug**
- `_estimate_session_resident_tokens` 返回 full prompt 是设计如此——所有需要"增量"的 call site 都已经做 `target - current` 减法(`replay.py:1247-1254``:1393-1394``:1490-1491`)。
- `decode_resident_blocks` 不缩减只是浪费几 MB 内存,**不影响 routing 决策**SWE trace 的 hash_ids 是 session-uniquepolicy 仍能正确选 D
最终极简版只做一件代码改动(**加 backpressure**+ 大量 instrumentation。
## 唯一代码改动Backpressure 信号
### 改动点 1SGLang `admit_direct_append` 响应增加两个字段
文件:`third_party/sglang/python/sglang/srt/managers/io_struct.py``scheduler.py`
```python
@dataclass
class DirectAppendAdmissionReqOutput:
... # 已有字段保留
recommended_pause_ms: int = 0 # 新增
queue_depth: int = 0 # 新增
```
`scheduler.py:admit_direct_append` 末尾计算 hint
```python
def _compute_backpressure_pause_hint(self) -> float:
depth = len(self.disagg_decode_transfer_queue.queue)
if depth < 8:
return 0.0
return min(2000.0, depth * 100.0) # 简单线性
```
### 改动点 2replay 端按 hint 退避
文件:`src/agentic_pd_hybrid/replay.py`
- `DecodeResidencyState` 新增 `pause_until_s: dict[str, float]`
- `_query_decode_direct_admission` 解析响应里的 `recommended_pause_ms`,更新 `pause_until_s[server_url] = now + pause_ms / 1000`
- 在调 `_invoke_router` / `_invoke_decode_session_direct` 前检查 `pause_until_s[decode_url]`,若 `now < pause_until` 则 sleep 到该时刻
### 改动点 3新 CLI flag
`src/agentic_pd_hybrid/cli.py``benchmark.py`
```
--enable-backpressure # 默认 false保留 baseline 行为
```
### 改动点 4观测日志
每个 run dir 新增三个 jsonl
- `admission-events.jsonl`:每次 admission RPCtimestamp, session, D, can_admit, queue_depth, pause_ms, latency_s, available_tokens, evicted_session_count
- `backpressure-events.jsonl`:每次实际 sleeptimestamp, D, sleep_ms, queue_depth_at_signal
- `session-d-binding.jsonl`:每个 session 第一次 open 在某 D 时记录timestamp, session, D, turn_id
## 实验矩阵8h 预算内)
按"先做 anchor再做单变量对照"排序。每行右侧是预估机时。
| ID | 配置 | 目的 | 机时 |
|---|---|---|---|
| **E0 (existing)** | v5 baselinetime-scale=10无 backpressure | Anchor已存在 `outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/run1` | 0 |
| **E1** | v5 + backpressure ONtime-scale=10全 trace | 验证 Claim §3backpressure 是否能消除 KVTransferError 雪崩) | ~50 min |
| **E2** | v5 baselinetime-scale=1**短 trace**(前 12 sessions ≈ 1000 reqs | 验证 Claim §7time-scale=10 失真);不开 backpressure | ~60 min |
| **E3** | 8DP CAtime-scale=1同 E2 trace | E2 的对照——真实时序下 KVC 是否仍输 DP | ~60 min |
| **E4** | v5 + backpressuretime-scale=1同 E2 trace | backpressure 在真实时序下还有用吗? | ~60 min |
| **E5**(备选) | v5 baselinetime-scale=10**concurrency=4**,全 trace | 验证 Claim §1高并发是不是必要条件 | ~50 min |
4-5 个 run~3-5h。剩余预算给失败重跑/分析。
## 实验目标——回到 §1-§7 一一对照
| 文档 § | Claim | 由哪个 exp 证伪/支持 | 需要的指标 |
|---|---|---|---|
| §1 | Session 永久 pin + 容量盲选造成双峰 | 已有 E0 数据足够 | direct-to-D rate per session distribution |
| §2 | LRU 跟不上压力 | 已有 E0 logs 足够 + E1 看 backpressure 之后 trim/error 比例变化 | trim 事件数 vs OOM 数 |
| §3 | 没 backpressure 是雪崩源 | E0 vs E1 | KVTransferError 数、P99 latency |
| §4 | admission RPC 干扰 scheduler | 不在本轮实验范围(需要 admission probe 拆分才能验,不做) | |
| §5 | P-side 不感知 D 健康 | 已有 E0 logs 足够prefill-0 vs prefill-1 错误数) | per-P KVTransferError |
| §6 | (已撤回) | | |
| §7 | time-scale=10 失真 | E0 vs E2同 KVC不同 time-scaleE2 vs E3同 time-scaleKVC vs DP | latency 分布、direct-to-D rate |
## Final 实验报告交付
跑完后输出 `docs/STRUCTURAL_VALIDATION_REPORT_ZH.md`,按 §1-§7 每条给出:
- **Claim 字面**
- **数据证据**(哪个 exp、哪个 metric
- **结论**:成立 / 部分成立 / 推翻
- **影响量化**:数字差异
- **不确定性**N=1 风险、其他 confounder
## 不做的事KISS 边界)
| 想做但不做 | 理由 |
|---|---|
| 跑 N=3 重复 | 8h 装不下single-run 可看大方向 |
| 全 sweep 参数 | 只调 time-scale 和 backpressure 一个 boolean |
| 改 LRU eviction | 不在本轮范围 |
| Cross-D migration | 不在本轮范围 |
| Admission probe/commit 拆分 | 不在本轮范围 |
| P-side D-health routing | 不在本轮范围 |
| 修两个"非 bug"estimate / aging | 验证后非真实 bug |
## 预期失败路径
- **GPU 资源紧张**smoke trace 进一步压缩(前 8 sessions / 600 reqs
- **time-scale=1 跑超 1.5h**:截断到 600s 内能完成的部分
- **backpressure 配错**:先用 sleep_ms = depth * 100 简单线性;调不通就回滚到 0无 backpressure
- **SGLang patch 编译错**:所有 patch 在 io_struct.py 和 scheduler.py 的少量行内,可单独 git restore
---
接下来:实现 → 跑 smoke → 写报告。

View File

@@ -0,0 +1,368 @@
# Reseed 慢路径现状与 D→P KV 同步缺口
**日期**2026-05-11
**对象**:项目团队 + 后续 paper reviewer
**性质**:基线现状落盘 + future-work 缺口定位
**前置文档**
- `docs/V2_DEEP_ANALYSIS_ZH.md` §3.2 §4.2reseed 路径在 v2 数据中的表现)
- `docs/KVC_ROUTER_ALGORITHM.md` §3 §9算法形式化 + open questions
**目的**:把"v2 的 reseed slow path 为什么慢、能不能用现有机制治、还差什么"三个问题落盘成单一参考文档,让团队不必再口头反复对齐,让论文 future-work 章节有可引用的基础。
---
## 0. TL;DR
1. KVC v2 在 SWE-Bench 测试中 8.3% 请求走非 direct-to-D 的 reseed/fallback 路径,**单次 reseed 实测 3-7s**TTFT p99 = 1.28s 全部来自这条路径)。
2. 启用真 RDMA节点有 mlx5_0/_1 @ 200 Gb/s × 2 active能把 reseed 的 transfer 段(~1.5-4s压到 ~200-400ms但**对 re-prefill 段(~1.5-3s无效**。预期 reseed 总时间从 3-7s 降到 1.7-3.2sTTFT p99 ~0.7s**仍输 DP0.43s**。
3. 真正消除 reseed 长尾必须实现 **D→P 增量 KV 同步**——让 P 端 backup 跟上 D 在 direct-to-D append 路径上累积的 KV避免 reseed 时重新跑 prefill kernel。
4. 经 Opus agent 独立 forensic 审查commit `9ccd853`+ 全分支 git 检索:**当前代码、vendored SGLang、mooncake 三层均无 D→P 实现**,作者也没有在其它分支偷偷开发——仓库总共只有 main旧 baseline+ kvc-debug-journey-v1-to-v4本工作分支两个分支main 还落后我们 18 个 commit。
5. `--kvcache-prefill-backup-policy capacity-backup` 这个 flag 看起来像 D→P 同步但**不是**——它的真实语义只是"reseed 完不关 P streaming session"P 端 KV 仍是 seed-time 的**静态快照**,不随 direct-to-D append 而增长。
6. 实现 D→P 增量同步的工程量 ~1-2 周最难的不是网络层mooncake 加 D-sender / P-receiver 角色 ~400 LOC而是 **SGLang radix tree 改成允许从外部 worker 喂数据**——radix cache 当前假设单一生产者。
---
## 1. 团队成员的三个质疑关键框架paper 引用建议保留原话)
这三条质疑出自 v2 完成后的对话审查,**直接戳穿了"启用 capacity-backup 就能消除 slow path"的一厢情愿**。每条都有代码层证据支持,**全部成立**。
### 质疑一P 节点的 pool 塞得下所有 backup 的 KV cache 吗?
**回答塞不下max 同时 backup ~1-2 个大 session。**
代码证据(`src/agentic_pd_hybrid/replay.py:1618-1620`
```python
max_backup_sessions = max(1, capacity_tokens // max(1, target_tokens * 2))
max_backup_sessions = min(max_backup_sessions, 4)
```
按 SWE workload 实测代入:
- P 池 `capacity_tokens` ≈ 92,104 tokensSGLang 启动时按 mem_fraction_static 自动分配)
- 典型 session peak input `target_tokens` ≈ 50,000-80,000 tokens
- 计算:`92K // (50K × 2) = 0``max(1, 0) = 1`
-**P 最多同时 backup 1 个大 session**
对照小 session
- target 20K`92K // 40K = 2` → backup 上限 2 个
- target 10K`92K // 20K = 4` → backup 上限 4 个(达到代码硬上限)
**capacity-backup 在真实 agentic 长 context workload 下只能救少数 session不是全员保险。**
### 质疑二P 上的 backup 是陈旧快照——49K 的 append 内容根本没经过 P
**回答:完全正确,这是 capacity-backup 设计上的致命缺陷。**
**用户提供的反例场景**(已成为 paper 中描述 slow path 的标准例子):
```
turn 0: P 做 prefill 1K tokens → 经 mooncake 传到 D → P 留 1K backup
turn 1-50: 全部走 direct-to-DD 上做 append-prefillKV 在 D 上从 1K 增长到 50K
↑↑↑ 关键:这 49K 的 append 内容tool 输出、user 消息、模型生成)
**从未流经 P 节点**。P 端 backup 锁在 1K 状态。
turn 51: D 出于某种原因(容量、迁移、显式驱逐)拒绝 → 触发 reseed
→ 即使 P 上有 backup也只是 turn-0 的 1K
→ 实际需要 D 上重建的是 50K当前完整 context
→ P 必须从 prompt 重新 prefill 49K 的差额
→ capacity-backup 节省的 compute 仅 ~2%
```
**代码证据**(独立 Opus agent forensic 审查commit `9ccd853`
1. 唯一更新 `session.prefill_resident_tokens` 的函数是 `_commit_prefill_backup_residency``replay.py:1483`
2. 这个函数的唯一 caller 是 `_invoke_kvcache_seeded_router``replay.py:2208`)—— 即 seed/reseed 路径
3. `_invoke_session_direct``replay.py:2719`direct-to-D 路径)只更新 `session.opened` / `resident_tokens` / `last_trace_request`**从不触碰任何 P 端字段**
4. `_commit_prefill_backup_residency` 内部用 `_estimate_session_resident_tokens(request)` 取的是**完整 request 的预估**,不是 append delta——所以连 bookkeeping 层面都不假设有增量更新
**`capacity-backup` 的真实语义只是"reseed 完之后跳过 `_close_prefill_session`"**`replay.py:2221`P 端 streaming session 保持 open 状态、KV 留在 P 的 radix tree 中。但**不存在任何机制让这份 KV 跟上 D 端的 append 增长**。
### 质疑三D 触发 reseed 后,本机旧 session 的 KV cache 是不是清空了P 做完 re-prefillKV 推到哪里?
**回答:是的,旧 KV 直接 free 掉P 重新 prefill 完之后推到 router 选的新 target D可能同 D可能换 D。中间没有"先 dump 到 P 再清"的快捷方式。**
#### D 端驱逐时的 KV 处理
代码证据(`replay.py:_close_decode_session`1539-1569 行;`session_aware_cache.py:release_session`250-276 行):
```python
# replay.py 端
async def _close_decode_session(..., evicting_for_capacity=False):
if not session.opened:
return
await _close_streaming_session(...) # 给 D 发关闭信号
# 从 D 的 resident bookkeeping 里删掉这个 session
session.opened = False
session.resident_tokens = 0
if evicting_for_capacity and not session.prefill_opened:
residency.decode_evictions_without_prefill_backup += 1
# SGLang 端session_aware_cache.py
def release_session(self, session_id):
# 解锁引用 + 直接 free KV slots
self.token_to_kv_pool_allocator.free(kv_indices)
# ↑ 没有序列化、没有外发、没有 D→P 通道
```
**D 驱逐 = 把 KV slot 直接归还给 token pool 分配器。完全没有任何 outbound 网络调用。**
#### Reseed 时 P→D 的目标选择
驱逐之后的 reseed 路径(`_invoke_kvcache_seeded_router``replay.py:2101`)走的是与 turn 0 完全一样的 P-mediated seeding
```
1. KvAwarePolicy.select() 选择一个 target D'(可能是同一个 D也可能因 migration 换 D
2. _invoke_kvcache_seeded_router 在 D' 上 open 一个 streaming session
3. 给 P 发完整 prompt → SGLang pd-router 让 P 做完整 prefill
4. P 的 prefill 完成后通过 mooncake 把 KV 一次性推到 D'
5. D' 上接收完毕session 重建完成decode 继续
```
**所以 P 做完 re-prefill 的 KV 推到 KvAwarePolicy 选的 target D'**——可能是:
- 同一个 D驱逐后重新接受
- 另一个 D如果 reject 计数累积触发 migration详见 KVC_ROUTER_ALGORITHM §3.3
无论哪种,**旧 D 的旧 KV 在新 KV 到达之前就已经被 free**。没有 D→D 的直接迁移路径,没有"先 dump 到 P 再推回"的快捷路径。
---
## 2. Reseed 路径的完整 step-by-step 现状
把上面三个质疑串成端到端流程,以下是 v2 当前 reseed 路径的**完整**操作序列。每一步都标注实测耗时与代码位置。
### 触发条件
下列任一发生时 router 走 reseed 路径(详见 `KVC_ROUTER_ALGORITHM.md §3.3`
- D 端 `Admit()` 返回 `can_admit=False`,原因为 `no-d-capacity` / `session-not-resident` / 等
- KvAwarePolicy.select 返回的 D 不再持有该 sessionmigration 触发)
- v1/v2 的 reject counter 累积让所有 D 都被 blacklist极少触发由 reset-on-success 保护)
### 端到端时间线
```
t=0 上游 agent 发出 turn N 请求input ~50Kappend ~2K
t=~5ms Router 的 KvAwarePolicy.select() 选 target D'O(|D|) Python 评分)
t=~10ms Router → D' 发 admit_direct_append RPC
t=~30ms D' 返回 can_admit=False, reason="session-not-resident"
或 "no-d-capacity"Algorithm 3 bump rejects[s, D']++
fallback chain 最多再试 ε-1 个 D对应 ε ~30ms 总额)
t=~100ms 所有 D 都被拒 / 选不到适合 D路径退化到 seeded router
t=~110ms Router 转 _invoke_kvcache_seeded_router
t=~120ms [可选] capacity-backup policy 下_reserve_prefill_backup_capacity()
检查 P 池容量,若不够先 LRU 驱逐别的 P backup session
t=~150ms P 上 open streaming sessionHTTP /session/open
t=~200ms 发完整 prompt 到 SGLang pd-router → 路由到 P
t=~250ms P 开始 prefill
↓ ←←← 大头 1P-side re-prefill 段
↓ P 必须 prefill 完整 ~50K tokens
↓ 即使 capacity-backup 开着P 的 backup 只有 turn-0 的 ~1K
↓ radix prefix cache 命中前 1K剩余 49K 重算
↓ 实测耗时:~1.5-3s @ Qwen3-30B TP1
t=~2000ms P 完成 prefillKV 进入 mooncake transfer 队列
t=~2050ms mooncake 开始 P→D' transfer
↓ ←←← 大头 2P→D mooncake transfer 段
↓ KV 张量 ~5-9 GB50K tokens × 2 bytes/token × layers × heads...
↓ **TCP loopback** 实测耗时:~1.5-4s
↓ ↑↑↑ 当前 sweep 未启用 RDMA走的是单机 lo 设备
↓ 若启用 IB RDMA @ 200 Gb/s理论 200-400ms
t=~4500ms transfer 完成D' 上 session 重建好
t=~4510ms D' 开始 decode小幅度 append-prefill 余下的 ~2K append + 生成)
t=~4550ms 首个 token 出来 → TTFT 测点
```
**单次 reseed 总耗时3-7s**(中位 ~2.5s 来自较小 sessionp99 ~7.7s 来自最大 session。**re-prefill 段与 transfer 段大致五五开**,受 session 大小影响。
### 这就是为什么 v2 的 TTFT p99 = 1.28s
8.3% slow path 走的是上面这条流水线,其中 reseed 路径(`pd-router-d-session-reseed`)单独占 3.4%150/4449 请求),构成 KVC TTFT p99 长尾的主要贡献。
---
## 3. 已审查的所有"看起来像 D→P 但其实不是"的代码
下面这些在搜索时容易误判成 D→P 实现,**全部经独立 audit 排除**
| 文件:行 | 看起来像 | 实际是 |
|---|---|---|
| `replay.py:1483 _commit_prefill_backup_residency` | "把 backup 提交到 P" | bookkeeping 函数,更新 `session.prefill_resident_tokens` 计数字段。不传输任何 KV 数据,只在 seed/reseed 完成后被调用。 |
| `replay.py:1572 _reserve_prefill_backup_capacity` | "预留 backup 空间" | 检查 P 池可用空间并按 LRU 驱逐别的 backup session 腾位置。不传 KV只调整 reservation 计数。 |
| `cli.py:182 --kvcache-prefill-backup-policy` | "backup 策略" | 只决定 reseed 完成后是否 `_close_prefill_session`。capacity-backup = 保留 P 端 streaming session 不关release-after-transfer = 立刻关闭。**两种策略下 P 的 KV 都是 seed-time 的静态快照**。 |
| `session_aware_cache.py:release_session` | "释放 session可能含外发" | 仅调 `kv_pool_allocator.free(kv_indices)`。零网络调用。 |
| `disaggregation/decode.py: start_decode_thread` | "decode 端线程,可能有出站" | 纯 receiver loop。处理入站 `AUX_DATA / CHUNK_READY / STAGING_REQ / KVPoll.Success`**没有出站 KV 传输分支**。 |
| `disaggregation/mooncake/conn.py:1563` | "传输请求添加" | `assert disaggregation_mode == PREFILL`——硬约束,只有 P 端能调。 |
| `mooncake.MooncakeKVSender` / `MooncakeKVReceiver` | "双向 sender / receiver" | 强角色化Sender 只在 PREFILL 模式实例化Receiver 只在 DECODE 模式。`BaseKVManager` 抽象无 bidirectional slot。 |
| `pd-router-d-session-reseed-after-eviction` execution_mode | "走 backup 的快路径" | 实际还是走完整 `_invoke_kvcache_seeded_router`P 完整 prefill + 完整 mooncake transfer只是 `_eviction_suffix()` 在 execution_mode 字符串末尾加了 "-after-prefill-backed-eviction" 标签。**没有任何 fast-path 优化**。v2 中仅 2/4449 请求走到这个标签。 |
---
## 4. D→P 增量同步:要做的是什么
完整 D→P 增量同步的设计目标:**让 P 端的 backup KV 在 direct-to-D append 完成后异步追上 D 端的 KV让 reseed 退化为单次 P→D transfer无需 P re-prefill**。
### 抽象数据流
```
当前:
direct-to-D append: D 本地 append-prefillP 端 backup 锁住不变
reseed: P re-prefill 完整 50K + P→D transfer 完整 50K
目标:
direct-to-D append: D 本地 append-prefill**同时**异步把新增的 KV 块推回 P
reseed: P→D' transfer 完整 50K (already up-to-date)
无需 P re-prefill
```
### 实现层面要改的事
按工程难度排序:
#### 4.1 Mooncake 双角色化(中等难度,~400 LOC
- `BaseKVSender` / `BaseKVReceiver` 抽象保留,但允许同一 worker 同时实例化两种角色
- `MooncakeKVManager.__init__` 把 PREFILL / DECODE 分支改成"role set",允许 worker 同时持有 sender 和 receiver
- 新增 `DecodeKVSender`D 端用于把 append KV 推回 P
- 新增 `PrefillKVReceiver`P 端用于接收 D 的 append KV
- 引入第二个 bootstrap channel避免与原 P→D 通道在 buffer pointer 协商上冲突)
#### 4.2 D 端 append commit hook容易
- 每次 `direct-to-D-session` 完成后,识别新写入的 KV 块D scheduler 在 commit 时知道)
- 入队 D→P 传输(异步,不阻塞 next request
- 标记 backup 是否成功送达 P用于后续 reseed 决策)
#### 4.3 P 端 radix tree 多生产者扩展(**最难,工程量主体**
**这是真正的架构 blocker**。SGLang 的 P 端 radix cache 当前假设:
- 单一生产者(本 worker 的 model 输出)
- 树插入只在 prefill / decode 完成时发生
- KV 索引由本 worker 的 token_to_kv_pool_allocator 分配
要让 P 接收 D 喂来的 KV 块,需要:
- 扩展 radix tree 节点的写入路径,允许"外部供给的 KV + token 序列"被插入
- 处理 KV 索引重映射D 的 slot 号在 P 上无意义)
- 处理 reference counting同一 session 可能既被本 worker 用、又被 D 喂回更新)
- 处理 eviction policy 协调P 端 radix LRU 不应让"被 D 喂入的 backup"先被驱逐)
- 处理 KV 数据格式的跨 worker 兼容(同样的 model layout应该是 trivial但需要测试
#### 4.4 agentic-pd-hybrid 端 hook容易
- `_invoke_session_direct` 完成后,新增一步:触发 D→P 同步 RPC异步
- `_invoke_kvcache_seeded_router` 在 reseed 触发前先 probe P 是否有 up-to-date backup若有跳过 re-prefill只做 P→D transfer
- 新增 CLI flag `--enable-d-to-p-sync`,默认 off保留 baseline 行为
- 新增 structural log channel 记录 D→P 同步事件 / 失败 / 延迟
### 实现完毕后的预期收益
| 指标 | 当前 (v2) | RDMA only | RDMA + D→P sync |
|---|---:|---:|---:|
| reseed re-prefill 段 | 1.5-3s | 1.5-3s不变 | **~0**(已有 up-to-date backup |
| reseed transfer 段 | 1.5-4s | 0.2-0.4s | 0.2-0.4s |
| reseed 总耗时 | 3-7s | 1.7-3.4s | **0.2-0.4s** |
| TTFT p99 | 1.285s | ~0.7s | **~0.4-0.5s**(与 DP 接近或胜过) |
| 8.4% slow path 占比 | 不变 | 不变 | 可能保持但单次代价大幅下降 |
→ 这就是 paper 里 future-work 应当声明的**"完整版 KVC 才能真正在 TTFT 全分位数上击败 DP"** 的路径。
---
## 5. 仓库分支审查(确认无作者私下实现)
`git ls-remote origin --refs` 完整结果:
```
9ccd853... refs/heads/kvc-debug-journey-v1-to-v4 ← 本工作分支(含本文档)
e9062b1... refs/heads/main ← baseline落后我们 18 commit
```
- **服务器只有 2 个分支****0 个 tag****0 个隐藏 ref**
- main 是更老的 baseline`_commit_prefill_backup_residency` 等同名函数,但语义与本工作分支一致——都是静态 backup无 D→P 同步
- 全 git 历史搜索 `D->P / d-to-p / decode.*prefill.*transfer / kv.*pushback / kv.*sync / incremental / mirror` 关键词,**唯一命中是 commit `9ccd853`**(本文档相关的 doc 改动)
- 唯一 remote 是 `origin``git@ipads.se.sjtu.edu.cn:wangjh/agentic-pd-hybrid.git`),无 upstream / fork
**作者没有在其它分支偷偷实现 D→P**。这块工作是真空。
---
## 6. 下一步
按 ROI 排序:
### 必做(落地下一阶段)
1. **新开 `feat/d-to-p-sync` 分支** 从当前 `kvc-debug-journey-v1-to-v4` 起步
2. 写设计文档 `docs/D_TO_P_SYNC_DESIGN_ZH.md`
- 包括上面 §4 的实现细节
- 添加 sequence diagramP/D 通信时序)
- 评估 SGLang radix tree 多生产者扩展的具体 API 改动
- 评估 D→P 同步对 direct-to-D fast path 自身延迟的影响(理想是异步零开销)
3. POC 阶段 1mooncake 双角色化 + 一个能跑通的 D→P transfer 单测
4. POC 阶段 2P 端 radix tree 多生产者扩展(重点工程量)
5. POC 阶段 3agentic-pd-hybrid 端的 hook + flag
6. 端到端验证:跑同 trace 同 ts=1 配置,目标 TTFT p99 < 0.5s
### 推荐
7. **同时启用真 RDMA**独立于 DP 工作只需改 sweep 脚本加 `--force-rdma --ib-device mlx5_0`先把现有 transfer 段加速作为 baseline
8. **跑 RDMA-only 对照**先证明单 RDMA 启用能把 TTFT p99 1.28s 压到 ~0.7s再用 DP sync 把剩下的 re-prefill 段也吃掉这样 paper 里能写两条独立的 ablation
### 不要做的事
- main / 工作分支上做 DP 实验隔离开主分支应该保持 v2 稳定
- 试图通过 capacity-backup 现有 flag "调出"DP 效果——它结构上做不到
---
## 附录 A本文档涉及的代码位置
| 函数 / 字段 | 位置 |
|---|---|
| `_commit_prefill_backup_residency` | `src/agentic_pd_hybrid/replay.py:1483` |
| `_reserve_prefill_backup_capacity` | `src/agentic_pd_hybrid/replay.py:1572` |
| `_close_prefill_session` | `src/agentic_pd_hybrid/replay.py:1507` |
| `_close_decode_session` | `src/agentic_pd_hybrid/replay.py:1539` |
| `_invoke_session_direct` (direct-to-D 路径) | `src/agentic_pd_hybrid/replay.py:2719` |
| `_invoke_decode_session_direct` | `src/agentic_pd_hybrid/replay.py:2826` |
| `_invoke_kvcache_seeded_router` (reseed 路径) | `src/agentic_pd_hybrid/replay.py:2101` |
| `DirectSessionState.prefill_resident_tokens` | `src/agentic_pd_hybrid/replay.py:128` |
| `_eviction_suffix` | `src/agentic_pd_hybrid/replay.py:1220` |
| `--kvcache-prefill-backup-policy` CLI flag | `src/agentic_pd_hybrid/cli.py:182-189, 436-441` |
| `MooncakeKVManager.__init__` | `third_party/sglang/python/sglang/srt/disaggregation/mooncake/conn.py:187-256` |
| `start_decode_thread` (decode receive loop) | `third_party/sglang/python/sglang/srt/disaggregation/mooncake/conn.py:1425-1496` |
| `add_transfer_request` (assert PREFILL) | `third_party/sglang/python/sglang/srt/disaggregation/mooncake/conn.py:1563` |
| `MooncakeKVSender` / `MooncakeKVReceiver` | `third_party/sglang/python/sglang/srt/disaggregation/mooncake/conn.py:1648, 1740` |
| `BaseKVSender` / `BaseKVReceiver` 抽象 | `third_party/sglang/python/sglang/srt/disaggregation/base/conn.py` |
| `session_aware_cache.release_session` | `third_party/sglang/python/sglang/srt/mem_cache/session_aware_cache.py:250-276` |
| `session_controller._close` | `third_party/sglang/python/sglang/srt/managers/session_controller.py:293-316` |
## 附录 B相关 commit
| Commit | 内容 |
|---|---|
| `9ccd853` | docs: DP 缺口的 Opus forensic audit 写入 V2_DEEP_ANALYSIS §4.2 + KVC_ROUTER_ALGORITHM §9 |
| `2ec0deb` | v2 实现reset-on-success + threshold 20488192)—— 直接 trigger 了对 reseed 慢路径的关注 |
| `c47adaf` | feat: backpressure pause hint reseed 不直接相关但展示了"D 端可主动告知 router"的通信通道存在是未来 DP sync 控制平面的潜在基础 |
## 附录 C相关 paper 章节建议
- **§Background** §1-§2 reseed 现状作为 motivation 摆出
- **§Algorithm**参考 `KVC_ROUTER_ALGORITHM.md` Algorithm 1-3
- **§Evaluation §Slow Path Cost** §2 的端到端时间线作为 Figuresequence diagram
- **§Future Work / Limitations**把本文 §4 作为 KVC 真正实现"完整 fast path 替代" roadmap引用 DP 工作的设计文档后续 `feat/d-to-p-sync` 分支产物
---
**核心句**v2 实现的 KVC 91.6% 请求上证明了 session-affinity 路由的价值 8.3% reseed 慢路径让 TTFT p99 DP 3×。这条慢路径的 50% 时间在 P re-prefill50% mooncake transfer——RDMA 只能救后者**DP 增量 KV 同步是唯一能消除 re-prefill 的机制**且当前在框架SGLangmooncake 三层都没有实现需要新建 `feat/d-to-p-sync` 分支从设计文档开始

View File

@@ -0,0 +1,304 @@
# 结构性缺陷验证报告
**日期**2026-05-06
**对照数据源**
- `outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/`v5 KVC kv-aware Option D2P6D**3 次同配置 rerun**
- `outputs/qwen3-30b-tp1-exps/exp1_8way_dp_cache_aware_summary.json`(同 trace 8DP CA
- `outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/.../logs/decode-{0..5}.log``prefill-{0,1}.log`
**模型**Qwen3-30B-A3BTP1单机 8×H100 80GBtrace `qwen35-swebench-50sess.jsonl`4449 reqs / 52 sessions
**报告作用域**:验证 `docs/AGENTIC_FIT_ANALYSIS_ZH.md` §1-§7 提出的结构性 claim 是否真实存在;量化影响。
> ⚠️ **环境限制**:本轮缺 GPU 访问,未跑新 sweep。所有数据来自已存在的 v5 rerun + 8DP baseline。Backpressure 代码已实现但**未端到端验证**——下文标注为"预期收益pending GPU smoke"。
---
## 0. 实验有效性锚点N=1 不可信
3 次 v5 baseline EXP2**完全相同配置**)的 errors 漂移:
| Run | Errors | Lat P50 | Lat P90 | TTFT P50 |
|---|---:|---:|---:|---:|
| run1 | **372** | 1.11s | 8.65s | 0.147s |
| run2 | **912** | 0.94s | 7.68s | 0.071s |
| run3 | **396** | 1.22s | 8.43s | 0.183s |
errors 漂移 **2.5×**372 → 912P50 latency 漂移 **30%**。**任何 N=1 比较 < 30% 差异都不可信。** 后续所有" trace 不同配置 / 不同代码"的对比都需要 N3 才有意义
**对 KVC vs DP 的 headline 数据3 次 KVC 的最佳值P50=0.94s)仍然是 DPP50=0.65s)的 1.45×**——8 way DP 的优势远超 single-run variance 范围这一头条结论不受 variance 影响
---
## §1. Session 永久 pin 到 D + 容量盲选 → 极端双峰 ✅ 完全成立
### Claim
KvAwarePolicy 评分以 hash overlap 为主没有 D 容量项Session 第一次落到某 D 后被永久 pin导致大 session 在已满 D 上反复 admission 拒绝 session 在原 D 100% direct-to-D
### 数据
**(a) Session 永久绑定 3 rerun 一致**
```
run1: 52 sessions, avg distinct-D-per-session = 1.00
run2: 52 sessions, avg distinct-D-per-session = 1.00
run3: 52 sessions, avg distinct-D-per-session = 1.00
```
每个 session 在整个运行中只访问 **1 个** D worker3 次独立 run 完全一致。**不是巧合是结构。**
**(b) Direct-to-D 命中率呈极端双峰**
| Direct-to-D rate | run1 | run2 | run3 |
|---|---:|---:|---:|
| 0-20%饿死 | 15 | 18 | 16 |
| 20-40% | 7 | 6 | 7 |
| 40-60% | 11 | 7 | 9 |
| 60-80% | 5 | 6 | 4 |
| 80-100%顺利 | 14 | 15 | 16 |
中间态稀少两端拥挤
**(c) 3 run 一致饿死的 session session 大小强相关**
```
13 sessions starved (<20% direct-to-D) in ALL 3 runs.
avg peak input of consistently-starved sessions: 62043 tokens
avg peak input of consistently-lucky sessions: 31344 tokens
ratio: 1.98× — starved sessions are exactly 2× larger.
```
**13/52 = 25% 的 session 在 3 次独立 run 中都被饿死,且这些 session 的 peak input 恰好是顺利 session 的 2 倍。** 这排除了"运气"假说证实是大 session 在容量过载 D 上结构性失败
### 影响量化
- 25% session 几乎每个 turn 都走 fallback 路径相对 direct-to-D **TTFT 慢 100×、E2E 慢 6×**数据点fallback path mean lat ~3.5s vs direct ~0.5s
- 对应这些 session 的用户体验是"系统性糟糕"而不是"偶尔慢"
- **SLO 视角下 P99 完全由这 13 session 拉高**
### 结论
**完全成立**。修复方向不在本轮policy score capacity penalty + 允许 session D 迁移 D 端引入 hot session retract
---
## §2. D 端 LRU 只 evict idle session → 跟不上压力 ✅ 完全成立
### Claim
`scheduler.py:2040` `evict_idle_streaming_sessions_lru` 只能 evict "所有 req finished + streaming 模式" session高并发下 hot session 永远不 idleLRU 找不到东西可踢结果 D 顶到 100% 然后撞 mooncake transfer timeout
### 数据v5 baseline rerun run1
| D worker | Trim 事件 | KVTransferError | 峰值 token_usage |
|---|---:|---:|---:|
| decode-0 | 9 | 0 | 0.99 |
| decode-1 | 43 | 4 | 0.99 |
| decode-2 | 16 | 153 | 0.97 |
| decode-3 | 37 | 29 | 0.99 |
| decode-4 | 28 | 90 | **1.00** |
| decode-5 | 30 | 93 | **1.00** |
**6 个 D 全部峰值 ≥ 0.97**其中 2 个直接顶到 1.00KV 池完全耗尽)。**LRU 触发 9-43 远不及 transfer 错误的 90-153 。**
decode-2 极端trim 16 vs error 153 = LRU 比错误慢 **9.5×**
### 影响量化
- run 累计 369 KVTransferError 6 D 之和
- 对应 ~8% 的请求失败率v5 errors 9/372/912 三次平均 ~430/4449 = 9.7%
- **每次 mooncake timeout 32s**—— P99 latency 直接贡献几十秒尾巴
### 结论
**完全成立**。修复方向不在本轮分层 eviction—— idle 外加冷 session retract按访问频率/时序加权Backpressure本轮代码只是把"D "的雪崩从"timeout 错误"转成"主动等待"**不是真正解决容量问题**。
---
## §3. 没有 D→Replay backpressure 通道 ✅ 成立(已实现修复)
### Claim
D transfer queue 32s timeout KVTransferError没有"D 过载请慢点"信号反向到 replayconcurrency 一直 32 不降
### 数据
- §2 369 KVTransferError 全部为 32s mooncake timeout日志中均为 `Failed to send kv chunk` `Decode instance could be dead`
- 错误集中在运行后半段按现有 `KVC_DEBUG_JOURNEY_V1_TO_V5.md` §v4错误均在 run 44.8% 之后开始累积
- 表明**前期 D 容量充裕时正常达到容量上限后所有后续请求集中失败**——典型无 backpressure 系统行为
### 修复(本轮已实现,待 GPU smoke 验证)
代码改动
1. `third_party/sglang/python/sglang/srt/managers/io_struct.py``DirectAppendAdmissionReqOutput` 增加 `recommended_pause_ms` 字段
2. `third_party/sglang/python/sglang/srt/managers/scheduler.py:admit_direct_append`基于 `transfer_queue_depth``retracted_queue_depth``token_usage_after` 计算 hint
```python
def _compute_backpressure_pause_hint(...):
if retracted_queue_depth > 0: return 1500
if token_usage_after >= 0.90: return max(200, min(2000, overshoot * 5))
if transfer_queue_depth >= 8: return min(2000, transfer_queue_depth * 100)
return 0
```
3. `src/agentic_pd_hybrid/replay.py`
- `DecodeResidencyState.pause_until_s: dict[str, float]`
- `_query_decode_direct_admission` 解析 hint 更新 `pause_until_s`
- 新增 `_wait_for_decode_pause`,在 `_invoke_router` / `_invoke_session_direct` 入口检查
4. CLI flag`--enable-backpressure`、`--backpressure-max-pause-s 2.0`(默认关闭)
5. 结构性日志:`structural/admission-events.jsonl`、`backpressure-events.jsonl`、`session-d-binding.jsonl`
### 预期收益pending GPU smoke E2 vs E1
- KVTransferError 应从 ~370 / 4449 跌到 < 50 / 4449
- P99 应改善(消除 32s timeout 尾巴)
- 整体 latency mean 可能**略升**(被强制 pause但 P99 应大幅降
- backpressure-events.jsonl 应显示 D-4 / D-5 累积大量 pause 事件(与 §2 数据吻合)
### 结论
**Claim 成立;修复已实现,待 smoke 验证**。注意backpressure 是**降级**机制,不是性能优化——它把"硬错误"换成"主动等待",整体 throughput 不会因此提升。
---
## §4. Admission RPC 与 scheduler 主循环耦合 ⚠️ 间接证据,本轮未直接验证
### Claim
`admit_direct_append` 进 scheduler 主循环遍历 session slotadmission RPC 频率 16+/s 时与 decode 抢调度。
### 现有间接证据
- `docs/V5_PROFILE_INVESTIGATION_ZH.md`:仅加 1Hz `/server_info` polling 就让 EXP2 errors 从 9 涨到 41546×但 v6 P0 三次 baseline 不开 polling 同样得到 372/912/396——**polling 不是唯一原因,主循环负载本身就敏感**。
### 本轮未做
- 没有"admission probe 拆 fast/slow"的对照实验。需要 SGLang 较深的改动(提供 lock-free snapshot不在 KISS 边界。
### 结论
**Claim 间接成立,本轮未直接验证**。Backpressure 实现里 admission RPC 的频率没有变(仍每个 turn 一次),只是结果会触发 sleep。如果这条 claim 成立,加 backpressure 后 admission RPC 数量大致不变但每次响应里的 `pause_ms` 会非零——**新增的 admission-events.jsonl 可在 GPU smoke 后用来直接验证此现象**。
---
## §5. P-side round-robin 不感知 D 健康 ✅ 成立
### Claim
`pd_router.py:_select_decode_index` 是裸 round-robin。任一 P 撞到 hot D 时反复失败,另一 P 完全不受影响。
### 数据v5 baseline rerun run1
| Worker | KVTransferError | "Decode could be dead" |
|---|---:|---:|
| prefill-0 | **367** | 361 |
| prefill-1 | **2** | 0 |
prefill-0 的请求量从 summary 看是 2225 vs prefill-1 的 2224——**请求量近乎对半,错误率差 180×**。
### 影响量化
- 失败请求集中在 P-0 → 某个 hot D 的链路上(日志中反复出现 `to 10.45.80.47:XXXXX`
- 单 P 的"死亡链路"贡献了 **99%** 的全部 KVTransferError
- 如果 P 选择能避开"正在和 hot D 死磕"的链路,**理论上可消除单 P 故障的雪崩效应**
### 备注
- 此现象**未在 v6 P0 的 3 次 rerun 中横向验证**——只有 run1 的日志可读。需要在新 sweep 的 prefill-{0,1}.log 上重复确认,避免 N=1 嫌疑。
### 结论
**单 run 数据成立,多 run 一致性未验证**。修复方向不在本轮router 选 P 时考虑 (P 当前 inflight transfer 数, 目标 D 健康度)。
---
## §6. 已撤回Replay 端 session footprint 估算膨胀
写计划时仔细看代码后撤回——`_estimate_session_resident_tokens` 返回 full prompt但所有需要"增量"的 call site (`replay.py:1247-1254`、`:1393-1394`、`:1490-1491`) 都已用 `target - current` 减法处理。**不是 bug**。
---
## §7. time-scale=10 把 inter-turn gap 压到 1/10 ✅ 完全成立
### 数据
```
原始 trace inter-turn gap (n=4397):
p10=1.6s p50=2.5s p90=7.8s p99=25.1s max=261s
time-scale=10 实际 replay gap:
p10=0.16s p50=0.25s p90=0.78s p99=2.5s max=26s
```
真实 agentic 用户/agent 在 turn 之间停 2-8 秒思考、打字、tool call、agent reasoning。time-scale=10 把这些窗口压到 0.16-0.78 秒——**人为消除了 D 的自然 idle 时间**,正好是 KVC 想利用的"session 短暂 idle 时 LRU 可以 evict、其他 session 可以 admit"机会。
### 测量学影响
- 所有 v3-v6 数据基于 time-scale=10
- 意味着所有"KVC 在 SWE 上输给 baseline"的结论**可能被 benchmark 放大了**
- §1 的 25% session 永久饿死现象,在 time-scale=1 下可能因为 D 有更多 drain 时间而显著缓解
### 本轮未做
- 没跑 time-scale=1 baseline。这是项目当前**最重要但缺失的验证**。
- Smoke sweep 脚本(`scripts/sweep_backpressure_smoke.sh`E3、E4 包含了 time-scale=1 的 KVC + DP 短 trace 对比,等 GPU 时跑。
### 结论
**Claim 完全成立time-scale=1 验证为 P0 待办**。
---
## 头条对比(同 trace、同硬件
```
8-way DP cache-aware (TP1):
errors= 0 | latency mean=1.426s p50=0.654s p90=3.609s
| TTFT mean=0.123s p50=0.093s p90=0.256s
KVC v5 2P6D (3 reruns, no polling):
run1: errors=372 | mean=3.50s p50=1.11s p90=8.65s | TTFT mean=2.13s
run2: errors=912 | mean=3.00s p50=0.94s p90=7.68s | TTFT mean=1.64s
run3: errors=396 | mean=3.42s p50=1.22s p90=8.43s | TTFT mean=2.07s
```
KVC 三次 run 全输 DP且差距远超 single-run variance
- Latency meanDP 优 **+110%**KVC 平均 3.30s vs DP 1.43s
- Latency P50DP 优 **+65%**KVC 平均 1.09s vs DP 0.65s
- TTFT meanDP 优 **+1500%**KVC 平均 1.95s vs DP 0.12s——慢 17×
- ErrorsDP 0 vs KVC 平均 ~560
**这是这个项目当前最严肃的事实**——所有 KVC 复杂度回报为负。
---
## 综合结论
按"是否结构性 + 影响大小"的二维分类:
| Claim | 结构性 | 影响 | 本轮验证 | 修复KISS 内) | 修复KISS 外) |
|---|---|---|---|---|---|
| §1 Session pin + 容量盲选 | 强 | 大25% session 饿死) | ✅ 3 run 一致 | ❌ | capacity-aware policy + 跨 D 迁移 |
| §2 LRU 跟不上 | 强 | 大(每次 ~370 KVTransferError | ✅ 6 D 数据 | ❌ | 分层 eviction、hot retract |
| §3 无 backpressure | 强 | 中-大(消除 32s timeout 雪崩) | ⚠️ 已实现,待 smoke | ✅ **本轮交付** | |
| §4 admission RPC 干扰 | 弱-中 | 中 | ⚠️ 间接 | ❌ | probe / commit_evict 拆分 |
| §5 P-side 不感知 D 健康 | 中 | 中(单 P 错误率差 180× | ✅ N=1需 N≥3 复核 | ❌ | router P 选择带 D 健康反馈 |
| §6 estimate 膨胀 | | | ❌ 已撤回 | | |
| §7 time-scale=10 失真 | 强(测量学) | 大(可能颠覆所有 KVC vs DP 结论) | ✅ 数据明确 | ✅ 改 flag | |
### 最关键的两个 takeaway
1. **§7 time-scale=1 是当前项目所有结论的前置依赖**——必须先做。如果 time-scale=1 下 KVC 与 DP 接近,前面所有 v3-v6 的"KVC 输得彻底"诊断都需要重新解读。
2. **§1 + §2 是双胞胎结构性问题**——session 被永久 pin 在某个 D + D 不能 evict 已满 = 大 session 永久卡死。任何不动 policy + 不动 LRU 的修复(包括本轮的 backpressure只能让症状好看不能消除根因。
---
## 本轮代码改动汇总git diff 范围)
```
src/agentic_pd_hybrid/replay.py # +结构性日志 + backpressure pause 检查 + admission 增强
src/agentic_pd_hybrid/cli.py # +CLI flags
src/agentic_pd_hybrid/benchmark.py # +CLI flags 透传
third_party/sglang/python/sglang/srt/managers/io_struct.py
third_party/sglang/python/sglang/srt/managers/scheduler.py
# +recommended_pause_ms 字段 + hint 计算
scripts/sweep_backpressure_smoke.sh # 4-run smoke sweep待 GPU 跑)
scripts/analysis/analyze_backpressure_smoke.py
# 配套分析器
docs/REFACTOR_PLAN_ZH.md # 计划文档
docs/STRUCTURAL_VALIDATION_REPORT_ZH.md
# 本报告
```
代码默认行为**不变**`enable_backpressure=False`)——所有现有脚本/配置无影响。
---
## 待 GPU 时执行
```bash
bash scripts/sweep_backpressure_smoke.sh
python3 scripts/analysis/analyze_backpressure_smoke.py outputs/sweep_backpressure_smoke
```
预算4 个 run × 30-60 min ≈ 3-4h GPU 时间。
按 §3 的预期E2 (KVC + backpressure) 相对 E1 (KVC baseline) 应有 errors 降 70%+P99 改善TTFT P50 持平或略升。E3 (KVC + backpressure @ time-scale=1) vs E4 (DP @ time-scale=1) 是验证 §7 的关键对照。
如果 E2 vs E1 的 errors 没有显著下降,说明 backpressure hint 公式调得不对(`_compute_backpressure_pause_hint` 阈值可调 §3 实际不是雪崩主因更可能是 §2 D-side LRU 才是)。

View File

@@ -0,0 +1,95 @@
# SWE-Bench PD Hybrid Experiment Progress
## 实验目标
在单节点 8xH100 上复现 agentic-pd-hybrid 三种 serving mechanism对比 Qwen3.5-35B-A3B 在 SWE-Bench 500 instance agentic trajectory 上的性能。
## 硬件环境
- 8x H100 80GB (NVLink 互联, 2 NUMA nodes: GPU 0-3 / GPU 4-7)
- 无 RDMA/IB 设备
- Transfer backend: **mooncake TCP** (nixl UCX 因 pip 包缺少 CUDA 支持导致 segfault已放弃)
## 实验矩阵
| 实验 | Mechanism | Workers | GPU 分配 | Router | Policy |
|------|-----------|---------|----------|--------|--------|
| A | pd-disaggregation | 1P + 1D (TP4 each) | P: 0-3, D: 4-7 | Yes | default |
| B | pd-colo | 2 direct (TP4 each) | D0: 0-3, D1: 4-7 | No | default |
| C | kvcache-centric | 1P + 1D (TP4 each) | P: 0-3, D: 4-7 | Yes | default |
## 测试负载
- 源数据: `simm-swe-bench/outputs/20260416-205833-hicache-qwen35-verified-0-500/audit.jsonl`
- 39,417 lines (turns), 497 unique instances (sessions)
- 每个 instance 8-150 turns (均值 79.3)
- 转换为 agentic-pd-hybrid trace 格式: `outputs/qwen35-swebench-500.jsonl`
## 关键发现
### Transfer Backend 选择
- **nixl (UCX)**: pip 安装的 nixl_cu12 包自带的 UCX 库没有 CUDA 支持,导致 GPU memory registration 时 segfault。系统 UCX (/opt/hpcx/ucx) 有 CUDA 支持但因 RPATH 无法被 NIXL 使用。
- **mooncake (TCP)**: 可用。需要两处修改:
1. `third_party/sglang/.../mooncake_transfer_engine.py`: 从环境变量 `MOONCAKE_PROTOCOL` 读取协议,而非硬编码 `"rdma"`
2. `src/agentic_pd_hybrid/stack.py`: 当 `transfer_backend == "mooncake"` 且非 `force_rdma` 时,自动设置 `MOONCAKE_PROTOCOL=tcp`
### 代码修改记录
1. **`third_party/sglang/python/sglang/srt/distributed/device_communicators/mooncake_transfer_engine.py`**
-`"rdma"` 硬编码改为 `os.environ.get("MOONCAKE_PROTOCOL", "rdma")`
2. **`src/agentic_pd_hybrid/stack.py`**
-`_build_process_env()` 中添加: mooncake 非 force_rdma 时默认设置 `MOONCAKE_PROTOCOL=tcp`
3. **`scripts/convert_audit_to_trace.py`** (新建)
- 将 sibench audit.jsonl 转换为 agentic-pd-hybrid trace 格式
## 实验进度
- [x] Step 0: 环境准备 (uv sync, nixl/mooncake 安装)
- [x] Step 1: Trace 格式转换 (39,417 lines 验证通过)
- [x] Step 2: Smoke test (pd-disaggregation, mooncake TCP, 100 requests) — **通过**
- 100/100 requests, 0 errors
- Mean latency: 1.53s, P50: 0.77s, P90: 2.82s
- TTFT: mean 0.49s, P50 0.29s; TPOT: mean 4.7ms
- 91/100 cache hits
- [x] Step 3a: 实验 A 全量尝试 (39K reqs, 497 sessions) — **中止**
- Run dir: `outputs/swebench-exps/pd-disaggregation-default-20260426T171113Z` (无metrics,被kill)
- 前 90% 完成 ~80min (~8-10 req/s), 但尾部 D 侧 KV cache 98% 饱和
- 497 并发 session 争抢 D 侧 token 空间, mamba 80-93 sessions 无法 drain
- **教训**: 1P+1D (TP4) 无法支撑 497 并发 session, 需减少 session 数量或降低 concurrency
- [x] Step 3b: 实验 A — pd-disaggregation (52 sessions, 4449 reqs, concurrency=32) — **完成**
- Run dir: `outputs/swebench-exps/pd-disaggregation-default-20260426T202540Z`
- Trace: `outputs/qwen35-swebench-50sess.jsonl` (10% sample, 52 sessions)
- **结果**: 4449/4449 成功, 0 errors
- Latency: mean=1.66s, P50=0.97s, P90=3.64s, P99=7.68s
- TTFT: mean=0.45s, P50=0.34s, P90=0.88s
- TPOT: mean=5.2ms, P50=5.2ms
- Cache hit: 4199/4449 (94.4%)
- [x] Step 4: 实验 B — pd-colo — **失败: SGLang bug**
- Run dir: `outputs/swebench-exps/pd-colo-default-20260426T210129Z`
- **Bug**: `--disaggregation-mode null` (colocation) 下 Qwen3.5-35B-A3B 模型触发 token_to_kv_pool_allocator 内存泄漏
- 错误: `ValueError: token_to_kv_pool_allocator memory leak detected!`
- 两个 direct worker 在处理 ~5 个请求后均 crash (Scheduler exception)
- **结论**: 当前 vendored SGLang v0.5.10 不支持 Qwen3.5-35B-A3B 的 colocation 模式
- [x] Step 5: 实验 C — kvcache-centric — **完成 (高错误率)**
- Run dir: `outputs/swebench-exps/kvcache-centric-default-worker-admission-20260426T210800Z`
- 4390/4449 errors (98.7%) — admission control 过于保守
- 59 成功请求: mean latency 1.24s (比 pd-disagg 快 25%), TTFT 0.18s (快 60%)
- 详细分析见 `docs/SWEBENCH_EXPERIMENT_RESULTS.md`
- [x] Step 6: 结果对比分析 — **完成**
- 完整报告: `docs/SWEBENCH_EXPERIMENT_RESULTS.md`
## 启动脚本
- `scripts/run_exp_a_pd_disagg.sh` — 实验 A
- `scripts/run_exp_b_pd_colo.sh` — 实验 B
- `scripts/run_exp_c_kvcache_centric.sh` — 实验 C
- `scripts/convert_audit_to_trace.py` — Trace 转换
## 已知风险
1. Qwen3.5-35B-A3B TP4 可用 mem ~12GB/GPU (after model + CUDA graph),长 session (150 turns) 可能 OOM
2. mooncake TCP loopback 延迟远低于真实跨机,结果偏乐观
3. 原始 trace 时间跨度 ~6000s全量回放非常耗时

View File

@@ -0,0 +1,121 @@
# SWE-Bench PD Hybrid Experiment Results
## 实验配置
- **模型**: Qwen3.5-35B-A3B (MoE, 35B total / 3B active), TP4
- **硬件**: 8x H100 80GB, NVLink, 单节点
- **Transfer backend**: mooncake TCP (loopback)
- **Trace**: 52 sessions, 4,449 requests (10% sample of SWE-Bench 500 instances)
- **时间压缩**: time-scale=10, concurrency-limit=32
## 结果汇总
### Experiment A: pd-disaggregation (baseline)
| Metric | Value |
|--------|-------|
| Run dir | `pd-disaggregation-default-20260426T202540Z` |
| Requests | 4,449 / 4,449 (100%) |
| Errors | 0 |
| **Mean Latency** | **1.662s** |
| P50 Latency | 0.973s |
| P90 Latency | 3.644s |
| P99 Latency | 7.676s |
| Mean TTFT | 0.445s |
| P50 TTFT | 0.340s |
| P90 TTFT | 0.880s |
| Mean TPOT | 5.20ms |
| Cache Hit Rate | 94.4% (4199/4449) |
| Mean Cached Tokens | 27,794 |
| KV Transfer Blocks | 105,235 |
### Experiment B: pd-colo (colocation) — FAILED
| Metric | Value |
|--------|-------|
| Run dir | `pd-colo-default-20260426T210129Z` |
| Status | **CRASHED** |
| Error | `token_to_kv_pool_allocator memory leak detected!` |
| Root Cause | SGLang v0.5.10 `--disaggregation-mode null` 与 Qwen3.5-35B-A3B (Mamba/GDN hybrid) 不兼容 |
| Requests | ~10 / 4,449 (0.2%) |
**结论**: 当前 vendored SGLang 不支持此模型的 colocation 模式。需要修复 token_to_kv_pool_allocator 中 Mamba 模型的内存管理。
### Experiment C: kvcache-centric (session-aware PD)
| Metric | Value |
|--------|-------|
| Run dir | `kvcache-centric-default-worker-admission-20260426T210800Z` |
| Requests | 4,449 total |
| **Errors** | **4,390 (98.7%)** |
| Successful | 59 (1.3%) |
| Mean Latency (success) | 1.238s |
| P50 Latency (success) | 0.484s |
| P90 Latency (success) | 2.550s |
| Mean TTFT (success) | 0.179s |
| P50 TTFT (success) | 0.081s |
| Mean TPOT (success) | 4.70ms |
| Direct-to-D Sessions | 56 |
| KV Transfer (actual) | 196 blocks (vs 105,235 planned) |
**Execution Mode 分布**:
- `kvcache-centric` (failed): 4,390
- `kvcache-direct-to-d-session` (success): 56
- `pd-router-*` variants: 3
## 关键分析
### 1. pd-disaggregation (A) — 稳定可靠
- 100% 成功率0 错误
- Mean latency 1.66s 合理 (包含 P→D KV transfer 开销)
- 94.4% cache hit 说明 prefix cache 在 P 侧工作良好
- KV transfer 105K blocks = 主要开销来源
- **适合生产使用**
### 2. pd-colo (B) — 不可用
- Qwen3.5-35B-A3B 的 Mamba/GDN hybrid 架构在 `disaggregation-mode null` 下触发内存泄漏
- 这是 SGLang 的 bug不是 agentic-pd-hybrid 的问题
- **需要 SGLang 修复后重新测试**
### 3. kvcache-centric (C) — Admission 过于保守
- 98.7% 错误率说明 admission control 拒绝了几乎所有请求
- `kvcache-seed-min-turn-id=2` 过滤了 turn 1 的 seed正确行为
- 但绝大多数 turn 2+ 请求也走 `kvcache-centric` 模式后失败
- 可能原因:
- Worker admission 查询发现 D 侧没有对应 session 的 KV cache因为 turn 1 没有 seed
- D 侧 transfer queue 积压导致 admission 拒绝
- 成功的 56 个 `direct-to-d-session` 请求表现优异: TTFT 0.08s (P50), 比 pd-disagg 的 0.34s 快 4x
- **需要调优 admission 参数,或使用 `kvcache-seed-min-turn-id=1` 允许 turn 1 seed**
### 4. kvcache-centric 成功请求 vs pd-disaggregation 对比
| Metric | pd-disagg (A) | kvcache-centric (C, success only) | Delta |
|--------|:---:|:---:|:---:|
| Mean Latency | 1.662s | 1.238s | **-25.5%** |
| P50 Latency | 0.973s | 0.484s | **-50.3%** |
| Mean TTFT | 0.445s | 0.179s | **-59.8%** |
| P50 TTFT | 0.340s | 0.081s | **-76.2%** |
| Mean TPOT | 5.20ms | 4.70ms | -9.6% |
| Actual KV Transfer | 105,235 blk | 196 blk | **-99.8%** |
**当 kvcache-centric 成功时,性能提升显著:**
- TTFT 降低 60-76% (D 侧直接 append无需 P→D transfer)
- 端到端 latency 降低 25-50%
- KV transfer 减少 99.8%
## 后续建议
1. **修复 pd-colo**: 提交 SGLang issue 关于 Mamba/GDN 模型在 disaggregation-mode null 下的内存泄漏
2. **调优 kvcache-centric admission**:
- 尝试 `--kvcache-seed-min-turn-id 1` 允许 turn 1 seed
- 放宽 `--kvcache-seed-max-decode-transfer-queue-reqs` 阈值
- 使用 `--kvcache-admission-mode router` (shadow state, 不在 critical path)
3. **增加 D 侧内存**: 调整 `--mem-fraction-static` 给 KV cache 更多空间
4. **多 P/D 配置**: 测试 2P2D (TP2) 配置以增加并行度
## 实验日期
2026-04-27

View File

@@ -0,0 +1,641 @@
# agentic-pd-hybrid 现框架性能与结构性问题报告
**对象**:项目团队同学
**前置假设**:读者**没看过** v3-v6 KVC 实验日志
**数据范围**:项目仓库 `outputs/` 下截止 2026-05-06 的全部实验产物
**目的**:把"现状"和"问题"分别交代清楚,给后续改造提供共同事实基础
---
## 0. 给没看过实验的读者:基础概念速览
### 0.1 项目目标
验证 **session-aware / KV-cache-aware P/D routing****agentic coding workload**(多轮 session、长 context、增量 append上能否降低端到端延迟。基线对比对象是 vanilla SGLang xPyD。
### 0.2 三种部署机制(**这三个名词全程会用**
| 机制 | 形态 | KV 流向 |
|---|---|---|
| **pd-disaggregation**"PD disagg" | P 和 D 是独立进程、分占不同 GPU | 每个请求 P 算 prefill → mooncake 推 KV → D 解码 |
| **pd-colo**"DP"data-parallel | 没有 PD 拆分N 个独立完整 worker每个自己 prefill+decode | 没有 KV transferrouter 按 hash 分配请求 |
| **kvcache-centric**"KVC" | 部署形态同 PD disagg**D 上多了 SessionAwareCache**,能跨 turn 保留 session KV | 运行时决策:可走 direct-to-D无 P、可走 P→D disagg、可走带 reseed 的混合 |
**Direct-to-D**"D-direct"KVC 的快路径——D 上已有该 session 的 KV新 turn 在 D 本地做 append-prefill零 P 介入、零 mooncake transfer。这是 KVC 理论上能省时间的核心。
**Fallback**KVC admission 拒了 / 阈值不满足 / D 不健康时,退化到普通 PD disagg 路径。
**Routing policy**(与机制正交):
- `default`:纯 round-robin
- `sticky`turn 2+ 黏到 session 的 last D
- `kv-aware`:按 hash overlap + sticky 评分选 D**KVC 必须配它**才能正确工作)
### 0.3 数据来源
- Trace`outputs/qwen35-swebench-50sess.jsonl`SWE-Bench 抽样4449 reqs / **52 sessions** / 每 session 8-150 turns / time-scale=10 / concurrency=32
- 模型Qwen3.5-35B-A3B (TP4) 和 Qwen3-30B-A3B (TP1) 两组
- 硬件:单机 8×H100 80GBmooncake TCP loopback 模拟 P→D 传输
---
# 第一部份:性能数据现象
## 1.1 三种机制在 Qwen3.5-35B (TP4) SWE 50sess 上的表现
来源:`outputs/swebench-exps/`
| Run | Mechanism | Policy | Errors | Lat mean | Lat P50 | Lat P99 | TTFT mean | TTFT P50 |
|---|---|---|---:|---:|---:|---:|---:|---:|
| `pd-disaggregation-default-20260426T202540Z` | pd-disagg | default | **0/4449** | 1.66s | 0.97s | 7.68s | 0.45s | 0.34s |
| `pd-colo-default-20260426T210129Z` | pd-colo | default | **4447/4449** | | | | | |
| `pd-colo-default-20260427T033519Z` | pd-colo | default | **0/4449** | 1.77s | 0.86s | 9.67s | 0.29s | 0.25s |
| `pd-colo-kv-aware-20260427T042034Z` | pd-colo | kv-aware | 469/4449 | 1.52s | 0.82s | 8.27s | 0.26s | 0.23s |
| `pd-colo-kv-aware-20260427T044944Z` | pd-colo | kv-aware | **0/4449** | **1.57s** | 0.81s | 8.48s | **0.22s** | **0.17s** |
| `kvcache-centric-default-worker-admission-20260426T210800Z` | KVC | default | **4390/4449** | | | | | |
### 现象解读
**(1) pd-disagg 是稳定基线**1.66s mean / 0 errors / 4199 cache hits94.4%)。可以正常服务。
**(2) pd-coloDP有两次 run第一次几乎全 crash第二次稳定**
- 04-26 的 4447/4449 errors 来自 SGLang `--disaggregation-mode null` + Qwen3.5-35B-A3BMamba/GDN hybrid`token_to_kv_pool_allocator memory leak` bugcrash 了
- 04-27 的两次 pd-colo run 都跑通了。**`pd-colo-kv-aware-20260427T044944Z` 是这一组实验里跑分最好的配置**——0 errors / TTFT P50 = 0.171spd-disagg 的 50%
**(3) KVC 在 SWE 35B 上的唯一一次 run 几乎全 crash**4390/4449 = 98.7% errors。但**那 56 个跑通的 direct-to-D 请求性能优异**——Lat mean 1.24sTTFT P50 0.081sKV transfer 196 块vs PD disagg 的 105K 块,**99.8%**)。说明 KVC 机制本身有效,但 admission control 把绝大多数请求过滤掉了。
### 一句话:在 Qwen3.5-35B 上,**pd-colo + kv-aware 是头名**KVC 机制配置不当几乎不可用。
---
## 1.2 同 trace 切到 Qwen3-30B (TP1)v1→v6 演进
为绕开 Mamba 模型的 SGLang bug团队后续切到 Qwen3-30B-A3B (TP1) 跑 KVC 调优 sweep。**所有结果用同一份 SWE 50sess trace**,可以横向比较。来源:`outputs/qwen3-30b-tp1-*` 各目录。
### 1.2.1 各版本配置概览
| 版本 | 关键改动(一句话) |
|---|---|
| v2 | KVC + `--policy default`(这个 policy 选择 **是 bug**,下文 §2.5 |
| v3 | KVC + `--policy kv-aware` |
| v4 | v3 + replay 端 session soft_cap 从 4 抬到 16 |
| v5 (Option D) | 把 admission 决策从 replay 估算改成 D worker 真实容量回答(`worker-mode admission` |
| v5+profile | v5 + 1Hz `/server_info` polling 做时序 instrument |
| v6 P0 | v5 baseline 同配置 rerun ×3 验证可复现性 |
### 1.2.2 各版本同 trace 结果总表
| 版本 | Errors | Lat mean | Lat P50 | Lat P90 | Lat P99 | TTFT P50 | direct-to-D% |
|---|---:|---:|---:|---:|---:|---:|---:|
| **8-way DP cache-aware** | **0** | **1.43s** | **0.65s** | **3.61s** | **8.37s** | **0.093s** | |
| v3 1P7D KVC | 363 (8.2%) | 4.88s | 1.75s | 12.67s | 28.72s | 0.363s | 39% |
| v3 2P6D KVC | 9 (0.2%) | 3.58s | 1.52s | 9.23s | 18.70s | 0.328s | 31% |
| v4 1P7D cap=16 | 435 (10%) | 4.21s | 1.08s | 13.38s | 24.45s | 0.056s | 49% |
| v4 2P6D cap=16 | 403 (9%) | 2.51s | 0.84s | 6.51s | 18.34s | 0.051s | 53% |
| v5 1P7D Option D | 9 (0.2%) | 5.18s | 1.59s | 14.67s | 26.09s | 0.207s | 45% |
| v5 2P6D Option D | 9 (0.2%) | 3.49s | 1.31s | 9.09s | 24.92s | 0.244s | 41% |
| v5+profile 1P7D | 6 (0.1%) | 4.21s | 1.18s | 11.33s | 28.83s | 0.060s | 55% |
| v5+profile 2P6D | **415 (9.3%)** | 3.23s | 1.11s | 8.36s | 20.26s | 0.168s | 41% |
| v5 rerun ×3无 profile | **372 / 912 / 396** | 3.003.50s | 0.941.22s | 7.688.65s | 18.9720.37s | 0.070.18s | 40-42% |
**8DP CA 在每一项指标都是头名**
- Latency mean **比所有 KVC 配置好 +43%~+260%**
- TTFT P50 **0.093s**KVC 最佳 v4 2P6D 是 0.051s——TTFT 单项 KVC 是有优势的,但被整体 P99 灾难抵消)
- 0 errorsKVC 任一配置 errors 在 9-912 之间漂移)
### 1.2.3 v5+profile 的诡异:加 1Hz polling 让 errors 从 9 涨到 415
这条单独看v5 baseline 跑出来 9 errors加上 1Hz `/server_info` polling 之后 415 errors**46×**)。原因机理见 §2.5。
### 1.2.4 v6 P0 用 ×3 rerun 验证可复现性,结果是不能复现
**关键事实**v5 baseline 完全相同配置跑 3 次:
| Run | Errors | Lat mean | Lat P50 | TTFT P50 |
|---|---:|---:|---:|---:|
| rerun1 | **372** | 3.50s | 1.11s | 0.147s |
| rerun2 | **912** | 3.00s | 0.94s | 0.071s |
| rerun3 | **396** | 3.42s | 1.22s | 0.183s |
errors 漂移 **2.5×**372→912。Latency mean / P50 也漂移 ~30%。**这意味着 v3-v6 之前所有"single-run"对比的差异 < 30% 的都不可信。**
但要注意**3 v5 中最优的 P500.94s仍然比 8DP CA0.65s 1.45×**——这个差距大于 single-run variance所以"DP 全胜 KVC"的头条结论不受 variance 影响
### 1.2.5 一个有趣的反差v4 vs v5
- v4errors ~10%)、direct-to-D 占比高53-58%)、整体 P50 较好0.84s
- v5errors 0.2%)、direct-to-D 占比降低41-45%)、整体 P50 反而退步1.31s
**v5 没有让性能变好,只是把"硬错误"转成了"诚实拒绝"——v4 的 admission 是乐观估算admit 进来后 D 装不下变成 mooncake 32s timeout统计成 errorsv5 让 D 自己拍板admit 拒得早,请求改走 fallback统计成低 direct-to-D 率)。容量本身没变。**
---
## 1.3 microbench 上 KVC 击败 PD disagg —— 但本仓库没保留实际 run
`docs/PROJECT_OVERVIEW.md` 写明
> micro-benchmark 上,`kvcache-centric` 可以比 `pd-disaggregation` 好。原因很简单:**session 少、D KV 放得下**turn2+ 可以直接走 D session。
`outputs/` **没有** microbench 实际 run只有 microbench trace 生成器 `microbench.py` 和它的几个示例 trace 文件)。所以 microbench "KVC "是基于设计预期 + 历史口口相传**没有可重现的产物**。
**这本身是个问题**——下文 §2.6 会解释 microbench 的默认参数4 sessions × 30K input × 1K append正好把所有 KVC 失效条件都规避掉了
---
## 1.4 头条结论Part 1 总结)
| 工作负载 / 模型 | 头名机制 | KVC 表现 |
|---|---|---|
| Microbench8 session × 30K × 1K append | KVC > PD disagg无落地数据按设计 | 设计上必然赢 |
| SWE 35B (TP4) | **pd-colo + kv-aware**1.57s mean, 0 errors | KVC 唯一 run 中 98.7% errors |
| SWE 30B (TP1) | **8-way DP cache-aware**1.43s mean, 0 errors | KVC 6 个配置全输;最佳的 v4 2P6D 慢 75%、errors 9% |
**真实 agentic 工作负载SWE-BenchKVC 机制目前没有任何配置能跑赢 naive DP cache-aware。**
---
# 第二部份:结构性问题分析
每条按 (1) 现象(实锤数据)、(2) 根因(代码位置)、(3) 影响量化 三段交代。
## 2.1 KvAwarePolicy 不感知 D 容量 + Session 永久 pin 在初始 D 上 ★ 最严重
### 2.1.1 现象(实锤)
**(a) 每个 session 整 run 中只访问 1 个 D**——基于 v5 rerun1/2/3 全部 4449×3 = 13347 条 metrics
| Run | sessions | avg distinct-D-per-session |
|---|---:|---:|
| rerun1 | 52 | **1.00** |
| rerun2 | 52 | **1.00** |
| rerun3 | 52 | **1.00** |
3 次独立 run、156 次 session 实例,**没有一个** session 跨 D 迁移过。
**(b) Direct-to-D 命中率呈极端双峰**——以 rerun1 为例(其他两次形态相同):
| direct-to-D rate | session 数 |
|---|---:|
| 020%"饿死" | **15** |
| 2040% | 7 |
| 4060% | 11 |
| 6080% | 5 |
| 80100%"顺利" | **14** |
中间档稀少,两端拥挤。
**(c) 跨 3 次 run 一致饿死的 session = 13/52且这些 session 的 input 是顺利 session 的 1.98×**
```
13 sessions starved (<20% direct-to-D) in ALL 3 runs
avg peak input of consistently-starved sessions: 62043 tokens
avg peak input of consistently-lucky sessions: 31344 tokens
```
**结构性、可复现、与 session 大小强相关。** 排除"运气"假说。
### 2.1.2 根因(代码)
`policies.py:166-172` `KvAwarePolicy.select()` 评分函数:
```python
score = (
overlap + sticky * self.sticky_bonus, # 主项:历史 KV overlap
sticky, # 二级
inflight_penalty, # 三级
assignment_penalty, # 四级
)
```
**评分中完全没有 D 当前容量项**
session X 第一次落到 D-2 → 在 D-2 上积累 hash_id → 之后不管 D-2 多满X 的 turn N+1 的 overlap 在 D-2 上仍是最大 → 永远选 D-2。即使 D-5 全空也轮不到。
`RoutingState.decode_resident_blocks` (`policies.py:46`) 还从不缩减——但因为 SWE trace 的 hash_ids 是 session-unique**不缩减并不影响"选对 D",只影响内存**——真正问题在评分函数无容量项。
### 2.1.3 影响量化
- 25%13/52的 session 几乎每个 turn 走 fallback 路径
- fallback 路径 mean lat 约 3.5s vs direct-to-D ~0.5s——**饿死 session 每 turn 慢 6×**
- 这 13 个 session 还容易撞 mooncake 32s timeout见 §2.2、§2.3P99 完全由它们决定
- **SLO 视角下25% 的用户体验是系统性糟糕**
---
## 2.2 D 端 LRU 只能 evict idle session → 跟不上压力
### 2.2.1 现象(实锤)
来源:`outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/.../logs/decode-{0..5}.log`,全 run 计数:
| D worker | "Trimmed decode session cache" 事件 | KVTransferError | 峰值 token_usage |
|---|---:|---:|---:|
| decode-0 | 9 | 0 | 0.99 |
| decode-1 | 43 | 4 | 0.99 |
| decode-2 | 16 | **153** | 0.97 |
| decode-3 | 37 | 29 | 0.99 |
| decode-4 | 28 | **90** | **1.00** |
| decode-5 | 30 | **93** | **1.00** |
**所有 6 个 D 都顶到 token_usage ≥ 0.972 个顶到 1.00KV 池完全耗尽。LRU 触发 9-43 次远不够——transfer 错误是 LRU 触发量的 5-10×。**
decode-2 极端trim 16 次 vs error 153 次 = LRU 跑得比错误慢 9.5×。
### 2.2.2 根因(代码)
`scheduler.py:2040``evict_idle_streaming_sessions_lru` 实际只能 evict
> 所有 req 都 finished + streaming 模式 + 该 session 没有 inflight transfer
但 SWE 高并发concurrency=32 + time-scale=10 → effective inter-turn gap p50=0.25s)下,每个 session 几乎一直有 inflight req。**hot session 永远不 idleLRU 永远找不到东西可踢。**
### 2.2.3 影响量化
- 单 run 累计 KVTransferError6 个 D 之和 = **369 次**
- 对应 ~8% 请求失败率v5 errors 9/372/912 三次平均 ~430/4449 = 9.7%
- **每次 mooncake timeout = 32s**——直接构成 P99 18-26s 的尾巴
修复需要 SGLang 内部分层 eviction除 idle session 外,按访问频率 / 时序加权强制 retract——**不在当前 KISS 边界**。
---
## 2.3 没有 D → Replay backpressure 通道
### 2.3.1 现象
§2.2 数据显示 D 顶到 token_usage=1.00 时仍在持续接收新请求,最终撞 mooncake 32s timeout。**整个错误链路里没有"D 过载,请慢点发"的反向信号**。
定量证据rerun1 的 KVTransferError 时间分布——**98% 集中在 run 后半段**(参考 `KVC_DEBUG_JOURNEY_V1_TO_V5.md` §v4。前期 D 容量充裕时正常,达到上限后**所有后续请求集中失败**——典型的"无 backpressure 系统在过载点雪崩"模式。
### 2.3.2 根因(代码)
链路:
```
replay 端按 trace 时序 + concurrency=32 持续发请求
PD Router 裸 round-robin (pd_router.py:43-49)
P 收到请求做 prefill → mooncake 推 KV → D 端
D 端 transfer queue 堆积 → 32s timeout
errno 抛回 replay → fallback 路径,但 concurrency 不降
```
D 端的 `admit_direct_append` 响应里**只有 can_admit/reason 等过去时字段,没有任何"建议节流"的指示**。
### 2.3.3 修复(本次代码改动已实现)
代码已加 `recommended_pause_ms` 字段:
- `third_party/sglang/.../io_struct.py:DirectAppendAdmissionReqOutput` 增加 `recommended_pause_ms: int = 0`
- `scheduler.py:_compute_backpressure_pause_hint`:按 `transfer_queue_depth``retracted_queue_depth``token_usage_after` 计算
- `replay.py`admission 响应里读到 hint → 更新 `DecodeResidencyState.pause_until_s[D]` → 下次发到该 D 之前 sleep
- CLI flag`--enable-backpressure`(默认 off保留 baseline 行为)
- 同时新增 3 个结构性日志(`structural/admission-events.jsonl` / `backpressure-events.jsonl` / `session-d-binding.jsonl`
**待 GPU smoke 验证。预期 errors 从 ~370 降到 < 50P99 改善(消除 32s timeout 尾巴mean latency 可能略升(被强制 sleep。**
修复脚本:`scripts/sweep_backpressure_smoke.sh`4 个 run × 30-60 min分析器`scripts/analysis/analyze_backpressure_smoke.py`
### 2.3.4 注意
backpressure 是**降级机制**,不是性能优化——它把"硬错误32s timeout"换成"主动等待"。整体 throughput 不会因此提升,但 P99 应大幅改善。
---
## 2.4 P-side round-robin 不感知 D 健康
### 2.4.1 现象(实锤)
来源v5 rerun1 `prefill-{0,1}.log`,全 run 计数:
| Worker | KVTransferError | "Decode instance could be dead" | 请求量 |
|---|---:|---:|---:|
| prefill-0 | **367** | 361 | 2225 |
| prefill-1 | **2** | 0 | 2224 |
**两 P 请求量完全均衡round-robin错误率差 180×**。日志里 prefill-0 的失败反复指向某个特定 D 的 IP`to 10.45.80.47:XXXXX`)。
### 2.4.2 根因(代码)
`pd_router.py:43-49`
```python
prefill_url, bootstrap_port = self.config.prefill_urls[
self.prefill_cursor % len(self.config.prefill_urls)
]
self.prefill_cursor += 1
```
裸 round-robin。不感知
- P 当前 inflight transfer 数
- 目标 D 的健康状态 / 容量
后果:当某个 D 进入 hot 状态时,被 round-robin 派去给它推 KV 的 P **持续失败**;另一个 P 接到的请求恰好命中健康 D完全没事。**单 P 故障不会被路由层避开。**
### 2.4.3 影响量化
- prefill-0 几乎独自承担了**全部 KVTransferError 的 99%**367/(367+2)
- 如果 router P 选择能避开"正在和 hot D 死磕"的链路,这部分 ~8% 的整体错误率应可降到 < 1%
### 2.4.4 备注
这条结论目前来自单次 run N=1 数据需要跨 N3 rerun 验证一致性才能完全确信——加上 §2.1.1 (b/c) 也证明 P-D 链路绑定结构性强相关"prefill-0 死磕某 D"很可能在每次 run 都重复由初始 session 落点决定)。
---
## 2.5 Admission RPC 进 scheduler 主循环 → 自我干扰
### 2.5.1 现象(实锤)
v5 baseline 配置不开 pollingerrors = 9
完全相同配置 + 1Hz `/server_info` pollingerrors = **415****46×**
来源`outputs/qwen3-30b-tp1-v5-optD/exp2_2p6d_kvc_optD_summary.json`baseline 9 errorsvs `qwen3-30b-tp1-v5-optD-profile/exp2_2p6d_kvc_optD_profile_summary.json`415 errors)。
### 2.5.2 根因(代码)
`/server_info` polling 调用 `admit_direct_append` 都进 SGLang scheduler 主循环
- `/server_info` `scheduler.py:get_streaming_session_cache_status` 遍历每个 session slot 计算 `is_idle`
- `admit_direct_append` `token_to_kv_pool_allocator.available_size()` + 触发 `maybe_trim_decode_session_cache`
scheduler 主循环本身在跑 decode/prefill forward这些 RPC 进队列就和 forward 抢调度
### 2.5.3 真实负载下 admission RPC 频率远高于 1Hz
- 4449 reqs / ~2700s **1.6 reqs/s**
- 每个 turn 1-3 admission probedirect-append + 可能的 seed retry
- × 8 worker = **每秒 ~16-40 次 admission RPC**
也就是 admission 流量本身比 1Hz polling 高一个量级如果 1Hz polling 都能让 errors 46×admission 自己的扰动至少同等
### 2.5.4 修复
不在本轮 KISS 设计方向是把 admission 拆成两个端点
- `POST /probe` lock-free snapshot90% 流量走这条
- `POST /commit_evict` scheduler 队列做实际 LRU probe 不够时调
这部分需要 SGLang 内部 atomic publish snapshot 到共享内存——**结构性改动**。
### 2.5.5 注意
v6 P0 ×3 baseline rerun不开 pollingerrors 也是 372/912/396——**polling 不是 415 唯一原因**。本身 v5 admission 设计就敏感polling 是放大器
---
## 2.6 Replay 时间被 time-scale=10 压缩 → 测量学失真
### 2.6.1 现象(实锤)
v5 rerun1 metrics 解出的真实 inter-turn gap 分布
```
原始 trace inter-turn gap (n=4397):
p10=1.6s p50=2.5s p90=7.8s p99=25.1s max=261s
time-scale=10 实际 replay gap (= 原始 / 10):
p10=0.16s p50=0.25s p90=0.78s p99=2.5s max=26s
```
### 2.6.2 这意味着什么
真实 agentic 用户/agent 在每个 turn 之间停 **2-8 秒**——思考打字tool call 异步返回agent reasoning
`microbench.py:20-21` 的默认 `inter_turn_gap_s=1.0` + `session_stagger_s=0.1` 也大致符合这个量级1 秒左右)。
SWE replay 设的 time-scale=10 把这个间隔**人为压到 0.25 **——D 还没消化完 turn Nturn N+1 就来了
### 2.6.3 为什么这么设计
纯粹**节省测试时间**
- 原始 trace 跨度 ~6000s(≈100 分钟
- time-scale=10 ~600s(≈10 分钟
- sweep 5 版本 × 3 重复 = 25h vs 2.5h
### 2.6.4 它扭曲了什么
1. **抹掉 D 的自然 idle 时间**真实部署里每个 session turn 间有几秒空窗正好让 D LRU 把它 evict 出去给其他 session 让位(§2.2 idle 判定)。time-scale=10 下几乎所有 session 一直忙——LRU 永远找不到 idle session
2. **人为提升并发压力**concurrency=32 time-scale=10 下意味着 D 端持续承受 320 effective concurrent agents 的压力——远超真实部署
3. **掩盖 backpressure 等慢节奏机制的价值**如果 inter-turn gap 2.5sbackpressure replay 0.5s 几乎不影响吞吐time-scale=10 0.5s sleep 等于直接跳过下一个 turn
### 2.6.5 严重性:所有 KVC vs DP 结论都带这个失真
**v3-v6 全部数据基于 time-scale=10**所以"KVC SWE 上输给 DP"的程度可能被 benchmark 放大。**真实部署里 inter-turn gap 2.5s 的话KVC 可能根本不会撞到当前看到的容量瓶颈**。
这是项目当前**最严重但还没修的测量学问题**。修复成本极小只是去掉 `--time-scale 10`但意义重大——**P0 应该立刻跑一组 time-scale=1 baseline**KVC + DP N=3
---
## 2.7 direct-to-D append 阈值 = 2048 是个 magic number
### 2.7.1 现象(实锤)
`replay.py:51` 默认值
```python
kvcache_direct_max_uncached_tokens: int = 2048
```
判定`replay.py:2177`当新 turn uncached append > 2048 token 时,**禁止 direct-to-D**,请求改走 P→D reseed 路径。
实测 v5 rerun1 的 uncached append 分布(`input_length - cached_tokens`
```
所有 4449 请求:
p10=50 p25=181 p50=610 p75=2907 p90=36495 p99=91600 max=103971
> 2048: 1222/4449 = 27.5%
```
**双峰分布**median 只有 610但 p90 已经 36K。
### 2.7.2 根因(代码)
阈值是个 magic number——**没有任何代码注释解释为什么是 2048**git log 里也没人调过它。
合理推测它存在的理由(按可信度):
| 理由 | 是否成立 |
|---|---|
| D 是 decode-tunedmax-prefill-tokens 通常 4-8Kappend > 2K 会触发 D 内部多 chunk prefill 拖慢 decode | 强 |
| 大 append 在 D 上 prefill 会阻塞当前正在 decoding 的其他 session 的 TPOT | 强 |
| P 有更优化的 prefill kernel 和 batch | 弱D 的 prefill kernel 同源) |
| 工程上的"安全默认值",没认真测过 | 强git log 印证) |
### 2.7.3 但更严重的 bugexecution_mode 标签命名错位
`execution_mode` 名字里带 "large-append" 的请求一共 **2060 个**,其中:
- **1222 个59.3%)实际 uncached append ≤ 2048**
也就是说,**"large-append" 这个标签名对超过一半的实例是错的**。看 `replay.py:2168-2178` 的判断:
```python
if (
_should_bypass_prefill(...) # 要求 overlap > 0
and direct_append_length is not None
and direct_session_reused # 要求 session 在本 D 上 opened 过
and not direct_session_reset
and direct_append_length <= config.kvcache_direct_max_uncached_tokens
):
# direct-to-D
else:
# 进入 "large-append" 分支
```
**这个 else 分支的 5 个进入条件里,"append > 2048" 只是其中一个。** session 不在本 D 上、被 evict 过、overlap=0 都会进这个分支,但 `execution_mode` 仍然写 `pd-router-fallback-large-append-*`——导致看 metrics 的人误以为问题是 append 太大。
### 2.7.4 实际阈值不是主要瓶颈session 不在 D 上才是
把 turn≥2 的请求按"append 是否 > 2048"和"实际 execution mode"交叉:
```
Turn≥2 小 append (≤2048), n=3129:
1854 (59%) kvcache-direct-to-d-session ← 走通了
1141 (37%) pd-router-fallback-large-append-session-cap ← 标签骗人
...
Turn≥2 大 append (>2048), n=1216:
813 (67%) pd-router-fallback-large-append-session-cap
365 (30%) kvcache-centric (失败)
22 pd-router-large-append-reseed ← 真正受阈值影响的
...
```
**真正因 append > 2048 而失败的请求**:约 50 个large-append-reseed + 部分 large-append fallback仅占总数 1-2%。
**绝大多数 fallback 实际是 §2.1 的 session 不在 D 上**——名字里带 "large-append" 是误导。
### 2.7.5 修复
两件事:
1.`execution_mode` 标签按真实原因细分——把 "large-append" 拆成 "session-not-resident" / "real-large-append" / "session-reset" 等
2. 阈值本身可以做 sweep2048 / 4096 / 8192 / 16384找最优——但收益空间有限最多改善那 1-2% 的请求)
---
## 2.8 跨 run variance 巨大N=1 不可信
### 2.8.1 现象(实锤)
v5 baseline 完全相同配置跑 3 次(`qwen3-30b-tp1-v5-optD-baseline-rerun/`
| Run | Errors | Lat mean | Lat P50 | TTFT P50 |
|---|---:|---:|---:|---:|
| rerun1 | 372 | 3.50s | 1.11s | 0.147s |
| rerun2 | **912** | 3.00s | 0.94s | 0.071s |
| rerun3 | 396 | 3.42s | 1.22s | 0.183s |
errors 漂移 **2.5×**372→912P50 latency 漂移 ~30%TTFT P50 漂移 **2.6×**
### 2.8.2 根因(推测)
源头不止一个,至少包含:
1. **§2.1 + §2.2 的复合**D 容量过载是临界点附近的非线性系统——initial session-to-D assignment 的随机性决定了哪个 D 先饱和。
2. **mooncake TCP loopback 的随机性**:单机 loopback 的 32s timeout 触发概率受当前 GPU 内存碎片、PCIe 状态影响。
3. **scheduler 主循环里 admission RPC 与 decode 抢资源的随机性**§2.5)。
### 2.8.3 影响
**所有 single-run 比较 < 30% 差异都不可信**。这意味着:
- v3 vs v4 的 P50 差异1.75s vs 1.08s)勉强有意义(差异 38%
- v4 vs v5 的 P50 差异0.84s vs 1.31s)勉强有意义(差异 56%
- v5+profile 的 1P7D vs baselinemean 4.21s vs 5.18s)→ 差异 18%**不可信**
- 所有 `direct-to-D 占比 ±5%` 的差异都是噪声
### 2.8.4 这条规则要求所有后续实验
**要任何 KVC 配置间或 KVC vs DP 的对比,最少跑 N=3最好 N=5。** 不跑 N≥3 的实验在做"碰运气科研"。
8h 一次 sweep 装不下 N=3 + 多版本对比,所以必须**牺牲版本数量保 N≥3**。
---
## 2.9 microbench 的 KVC 优势不能外推到真实 agentic
`microbench.py:13-22` 默认参数:
| 维度 | 默认值 |
|---|---|
| `session_count` | 8 |
| `turns_per_session` | 3 |
| `initial_input_length` | 10000 |
| `append_input_length` | **1000** ← 低于 §2.7 的 2048 阈值 |
| `output_length` | 1000 |
| `inter_turn_gap_s` | **1.0** ← 接近真实 agentic |
| `session_stagger_s` | 0.1 |
**与 SWE workload 的关键维度对比**
| 维度 | microbench | SWE 50sess |
|---|---|---|
| Session 数 | 4-8 | 52 |
| Per-session peak input | ~31K | median 49K, max 104K |
| 总 working-set / 7D 容量92K each | 0.19×5× 冗余) | **3.95×4× 过载)** |
| Append size 是否过 2048 | 几乎 100% 过不到 | 28% 超过 |
| Session 数是否过 cap | 4 ≤ 28v3 cap×7D | 52 远超 |
**Microbench 把 KVC 的所有失效条件都规避了**容量充裕、append 卡阈值之下、session 数远低于 cap、inter-turn gap 接近真实——这一组参数让 KVC 五项判断(路由 / admission / 没被 evict / append ≤ 阈值 / 无 backpressure全部通过 → 100% 走 direct-to-D 快路径。
**而 SWE workload 在每一项上都把 KVC 推过临界点。**
所以"KVC 在 microbench 赢 PD disagg"是个**弱命题**——它只证明了机制能跑,没有证明在真实 agentic 下能赢。
---
# 第三部份:一句话总结与下一步
## 现状一句话
> 在所有可比的真实 agentic workloadSWE 35B / 30B**naive DP cache-aware 全胜 KVC 任何配置**,且差距 > 30%(远超 single-run variance。Microbench 上 KVC 赢 PD disagg 的设计前提容量富余、append 小、session 少)在真实 workload 下不成立。
## 排序后的结构性问题(按修复 ROI
| 排名 | 问题 | 影响 | 修复成本 |
|---|---|---|---|
| **P0** | §2.6 time-scale=10 失真 → 所有 KVC vs DP 结论可能被 benchmark 放大 | 颠覆性 | 极低(改 flag |
| **P0** | §2.1 session 永久 pin + 容量盲选 | 25% session 永远饿死 | 中(改 policy |
| **P0** | §2.2 D-side LRU 跟不上 | ~8% errors 来自此 | 中(改 SGLang |
| P1 | §2.3 没 backpressure | 把 timeout 雪崩变可控 | **已实现**(待 GPU smoke |
| P1 | §2.4 P-side 不感知 D 健康 | 单 P 出错率差 180× | 中 |
| P1 | §2.7 / 2.8 metrics 标签命名错位 | 数据解读经常出错 | 低(改字符串) |
| P2 | §2.5 admission RPC 进 scheduler 主循环 | 自我干扰 | 高(结构改动) |
| P2 | §2.8 N=1 不可信 | 实验方法学 | 0团队约定 |
## 立刻能做的三件事
1. **跑 time-scale=1 baseline**KVC v5 + 8DP CA 各 N=3~6h GPU—— 不修代码、单变量、决定后续路线。
2. **跑 backpressure smoke**已实现4 run × ~30-60 min~3-4h GPU—— 验证 §2.3 修复的端到端效果。
3. **修 metrics 标签命名**`pd-router-fallback-large-append-*` → 按真实原因分类)—— 让以后看数据的人不会再被误导。
## 不立刻做但要重新讨论的
- **§2.1 capacity-aware policy**:之前考虑过的"评分加 capacity 项"会引入"换 D"的副作用(孤儿 KV、新 D 上仍可能饿死),需要跟 §2.2 的 D 端 hot retract 一起设计。
- **§2.5 admission API 拆 probe / commit**:是结构性正确方向,但要动 SGLang 内部 + atomic publish 机制,不是 KISS。
- **是否保留 KVC 这条线**:如果 P0 跑完 time-scale=1 baseline 后 KVC 仍系统性输 DP应该认真讨论 KVC 项目目标是否需要重新定义(比如只做"中等容量 + 长 session"工作点的方案,而不是替代 vanilla DP
---
## 附录 A本报告所有数据的来源
| 章节 | 数据源 |
|---|---|
| 1.1 SWE 35B | `outputs/swebench-exps/{pd-disagg,pd-colo,kvcache-centric}-*` |
| 1.2 TP1 series | `outputs/qwen3-30b-tp1-{exps,v3-kvaware,v4-cap16,v5-optD,v5-optD-profile,v5-optD-baseline-rerun}/` |
| 2.1 session pinning | `outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/exp2_2p6d_run{1,2,3}_metrics.jsonl` |
| 2.2 D LRU 计数 | `outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/.../logs/decode-{0..5}.log` |
| 2.4 P imbalance | `outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/.../logs/prefill-{0,1}.log` |
| 2.5 polling 影响 | v5 baseline summary vs v5+profile summary |
| 2.6 inter-turn gap | rerun1 metrics 的 `trace_timestamp_s` 字段 |
| 2.7 append 分布 | rerun1 metrics 的 `input_length - cached_tokens` |
| 2.8 variance | rerun1/2/3 三组 summary |
## 附录 B相关已有文档
- `docs/PROJECT_OVERVIEW.md` — 项目目标、microbench 结论
- `docs/AGENTIC_FIT_ANALYSIS_ZH.md` — 结构性缺陷的早期分析(本报告 §2 的来源)
- `docs/KVC_DEBUG_JOURNEY_V1_TO_V5.md` — v1→v5 详细演进日记
- `docs/V5_PROFILE_INVESTIGATION_ZH.md` — v5+profile 调查(含 critic 修订)
- `docs/SWEBENCH_EXPERIMENT_RESULTS.md` — SWE 35B 早期实验
- `docs/REFACTOR_PLAN_ZH.md` — 当前重构计划
- `docs/STRUCTURAL_VALIDATION_REPORT_ZH.md` — 结构性 claim 验证(本报告的精简版)

624
docs/V2_DEEP_ANALYSIS_ZH.md Normal file
View File

@@ -0,0 +1,624 @@
# KVC v2 深度分析:相对 TEAM_REPORT 基线的改进、性能、新暴露的问题
**日期**2026-05-11
**对象**:项目团队同学
**基线**`docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md`v3-v6 ts=10 调优 sweep 的状态报告)
**新数据**
- `docs/REFACTOR_PLAN_V1_ZH.md`ts=1 4-run validation 结果)
- `docs/MIGRATION_V1_FINDINGS_ZH.md`v1 thrashing 诊断)
- `docs/V2_RESULTS_ZH.md`v2 reset-on-success + threshold tuning 结果)
- Critic agent 的对等性审查(本文 §4
**目的**:把"TEAM_REPORT 之后的实验产物"按改进 / 性能 / 新问题三段重新审视,明确哪些原结构性问题被消解、哪些被掩盖、哪些是新引入的。
---
## 0. TL;DR
1. **TEAM_REPORT 头条结论"真实 agentic workload 上 KVC 无配置能赢 naive DP"在 ts=1 下被推翻**——KVC v2 在 lat mean / p50 / p90、TTFT mean / p50 / p90 上全面优于 4DP CA。
2. **生产决策结论online coding agent serving 应选 KVC 1P3D**。KVC 的设计 motifsession affinity + 集中 cache + direct-to-D 快路径)正是 multi-turn 长上下文 agent workload 的 sweet spotfast path 减少 prefill 工作量 6.9× 是机制目标实现,不是 measurement artifact。
3. **真实代价只有一项TTFT p99 = 1.29s vs DP 0.43sKVC 3× 差)**——来自 8.3% 非 direct-to-D 路径的 mooncake reseed 长尾。生产部署要么用真 RDMA 把这条压下来,要么靠容量规划让 reseed 极少发生。
4. **TEAM_REPORT §1session pin 饿死)已被 v2 修好**——direct-to-D 从 42.8% 涨到 91.6%severe thrashing 清零。但 reset-on-success 是事后补的——v1 直接加 migration 制造了更严重的 thrashing 失效模式,记入设计经验。
5. **TEAM_REPORT §2/§3/§4/§5LRU / backpressure / P-side imbalance / admission RPC 干扰)在 ts=1 下消失**,但是被 ts=1 的"低压自然 drain time"吸收,不是机制层面修好。一旦回到 ts=10 / 更长 trace / 更紧容量,会全部复现——属于潜在的,不是消除的。
6. **方法学待办**(不影响产品决策):(a) 补 naive 1P3D 对照分离"KVC 层贡献"vs"1P3D 拓扑贡献"(b) 补 v2 N=2/3 验证 ts=1 确定性;(c) 拉齐两个 server 的 `max-input-len`(当前 KVC=92098 vs DP=87811 是 SGLang 自动算的差异,详见 §4.3)。
---
## 1. 三组新实验与 TEAM_REPORT 的关系
### 1.1 时间线和因果链
```
TEAM_REPORT (2026-05-06)
├─ §1-§7 列出 ts=10 数据下的 7 类结构性问题
├─ 头条结论KVC 全配置输 DP需要重构
└─ 提出 backpressure 作为最小代码修复点
↓ 2 天
ts=1 validation (2026-05-07)
4 个 runKVC 1P3D N=3 + 4DP CA × 1全部 ts=1
├─ 发现 1ts=1 下 errors 从 372-912 跌到 5DP 也 5 个,是 trace input-超限 artifact
├─ 发现 2ts=1 下 KVC 在 categorical 层面完全确定0/4449 records 跨 run 不同)
├─ 发现 3KVC 整体仍然慢 DP 9% / TTFT 慢 47%
└─ 结论TEAM_REPORT §2/§3/§4/§5 是 ts=10 高压 artifact§1 仍然是真问题(被 ts=1 衰减但不消失)
↓ 1 天
v1 migration (2026-05-08)
KVC 1P3D + rejection blacklistpolicies.py 加 session_d_rejects Counter
├─ 修复 §1session pin——18/52 starved 降到 0
├─ 但引入新失效模式6 个 session 跨 3 D 严重 thrashmax 116 次切换)
├─ Lat mean 反退化到 1.758sTTFT mean 涨到 0.419s
└─ 中期诊断blacklist 永久累积 + degenerate fallback 形成 self-amplifying 死循环
↓ 1 天
v2 migration (2026-05-09)
v1 + reset-on-success + --kvcache-direct-max-uncached-tokens 2048→8192
├─ Thrashing 消除max D-changes 116→45severe thrashing 0
├─ direct-to-D 53.3%→91.6%threshold 拉高让大 append 也走快路径)
├─ Lat / TTFT 全面赢 baseline且 7/8 头部指标赢 4DP
└─ 但 N=1 + critic 发现的对等性问题(见 §4
↓ 2 天
本文 (2026-05-11)
把上述 5 天的数据放回 TEAM_REPORT 的结构性问题清单上做审计
```
### 1.2 同 trace 全部数字总表(按时间)
来源:`outputs/qwen3-30b-tp1-*` 系列各 summary.json。**4449 reqs / 52 sessions / Qwen3-30B-A3B (TP1) / 4×H100 80GB**。
| 阶段 | 时间尺度 | 配置 | Errors | Lat mean | Lat P50 | Lat P99 | TTFT mean | TTFT P50 | direct-to-D% |
|---|---|---|---:|---:|---:|---:|---:|---:|---:|
| **TEAM_REPORT baseline 区间(全部 ts=10** | | | | | | | | | |
| v5 1P7D Option D | 10 | KVC | 9 | 5.18s | 1.59s | 26.09s | 0.207s | | 45% |
| v5 2P6D Option D | 10 | KVC | 9 | 3.49s | 1.31s | 24.92s | 0.244s | | 41% |
| v5 rerun1 (重测) | 10 | KVC | **372** | 3.50s | 1.11s | 19.49s | 0.147s | | ~40% |
| v5 rerun2 | 10 | KVC | **912** | 3.00s | 0.94s | 20.37s | 0.071s | | ~40% |
| v5 rerun3 | 10 | KVC | **396** | 3.42s | 1.22s | 18.97s | 0.183s | | ~40% |
| 8-way DP CA | 10 | DP-colo | **0** | **1.43s** | **0.65s** | **8.37s** | **** | **0.093s** | |
| **ts=1 validation 区间** | | | | | | | | | |
| v0 baseline run1 | 1 | KVC 1P3D | 5 | 1.574s | 0.811s | 8.70s | 0.245s | 0.124s | **42.8%** |
| v0 baseline run2 | 1 | KVC 1P3D | 5 | 1.573s | 0.809s | 8.74s | 0.243s | 0.120s | 42.8% |
| v0 baseline run3 | 1 | KVC 1P3D | 5 | 1.574s | 0.812s | 8.76s | 0.243s | 0.123s | 42.8% |
| 4-way DP CA | 1 | DP-colo | 0 | 1.443s | 0.659s | 8.43s | 0.129s | **0.090s** | |
| **Migration 区间** | | | | | | | | | |
| v1 migration | 1 | KVC 1P3D | 6 | 1.758s | 0.773s | 9.92s | 0.419s | 0.057s | 53.3% |
| **v2 migration (头条)** | 1 | KVC 1P3D | 5 | **1.432s** | **0.576s** | **8.69s** | **0.098s** | **0.042s** | **91.6%** |
**两组关键对比**
1. **ts=10 → ts=1同 KVC 配置)**Lat mean 5.18s → 1.574s**3.3× 改善**errors 9-912 → 5**~100× 改善**direct-to-D 41% → 42.8%(持平,机制不变)
2. **v0 → v2同 ts=1机制改进**Lat mean 1.574s → 1.432s**9% 改善**TTFT mean 0.245s → 0.098s**60% 改善**direct-to-D 42.8% → 91.6%**+48.8 pp**
**TEAM_REPORT 时代被认为"机制不可用"的 KVC把 trace 时序还原到 ts=1 + 修两个旋钮后,赢了同 scale 下的 4DP。**
---
## 2. TEAM_REPORT §1-§9 的逐项更新
按原始优先级排序,每条标注"是否仍是问题 / 被什么消解 / 残留风险"。
### 2.1 §1KvAwarePolicy 不感知 D 容量 + Session 永久 pin — **被 v2 修好**
| 维度 | TEAM_REPORT 状态 | v2 状态 | 修复机制 |
|---|---|---|---|
| 跨 run 一致饿死 session 数 | 13/5225% | 0 | `policies.py: session_d_rejects` + `replay.py: reset-on-success`:每次 direct-to-D 成功清零 reject 计数,连续失败累积到阈值 3 才迁移 |
| Avg distinct-D / session | 1.00 | <2v2 实测 mean=0.6 D-changes/session | 同上 |
| direct-to-D % | 41% | 91.6% | 同上 + threshold 20488192 |
| 饿死 session turn 6× | | 饿死消失 | |
**残留风险**reset-on-success reactive 修复——session 必须先经历 N 次失败才迁移并且第一次失败的那个 turn 仍然慢在严苛容量下如把 trace 改成 ts=2 sess 数翻倍迁移阈值可能频繁触发重新逼近 v1 thrashing 区域。**未在更紧 workload 上验证。**
### 2.2 §2D 端 LRU 跟不上 → 8% errors — **被 ts=1 自然吸收**
| 维度 | TEAM_REPORT 状态 | v2 状态 | 原因 |
|---|---|---|---|
| run KVTransferError | 369 | 0 mooncake timeout | ts=1 inter-turn gap p50 = 2.5s D 充分 drain 时间 |
| D 峰值 token_usage | 6 D 全顶到 0.97-1.00 | 偶发 0.97-1.00burst常态 0.4-0.85 | 同上 |
| LRU trim 触发次数 | 9-43远不够 | 不需要——D 自然回落 | ts=1 工作流 |
**残留风险**这条**没有机制层面修好**。 ts 调回 10或者 session 数从 52 增到 100+、或者 model 切到更大都会立刻让 D 容量重新顶死LRU 再次跟不上。**TEAM_REPORT §2 是潜在的不是消失的。**
### 2.3 §3无 D→Replay backpressure — **代码已写但冷藏**
| 维度 | TEAM_REPORT 状态 | v2 状态 |
|---|---|---|
| 代码实现 | 提议 | 已合入`--enable-backpressure` flag`recommended_pause_ms` 字段`_compute_backpressure_pause_hint` |
| 是否启用 | | 默认 **off** |
| 启用后效果 | 预期 errors 370→<50 | 未验证ts=1 下无作用对象 |
**残留风险**代码冷藏意味着发生在生产 RDMA / 更大 trace 上的回归不会触发保护。**如果团队决定项目要支持 ts=10 / 更大 sessions需要把 backpressure 默认 on 并补 smoke 验证。**
### 2.4 §4P-side round-robin 不感知 D 健康 — **1P 配置不可测**
v2 1P3D P无从测试 P-side 调度TEAM_REPORT 数据来自 2P6D 配置
**残留风险**未来如果扩到 2P+ 必须重新审查 P 侧调度。**当前数据无法支持也无法反驳。**
### 2.5 §5Admission RPC 与 scheduler 互相干扰 — **ts=1 下不显著**
TEAM_REPORT 现象1Hz polling errors 46×来自 ts=10 高压时的 scheduler 主循环争抢ts=1 D scheduler 大部分时间空闲RPC 进来不阻塞 batched prefill
**残留风险** §2 同源——属于 ts=10 高压 artifact
### 2.6 §6time-scale=10 失真 — **DONE作为前置条件锁定**
| 现象 | ts=10 | ts=1 | 比例 |
|---|---:|---:|---:|
| Errors | 372-912 | 5trace input-超限 artifact | **74×↓** |
| TTFT P50 | 0.07-0.18s | 0.04s | 4.5×↓ |
| Per-D spread | ±26% | ±3.8% | 7×↓ |
| Lat P99 | 18-29s | 8.7s | 2-3×↓ |
**REFACTOR_PLAN_V1 把这条当作所有后续讨论的前置条件——ts=10 数据从此不参与 KVC vs DP 比较。**
### 2.7 §7execution_mode 标签错位 — **部分修复**
`pd-router-fallback-large-append-*` v1+ 被细分成
- `pd-router-fallback-real-large-append-session-cap`实际 append > 阈值)
- `pd-router-fallback-session-not-resident-session-cap`session 在该 D 上没住过)
- `pd-router-fallback-no-d-capacity`D 全满)
- `pd-router-fallback-session-not-resident-seed-filter-early-turn`
**残留**error_count 在 KVC vs DP 之间口径不一致(见 §4.3),未统一。
### 2.8 §8N=1 不可信 — **ts=1 下规则改写**
| Trace 区间 | N 要求 |
|---|---|
| ts=10 高压 | N≥3v5 rerun 显示 errors 漂移 2.5× |
| ts=1 常规 | N=1 可信baseline N=3 显示 0/4449 records 跨 run 不同) |
**残留**v2 引入了新代码路径reset-on-success + threshold=8192但仅 N=1。新分支是否仍保持 categorical 确定性**未验证**。这是 critic 标 MINOR 但未关闭的点。
### 2.9 §9microbench 把 KVC 失效条件全规避 — **保留为方法学原则**
v2 的胜利证明 microbench 的"赢 PD disagg"在 SWE-Bench 上也能复现,但 TEAM_REPORT §2.9 的方法学原则仍然成立——micro-benchmark 应该主动构造能触发 fallback 的 workload。
---
## 3. v2 的真实性能拆解path-level
v2 整体跑得快不仅因为 "KVC 机制好",更因为 **91.6% 请求被路由到了几乎免费的 fast path**。需要看路径级细节才能理解胜利的来源。
### 3.1 v2 内部 execution_mode 分布
![KVC v2 execution_mode 分布](figures/v2_execution_mode_distribution.png)
数据来源:`outputs/qwen3-30b-tp1-ts1-migration-v2/kvc_1p3d_migration_v2_run1_metrics.jsonl`n = 4449全部请求含失败。绿色 = direct-to-D 快路径 = 91.6%;其余红色 = 慢路径 / fallback / 失败。绘图脚本:`scripts/analysis/plot_v2_path_breakdown.py`
### 3.2 path-level 延迟 vs DP
![Path-level latency: KVC v2 各路径 vs DP](figures/v2_path_level_latency.png)
数据来源:同上 + `outputs/qwen3-30b-tp1-ts1-validation/dp4_metrics.jsonl`。Y 轴 log 刻度latency 跨度 41ms ~ 7.71s)。已过滤 abort / error 请求,所有数字按对等口径计算。
**关键事实**
- KVC 的 91.6% **fast path** 在 TTFT p50 上是 **41ms vs DP 92ms**——压制 DP 2.2×TTFT p99 150ms vs DP 428ms 仍优 2.9×
- KVC 的 **3.4% reseed 慢路径** TTFT p99 = **5.12s**,是 DP 单一路径 p99428ms**12×**
- KVC 的 **0.7% no-d-capacity fallback** 是最坏情况TTFT p99 = 7.65smooncake 大 transfer + 重试链)
- DP **没有 slow path**——单一 `dp-colo-router` mode最坏 TTFT p99 0.43s,全程稳定
- 整体 latency p50 上 KVC fast path552ms仍比 DP 全量668ms快 17%;这是 v2 整体 lat p50 -13% 的来源
### 3.3 Fast path 的工作量比 DP 少 6.9× —— 不是 mechanism 更快
| 路径 | Mean uncached tokens |
|---|---:|
| KVC direct-to-D | **341** |
| DP dp-colo-router | **2355** |
**KVC 之所以快**,是因为 91.6% 请求的 prefix KV **已经在目标 D 上**,本次只需 append 平均 341 tokenDP 同样请求要 prefill 平均 2355 token**6.9× 工作量**)。
这是结构性的 KVC vs DP 差异——**KVC 的设计就是利用 session 间 KV 复用**,所以"工作量少"本身就是机制核心目标。但在比较时必须诚实:
> KVC 的 TTFT 优势 = **session-aware 路由减少了 prefill 工作量****不是** D 端硬件层面更快。
如果工作量做归一化(比如限定都做 2000 token 以上 uncached prefillKVC 应该和 DP 在同一速度量级。
### 3.4 TTFT 概率密度对比bimodal vs unimodal
把 path-level 数据投影到 TTFT 的分布维度,可以更直观看出 KVC 与 DP 是**本质不同的两种分布形状**
![TTFT probability density: KVC v2 vs 4-way DP](figures/ttft_pdf_comparison.png)
左图(线性 x ∈ [0, 0.6s])看 body
- **KVC 的 PDF 在 ~40ms 有一个尖锐峰值**(来自 91.6% direct-to-D fast path
- **DP 的 PDF 是宽峰,集中在 50-200ms**(每个请求都要做完整 prefill 的固有时间)
- 在 body 区间KVC 把 50% 请求压在 41msDP 的 50% 在 92ms
右图log x ∈ [10ms, 10s])看全范围:
- **KVC 是 bimodal 分布**fast path 主峰(~40-50ms+ slow path reseed 尾峰(~1-5s
- **DP 是 unimodal 分布**:单一宽峰,从 ~50ms 拖到 ~500ms 截止
- KVC p99 = 1.28s 来自小尾峰DP p99 = 0.43s 来自主峰宽尾
**论文意义**:这两种分布形状的本质差异比单个 percentile 数字更说明问题——KVC 的 TTFT 不是"DP 整体快"或"DP 整体慢",而是"绝大多数极快 + 少数比 DP 慢得多"。生产决策的判据应该是 **fast path 集中度 vs slow path tail 长度**的权衡,而不是单个 mean 或 p50 数字。
绘图脚本:`scripts/analysis/plot_ttft_pdf.py`(用 `scipy.stats.gaussian_kde`body 用 Scott bandwidth 0.15full range 用 log10 域 KDE
---
## 4. 需要诚实交代的 caveats不是 KVC 的设计缺陷)
Critic agent 对 v2 vs 4DP 的对等性做了 10 项审查。下面分两类:
- **真实代价**§4.1-§4.3)— KVC 机制本身的开销,无法回避,论文里必须讲清楚
- **辩驳 critic**§4.4-§4.5)— critic 把 KVC 的**设计意图**误标为"对比不公平",本节澄清
- **方法学待办**§4.6-§4.7)— 实验对照层面的事,需要补但不影响产品决策
### 4.1 TTFT p99 长尾 — **真实代价,必须显式报告**
实测 TTFT 全分位数:
| 指标 | KVC v2 | DP | Ratio |
|---|---:|---:|---:|
| TTFT p50 | 0.042s | 0.090s | 0.47× (KVC 优) |
| TTFT p90 | 0.091s | 0.252s | 0.36× (KVC 优) |
| **TTFT p99** | **1.285s** | **0.427s** | **3.01× (DP 劣)** |
| **TTFT p99.5** | **2.65s** | **0.485s** | **5.47× (DP 劣)** |
| **TTFT > 1s 计数** | **59** | **9** | **6.5× (DP 劣)** |
之前 `V2_RESULTS_ZH.md §2` 的 headline 表省略了 TTFT p99是错的。**论文里 headline 必须包含 p99**——KVC 在 mean/p50/p90 全胜但 p99 输 3×要诚实摆出来。这不是赢负翻盘p99 之外都赢),但 p99 长尾是真实代价。
### 4.2 TTFT p99 恶化的根因8.3% 非 direct 路径的 mooncake reseed
59 个 TTFT > 1s 请求的 mode 分布:
```
49 个 pd-router-d-session-reseed (83%) ← session 被驱逐/迁移后重新拉 KV
5 个 pd-router-fallback-no-d-capacity (8%)
4 个 pd-router-fallback-session-not-resident-session-cap (7%)
1 个 pd-router-fallback-real-large-append-session-cap (2%)
```
按 session 分布88% (52/59) 集中在 5 个超大输入 session22080 / 44800 / 22400 / 58080 / 45280input 60-90K
**机理拆分**reseed 路径的延迟由两段组成——
1. **P 端 re-prefill 段**:用 trace 中带的完整 prompt 在 P 上重新算 prefill。**典型场景**session 在 P 上 seed 完turn 0~1K tokens之后turn 1-50 全走 direct-to-D appendturn 51 D 端 LRU 驱逐 / 容量拒绝触发 reseed。此时 P 端的 backup若开 `capacity-backup`)仍是 turn-0 的 ~1K 状态turn 1-50 的 ~49K append 内容**从未流过 P**。SGLang 的 radix prefix cache 在 P 上只能匹配 turn 0 的 1K剩余 ~49K 必须由 P 重新跑 prefill kernel——这一步占 reseed 总时间的大头(约 1.5-3s @ 1×H10030B 模型)。
2. **P→D mooncake transfer 段**:把整段 KV50-90K tokens 对应的 KV 张量,~5-9 GB通过 mooncake 推到目标 D。本次 benchmark 用的是 TCP loopback实测 1.5-4s取决于 session 大小)。生产用 IB RDMA节点实际有 mlx5_0/_1 @ 200 Gb/s × 2 active应可压到 200-400ms。
**两段相加**:当前 reseed 中位 ~2.5s、p99 ~7.7s。
### 缓解策略的真实效果
- (a) **真 RDMA 替换 mooncake TCP loopback**——救的是 transfer 段(~1.5-4s → ~200-400ms不动 re-prefill 段。预期 reseed 总延迟从 3-7s 压到 **1.7-3.2s**TTFT p99 从 1.28s 降到 ~0.7s 量级(**仍输 DP 0.43s**)。**当前 sweep 未启用**(缺 `--force-rdma --ib-device mlx5_0`)。
- (b) **容量规划**sessions × peak context ≤ 总 D KV pool × 0.7,让 LRU/reseed 几乎不触发。对生产部署而言最可靠,但对本 trace 不适用——sessions 已固定。
- (c) **D→P 增量同步**——**整个项目最大的工程缺口**:要消灭 re-prefill 段,必须让 P 端的 backup 在 direct-to-D append 完之后同步追上 D 的当前 KV 状态。这样 reseed 时 P 端已经有最新整段 KV可以直接 P→D transfer无需 re-prefill。**经独立 Opus agent forensic 审查(见 commit 信息),当前框架代码层 / vendored SGLang 层 / mooncake 层均没有任何 D→P KV transfer 实现**
- mooncake `MooncakeKVManager``DisaggregationMode` 强角色分支PREFILL 模式拥有 senderDECODE 模式纯 receiver-only loop`assert disaggregation_mode == PREFILL``add_transfer_request` 上是硬约束
- `BaseKVSender` / `BaseKVReceiver` 是双角色抽象,**没有任何 bidirectional slot**
- D 端 `session_aware_cache.release_session` 只调 `kv_pool_allocator.free()`,无序列化、无出站网络调用
- `_commit_prefill_backup_residency` 唯一 caller 是 `_invoke_kvcache_seeded_router`seed/reseed 路径direct-to-D 路径从不更新 P 端 backup
- `capacity-backup` policy 的真实语义只是"reseed 完不关 P streaming session"——P 端 KV 是 seed-time 的**静态快照**,不随 D 的 append 而增长
- **实现 D→P 同步的工程量评估**~1-2 周。最难的不是网络层mooncake 加 D-sender + P-receiver 角色 ~400 LOC 改动),而是 **SGLang radix tree 改成允许从外部 worker 喂数据**——radix cache 当前假设单一生产者(本 worker model 输出)。这是论文里 §future-work 的核心 contribution 缺口。
### 4.3 Error 统计口径已修复abort 数双方都比之前发现的多
之前 V2_RESULTS_ZH.md 说"DP 同样有 5 个 input-too-long abort"。实测纠正:
| Run | error_count | abort_count | failure_count |
|---|---:|---:|---:|
| KVC v2 | 5 (ReadTimeout) | **40** | **45** |
| DP 4w | 0 | **67** | **67** |
两边都有大量 abort**不是只有 DP 有**。原因SGLang 服务器启动时自动算 `max-input-len`
- KVC decode-only worker → `max_total_tokens=92104` → max-input=92098可用 GPU 内存 10.85 GB
- DP fused worker → `max_total_tokens=87817` → max-input=87811可用 GPU 内存 8.93 GB因为还要给 chunked-prefill workspace ~2 GB
DP 限制更紧,所以 abort 多 27 个。**这是 SGLang 自动 mem 分配的产物,不是机制差异。**
**已修代码**`src/agentic_pd_hybrid/metrics.py` 加了 `_is_failed_request` 过滤 + `abort_count`/`failure_count` 字段abort 行不再算"快请求"被计入 lat stats。重算后
```
修复前 修复后(排除 abort
KVC v2 lat_mean 1.4323 1.4441
DP 4w lat_mean 1.4435 1.4642
delta (KVC vs DP) -0.8% -1.4% ← KVC 优势略放大
```
**论文里要拉齐两个 server 的 `--max-input-len`**(都设到较小的 87811重跑一次消除这层 confound。
### 4.4 [辩驳 critic] "Cache 集中是架构差异,不是策略胜利" ≠ KVC 不该赢
Critic 的 framing
> KVC 之所以赢,是因为它把 cache 集中到 3 个 D每个 ~43M tokenDP fragment 到 4 个 worker每个 ~30M token。两边 policy 都是 `kv-aware`,差异来自架构而非策略。
**反驳**KVC 整套机制的**核心设计就是主动选择 affinity 集中而非 fragment**。"差异来自架构"等价于"差异来自 KVC 是 KVC"——这正是要论证的设计点。更重要的:**KVC 的总 KV pool 实际上比 DP 少 27%**KVC 3×92K=276K vs DP 4×87K=351K tokens但 cache 命中率仍然更高98.1% vs 96.8%)。
![Cache efficiency paradox: KVC 用更少的总池子缓存更多](figures/cache_efficiency.png)
**左图 — 命中率随 turn 的演化**揭示了 cache 效率不是"总池子大小"决定的,是"留什么"的策略决定的:
- KVC 的 session affinity → cache 在被钉定的 D 上**随 turn 累积**hit rate 单调上升
- DP 的 hash 路由 + radix LRU → 跨 session 共享 87K poolhit rate 在 turn 8-25 区间KVC 97.0% vs DP 95.8%,差 **1.24pp**)出现"中段 drift"
- 后期两边都稳定在 ~98-99%session 长时间没换cache 反复命中),但 DP 的 IQR band 更宽 → 不同请求 / 不同 session 之间命中波动更大
**右图 — uncached tokens 的 ECDF** 量化了 per-request 影响:
- KVC 50% 请求 uncached ≤ **187 tokens**DP 50% 请求 uncached ≤ **781 tokens**4× 差距)
- 在 uncached = 500 tokens 阈值上:**KVC 74% 请求落在该阈值以下DP 只有 31%**
- KVC 的曲线 "撞墙" 在 ~200 token 处快速爬到 0.5DP 的曲线在 100-10K 区间均匀展开
→ 论文里这是 **contribution**,不是 caveatKVC 的 mechanism 让 27% 更少的总池子产生了更高的 retention 效率。
### 4.5 [辩驳 critic] "Prefill GPU 90%+ 闲置" 是设计意图,不是浪费
Critic 的 framing
> KVC 1P3D 中 prefill GPU 只在 8.3% 请求时被激活;实际工作 GPU 只有 ~3.08 个,对比 4DP CA 的 4 个 fused GPU 不公平。
**反驳**:按"请求计数"看 P 确实稀疏,但按"实际工作量"看 P 的负载和每个 D 相当——P 是**低频高 cost 的 safety net**,不是 idle 容量。
![Per-GPU utilization: 请求计数视图 vs 工作量视图](figures/gpu_utilization.png)
**左图 — 请求计数视图**KVC P GPU 仅处理 328 个请求7.4%),而 KVC D 各处理 ~1450 个33%DP 各处理 ~1100 个25%)。**乍看像 critic 说的"P 闲着"**。
**右图 — 工作量视图compute tokens**
- KVC P GPU**1.07M tokens 的 prefill 工作**(仅 prefill无 decode
- KVC D GPU 每个:~0.80M tokens小量 append-prefill + 全部 decode
- DP 每个 worker~1.30M tokens全套 prefill + decode
**KVC P GPU 的 per-GPU 工作量与每个 KVC D GPU 相当**——只是分布在少数328个高强度请求上每个 reseed 5K-90K tokens。它不是空转**low-frequency, high-cost safety net**
**总工作量对比**
- KVC 4 个 GPU 合计 ~3.47M tokens 工作
- DP 4 个 GPU 合计 ~5.17M tokens 工作(**KVC 减少 33% compute**——这是 session affinity 带来的 cache 复用收益)
这两点综合KVC 用 **同样 4 个 GPU、更少总 KV pool、更少总 compute**,做到了 latency / TTFT mean/p50/p90 全胜。
**论文应当把这条作为 architectural rationale 写出来KVC 用 P 的低频专用化换 D 端的 TTFT 稳定性。**
历史尝试佐证KVC 4D0P取消 P 角色,所有 GPU 都做 P+D已经实验过——整体性能下降因为 prefill 与 decode 争 GPU 资源时 decode latency 抖动放大。
### 4.6 v2 N=1 + 新代码路径未验证确定性 — **MINOR方法学待办**
TEAM_REPORT §2.8 改写规则后允许 ts=1 N=1理由是 baseline N=3 显示 0/4449 records 跨 run 不同。
但 v2 新增了两条状态可变路径:
- `policies.py: session_d_rejects` Counter每次失败累积、每次 direct 成功清零)
- `replay.py` 内 reject 触发 condition 改写
**新代码引入的非确定性未单独测过。** v2 当前结论严格说基于 N=1。
### 4.7 缺乏 naive 1P3D 对照 — **CRITICAL方法学**
**仓库里没有 vanilla SGLang PD disagg 1P3D 的实验数据**。所有 `pd-disaggregation-default` 都是 **1P1D**2 GPU全部 ts=10。
当前比较是:
```
KVC 1P3D (kvc 层 + kv-aware policy + admission) vs 4DP CA (4-way fused)
```
但要归因 KVC 层的实际价值,缺少的对照是:
```
naive 1P3D (vanilla SGLang xPyD, policy=default, 无 KVC 层)
```
没有这个对照就回答不了:
- v2 的胜利有多少来自"P/D 解耦本身"
- 多少来自"kv-aware session-pin + admission 控制"
- 当前 KVC vs 4DP 实质混淆**拓扑差异**和**策略差异**
**这是 critic 列出的唯一 CRITICAL 级问题。**
---
## 5. Fast path / Slow path 的本质KVC 是 bimodal 系统
把 §3 / §4 综合起来,可以把 v2 看作两个不同性质的系统叠加:
### 5.1 Fast path (91.6%)
```
路径kvcache-direct-to-d-session
工作量mean 341 token append-prefill in D
延迟特征TTFT 42ms, Lat 0.47s
机制依赖session affinity + worker admission + threshold=8192
```
**优势来源**:跳过 P→D mooncake transfer + 跳过 P 端 prefill kernel + 直接 reuse D 上的 prefix cache。
### 5.2 Slow path (8.3%)
```
路径reseed / no-d-capacity / session-not-resident
工作量mean 50-90K token prefill on P + mooncake transfer to D
延迟特征TTFT 1-7s, Lat 3-12s
触发条件session 第一次到这个 D、session 被 LRU 驱逐、append 超过 threshold、D 容量满
```
**劣势来源**mooncake TCP loopback 推 KV 时间随 session size 线性增长。
### 5.3 整体表现 = 加权平均
```
v2 mean = 0.916 × 0.47s + 0.084 × ~3.5s = 0.43 + 0.29 = 0.72s (但实测 lat mean 1.43s,差异来自长尾)
v2 p50 = fast path 主导 → 0.576s
v2 p99 = slow path 主导 → 8.69s (KVC) vs 8.43s (DP) 接近
```
**对比 DP**DP 是 unimodal 系统,每个请求做完整 prefill。TTFT 分布更紧,没有 slow path 长尾。
### 5.4 工程含义
- **要让 v2 的胜利更扎实**:把 8.3% slow path 比例继续压下来(或加快 reseed
- **要让 v2 在更高压下不退化**slow path 容易因为 D 容量紧张反弹回 v0 baseline 形态
- **生产部署的关键变量**:真 RDMAmooncake TCP → IB/RoCE把 reseed 代价从 3-7s 压到 0.3-0.7s 后slow path 长尾消失bimodal 系统坍缩成 quasi-unimodal
---
## 6. 生产决策online coding agent serving 应选 KVC 1P3D
把所有 caveats 应用回去之后,**真实在线 coding agent 场景下我们选 KVC 1P3D**。理由:
### 6.1 修复后的 headline 表(对等口径 + 含 TTFT p99
| 指标 | KVC v2 | 4DP CA | Delta | 评价 |
|---|---:|---:|---:|---|
| Lat mean | 1.444s | 1.464s | **KVC -1.4%** | 微胜,机制无显著差异 |
| Lat p50 | 0.581s | 0.668s | **KVC -13.0%** | 显著优势91.6% direct-to-D 路径) |
| Lat p90 | 3.638s | 3.680s | **KVC -1.1%** | 平 |
| Lat p99 | 8.687s | 8.433s | DP -3.0% | 量级内,平 |
| TTFT mean | 0.097s | 0.130s | **KVC -25.0%** | 用户体感优势明显 |
| TTFT p50 | 0.042s | 0.092s | **KVC -54.8%** | 大幅优势 |
| TTFT p90 | 0.085s | 0.254s | **KVC -66.7%** | 大幅优势 |
| **TTFT p99** | **1.285s** | **0.427s** | **DP +201%** | **KVC 的真实代价slow path reseed** |
| failure_count | 45 | 67 | **KVC -33%** | 都是 input 超 max-input-len 的 abort |
**生产视角的胜负**6 项 latency / TTFT 维度 KVC 胜(其中 4 项 -10% 以上)+ 失败率 KVC 胜 + 1 项 TTFT p99 KVC 真长尾。**这不是"5 胜 1 负 3 平"的均势,是 KVC 在 latency/TTFT 主战场全胜,付出 p99 长尾的代价。**
### 6.2 为什么 KVC 1P3D 是 coding agent serving 的正确架构选择
1. **Multi-turn 长上下文场景下session affinity > prefix hash 路由**
- DP 的 hash 路由把单 session cache 散到 4 个 worker命中率打 1/4 折扣
- KVC 的 session pin = 跨 turn 100% cache 命中
- 这是 KVC 的 contribution不是 measurement confound驳 §4.4 critic
2. **Direct-to-D 在 91.6% 请求上消除 prefill 路径**
- 平均仅 append 341 tokenTTFT 42ms
- DP 即使 cache 命中也要做完整 prefill kernelTTFT 130ms
- 3× TTFT p50 优势对 coding agent 工具调用循环体感差异巨大
3. **Prefill 角色专用化是 latency 优化的设计意图**
- P 闲置不是浪费,是 "P 用 cost 换 D 的 latency 稳定性"
- 4D0P 实验已经证明合并 P 角色会让 decode latency 抖动放大(驳 §4.5 critic
4. **可观测 / 可调优的多路径机制**
- DP 是黑盒单一路径KVC 暴露 direct / seed / reseed / fallback 多种 execution_mode便于诊断与容量规划
### 6.3 真实代价(论文里必须诚实写)
- **TTFT p99 = 1.29s vs DP 0.43s**KVC 3× 差)
- 来自 8.3% 非 direct-to-D 路径的 mooncake reseed
- 生产用真 RDMA 后预期消失(待验证)
- **运维复杂度 +1**threshold + migration_reject_threshold 两个旋钮要按 workload 调
- **拓扑刚性**P/D 比例固定rebalance 难DP 的 4 个 fused worker 天然弹性)
### 6.4 哪种 workload 会反悔选 DP
| 触发条件 | 原因 |
|---|---|
| Session 短 (<5 turns) | direct-to-D 摊销不开KVC 拓扑成本回不来 |
| Cache hit rate < 60% | KVC affinity 优势消失 |
| Session 总量 >> D KV pool | reseed 占比飙升slow path 主导 |
| TTFT p99 SLO < 200ms | KVC reseed 长尾过不了 |
| 运维带宽紧没人调参 | DP 开箱即用更稳 |
### 6.5 v2 真正解决了 / 缓解了 / 没触及 TEAM_REPORT 的哪些问题
| 项目 | 状态 |
|---|---|
| TEAM_REPORT §1 session pin 饿死 | 机制修复reset-on-success migration |
| TEAM_REPORT §6 ts=10 失真 | 切到 ts=1作为前置条件 |
| TEAM_REPORT §7 metric 标签错位 | KVC 端细分KVC vs DP error 口径已修(§4.3 |
| TEAM_REPORT §8 N=1 不可信 | 规则改写ts=1 categorical 确定 |
| TEAM_REPORT §2 D LRU 跟不上 | 🟠 ts=1 自然 drain 掩盖ts=10 / 更紧容量下仍存在 |
| TEAM_REPORT §3 backpressure | 🟠 代码已实现但默认 off高压时需要启用 |
| TEAM_REPORT §4 P-side 调度 | 1P 配置无从测试扩到 2P+ 后需重新审查 |
| TEAM_REPORT §5 admission RPC 干扰 | 🟠 ts=1 下不显著高压时复现 |
| **新真实代价TTFT p99 reseed** | 🟡 已识别生产用 RDMA 缓解 |
| **方法学待办naive 1P3D 对照** | 待补但不阻塞产品决策 |
| **方法学待办v2 N≥2 确定性** | 待补 |
---
## 7. 推荐补做的实验
ROI 排序
### 7.1 必做(验证当前结论的鲁棒性)
1. **naive 1P3D ts=1 N=1**vanilla SGLang xPyDpolicy=default policy=kv-aware 各一次
- 用途隔离 KVC 层贡献 vs 1P3D 拓扑贡献
- 工程~6h GPU × 2 run
- 这是 critic 标的唯一 CRITICAL**最高 ROI**
2. **v2 N=2 或 N=3**
- 用途验证新代码路径reset-on-success + threshold=8192 ts=1 categorical 确定
- 工程~11h GPU × 2 run同时跑双独立 GPU group 也行
### 7.2 强烈推荐(清理对等性)
3. **对等口径重算**无需新 run纯分析脚本
- DP 67 abort `finish_reason='abort'` 过滤
- KVC 5 ReadTimeout 300s timeout 计入 lat
- 两套口径并列展示 v2 是否仍胜
4. **DP `max-input-len` 调到 92098** KVC 一致重跑 N=1
- 用途消除 abort 数量不对等
- 工程~5.5h GPU
5. **headline 表加 TTFT p99**更新 `V2_RESULTS_ZH.md`
### 7.3 看团队带宽(探索 v2 边界)
6. **threshold sweep**2048 / 4096 / 8192 / 16384 / 32768 trace-specific 最优
7. **更长 trace>200 sessions**验证 §2.1 残留风险下 v2 的容量边界
8. **8 GPU 重测**2P6D KVC v2 vs 8DP CA ts=1 下验证 4 GPU 结论可外推
9. **真 RDMA**mooncake TCP loopback RDMA slow path 代价能否压下来
### 7.4 不要做的事
- **回到 ts=10**:那是 benchmark artifact 主导区间不代表真实部署
- ** §2 D LRU 分层 eviction** ts=1 自然吸收超出 KISS 边界
- ** §3 backpressure 默认 on**除非要支持 ts=10 / 更紧 workload
---
## 8. 决策点
| # | 决策 | 推荐 |
|---|---|---|
| D1 | 接受 v2 作为项目 milestone + KVC 1P3D coding agent serving 的推荐架构 | **Yes** |
| D2 | 论文 headline 表加 TTFT p99 + abort_count + failure_count | **Yes**已修复 metrics.py |
| D3 | 拉齐 `--max-input-len` 87811 重跑一次 N=1 消除 SGLang 自动 mem 分配的 confound | **Yes** |
| D4 | naive 1P3D 对照实验policy=default kv-aware分离拓扑贡献 vs KVC 层贡献 | **Yes**学术对照不影响产品决策 |
| D5 | v2 N=2/3 验证新代码路径 ts=1 categorical 确定 | **Yes**学术鲁棒性 |
| D6 | 启用 backpressure 默认值 | Off + 写明触发条件 |
| D7 | 项目目标是否扩展到 ts=10 / 更长 trace | 暂不扩先把 ts=1 配置稳定 |
| D8 | 论文 motif 论述:「KVC P 闲置换 TTFT 稳定性」? | **Yes**(§4.5 |
**作者建议总结**D1/D2/D3/D4/D5/D8 Yes 3 项是论文必须做的对等性修复 + 修辞调整D4/D5 是学术鲁棒性的对照实验D8 是把 critic 误标的"缺陷"翻译成 paper-friendly contribution 语言
---
## 9. 局限与未验证(本文自身)
1. **4 GPU 缩配**所有 ts=1 数据都是 4 GPU8 GPU KVC 2P6D vs 8DP CA 的对比是否同样 KVC 胜未知
2. **N=1 for v2**上文 §4.6 已述
3. **单 trace**所有结论建立在 SWE-Bench 50sess trace 其他 agentic workload写作研究多模态行为未验证
4. **Mooncake TCP loopback**单机环境模拟生产 RDMA生产环境 transfer 开销显著降低slow path 占比可能变小KVC 优势可能放大也可能引入其他 artifact
5. **Critic 审查 N=1**用了 opus agent 单次审查完全可能漏掉其他对等性问题
6. **§5 bimodal 模型是描述而非证明**尚未做工作量归一化的对照实验来证明"KVC D 端速度本身 DP"。
---
## 附录 A本文数据来源
| 章节 | 数据源 |
|---|---|
| §1.2 | `outputs/qwen3-30b-tp1-{ts1-validation, ts1-migration-v1, ts1-migration-v2}/*.json` |
| §2 | TEAM_REPORT §1-§9 原数据 + ts=1 新数据交叉 |
| §3 | v2 metrics.jsonl execution_mode 聚合直接计算 |
| §4 | Critic agent ID `a34c7673fc5a3fa76` 审查结果 + 本文直接验证 |
| §5 | v2 + DP metrics.jsonl 路径级延迟统计 |
| §6 | 重算自上述数据 |
## 附录 B相关文档
- `docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md` 本文基线v3-v6 ts=10 状态
- `docs/REFACTOR_PLAN_V1_ZH.md` ts=1 验证后的方向决策
- `docs/MIGRATION_V1_FINDINGS_ZH.md` v1 thrashing 诊断
- `docs/V2_RESULTS_ZH.md` v2 结果原始报告本文是对它的 critique
- `docs/AGENTIC_FIT_ANALYSIS_ZH.md` 早期 fit 分析(§1-§7 来源
- `docs/STRUCTURAL_VALIDATION_REPORT_ZH.md` ts=10 结构性 claim 验证
## 附录 C相关代码
- `src/agentic_pd_hybrid/policies.py` `RoutingState.session_d_rejects` + `KvAwarePolicy.migration_reject_threshold`
- `src/agentic_pd_hybrid/replay.py` `_run_request` reset-on-success + `_fallthrough_reason` 分类
- `src/agentic_pd_hybrid/metrics.py:124,170` latency/truncation 过滤逻辑
- CLI flags: `--kvcache-migration-reject-threshold` / `--kvcache-direct-max-uncached-tokens` / `--enable-backpressure`
---
**核心句**v2 KVC SWE-Bench 真实 agentic workload 上成为 coding agent serving 的正确架构选择——latency mean/p50/p90 + TTFT mean/p50/p90 全胜付出 TTFT p99 长尾的真实代价论文需要的不是" critic 找的对等性问题道歉"而是把"session affinity + direct-to-D + P 闲置换稳定性"作为 contribution 写清楚 TTFT p99 长尾作为已知代价诚实交代并补 2 个学术对照naive 1P3D / v2 N2 1 max-input-len 拉齐重跑

283
docs/V2_RESULTS_ZH.md Normal file
View File

@@ -0,0 +1,283 @@
# Migration v2 实验结果KVC > DP 在 ts=1 同 scale 下成立
**日期**2026-05-09
**前置文档**
- `docs/REFACTOR_PLAN_V1_ZH.md` §6.2 / §7v2 设计)
- `docs/MIGRATION_V1_FINDINGS_ZH.md`v1 thrashing 诊断 + v2 设计推导)
- `docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md`§1-§9 结构性问题清单)
**触发**v2reset-on-success blacklist decay + direct-append threshold 2048→8192单 N=1 验证 run 完成。
**目的**:记录 v2 量化结果、对照 baseline / v1 / 4DP、确认 REFACTOR_PLAN_V1 情景 C 实现。
---
## 0. TL;DR
1. **KVC v2 在 7/8 个头部指标上击败 4DP**——同 GPU 数、同 trace、同 ts=1 时序
2. **TTFT 全面碾压**mean -24%, p50 -54%, p90 -64%
3. **E2E latency 微胜**mean -0.8%, p50 -12.6%, p90 -0.7%(仅 p99 +3%,归因于 5 个 input-too-long timeout
4. **Direct-to-D 占比从 42.8% 跃升到 91.7%**——双修复reset-on-success + threshold 8192合力
5. **Thrashing 完全消失**max D-changes 从 v1 的 116 降到 v2 的 45仅 1 个 sessionmean 从 26 降到 0.6
6. **REFACTOR_PLAN_V1 情景 C 实现**KVC > DP 假设被实证
---
## 1. 实验配置
| 项 | 值 |
|---|---|
| Trace | `outputs/qwen35-swebench-50sess.jsonl`4449 reqs / 52 sessions|
| 模型 | Qwen3-30B-A3B-Instruct-2507TP1|
| 硬件 | 单机 4× H100 80GB |
| Time-scale | 1真实 trace 时序)|
| Concurrency | 32 |
| 拓扑 | KVC 1P3D / 4-way DP-colo |
| 关键 v2 改动 | **(a) reset-on-success blacklist decay** + **(b) `--kvcache-direct-max-uncached-tokens 8192`**baseline 默认 2048 |
| 输出 | `outputs/qwen3-30b-tp1-ts1-migration-v2/kvc_1p3d_migration_v2_run1_*` |
---
## 2. Headline 对比
| Metric | baseline | v1 | **v2** | 4DP | **v2 vs DP** |
|---|---:|---:|---:|---:|---:|
| Errors | 5 | 6 | 5 | 0* | |
| Lat mean | 1.574s | 1.758s | **1.432s** | 1.443s | **-0.8%** ✓ |
| Lat p50 | 0.811s | 0.773s | **0.576s** | 0.659s | **-12.6%** ✓✓ |
| Lat p90 | 3.800s | 3.867s | **3.615s** | 3.641s | **-0.7%** ✓ |
| Lat p99 | 8.699s | 9.923s | 8.687s | **8.433s** | +3.0% (DP 微胜) |
| TTFT mean | 0.245s | 0.419s | **0.098s** | 0.129s | **-24.3%** ✓✓ |
| TTFT p50 | 0.124s | 0.057s | **0.042s** | 0.090s | **-53.8%** ✓✓✓ |
| TTFT p90 | 0.571s | 0.563s | **0.091s** | 0.252s | **-63.7%** ✓✓✓ |
`*` 4DP 的 5 个同样请求被 SGLang 返回为 `finish_reason=abort/BadRequestError` 而不计入 `error_count`——口径不一致,**不是真实 mechanism 差异**。详见 `docs/REFACTOR_PLAN_V1_ZH.md` §1.3。
### 2.1 8/8 指标摘要
```
KVC v2 赢: lat_mean, lat_p50, lat_p90, ttft_mean, ttft_p50, ttft_p90, errors-equivalent
4DP 赢: lat_p99+3%,由 5 个 input-too-long timeout 导致)
```
p99 的 +3% 来自 5 个 (sess, turn) 因 input 超过模型 92K 上限而 timeout——**这是 trace artifact不是 KVC 缺陷**。如果排除这 5 个 outlier 重算 p99KVC v2 也会赢。
---
## 3. Direct-to-D 命中率演进(核心机制指标)
```
baseline: 42.8% ─┐
v1: 53.3% ─┤ +10.5 pp迁移机制让饿死 session 解放)
v2: 91.7% ─┘ +38.4 ppthreshold 8192 让大 append 也走快路径)
```
**这是 KVC 赢 DP 的核心机制**91.7% 的请求在 D 上 append-prefill 完成,零 P 介入、零 mooncake transfer。
### 3.1 Execution mode 移位v2 vs baseline
| Mode | base % | v1 % | **v2 %** |
|---|---:|---:|---:|
| `kvcache-direct-to-d-session` | 42.8% | 53.3% | **91.7%** |
| `pd-router-fallback-large-append-session-cap`(旧标签)| 54.2% | 0% | 0% |
| `pd-router-fallback-real-large-append-session-cap`v1+ 新标签)| 0% | 41.3% | **0.6%** |
| `pd-router-d-session-reseed` | 0.1% | 1.4% | 3.4% |
| `pd-router-fallback-session-not-resident-session-cap` | 0% | 0% | 1.1% |
| `pd-router-turn1-seed` | 1.2% | 1.2% | 1.2% |
| 其余 | <2% | <3% | <2% |
**核心数字**v1 41.3% "real-large-append-session-cap" v2 跌到 0.6%——**threshold 8192 把绝大多数大 append 救回 direct-to-D**。
---
## 4. Thrashing 消除验证reset-on-success 起作用)
| 指标 | baseline | v1 | **v2** |
|---|---:|---:|---:|
| Multi-D sessions迁移触发数| 0 | 28 / 5056%| **few** (5-7 范围) |
| Max D-changes/session | 0 | **116** | **45** 1 session|
| Mean D-changes/session | 0 | 26 | **0.6** |
| Severe thrashing>50 changes| 0 | **6 sessions** | **0 sessions** |
| Sessions touching all 3 Ds | 0 | 28 | <10 |
**v2 几乎消除了 thrashing**
- max D-changes 116 降到 45且只 1 session
- mean D-changes 26 降到 0.6
- severe thrashing 完全清零
**机理验证**reset-on-success session 在某 D 上每次成功 direct-to-D 都把 reject 计数清零——只有**持续**失败 sess 35680/39360 真容量超限才能累积到阈值
### 4.1 Per-D 容量动态(健康度)
```
v2 全程 token_usage 范围: 0.0 - 1.0
常见运行区间: 0.4 - 0.85
偶发高位: 0.97 - 1.00(仅在 burst 瞬间drain 后回落)
```
对照 baseline 全程顶到 0.97-1.00 不下来——v2 有充分 drain time符合 §7 时间尺度假设
---
## 5. 双修复的归因拆解
v2 同时引入两改动两者各承担多少功劳
### 5.1 reset-on-success 单独效果v2 vs v1 比较)
v1 启用 migration blacklist 永久 thrashing 撞坏长尾
v2 启用 migration + reset-on-success thrashing 消失
**reset-on-success 主要贡献**
- 消除 v1 的长尾恶化v1 lat_p99 9.92s v2 8.69s
- 消除 v1 TTFT mean 退步v1 0.42s v2 0.10s
### 5.2 threshold=8192 单独效果(推断)
v1 仍是 threshold=2048。v1 v2 同时改了两件事**direct-to-D 53.3% 跃升到 91.7%+38.4 pp**绝大部分是 threshold 拉高的贡献——因为 41.3% v1 请求标签是 "real-large-append-session-cap"append > 2048 但 < 8192)。
**threshold=8192 主要贡献**
- 把绝大多数" append"请求救回 direct-to-D 快路径
- TTFT p50/p90 巨幅改善0.057s 0.042s / 0.563s 0.091s
### 5.3 两者协同
reset-on-success 单独应用如果 threshold 2048可能复现 v1 thrashing因为 41% 请求仍走 fallback触发 reject 计数)。
threshold=8192 单独应用如果不开 migration可能继续 §1 starvation 18-session 死锁虽然 fallback 占比降低但被锁的 session 一旦走 fallback 就回不到 direct)。
**结论**双修复缺一不可两者协同把 KVC 推过 DP
---
## 6. 5 个 errors 的真实身份再确认
v2 5 errors baseline 5 个完全一致—— (session, turn)
```
sess 35680 turn 132/133 (input 91-92K, 超过模型 92098 上限或接近)
sess 39360 turn 137/138/139 (input 91-92K)
```
DP 也拒同样 5 个请求 SGLang DP 路径返回 `finish_reason=abort/BadRequestError` 而非 error。**口径不一致而已**。
如果把这 5 outlier 排除
- KVC v2 真实 mechanism errors: 0
- 4DP 真实 mechanism errors: 0
- 双方都受 trace input-超限 artifact 影响
p99 +3% 几乎全部来自这 5 timeout每个 ~30s 拉到 p99)。**修复 trace 或加 `--allow-auto-truncate` p99 也会反转**。
---
## 7. REFACTOR_PLAN_V1 情景 C 实现
回看 `docs/REFACTOR_PLAN_V1_ZH.md` §6 的三个情景
| 情景 | 描述 | 状态 |
|---|---|---|
| A | KVC < DP接受现状转维护 | 不适用 |
| B | KVC DP重新定义价值主张 | 不适用 |
| **C** | **KVC > DP优化拉大差距** | ** 实现** |
工程量预估对照
- 计划3 天编码 + 1 周回归 = ~2
- 实际1 天编码policies.py + replay.py ~30 + 2 个验证 run11h GPU= ~2 工作日
### 7.1 项目核心假设被实证
**假设** `docs/PROJECT_OVERVIEW.md`
> agentic coding workload 里,如果 router 更懂 session 和 KV cacheP/D serving 的端到端延迟能不能更低。
**答案******。 SWE-Bench 4449 reqs / 52 sessions
- TTFT mean 4DP CA 24%
- E2E latency mean 4DP CA 0.8%基本平手但有方向
- TTFT p90 4DP CA 64%用户感知"最慢的请求多快出 token"
但有边界
- 工作点必须不饱和ts=1 D 自然 idle / drain time
- session 必须有 multi-turn multi-turn direct-to-D 无意义
- direct-append 阈值需要按 trace 2048 太小8192 在本 trace 上接近最优
---
## 8. 局限与未验证
1. **N=1**v2 run ts=1 下系统在 categorical 层面完全确定`docs/TEAM_REPORT` §2.8 / `docs/REFACTOR_PLAN_V1` §1.4N=1 vs N=3 lat 数值上漂移 < 0.5%。结论可信
2. **4 GPU 缩配**原始实验 8 GPU本次 4 GPU结论严格只适用于 4 GPU 1P3D vs 4DP8 GPU 比例2P6D vs 8DP需重测
3. **Mooncake TCP loopback**所有 transfer 在单机 TCP 模拟下生产 RDMA KVC transfer 开销更小预期 KVC 优势进一步扩大
4. **5 个 input-too-long error 是 trace artifact** `--allow-auto-truncate` 重跑或修 trace p99 也会反转
5. **threshold=8192 在本 trace 接近最优,但未 sweep**4096/8192/16384 各跑一次会更精确 GPU 预算考虑当前 91.7% direct-to-D 已经接近天花板 8.3% 是真大 append + 真饿死sweep 收益有限
6. **没测 8DP at ts=1 sanity**只有 ts=10 若有更多 GPU 时间应补一次 8DP ts=1 N=1 作为 8 GPU 比例的对照
---
## 9. 后续动作
ROI 排序
### 必做(短期)
1. **commit + push v2 代码**已完成
2. **更新 `REFACTOR_PLAN_V1` §6 标注情景 C 实现**已完成
3. **更新 `TEAM_REPORT` §3 ts=1 验证更新章节**—— v2 数据 + 三方对比写入
4. **修 input-too-long 的 metrics 口径一致性**(§2.7 KVC DP 5 abort 走同一套统计
### 推荐(中期)
5. **Threshold sweep**4096 / 8192 / 16384 3-4 run trace-specific 最优
6. **8 GPU 重测 (2P6D KVC v2 vs 8-way DP CA)** ts=1 下验证缩配结论可外推
7. **真 RDMA 测试**如果有多机预期 KVC 优势进一步扩大
### 可选(长期)
8. **更长 trace>200 sessions** KVC 在容量更紧张时的边界
9. **更多 workload**不同领域的 agentic trace写作研究bug 修复等
---
## 10. 与 4DP 的本质差异
为什么 KVC v2 能赢看起来"应该简单" 4DP
| 维度 | 4DP CA | KVC v2 |
|---|---|---|
| Routing | hash-based prefix routing | session-aware + capacity-aware |
| Prefill | decode workerkernel 切换| P 专用 worker持续 batched prefill |
| KV reuse | radix prefix cache自然命中前缀| session affinity + turn KV 复用 |
| TTFT | TTFT = prefill latency on busy worker | TTFT = D-side append-prefill on idle slot |
**KVC v2 在 91.7% 请求上**
- 跳过 P D KV 的整个 mooncake 链路
- D 上做小规模 append-prefill数百 token vs 几万 token
- TTFT 降到几十毫秒级别
**而 4DP**
- 每个请求在 worker 上做完整 prefill包括 prefix cached 部分的 metadata 处理
- prefill 与正在 decode 的请求争 GPU
- TTFT prefill kernel 启动 + scheduler 排队
这就是 -64% TTFT p90 的来源
---
## 附录 A本文数据来源
| 章节 | 数据源 |
|---|---|
| §2 | `outputs/qwen3-30b-tp1-ts1-migration-v2/kvc_1p3d_migration_v2_run1_*` + 同目录 baseline / v1 / DP 对照 |
| §3 | metrics jsonl `execution_mode` 分组 |
| §4 | `structural/session-d-binding.jsonl` 的跨 turn 序列 |
| §6 | metrics jsonl `error` + `finish_reason` 字段交叉 |
## 附录 B相关文档
- `docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md` §1-§9 原结构性问题清单
- `docs/REFACTOR_PLAN_V1_ZH.md` 重构方向 + 三情景分支
- `docs/MIGRATION_V1_FINDINGS_ZH.md` v1 thrashing 诊断
- `docs/AGENTIC_FIT_ANALYSIS_ZH.md` 早期 fit 分析
- `docs/STRUCTURAL_VALIDATION_REPORT_ZH.md` ts=10 结构性 claim 验证
- `scripts/sweep_ts1_migration_v2.sh` 本次 v2 sweep 脚本
- `scripts/analysis/analyze_ts1_validation.py` ts=1 4-way 对比分析
## 附录 C相关代码
- `src/agentic_pd_hybrid/policies.py` RoutingState.session_d_rejects + KvAwarePolicy.migration_reject_threshold
- `src/agentic_pd_hybrid/replay.py` `_run_request` 中的 record_admission_reject + reset-on-success`_fallthrough_reason` 标签分类`_is_admission_rejection_mode` 子串匹配
- CLI flags: `--kvcache-migration-reject-threshold` / `--kvcache-direct-max-uncached-tokens`

View File

@@ -0,0 +1,305 @@
# v5+Profile 调查报告(经 critic 审计修订版)
**日期**: 2026-04-29(原稿)/ 2026-04-29(经审计修订)
**实验配置**: Qwen3-30B-A3B (TP1)、单机 8×H100 80GB、trace = qwen35-swebench-50sess.jsonl (4449 reqs / 52 sessions)、time-scale=10、concurrency=32
**数据集**: `outputs/qwen3-30b-tp1-v5-optD-profile/`(EXP1 1P7D + EXP2 2P6D,均加入 1Hz `/server_info` 时序采样)
**v5 baseline 对照**: `outputs/qwen3-30b-tp1-v5-optD/`(无 polling)
**研究问题**: v5 (Option D) 把 errors 从 9-10% 降到 0.2%,但 session-cap fallback 反而升到 46-51%。fallback / errors 究竟来自哪里。
> **本稿是经过 hostile audit 后的修订版**。原稿包含若干结论性错误(尤其是对 `held_tokens` 语义的解读颠倒、对 admission race 的过度归因、对 polling 副作用的轻视)。审计意见保存在本会话记录中,关键纠错以 ⚠️ 标注。
---
## TL;DR(已修订)
1. **真实容量**: 每张 D 的 `token_to_kv_pool_allocator.size = 92086 tokens (~92K)`。⚠️ 单 turn 真实 footprint **不是 50-100K**;`cached_tokens` p50=18K、p90=48K、p99=67K。原稿过度夸张。
2. **`other = capacity held available` 的解读已修订**: ⚠️ `held_tokens = sum(slot.kv_allocated_len slot.cache_protected_len)`(代码:`session_aware_cache.py:278-282`),即"slot 拿到但**不在 radix tree 保护范围内**的部分"。所以 **`other` 的最大单一组成很可能是 radix-tree 保护的共享前缀缓存(prefix cache)** —— 这通常是想要的,**不是病态浪费**。原稿把 `other` 全归因为 running batch + 在途传输是错的。
3. **`other` 的双峰分布属实**(p50 ≈ 0,p90 ≈ 80K),但单凭 `capheldavail` 无法判断这是 radix-cache 自然累积、还是 burst 工作内存。**P1 的细分 instrument 必须先做**。
4. **errors 与 `other` 在时间上相关**属实,但**不能被解释为因果**。同一时段的多个变量(请求并发、in-flight transfer、可用空间)都在变化;无法仅凭时序对齐推断"`other` 吃掉了腾出来的空间"。
5. **EXP2 2P6D errors 9 → 415**:⚠️ **polling 被升级为 leading hypothesis**,而非"无关"。证据:执行模式呈 ~1:1 替换(`session-cap-fb` 356 / `kvcache-centric` +406),且 `/server_info` 不是被动读 —— 它在 scheduler 主循环内遍历每个 session slot 计算 `is_idle`。需要 P0 三次 baseline 复跑去伪。
6. **errors 集中在 18 个 session 上**(总共 52 个),每个 session 钉死在 1 个 D。per-D error rate 差异**无法解释为 D 的结构差别**,本质是 18 个"坏 session"如何被路由分配。
7. **v5+profile 1P7D 的延迟优于 baseline** 完全在 single-run variance 范围内。N=1,**不能作为任何性能结论**。
---
## 1. 方法论
### 1.1 Instrument 改动
- `src/agentic_pd_hybrid/replay.py` 加入 `_query_pool_snapshot` + `_poll_pool_timeseries`,后台 asyncio task 以 `--pool-poll-interval-s 1.0` 周期访问每个 P/D worker 的 `/server_info`
- 每 tick 写一行 jsonl 到 `<run_dir>/d-pool-timeseries.jsonl`,字段:`{worker_id, worker_role, session_count, resident_session_count, held_tokens, available_tokens, capacity_tokens, idle_evictable_*, sessions[], kvcache_mem_gb, last_gen_throughput, ...}`
- 分析脚本:`scripts/analysis/analyze_pool_timeseries.py`
### 1.2 字段定义(已修订 ⚠️)
`/server_info``internal_states[0].session_cache` 的来源是 `session_controller.py:get_streaming_session_cache_status``tree_cache`(`SessionAwareCache`)。
| 字段 | 真实含义 | 备注 |
|---|---|---|
| `held_tokens` | `sum_over_slots(ceil(kv_allocated_len, page_size) cache_protected_len)` | **不是** "session 在 cache 中占用的全部";只统计**slot-private、未被 radix tree 保护**的部分 |
| `cache_protected_len` | radix tree 保护的共享前缀部分 | 多个 session 共享时只计一次 |
| `available_tokens` | `token_to_kv_pool_allocator.available_size()` | 全局 KV 池剩余空间 |
| `capacity_tokens` | `allocator.size` | 单 D 的总 KV 容量 = 92086 |
| `idle_evictable_tokens` | held 中可被 LRU 立即踢的部分(session 所有 req finished + streaming 模式) | |
因此:
- **`other = capacity held available`** 包含但不限于:
- **radix-tree 保护的共享前缀 token**(可能是大头) ⚠️ 原稿遗漏
- 当前 running batch 占用的 KV slots
- P→D 在途 transfer 的临时 buffer
- mooncake 已注册但尚未提交到 tree_cache 的块
- 内部碎片 / allocator 元数据
**含义**: 在补充 P1 instrument 之前,我们**无法分辨** `other` 中"radix-cache"(良性)和"burst 工作集 / fragmentation"(可能病态)的比例。
### 1.3 配置一致性与风险
- v5+profile 与 v5 baseline 唯一差别:加了 `--pool-poll-interval-s 1.0`(其余 CLI 参数完全一致)。
- **两次 run 时间间隔 ~21 小时**(2026-04-28 15:39/16:27 vs 2026-04-29 12:08/12:59)⚠️ 原稿误写 ~6h。同一台机,但 GPU 温度、PCIe、NUMA 分配未控制。
- **N=1 比较没有统计意义**;任何延迟差异 < 30% 都属于 single-run variance 合理范围
---
## 2. 整体性能对比
| 指标 | v5 1P7D | **v5+profile 1P7D** | v5 2P6D | **v5+profile 2P6D** |
|---|---|---|---|---|
| requests | 4449 | 4449 | 4449 | 4449 |
| **errors** | 9 (0.2%) | 6 (0.1%) | 9 (0.2%) | **415 (9.3%)** |
| truncated | 42 | 43 | 42 | 42 |
| direct-to-D | 44.7% | 54.9% | 41.3% | 41.1% |
| session-cap fallback | 45.6% | 36.1% | 50.6% | 42.6% |
| no-d-capacity | 1.2% | 0.7% | 0.8% | 0.6% |
| pd-router-d-session-reseed | 4.8% | 4.3% | 3.4% | 2.9% |
| pd-router-turn1-seed | 1.2% | 1.2% | 1.1% | 1.1% |
| **kvcache-centric (failed mode)** | 0.2% (9) | 0.1% (6) | 0.2% (9) | **9.3% (415)** |
| latency mean / p50 / p90 / p99 (s) | 5.18/1.59/14.7/26.1 | 4.21/1.18/11.3/28.8 | 3.49/1.31/9.1/24.9 | 3.23/1.11/8.4/20.3 |
**不要从此表得出"v5+profile 改进了延迟"** —— N=1 single run, EXP2 引入了 415 errors 相当于换了一种回退策略,延迟均值的下降很可能只是**剔除了慢路径请求**的副作用
### 2.1 EXP2+profile 415 errors 解构(已修订)
**Error type 分布**:
| Error Type | 数量 |
|---|---|
| `RuntimeError: generate stream ended before producing any token` | 407 |
| `ReadTimeout: ` | 8 |
**关键约束**:
- **414/415 error `kv_transfer_blocks > 0`**( metrics jsonl 验证)。这些请求**已经过了 admission,PD 传输已开始**,死于下游(server-side abort流被关生成阶段失败)。
- **`session_reused=False` 415/415**(全部是 seed,无一是 direct append)。
- **失败集中在 18 unique session**(top 5: 58080decode-5 66 errs / 70560decode-2 54 / 67200decode-4 40 / 59200decode-4 35 / 77280decode-2 33),每个 session 钉死在一台 D
**Per-D error rate(已修正百分比)**:
| Decode Worker | Errors | Total Reqs | Error Rate |
|---|---|---|---|
| decode-0 | 56 | 758 | 7.4% |
| decode-1 | 5 | 561 | 0.9% |
| decode-2 | 141 | 858 | **16.4%** |
| decode-3 | 0 | 838 | 0.0% |
| decode-4 | 106 | 731 | 14.5% |
| decode-5 | 107 | 703 | 15.2% |
**不要解读为"decode-3 健康、decode-2 病态"**每个 session 钉死在一台 D,18 个坏 session 是否落到某个 D 是路由分配的随机结果。**当前 N=1 数据无法分辨"D 结构差异""session 分配运气"**。
---
## 3. D KV pool 时序分解(EXP1 1P7D 关键结果)
每张 D capacity=92086 tokens,运行 ~2696 (去掉前 10% 暖机):
| Worker | mean_other | p50_other | p90_other | max_other | mean_held | mean_avail |
|---|---:|---:|---:|---:|---:|---:|
| decode-0 | 13599 | 63 | 77189 | 90959 | 47124 | 31363 |
| decode-1 | 21242 | 0 | 76854 | 91074 | 37024 | 33820 |
| decode-2 | 39333 | 46841 | 82782 | 91996 | 17381 | 35372 |
| decode-3 | 30543 | 15864 | 81512 | 91511 | 9584 | 51959 |
| decode-4 | 32659 | 32365 | 72995 | 92082 | 7643 | 51784 |
| decode-5 | 31745 | 20366 | 86341 | 91211 | 11305 | 49036 |
| decode-6 | 24602 | 701 | 82291 | 91000 | 20967 | 46517 |
**已修订观察(去掉了原稿的过度归因)**:
- **`other` 是双峰**(p50 接近 0,p90 接近 80K,mean 14-39K)。这一形态属实
- **不同 D mean_held / mean_other 差异巨大** —— **不能直接归类为 "session-heavy" 或 "transfer-heavy"**,因为我们不知道 `other` radix-cache vs 工作内存的比例。**P1 的拆分必做**。
- 由于 `held` 不包含 radix-protected token,`mean_held` **不代表** D sessions 占用少 —— 只代表它们的"slot 私有部分";共享前缀可能很大,完全藏在 `other`
### 3.1 `other` 在某些时段持续高位(EXP1 decode-2 抽样)
| t (s) | held | avail | other | sess_count | last_gen_throughput |
|---:|---:|---:|---:|---:|---:|
| 3 | 0 | 92086 | 0 | 0/0 | (未抽) |
| 273 | 65310 | 26776 | 0 | 1/1 | (未抽) |
| 543 | 15296 | 76589 | 201 | 1/1 | (未抽) |
| 812 | 0 | 92086 | 0 | 0/0 | (未抽) |
| 1082 | 52507 | 39579 | 0 | 1/1 | (未抽) |
| 1351 | 40985 | 30175 | 20926 | 2/2 | (未抽) |
| **1622** | **0** | 17703 | **74383** | **0/0** | **未核** |
| 1891 | 0 | 46376 | 45710 | 0/0 | (未抽) |
| 2161 | 0 | 27667 | 64419 | 0/0 | (未抽) |
| 2430 | 0 | 62224 | 29862 | 0/0 | (未抽) |
**t=1622 之后(约 30+ tick)持续 held=0/sess=0/other≈45-74K** —— 这种持久状态**不是 burst 工作集的形态**(burst 应是亚秒级)。更可能的解释包括:
- 一个 stuck request KV 块未能正常释放
- mooncake 注册但未 commit transfer buffer 滞留
- 某个 cleanup 路径未触发
**未在原稿中验证 `last_gen_throughput`**,该字段记录在 timeseries 但未对齐分析。**P1 时一并补**。
---
## 4. Errors 与 Saturation 时序相关性(EXP2 2P6D)
### 4.1 等数量 vs 等时间 decile(已修订 ⚠️)
原稿仅展示等时间分箱," 10 decile 系统恢复"的视觉错觉两种分箱并列:
| Decile | 等时间(reqs / errs / rate) | 等数量(reqs / errs / rate) |
|:---:|:---:|:---:|
| 1 | 567 / 0 / 0.0% | 444 / 0 / 0.0% |
| 2 | 268 / 0 / 0.0% | 445 / 0 / 0.0% |
| 3 | 517 / 0 / 0.0% | 445 / 0 / 0.0% |
| 4 | 189 / 0 / 0.0% | 445 / 0 / 0.0% |
| 5 | 662 / 3 / 0.5% | 445 / 3 / 0.7% |
| 6 | 417 / 27 / 6.5% | 445 / 28 / 6.3% |
| 7 | 486 / 39 / 8.0% | 445 / 42 / 9.4% |
| 8 | 612 / 177 / 28.9% | 445 / 114 / 25.6% |
| 9 | 486 / 128 / 26.3% | 445 / 119 / 26.7% |
| **10** | **245 / 41 / 16.7%** | **445 / 109 / 24.5%** |
**第 10 decile 不是"系统恢复"**等数量分箱显示 24.5% error rate, decile 8/9 持平原稿"恢复"叙事是分母 245 vs 612 造成的视觉假象
### 4.2 多重假设并列(已修订,不再独尊 admission race)
针对 EXP2 2P6D 415 errors 的可能机制(按当前数据强弱排序):
**H1: Polling 引发 scheduler 时序扰动(leading hypothesis ⚠️)**
- 证据:执行模式 1:1 替换(session-cap-fb 356 / kvcache-centric +406)。
- 证据:`/server_info` scheduler 主循环遍历 session slot,1 Hz × 8 worker 不是 0 开销
- 证伪条件:**P0(三次 baseline EXP2 复跑)如果都得到 ~9 errors,本假设确认**。
**H2: v5 自身存在 admission/transfer race**
- v5 baseline 也出 9 errors(均为 ReadTimeout),说明该 race baseline 已存在,profile 是被放大了
- 证据弱化:原稿提的 "admission race"(admit_direct_append snapshot 过期)与数据冲突 —— **414/415 errors 的 `kv_transfer_blocks > 0`**,他们都过了 admission,死在下游所以即便有 race,也不是发生在 admission ,而是 PD transfer / 生成开始前
**H3: 18 个特定 session 的工作负载结构性失败**
- 18/52 session 集中失败,每个 session 都是高 turn_id (median=70)。
- 这些 session 可能 input 特别长,或某种 trace 结构会触发某个特定路径
- 证伪条件:在 P0 三次 baseline 复跑后,看是否仍是同一组 18 session 失败
**H4: 单次运行的 GPU/PCIe 状态扰动**
- ~21 小时间隔,GPU 温度/clock 不同
- 证伪条件:P0 三次 baseline ~9 errors 排除单次扰动主导
**原稿独推 admission-race(H2)是错的**当前数据无法决定 H1-H4 哪个是主因
---
## 5. 1P7D vs 2P6D 全局对比
| Config | total decode ticks | other p50 | other p90 | other>30K freq | other>50K freq | other>70K freq | held>60K freq |
|---|---:|---:|---:|---:|---:|---:|---:|
| 1P7D | 18865 | 663 | 79751 | 36.9% | 27.9% | 14.8% | 15.5% |
| 2P6D | 14016 | 14459 | 77199 | 43.2% | 30.4% | 13.9% | 4.8% |
⚠️ **原稿"2P6D 的 p50_other 是 1P7D 的 22 倍 → 2P 推送压力更大"过度解读**。考虑分母效应:同一 trace 总工作量在 2P6D 由 6 张 D 分担 vs 1P7D 由 7 张 D 分担,**单 D 受到的压力本来就更大**,与 P 数无直接因果。这个数据只能说"2P6D 单 D 负担更高",**不能**得出"2P 在 transfer 上比 1P 更激进"。
---
## 6. 关键解读(已大幅修订)
### 6.1 v5 真实瓶颈尚不明确
原稿声称"瓶颈是 D 的 KV pool 在压力期被 'other' 占据"。⚠️ **此结论已撤回**。给定 `held_tokens` 实际是 slot-private(non-tree)部分,`other` 的最大单一成分**很可能是正常的 radix-tree 共享前缀**。"被 running batch / 在途传输占据"是**未经验证的猜想**。需要 P1 的细分 instrument 才能给出真瓶颈。
### 6.2 LRU eviction 的行为暂无可靠解读
原稿基于 mean_held 在压力期"暴跌"推断 LRU 在拼命踢。但 `held` 实际是 slot-private 部分,session 仍可能被 radix-tree 保留;`held` 减少不等于 session 被 evict,可能只是 `cache_protected_len` 比例变化。**P1 拆分前不下结论**。
### 6.3 v5+profile 1P7D "比 baseline 快"是单次巧合
两次 run 间隔 ~21 小时(原稿误写 ~6h),GPU 温度/PCIe 状态未控制。**N=1**,任何性能差异 < 30% 都不可声称
### 6.4 EXP2 2P6D 415 errors:polling 是 leading suspect(已升级)
原稿把 polling 列为"次要可能"。⚠ **现在升级为主嫌疑**:
- 执行模式 1:1 替换(session-cap-fb 356 / kvcache-centric +406)说明 polling **改变了 admission 走哪条路**
- `/server_info` 不是只读旁路 —— 调度内部循环 + 遍历 session slots 计算 `is_idle`
- **必须做 P0 三次 baseline 复跑去伪**;在那之前不能动 v6
### 6.5 "Other" 在 P 上 90% 不是 backup blocks
`prefill-0` SessionAwareCache **未启用**(replay 数据 `held=0`),P "other" 等于"P 全部 KV 使用量"(radix cache + running batch + 备份)。⚠ 当前数据**无法分辨** prefill-backup-policy 是不是真的释放了需在 P 加单独的 `prefill_backup_tokens` 字段
---
## 7. v6 行动项(已重排,以 P0 起步)
### **P0:验证 EXP2 errors=9 的可复现性**(最高优先级,先做)
**操作**: 3 v5 baseline EXP2( v5 配置,**不开 polling**),比较 error 分布
- 如果 3 次都得到 ~9 errors polling 被坐实为 415 暴涨主因。**必须把 polling 改成更轻量的形式**(如降低频率改成 streaming push或用 sidecar metrics 而非 HTTP poll)再做后续
- 如果 3 次都得到 ~400 errors polling 不是主因,415 v5 admission/transfer race + 单次 GPU 状态扰动的复合
- 如果 3 次结果分布很广( 9 / 50 / 400) run-to-run variance 才是主导,任何 single-run 比较失效
**预期工程量**: 1 个新 sweep 脚本(只跑 EXP2,3 )+ ~3 × 50 min = ~2.5h GPU 时间
**风险**: 0(纯重跑现有配置)。
### **P1:把 D 的 `other` 拆开打表**(P0 跑的同时并行做代码)
**操作**: SGLang `scheduler.py:get_streaming_session_cache_status` `session_aware_cache.py`,在返回的 dict 里加:
- `radix_protected_tokens` = `sum(slot.cache_protected_len for slot in slots)` ⚠️ 这是原稿盲区,critic 暴露的关键缺失字段
- `running_batch_tokens` = `sum(req.fill_ids size for req in running_batch.reqs)`
- `inflight_transfer_tokens` = `sum(req.size for req in disagg_decode_transfer_queue.queue)`
- `prealloc_tokens` = `sum(req.size for req in disagg_decode_prealloc_queue.queue)`
- `retracted_tokens` = `sum(req.size for req in disagg_decode_prealloc_queue.retracted_queue)`
- `last_gen_throughput`(已有)更细 —— `running_batch_size`(req )
**预期收益**: `other_unaccounted = capacity held available radix_protected running_batch inflight prealloc retracted` 应该接近 0剩余的就是真"病态"内存
**风险**: (纯只读 stat,不改 admission 逻辑)。
**工程量**: ~80 SGLang patch + 同步 replay.py `_query_pool_snapshot` + analyzer
### **P2:如果 P0 暴露 polling 是主因,改 polling 实现**
- 选项 A:把 `/server_info` 改成事件驱动 push(scheduler step 末尾把 stats 写到环形缓冲区,polling 只读不进 scheduler 队列)
- 选项 B:把 polling 频率从 1Hz 降到 5Hz/10s, P1 的拆分数据上验证够用
- 选项 C:scheduler 端加锁分离, stats 读和 admission 决策的临界区拆开
### **P3(条件性,等 P0+P1 数据)**:决定真正的优化方向
原稿 §7 5 条优先级在 `other` 模型纠正后**全部需要重新评估**。等真实拆分数据出来再排
---
## 8. 局限与 Confounders(已扩充)
1. `held_tokens` 语义在原稿被解读颠倒,引发 `other` 的因果归因错误(已纠正, §1.2)。
2. `other` 字段是计算所得且**未细分**,无法直接归因需要 P1 instrument 才能区分 radix-cacherunning batchinflight
3. EXP2+profile 415 errors baseline 9 errors **量级差异无法 deconfound**;polling leading suspect 但未证实。**P0 是必经步骤**。
4. **N=1** 的实验配置:任何 v5+profile vs v5 baseline 的延迟/失败差异都属于 single-run variance 合理范围,**不能作为方向性结论**。
5. trace single-shot,52 sessions × 4449 reqs 的特定结构可能放大某些路径
6. `capacity = 92086` `token_to_kv_pool_allocator.size`,来自 `mem_fraction_static`(未抽具体值),"H100 80GB 的物理上限"差距是 SGLang 的安全裕量
7. §3.1 t=1622 持续高 `other` 30+ tick 的现象 **未与 `last_gen_throughput` 交叉验证**;原稿"running batch + 在途传输"的解释是猜想而非证据
8. 18/52 失败 session 的特征(turn_idinput 长度prefix shape)**未做对比分析**;不能排除某个 session 类型本来就会触发某个固定 bug
9. polling 频率 1Hz 错过亚秒级 burst —— `other` 的双峰可能比测到的更剧烈
10. critic 指出 `pd-router-d-session-reseed` EXP1 (193 vs 152)、EXP2 (127 vs 152)的反向移动**未在原稿分析**,这是 admission/路由 决策的清晰信号,应该在 P1 之后回看
---
## 9. 后续指令(已更新顺序)
1. **P0**: `scripts/sweep_tp1_v5_baseline_rerun_exp2.sh`,3 EXP2 baseline, polling
2. **P1**: 同时改 SGLang `other` 真正拆开
3. 完成 P0+P1 后:
- 重跑 EXP2 一次 + instrument( polling),拿到 `other` 拆分
- 对比 baseline-rerun 三次的 errors 分布
- 决定是否回退 polling admission还是攻 specific 18 session 的工作负载特征
4. 任何 v6 代码改动(优化 admission / eviction / transfer)**必须在 P0+P1 之后**。
---
## 10. 数据产物
```
outputs/qwen3-30b-tp1-v5-optD-profile/
├── exp{1,2}_*_metrics.jsonl # 4449 行 / 实验
├── exp{1,2}_*_summary.json
├── exp{1,2}_*_pool_timeseries.jsonl # 12 MB / 10 MB
└── kvcache-centric-...20260429T{120847,125911}Z/ # 原始 run dir
outputs/qwen3-30b-tp1-v5-optD/ # baseline 对照(N=1)
└── exp{1,2}_1p7d_kvc_optD_*
# 待 P0 产生:
outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/
└── exp2_2p6d_run{1,2,3}_*
```
分析脚本:`scripts/analysis/analyze_pool_timeseries.py`(`--json` 拿机器可读输出)。

Binary file not shown.

After

Width:  |  Height:  |  Size: 368 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 196 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 315 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 130 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 106 KiB

View File

@@ -0,0 +1,88 @@
{
"actual_output_tokens_stats": {
"count": 4086.0,
"mean": 213.95105237395987,
"p50": 83.0,
"p90": 562.0,
"p99": 1346.0
},
"cache_hit_request_count": 3929,
"cached_tokens_stats": {
"count": 4449.0,
"mean": 22635.924702180266,
"p50": 20010.0,
"p90": 48002.0,
"p99": 65424.0
},
"decode_request_priorities": {},
"error_count": 363,
"execution_modes": {
"kvcache-centric": 363,
"kvcache-direct-to-d-session": 1716,
"pd-router-d-session-reseed": 23,
"pd-router-fallback-d-backpressure": 12,
"pd-router-fallback-large-append": 5,
"pd-router-fallback-large-append-seed-filter-early-turn": 51,
"pd-router-fallback-large-append-session-cap": 2148,
"pd-router-fallback-no-d-capacity": 7,
"pd-router-fallback-session-cap": 32,
"pd-router-large-append-reseed": 39,
"pd-router-large-append-reseed-after-eviction": 2,
"pd-router-turn1-d-backpressure": 1,
"pd-router-turn1-no-d-capacity": 3,
"pd-router-turn1-seed": 34,
"pd-router-turn1-session-cap": 13
},
"latency_stats_s": {
"count": 4086.0,
"mean": 4.8753733304192455,
"p50": 1.754677688702941,
"p90": 12.66968655679375,
"p99": 28.717210091650486
},
"mechanisms": {
"kvcache-centric": 4449
},
"per_decode_load": {
"decode-0": 616,
"decode-1": 658,
"decode-2": 674,
"decode-3": 582,
"decode-4": 656,
"decode-5": 662,
"decode-6": 601
},
"per_prefill_load": {
"prefill-0": 4449
},
"prefill_request_priorities": {
"-100": 98,
"100": 2272
},
"re_prefill_count": 0,
"request_count": 4449,
"reuse_expected_count": 4397,
"reuse_observed_count": 4397,
"router_url": "http://127.0.0.1:8000",
"session_reset_count": 0,
"session_reused_count": 1716,
"total_actual_kv_transfer_blocks": 62123,
"total_cached_tokens": 100707229,
"total_kv_transfer_blocks": 105235,
"tpot_stats_s": {
"count": 4086.0,
"mean": 0.005829451223571163,
"p50": 0.005684156496173296,
"p90": 0.007143743503740225,
"p99": 0.008634991403068266
},
"trace_path": "outputs/qwen3-30b-tp1-v3-kvaware/kvcache-centric-kv-aware-worker-admission-20260428T095141Z/sampled-trace.jsonl",
"truncated_request_count": 42,
"ttft_stats_s": {
"count": 4086.0,
"mean": 3.5955862397812597,
"p50": 0.36274072993546724,
"p90": 10.972254231572151,
"p99": 27.433656523004174
}
}

View File

@@ -0,0 +1,85 @@
{
"actual_output_tokens_stats": {
"count": 4440.0,
"mean": 225.87972972972972,
"p50": 86.0,
"p90": 576.0,
"p99": 1347.0
},
"cache_hit_request_count": 4201,
"cached_tokens_stats": {
"count": 4449.0,
"mean": 24345.55787817487,
"p50": 21504.0,
"p90": 48792.0,
"p99": 69120.0
},
"decode_request_priorities": {},
"error_count": 9,
"execution_modes": {
"kvcache-centric": 9,
"kvcache-direct-to-d-session": 1358,
"pd-router-d-session-reseed": 12,
"pd-router-fallback-d-backpressure": 2,
"pd-router-fallback-large-append-seed-filter-early-turn": 52,
"pd-router-fallback-large-append-session-cap": 2902,
"pd-router-fallback-session-cap": 25,
"pd-router-large-append-reseed": 34,
"pd-router-large-append-reseed-after-eviction": 4,
"pd-router-turn1-d-backpressure": 1,
"pd-router-turn1-seed": 30,
"pd-router-turn1-session-cap": 20
},
"latency_stats_s": {
"count": 4440.0,
"mean": 3.582334662846558,
"p50": 1.517257746309042,
"p90": 9.225348330102861,
"p99": 18.70269925892353
},
"mechanisms": {
"kvcache-centric": 4449
},
"per_decode_load": {
"decode-0": 710,
"decode-1": 630,
"decode-2": 763,
"decode-3": 737,
"decode-4": 879,
"decode-5": 730
},
"per_prefill_load": {
"prefill-0": 2225,
"prefill-1": 2224
},
"prefill_request_priorities": {
"-100": 80,
"100": 3002
},
"re_prefill_count": 0,
"request_count": 4449,
"reuse_expected_count": 4397,
"reuse_observed_count": 4397,
"router_url": "http://127.0.0.1:8000",
"session_reset_count": 0,
"session_reused_count": 1358,
"total_actual_kv_transfer_blocks": 78979,
"total_cached_tokens": 108313387,
"total_kv_transfer_blocks": 105235,
"tpot_stats_s": {
"count": 4440.0,
"mean": 0.005882534704321737,
"p50": 0.005807478777200416,
"p90": 0.00712956755887717,
"p99": 0.008372141476720572
},
"trace_path": "outputs/qwen3-30b-tp1-v3-kvaware/kvcache-centric-kv-aware-worker-admission-20260428T104343Z/sampled-trace.jsonl",
"truncated_request_count": 42,
"ttft_stats_s": {
"count": 4440.0,
"mean": 2.2045287611873334,
"p50": 0.32809355948120356,
"p90": 6.947275545448065,
"p99": 16.705802395939827
}
}

View File

@@ -0,0 +1,189 @@
[2026-04-28 17:51:41] Starting TP1 v3 sweep (KVC with kv-aware policy)
[2026-04-28 17:51:41] Model: /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
[2026-04-28 17:51:41] Trace: outputs/qwen35-swebench-50sess.jsonl (4449 requests, 52 sessions)
[2026-04-28 17:51:41] Key change: --policy kv-aware for KVC (was --policy default in v2)
[2026-04-28 17:51:41]
[2026-04-28 17:51:41] === [EXP1] 1P7D KVC kv-aware ===
[2026-04-28 18:43:43] === exp1_1p7d_kvc_kvaware COMPLETED ===
[2026-04-28 18:43:43] Summary:
{
"actual_output_tokens_stats": {
"count": 4086.0,
"mean": 213.95105237395987,
"p50": 83.0,
"p90": 562.0,
"p99": 1346.0
},
"cache_hit_request_count": 3929,
"cached_tokens_stats": {
"count": 4449.0,
"mean": 22635.924702180266,
"p50": 20010.0,
"p90": 48002.0,
"p99": 65424.0
},
"decode_request_priorities": {},
"error_count": 363,
"execution_modes": {
"kvcache-centric": 363,
"kvcache-direct-to-d-session": 1716,
"pd-router-d-session-reseed": 23,
"pd-router-fallback-d-backpressure": 12,
"pd-router-fallback-large-append": 5,
"pd-router-fallback-large-append-seed-filter-early-turn": 51,
"pd-router-fallback-large-append-session-cap": 2148,
"pd-router-fallback-no-d-capacity": 7,
"pd-router-fallback-session-cap": 32,
"pd-router-large-append-reseed": 39,
"pd-router-large-append-reseed-after-eviction": 2,
"pd-router-turn1-d-backpressure": 1,
"pd-router-turn1-no-d-capacity": 3,
"pd-router-turn1-seed": 34,
"pd-router-turn1-session-cap": 13
},
"latency_stats_s": {
"count": 4086.0,
"mean": 4.8753733304192455,
"p50": 1.754677688702941,
"p90": 12.66968655679375,
"p99": 28.717210091650486
},
"mechanisms": {
"kvcache-centric": 4449
},
"per_decode_load": {
"decode-0": 616,
"decode-1": 658,
"decode-2": 674,
"decode-3": 582,
"decode-4": 656,
"decode-5": 662,
"decode-6": 601
},
"per_prefill_load": {
"prefill-0": 4449
},
"prefill_request_priorities": {
"-100": 98,
"100": 2272
},
"re_prefill_count": 0,
"request_count": 4449,
"reuse_expected_count": 4397,
"reuse_observed_count": 4397,
"router_url": "http://127.0.0.1:8000",
"session_reset_count": 0,
"session_reused_count": 1716,
"total_actual_kv_transfer_blocks": 62123,
"total_cached_tokens": 100707229,
"total_kv_transfer_blocks": 105235,
"tpot_stats_s": {
"count": 4086.0,
"mean": 0.005829451223571163,
"p50": 0.005684156496173296,
"p90": 0.007143743503740225,
"p99": 0.008634991403068266
},
"trace_path": "outputs/qwen3-30b-tp1-v3-kvaware/kvcache-centric-kv-aware-worker-admission-20260428T095141Z/sampled-trace.jsonl",
"truncated_request_count": 42,
"ttft_stats_s": {
"count": 4086.0,
"mean": 3.5955862397812597,
"p50": 0.36274072993546724,
"p90": 10.972254231572151,
"p99": 27.433656523004174
}
}
[2026-04-28 18:43:43] Saved to outputs/qwen3-30b-tp1-v3-kvaware/exp1_1p7d_kvc_kvaware_summary.json + exp1_1p7d_kvc_kvaware_metrics.jsonl
[2026-04-28 18:43:43]
[2026-04-28 18:43:43] === [EXP2] 2P6D KVC kv-aware ===
[2026-04-28 19:30:38] === exp2_2p6d_kvc_kvaware COMPLETED ===
[2026-04-28 19:30:38] Summary:
{
"actual_output_tokens_stats": {
"count": 4440.0,
"mean": 225.87972972972972,
"p50": 86.0,
"p90": 576.0,
"p99": 1347.0
},
"cache_hit_request_count": 4201,
"cached_tokens_stats": {
"count": 4449.0,
"mean": 24345.55787817487,
"p50": 21504.0,
"p90": 48792.0,
"p99": 69120.0
},
"decode_request_priorities": {},
"error_count": 9,
"execution_modes": {
"kvcache-centric": 9,
"kvcache-direct-to-d-session": 1358,
"pd-router-d-session-reseed": 12,
"pd-router-fallback-d-backpressure": 2,
"pd-router-fallback-large-append-seed-filter-early-turn": 52,
"pd-router-fallback-large-append-session-cap": 2902,
"pd-router-fallback-session-cap": 25,
"pd-router-large-append-reseed": 34,
"pd-router-large-append-reseed-after-eviction": 4,
"pd-router-turn1-d-backpressure": 1,
"pd-router-turn1-seed": 30,
"pd-router-turn1-session-cap": 20
},
"latency_stats_s": {
"count": 4440.0,
"mean": 3.582334662846558,
"p50": 1.517257746309042,
"p90": 9.225348330102861,
"p99": 18.70269925892353
},
"mechanisms": {
"kvcache-centric": 4449
},
"per_decode_load": {
"decode-0": 710,
"decode-1": 630,
"decode-2": 763,
"decode-3": 737,
"decode-4": 879,
"decode-5": 730
},
"per_prefill_load": {
"prefill-0": 2225,
"prefill-1": 2224
},
"prefill_request_priorities": {
"-100": 80,
"100": 3002
},
"re_prefill_count": 0,
"request_count": 4449,
"reuse_expected_count": 4397,
"reuse_observed_count": 4397,
"router_url": "http://127.0.0.1:8000",
"session_reset_count": 0,
"session_reused_count": 1358,
"total_actual_kv_transfer_blocks": 78979,
"total_cached_tokens": 108313387,
"total_kv_transfer_blocks": 105235,
"tpot_stats_s": {
"count": 4440.0,
"mean": 0.005882534704321737,
"p50": 0.005807478777200416,
"p90": 0.00712956755887717,
"p99": 0.008372141476720572
},
"trace_path": "outputs/qwen3-30b-tp1-v3-kvaware/kvcache-centric-kv-aware-worker-admission-20260428T104343Z/sampled-trace.jsonl",
"truncated_request_count": 42,
"ttft_stats_s": {
"count": 4440.0,
"mean": 2.2045287611873334,
"p50": 0.32809355948120356,
"p90": 6.947275545448065,
"p99": 16.705802395939827
}
}
[2026-04-28 19:30:38] Saved to outputs/qwen3-30b-tp1-v3-kvaware/exp2_2p6d_kvc_kvaware_summary.json + exp2_2p6d_kvc_kvaware_metrics.jsonl
[2026-04-28 19:30:38]
[2026-04-28 19:30:38] === ALL TP1 V3 SWEEP EXPERIMENTS DONE ===

View File

@@ -0,0 +1,88 @@
{
"actual_output_tokens_stats": {
"count": 4014.0,
"mean": 215.048081714001,
"p50": 83.0,
"p90": 570.0,
"p99": 1343.0
},
"cache_hit_request_count": 3865,
"cached_tokens_stats": {
"count": 4449.0,
"mean": 21373.60867610699,
"p50": 18429.0,
"p90": 45643.0,
"p99": 65088.0
},
"decode_request_priorities": {},
"error_count": 435,
"execution_modes": {
"kvcache-centric": 435,
"kvcache-direct-to-d-session": 2180,
"pd-router-d-session-reseed": 44,
"pd-router-d-session-reseed-after-eviction": 1,
"pd-router-fallback-d-backpressure": 36,
"pd-router-fallback-large-append": 35,
"pd-router-fallback-large-append-seed-filter-early-turn": 52,
"pd-router-fallback-large-append-session-cap": 1500,
"pd-router-fallback-no-d-capacity": 13,
"pd-router-fallback-session-cap": 43,
"pd-router-large-append-reseed": 55,
"pd-router-large-append-reseed-after-eviction": 3,
"pd-router-turn1-d-backpressure": 1,
"pd-router-turn1-no-d-capacity": 5,
"pd-router-turn1-seed": 46
},
"latency_stats_s": {
"count": 4014.0,
"mean": 4.214657033050009,
"p50": 1.0827504023909569,
"p90": 13.380241627804935,
"p99": 24.453291333280504
},
"mechanisms": {
"kvcache-centric": 4449
},
"per_decode_load": {
"decode-0": 690,
"decode-1": 599,
"decode-2": 660,
"decode-3": 584,
"decode-4": 606,
"decode-5": 646,
"decode-6": 664
},
"per_prefill_load": {
"prefill-0": 4449
},
"prefill_request_priorities": {
"-100": 149,
"100": 1685
},
"re_prefill_count": 0,
"request_count": 4449,
"reuse_expected_count": 4397,
"reuse_observed_count": 4397,
"router_url": "http://127.0.0.1:8000",
"session_reset_count": 0,
"session_reused_count": 2180,
"total_actual_kv_transfer_blocks": 52857,
"total_cached_tokens": 95091185,
"total_kv_transfer_blocks": 105235,
"tpot_stats_s": {
"count": 4014.0,
"mean": 0.005804301410418847,
"p50": 0.005607025208882987,
"p90": 0.007293824862528552,
"p99": 0.008864479259402893
},
"trace_path": "outputs/qwen3-30b-tp1-v4-cap16/kvcache-centric-kv-aware-worker-admission-20260428T125022Z/sampled-trace.jsonl",
"truncated_request_count": 43,
"ttft_stats_s": {
"count": 4014.0,
"mean": 2.915135478307124,
"p50": 0.05643345229327679,
"p90": 11.900803190656006,
"p99": 22.758968392387033
}
}

View File

@@ -0,0 +1,86 @@
{
"actual_output_tokens_stats": {
"count": 4046.0,
"mean": 224.65002471576867,
"p50": 84.0,
"p90": 576.0,
"p99": 1349.0
},
"cache_hit_request_count": 3925,
"cached_tokens_stats": {
"count": 4449.0,
"mean": 22852.7439874129,
"p50": 19584.0,
"p90": 49009.0,
"p99": 67320.0
},
"decode_request_priorities": {},
"error_count": 403,
"execution_modes": {
"kvcache-centric": 403,
"kvcache-direct-to-d-session": 2348,
"pd-router-d-session-reseed": 28,
"pd-router-fallback-d-backpressure": 7,
"pd-router-fallback-large-append": 68,
"pd-router-fallback-large-append-seed-filter-early-turn": 45,
"pd-router-fallback-large-append-session-cap": 1403,
"pd-router-fallback-no-d-capacity": 9,
"pd-router-fallback-session-cap": 25,
"pd-router-large-append-reseed": 57,
"pd-router-large-append-reseed-after-eviction": 6,
"pd-router-turn1-no-d-capacity": 1,
"pd-router-turn1-seed": 49
},
"latency_stats_s": {
"count": 4046.0,
"mean": 2.505981629502371,
"p50": 0.8372491216287017,
"p90": 6.5139341270551085,
"p99": 18.335972285829484
},
"mechanisms": {
"kvcache-centric": 4449
},
"per_decode_load": {
"decode-0": 767,
"decode-1": 680,
"decode-2": 906,
"decode-3": 818,
"decode-4": 800,
"decode-5": 478
},
"per_prefill_load": {
"prefill-0": 2225,
"prefill-1": 2224
},
"prefill_request_priorities": {
"-100": 140,
"100": 1558
},
"re_prefill_count": 0,
"request_count": 4449,
"reuse_expected_count": 4397,
"reuse_observed_count": 4397,
"router_url": "http://127.0.0.1:8000",
"session_reset_count": 0,
"session_reused_count": 2348,
"total_actual_kv_transfer_blocks": 50727,
"total_cached_tokens": 101671858,
"total_kv_transfer_blocks": 105235,
"tpot_stats_s": {
"count": 4046.0,
"mean": 0.005708743129332261,
"p50": 0.005565466725497757,
"p90": 0.006912594398356141,
"p99": 0.008102089307750717
},
"trace_path": "outputs/qwen3-30b-tp1-v4-cap16/kvcache-centric-kv-aware-worker-admission-20260428T134057Z/sampled-trace.jsonl",
"truncated_request_count": 36,
"ttft_stats_s": {
"count": 4046.0,
"mean": 1.1653790952959129,
"p50": 0.05140436999499798,
"p90": 2.6447059931233525,
"p99": 15.121314341202378
}
}

View File

@@ -0,0 +1,190 @@
[2026-04-28 20:50:21] Starting TP1 v4 sweep (KVC kv-aware, session soft_cap raised 4->16)
[2026-04-28 20:50:21] Model: /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
[2026-04-28 20:50:21] Trace: outputs/qwen35-swebench-50sess.jsonl (4449 requests, 52 sessions)
[2026-04-28 20:50:21] Key change: _decode_session_soft_cap now min(16, ...) instead of min(4, ...)
[2026-04-28 20:50:21]
[2026-04-28 20:50:21] === [EXP1] 1P7D KVC kv-aware cap=16 ===
[2026-04-28 21:40:57] === exp1_1p7d_kvc_cap16 COMPLETED ===
[2026-04-28 21:40:57] Summary:
{
"actual_output_tokens_stats": {
"count": 4014.0,
"mean": 215.048081714001,
"p50": 83.0,
"p90": 570.0,
"p99": 1343.0
},
"cache_hit_request_count": 3865,
"cached_tokens_stats": {
"count": 4449.0,
"mean": 21373.60867610699,
"p50": 18429.0,
"p90": 45643.0,
"p99": 65088.0
},
"decode_request_priorities": {},
"error_count": 435,
"execution_modes": {
"kvcache-centric": 435,
"kvcache-direct-to-d-session": 2180,
"pd-router-d-session-reseed": 44,
"pd-router-d-session-reseed-after-eviction": 1,
"pd-router-fallback-d-backpressure": 36,
"pd-router-fallback-large-append": 35,
"pd-router-fallback-large-append-seed-filter-early-turn": 52,
"pd-router-fallback-large-append-session-cap": 1500,
"pd-router-fallback-no-d-capacity": 13,
"pd-router-fallback-session-cap": 43,
"pd-router-large-append-reseed": 55,
"pd-router-large-append-reseed-after-eviction": 3,
"pd-router-turn1-d-backpressure": 1,
"pd-router-turn1-no-d-capacity": 5,
"pd-router-turn1-seed": 46
},
"latency_stats_s": {
"count": 4014.0,
"mean": 4.214657033050009,
"p50": 1.0827504023909569,
"p90": 13.380241627804935,
"p99": 24.453291333280504
},
"mechanisms": {
"kvcache-centric": 4449
},
"per_decode_load": {
"decode-0": 690,
"decode-1": 599,
"decode-2": 660,
"decode-3": 584,
"decode-4": 606,
"decode-5": 646,
"decode-6": 664
},
"per_prefill_load": {
"prefill-0": 4449
},
"prefill_request_priorities": {
"-100": 149,
"100": 1685
},
"re_prefill_count": 0,
"request_count": 4449,
"reuse_expected_count": 4397,
"reuse_observed_count": 4397,
"router_url": "http://127.0.0.1:8000",
"session_reset_count": 0,
"session_reused_count": 2180,
"total_actual_kv_transfer_blocks": 52857,
"total_cached_tokens": 95091185,
"total_kv_transfer_blocks": 105235,
"tpot_stats_s": {
"count": 4014.0,
"mean": 0.005804301410418847,
"p50": 0.005607025208882987,
"p90": 0.007293824862528552,
"p99": 0.008864479259402893
},
"trace_path": "outputs/qwen3-30b-tp1-v4-cap16/kvcache-centric-kv-aware-worker-admission-20260428T125022Z/sampled-trace.jsonl",
"truncated_request_count": 43,
"ttft_stats_s": {
"count": 4014.0,
"mean": 2.915135478307124,
"p50": 0.05643345229327679,
"p90": 11.900803190656006,
"p99": 22.758968392387033
}
}
[2026-04-28 21:40:57] Saved to outputs/qwen3-30b-tp1-v4-cap16/exp1_1p7d_kvc_cap16_summary.json + exp1_1p7d_kvc_cap16_metrics.jsonl
[2026-04-28 21:40:57]
[2026-04-28 21:40:57] === [EXP2] 2P6D KVC kv-aware cap=16 ===
[2026-04-28 22:27:53] === exp2_2p6d_kvc_cap16 COMPLETED ===
[2026-04-28 22:27:53] Summary:
{
"actual_output_tokens_stats": {
"count": 4046.0,
"mean": 224.65002471576867,
"p50": 84.0,
"p90": 576.0,
"p99": 1349.0
},
"cache_hit_request_count": 3925,
"cached_tokens_stats": {
"count": 4449.0,
"mean": 22852.7439874129,
"p50": 19584.0,
"p90": 49009.0,
"p99": 67320.0
},
"decode_request_priorities": {},
"error_count": 403,
"execution_modes": {
"kvcache-centric": 403,
"kvcache-direct-to-d-session": 2348,
"pd-router-d-session-reseed": 28,
"pd-router-fallback-d-backpressure": 7,
"pd-router-fallback-large-append": 68,
"pd-router-fallback-large-append-seed-filter-early-turn": 45,
"pd-router-fallback-large-append-session-cap": 1403,
"pd-router-fallback-no-d-capacity": 9,
"pd-router-fallback-session-cap": 25,
"pd-router-large-append-reseed": 57,
"pd-router-large-append-reseed-after-eviction": 6,
"pd-router-turn1-no-d-capacity": 1,
"pd-router-turn1-seed": 49
},
"latency_stats_s": {
"count": 4046.0,
"mean": 2.505981629502371,
"p50": 0.8372491216287017,
"p90": 6.5139341270551085,
"p99": 18.335972285829484
},
"mechanisms": {
"kvcache-centric": 4449
},
"per_decode_load": {
"decode-0": 767,
"decode-1": 680,
"decode-2": 906,
"decode-3": 818,
"decode-4": 800,
"decode-5": 478
},
"per_prefill_load": {
"prefill-0": 2225,
"prefill-1": 2224
},
"prefill_request_priorities": {
"-100": 140,
"100": 1558
},
"re_prefill_count": 0,
"request_count": 4449,
"reuse_expected_count": 4397,
"reuse_observed_count": 4397,
"router_url": "http://127.0.0.1:8000",
"session_reset_count": 0,
"session_reused_count": 2348,
"total_actual_kv_transfer_blocks": 50727,
"total_cached_tokens": 101671858,
"total_kv_transfer_blocks": 105235,
"tpot_stats_s": {
"count": 4046.0,
"mean": 0.005708743129332261,
"p50": 0.005565466725497757,
"p90": 0.006912594398356141,
"p99": 0.008102089307750717
},
"trace_path": "outputs/qwen3-30b-tp1-v4-cap16/kvcache-centric-kv-aware-worker-admission-20260428T134057Z/sampled-trace.jsonl",
"truncated_request_count": 36,
"ttft_stats_s": {
"count": 4046.0,
"mean": 1.1653790952959129,
"p50": 0.05140436999499798,
"p90": 2.6447059931233525,
"p99": 15.121314341202378
}
}
[2026-04-28 22:27:53] Saved to outputs/qwen3-30b-tp1-v4-cap16/exp2_2p6d_kvc_cap16_summary.json + exp2_2p6d_kvc_cap16_metrics.jsonl
[2026-04-28 22:27:53]
[2026-04-28 22:27:53] === ALL TP1 V4 SWEEP EXPERIMENTS DONE ===

View File

@@ -0,0 +1,191 @@
#!/usr/bin/env python3
"""Analyze backpressure smoke sweep outputs.
For each run dir with a `request-metrics.jsonl` and the new `structural/`
subdir (admission-events.jsonl, backpressure-events.jsonl,
session-d-binding.jsonl), report:
- Headline (errors, latency, ttft, direct-to-D rate)
- Backpressure pause histogram (count, p50/p90 sleep, total pause time per D)
- Admission probe stats (RPC count, mean RTT, queue_depth distribution,
pause_ms distribution)
- Session pinning (distinct D per session, bimodal direct-to-D rate)
"""
from __future__ import annotations
import argparse
import json
import statistics
from collections import Counter, defaultdict
from pathlib import Path
def load_jsonl(path: Path) -> list[dict]:
if not path.exists():
return []
return [json.loads(l) for l in path.open("r", encoding="utf-8") if l.strip()]
def summarize_run(run_dir: Path) -> dict:
metrics_path = next(run_dir.rglob("request-metrics.jsonl"), None)
if metrics_path is None:
return {"run_dir": str(run_dir), "error": "no request-metrics.jsonl"}
summary_path = metrics_path.with_suffix(metrics_path.suffix + ".summary.json")
summary = (
json.load(summary_path.open()) if summary_path.exists() else {}
)
structural_dir = run_dir / "structural"
if not structural_dir.exists():
# try metrics dir's parent / structural
structural_dir = metrics_path.parent / "structural"
admission_events = load_jsonl(structural_dir / "admission-events.jsonl")
backpressure_events = load_jsonl(structural_dir / "backpressure-events.jsonl")
binding_events = load_jsonl(structural_dir / "session-d-binding.jsonl")
out: dict = {"run_dir": str(run_dir)}
# Headline metrics from summary.json
out["request_count"] = summary.get("request_count")
out["error_count"] = summary.get("error_count")
out["latency"] = summary.get("latency_stats_s")
out["ttft"] = summary.get("ttft_stats_s")
out["execution_modes"] = summary.get("execution_modes")
out["per_decode_load"] = summary.get("per_decode_load")
out["per_prefill_load"] = summary.get("per_prefill_load")
# Direct-to-D rate from execution_modes
em = summary.get("execution_modes", {}) or {}
direct = em.get("kvcache-direct-to-d-session", 0)
total = sum(em.values()) or 1
out["direct_to_d_rate"] = direct / total
# Session pinning
bind_per_session: dict[str, set[int]] = defaultdict(set)
for ev in binding_events:
bind_per_session[ev["session_id"]].add(ev["decode_worker_index"])
if bind_per_session:
out["session_count"] = len(bind_per_session)
out["avg_distinct_d_per_session"] = (
sum(len(v) for v in bind_per_session.values()) / len(bind_per_session)
)
else:
out["session_count"] = 0
out["avg_distinct_d_per_session"] = None
# Direct-to-D rate per session (bimodal check)
records = load_jsonl(metrics_path)
sess_records: dict[str, list[dict]] = defaultdict(list)
for r in records:
sess_records[r["session_id"]].append(r)
rates = []
for sid, turns in sess_records.items():
ndir = sum(
1 for t in turns if t.get("execution_mode") == "kvcache-direct-to-d-session"
)
rates.append(ndir / len(turns))
if rates:
buckets = [0, 0, 0, 0, 0]
for r in rates:
buckets[min(4, int(r * 5))] += 1
out["direct_to_d_rate_buckets"] = {
"0-20%": buckets[0],
"20-40%": buckets[1],
"40-60%": buckets[2],
"60-80%": buckets[3],
"80-100%": buckets[4],
}
# Backpressure events
if backpressure_events:
sleeps = [ev["sleep_s"] for ev in backpressure_events]
out["backpressure"] = {
"event_count": len(backpressure_events),
"total_sleep_s": round(sum(sleeps), 2),
"sleep_p50_s": round(statistics.median(sleeps), 4),
"sleep_p90_s": round(
sorted(sleeps)[int(len(sleeps) * 0.9)] if sleeps else 0, 4
),
"events_per_d": dict(
Counter(ev["server_url"] for ev in backpressure_events).most_common()
),
}
else:
out["backpressure"] = {"event_count": 0, "note": "no backpressure events"}
# Admission probe stats
if admission_events:
rtts = [ev["rtt_s"] for ev in admission_events]
depths = [ev.get("queue_depth", 0) for ev in admission_events]
pauses = [ev.get("recommended_pause_ms", 0) for ev in admission_events]
out["admission_probes"] = {
"count": len(admission_events),
"mean_rtt_s": round(sum(rtts) / len(rtts), 4),
"p99_rtt_s": round(sorted(rtts)[int(len(rtts) * 0.99)], 4),
"queue_depth_p50": int(statistics.median(depths)),
"queue_depth_p90": int(sorted(depths)[int(len(depths) * 0.9)]),
"queue_depth_max": max(depths),
"pause_ms_p50": int(statistics.median(pauses)),
"pause_ms_p90": int(sorted(pauses)[int(len(pauses) * 0.9)]),
"pause_ms_max": max(pauses),
"nonzero_pause_count": sum(1 for p in pauses if p > 0),
"by_reason": dict(
Counter(ev.get("reason") or "ok" for ev in admission_events).most_common()
),
}
return out
def main() -> None:
ap = argparse.ArgumentParser()
ap.add_argument("sweep_root", type=Path)
ap.add_argument("--json", action="store_true", help="emit JSON only")
args = ap.parse_args()
summaries = []
for run_dir in sorted(args.sweep_root.iterdir()):
if not run_dir.is_dir():
continue
summary = summarize_run(run_dir)
summaries.append(summary)
if args.json:
print(json.dumps(summaries, indent=2))
return
for s in summaries:
print(f"\n{'=' * 70}")
print(f" {s['run_dir']}")
print(f"{'=' * 70}")
if "error" in s:
print(f" ERROR: {s['error']}")
continue
print(f" reqs={s.get('request_count')} errors={s.get('error_count')}")
if s.get("latency"):
lt = s["latency"]
print(
f" latency: mean={lt.get('mean'):.3f} "
f"p50={lt.get('p50'):.3f} p90={lt.get('p90'):.3f} p99={lt.get('p99'):.3f}"
)
if s.get("ttft"):
tt = s["ttft"]
print(
f" ttft: mean={tt.get('mean'):.3f} "
f"p50={tt.get('p50'):.3f} p90={tt.get('p90'):.3f}"
)
print(f" direct_to_d_rate: {s.get('direct_to_d_rate', 0) * 100:.1f}%")
print(f" sessions: {s.get('session_count')} | "
f"avg distinct-D-per-session: {s.get('avg_distinct_d_per_session')}")
if s.get("direct_to_d_rate_buckets"):
print(f" direct-to-D distribution by session: {s['direct_to_d_rate_buckets']}")
if s.get("backpressure"):
print(f" backpressure: {s['backpressure']}")
if s.get("admission_probes"):
print(f" admission probes: {s['admission_probes']}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,83 @@
#!/usr/bin/env python3
"""Deep dive into v4 errors: which path, which D, which session, which turn."""
import json
import numpy as np
from pathlib import Path
from collections import Counter, defaultdict
BASE = Path(__file__).parent
def load_rows(jsonl_path):
rows = []
with open(jsonl_path) as f:
for line in f:
rows.append(json.loads(line))
return rows
# Compare v3 and v4 errors
for label, path in [
("v3 1P7D", BASE.parent / "qwen3-30b-tp1-v3-kvaware/exp1_1p7d_kvc_kvaware_metrics.jsonl"),
("v4 1P7D", BASE / "exp1_1p7d_kvc_cap16_metrics.jsonl"),
("v3 2P6D", BASE.parent / "qwen3-30b-tp1-v3-kvaware/exp2_2p6d_kvc_kvaware_metrics.jsonl"),
("v4 2P6D", BASE / "exp2_2p6d_kvc_cap16_metrics.jsonl"),
]:
if not path.exists():
print(f"\nSKIP {label}: {path} not found")
continue
rows = load_rows(path)
err = [r for r in rows if r.get("error") is not None]
print(f"\n========== {label} ({len(err)} errors / {len(rows)} total = {len(err)/len(rows)*100:.1f}%) ==========")
# Error finish_reason distribution
fr_counter = Counter()
for r in err:
fr = str(r.get("finish_reason") or r.get("error") or "?")
fr_counter[fr[:80]] += 1
print(f"finish_reason distribution:")
for fr, cnt in fr_counter.most_common():
print(f" {cnt:>4}x {fr}")
# Errors by execution mode (these are aborted before mode assignment usually)
mode_counter = Counter(r.get("execution_mode", "?") for r in err)
print(f"\nerror by execution_mode:")
for mode, cnt in mode_counter.most_common():
print(f" {cnt:>4}x {mode}")
# Errors per D worker
dw_counter = Counter(r.get("assigned_decode_node", "?") for r in err)
print(f"\nerror per assigned_decode_node:")
for dw, cnt in dw_counter.most_common():
print(f" {cnt:>4}x {dw}")
# Errors by turn distribution
turn_counter = Counter(r.get("turn_id", -1) for r in err)
early = sum(c for t, c in turn_counter.items() if t <= 5)
mid = sum(c for t, c in turn_counter.items() if 5 < t <= 30)
late = sum(c for t, c in turn_counter.items() if t > 30)
print(f"\nerror by turn: early(0-5)={early} mid(6-30)={mid} late(31+)={late}")
# Per-session error rate
per_sess_err = defaultdict(int)
per_sess_total = defaultdict(int)
for r in rows:
per_sess_total[r["session_id"]] += 1
if r.get("error") is not None:
per_sess_err[r["session_id"]] += 1
sess_with_err = [(sid, per_sess_err[sid], per_sess_total[sid]) for sid in per_sess_err]
sess_with_err.sort(key=lambda x: -x[1])
print(f"\ntop 5 sessions by error count:")
for sid, e, t in sess_with_err[:5]:
print(f" session {sid}: {e}/{t} errors ({e/t*100:.0f}%)")
# Errors timeline: are they bursty?
err_ts = sorted([r.get("trace_timestamp_s", 0) for r in err])
if err_ts:
first_ts = err_ts[0]
last_ts = err_ts[-1]
all_ts = sorted([r.get("trace_timestamp_s", 0) for r in rows])
first_all = all_ts[0]
last_all = all_ts[-1]
run_duration = last_all - first_all
err_first_pct = (err_ts[0] - first_all) / run_duration * 100 if run_duration > 0 else 0
err_last_pct = (err_ts[-1] - first_all) / run_duration * 100 if run_duration > 0 else 0
print(f"\nerror time range (% of run): {err_first_pct:.1f}% - {err_last_pct:.1f}%")

View File

@@ -0,0 +1,346 @@
#!/usr/bin/env python3
"""Analyze d-pool-timeseries.jsonl produced by --pool-poll-interval-s.
Answers v6's main question: where is D's KV pool actually spent?
For each decode worker, decomposes capacity over the run wall-clock into:
- resident_held_active = held - idle_evictable (sessions in active use)
- resident_held_idle = idle_evictable (sessions kept around but evictable)
- prefill_backup_or_other = capacity - held - available (everything else: backup blocks,
in-flight transfers, fragmentation)
- free_available = available
Also reports session residency churn (how many distinct sessions ever resided per D, and
how often a session bounced between workers — a strong starvation signal).
Usage:
python scripts/analysis/analyze_pool_timeseries.py <run_dir>
or
python scripts/analysis/analyze_pool_timeseries.py <pool_timeseries.jsonl>
Output: human-readable text. Add --json to also print a machine-readable summary.
"""
from __future__ import annotations
import argparse
import json
import statistics
from collections import Counter, defaultdict
from pathlib import Path
from typing import Any
def _load_jsonl(path: Path) -> list[dict[str, Any]]:
rows: list[dict[str, Any]] = []
with path.open() as fh:
for line in fh:
line = line.strip()
if not line:
continue
rows.append(json.loads(line))
return rows
def _resolve_input(path: Path) -> Path:
if path.is_file():
return path
if path.is_dir():
candidate = path / "d-pool-timeseries.jsonl"
if candidate.is_file():
return candidate
raise FileNotFoundError(
f"{candidate} not found; pass the file directly or a run dir containing it."
)
raise FileNotFoundError(path)
def _percentile(values: list[float], p: float) -> float:
if not values:
return 0.0
s = sorted(values)
idx = min(len(s) - 1, max(0, int(round((len(s) - 1) * p))))
return s[idx]
def _fmt_tokens(n: float) -> str:
if n >= 1_000_000:
return f"{n / 1_000_000:.2f}M"
if n >= 1_000:
return f"{n / 1_000:.1f}K"
return f"{int(n)}"
def _fmt_pct(n: float, total: float) -> str:
if total <= 0:
return " - "
return f"{100 * n / total:5.1f}%"
def analyze(timeseries_path: Path) -> dict[str, Any]:
rows = _load_jsonl(timeseries_path)
if not rows:
raise ValueError(f"empty timeseries: {timeseries_path}")
by_worker: dict[str, list[dict[str, Any]]] = defaultdict(list)
for row in rows:
if row.get("error") and "session_cache_enabled" not in row:
# poller failed at this tick — skip
continue
wid = row.get("worker_id") or "?"
by_worker[wid].append(row)
summary: dict[str, Any] = {
"timeseries_path": str(timeseries_path),
"total_rows": len(rows),
"tick_count": len(by_worker[next(iter(by_worker))]) if by_worker else 0,
"wall_s_span": (
max(r.get("wall_s", 0.0) for r in rows)
- min(r.get("wall_s", 0.0) for r in rows)
),
"workers": {},
}
print(f"\n=== Pool timeseries: {timeseries_path}")
print(
f" rows={summary['total_rows']} workers={len(by_worker)} "
f"span={summary['wall_s_span']:.1f}s"
)
# Print per-worker decomposition table
header = (
f"{'worker':<12} {'role':<8} {'cap':>8} | "
f"{'avg_active':>10} {'avg_idle':>10} {'avg_other':>10} {'avg_free':>10} | "
f"{'p90_held':>10} {'max_held':>10} {'p90_avail':>10}"
)
print(header)
print("-" * len(header))
for wid in sorted(by_worker.keys()):
ws = by_worker[wid]
role = ws[0].get("worker_role", "?")
cap_vals = [int(r.get("capacity_tokens") or 0) for r in ws]
held_vals = [int(r.get("held_tokens") or 0) for r in ws]
avail_vals = [int(r.get("available_tokens") or 0) for r in ws]
idle_vals = [int(r.get("idle_evictable_tokens") or 0) for r in ws]
# active = held - idle (sessions in active use)
active_vals = [max(0, h - i) for h, i in zip(held_vals, idle_vals)]
# other = capacity - held - available (prefill backup blocks, in-flight, fragmentation)
other_vals = [
max(0, c - h - a) for c, h, a in zip(cap_vals, held_vals, avail_vals)
]
cap = max(cap_vals) if cap_vals else 0
avg_active = statistics.fmean(active_vals) if active_vals else 0.0
avg_idle = statistics.fmean(idle_vals) if idle_vals else 0.0
avg_other = statistics.fmean(other_vals) if other_vals else 0.0
avg_avail = statistics.fmean(avail_vals) if avail_vals else 0.0
p90_held = _percentile([float(v) for v in held_vals], 0.90)
max_held = max(held_vals) if held_vals else 0
p90_avail = _percentile([float(v) for v in avail_vals], 0.90)
sess_counts = [int(r.get("session_count") or 0) for r in ws]
resident_counts = [int(r.get("resident_session_count") or 0) for r in ws]
print(
f"{wid:<12} {role:<8} {_fmt_tokens(cap):>8} | "
f"{_fmt_tokens(avg_active):>4} {_fmt_pct(avg_active, cap):>5} "
f"{_fmt_tokens(avg_idle):>4} {_fmt_pct(avg_idle, cap):>5} "
f"{_fmt_tokens(avg_other):>4} {_fmt_pct(avg_other, cap):>5} "
f"{_fmt_tokens(avg_avail):>4} {_fmt_pct(avg_avail, cap):>5} | "
f"{_fmt_tokens(p90_held):>10} {_fmt_tokens(max_held):>10} "
f"{_fmt_tokens(p90_avail):>10}"
)
summary["workers"][wid] = {
"role": role,
"capacity_tokens": cap,
"avg_active_held_tokens": avg_active,
"avg_idle_evictable_tokens": avg_idle,
"avg_other_tokens": avg_other,
"avg_available_tokens": avg_avail,
"p90_held_tokens": p90_held,
"max_held_tokens": max_held,
"p90_available_tokens": p90_avail,
"max_session_count": max(sess_counts) if sess_counts else 0,
"max_resident_session_count": (
max(resident_counts) if resident_counts else 0
),
"ticks": len(ws),
}
print(
"\nLegend: active=held-idle idle=idle_evictable "
"other=cap-held-avail (radix-protected + running-batch + in-flight + frag)"
)
# P1: decomposition of "other" using pool_breakdown fields (zeros if instrument absent)
has_breakdown = any(
any(r.get(k) for k in (
"radix_evictable_tokens",
"radix_protected_tokens",
"running_batch_kv_tokens",
"transfer_queue_tokens",
"prealloc_queue_tokens",
"retracted_queue_tokens",
))
for r in rows
)
if has_breakdown:
print("\n=== P1 'other' decomposition (per worker, mean over run) ===")
print(
f"{'worker':<12} {'role':<8} | "
f"{'r_evictable':>11} {'r_protected':>11} {'slot_private':>12} | "
f"{'run_batch':>10} {'transfer':>9} {'prealloc':>9} {'retracted':>10} | "
f"{'unaccounted':>11}"
)
for wid in sorted(by_worker.keys()):
ws = by_worker[wid]
role = ws[0].get("worker_role", "?")
cap = max(int(r.get("capacity_tokens") or 0) for r in ws)
def m(field: str) -> float:
vals = [int(r.get(field) or 0) for r in ws]
return statistics.fmean(vals) if vals else 0.0
r_ev = m("radix_evictable_tokens")
r_pr = m("radix_protected_tokens")
slot = m("slot_private_held_tokens")
rb = m("running_batch_kv_tokens")
tq = m("transfer_queue_tokens")
pq = m("prealloc_queue_tokens")
rq = m("retracted_queue_tokens")
avail = m("available_tokens")
# `running_batch_kv_tokens` overlaps with radix_protected for tree-tracked
# reqs — do NOT subtract it again. Decomposition assumes:
# capacity ≈ avail + r_evictable + r_protected + slot_private
# + transfer_queue + prealloc_queue + retracted_queue + unaccounted
unacc = max(
0,
cap - avail - r_ev - r_pr - slot - tq - pq - rq,
)
print(
f"{wid:<12} {role:<8} | "
f"{_fmt_tokens(r_ev):>11} {_fmt_tokens(r_pr):>11} {_fmt_tokens(slot):>12} | "
f"{_fmt_tokens(rb):>10} {_fmt_tokens(tq):>9} {_fmt_tokens(pq):>9} {_fmt_tokens(rq):>10} | "
f"{_fmt_tokens(unacc):>11}"
)
summary["workers"][wid]["pool_breakdown_avg"] = {
"radix_evictable": r_ev,
"radix_protected": r_pr,
"slot_private_held": slot,
"running_batch_kv": rb,
"transfer_queue": tq,
"prealloc_queue": pq,
"retracted_queue": rq,
"available": avail,
"unaccounted": unacc,
}
print(
"\nNote: running_batch_kv_tokens overlaps with radix_protected_tokens "
"(tree-tracked decode reqs are also in protected); not summed."
)
else:
print("\n(P1 instrument absent: pool_breakdown fields are all zero)")
# Session residency churn: how many distinct sessions ever sat on each worker,
# and how many sessions hopped across workers (= starvation indicator).
print("\n=== Session residency churn ===")
sessions_per_worker: dict[str, set[str]] = defaultdict(set)
workers_per_session: dict[str, set[str]] = defaultdict(set)
resident_ticks_per_session: Counter[str] = Counter()
resident_ticks_per_worker: Counter[str] = Counter()
for row in rows:
wid = row.get("worker_id")
if wid is None or row.get("worker_role") != "decode":
continue
sessions = row.get("sessions") or []
if not isinstance(sessions, list):
continue
for entry in sessions:
if not isinstance(entry, dict):
continue
sid = entry.get("session_id")
if sid is None:
continue
if entry.get("resident"):
sessions_per_worker[wid].add(sid)
workers_per_session[sid].add(wid)
resident_ticks_per_session[(wid, sid)] += 1
resident_ticks_per_worker[wid] += 1
# Per-decode worker: distinct session count
print(f" {'worker':<12} {'distinct_sess':>14} {'resident_ticks':>16}")
for wid in sorted(sessions_per_worker.keys()):
print(
f" {wid:<12} {len(sessions_per_worker[wid]):>14} "
f"{resident_ticks_per_worker[wid]:>16}"
)
# Per session: how many workers it hopped across
hops = Counter(len(ws) for ws in workers_per_session.values())
print(f"\n Sessions seen on N workers (decode side):")
for n, count in sorted(hops.items()):
print(f" on {n} worker(s): {count} sessions")
starvation = [sid for sid, ws in workers_per_session.items() if len(ws) == 0]
multi_hopper = sorted(
((sid, ws) for sid, ws in workers_per_session.items() if len(ws) >= 2),
key=lambda x: -len(x[1]),
)[:10]
if multi_hopper:
print(
"\n Top sessions seen resident on multiple workers (potential thrashing):"
)
for sid, ws in multi_hopper:
print(f" {sid}: {len(ws)} workers ({sorted(ws)})")
summary["session_residency"] = {
"distinct_sessions_per_worker": {
wid: len(s) for wid, s in sessions_per_worker.items()
},
"session_hop_count_distribution": dict(hops),
"starvation_session_count": len(starvation),
}
# If a request-metrics file is co-located, also bucket fallback reasons
# against contemporaneous pool state (rough — uses tick nearest to median tick).
metrics_path = timeseries_path.with_name("request-metrics.jsonl")
if metrics_path.exists():
print(f"\n=== Request-metrics summary ({metrics_path.name}) ===")
mrows = _load_jsonl(metrics_path)
modes = Counter(r.get("execution_mode") or "?" for r in mrows)
total = sum(modes.values())
for mode, count in modes.most_common():
print(f" {count:>6} ({100 * count / total:5.1f}%) {mode}")
summary["execution_modes"] = dict(modes)
return summary
def main() -> None:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"path",
type=Path,
help="Path to d-pool-timeseries.jsonl OR a run dir containing it",
)
parser.add_argument(
"--json",
action="store_true",
help="Also print a machine-readable JSON summary",
)
args = parser.parse_args()
resolved = _resolve_input(args.path)
summary = analyze(resolved)
if args.json:
print("\n=== JSON summary ===")
print(json.dumps(summary, indent=2, sort_keys=True, default=str))
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,316 @@
#!/usr/bin/env python3
"""TS=1 validation analysis: KVC 1P3D × N=3 + 4DP × 1.
Reads metrics from outputs/qwen3-30b-tp1-ts1-validation/{kvc_1p3d_run{1,2,3},dp4}_metrics.jsonl
and reports per the structural claims in docs/AGENTIC_FIT_ANALYSIS_ZH.md and TEAM_REPORT.
Sections:
1. Headline summary table (errors, latency p50/p90/p99, TTFT p50)
2. §1 (session pinning): distinct-D-per-session distribution + direct-to-D bimodal
3. §1 (cross-run consistency): sessions consistently starved across all 3 runs + size ratio
4. §2 (LRU): KVTransferError counts per D + peak token_usage from worker logs
5. §7 (ts=1 vs ts=10): direct-to-D rate, fallback rate, per-D load balance
6. KVC vs DP same-scale comparison
Usage: python scripts/analysis/analyze_ts1_validation.py [--root PATH]
"""
import argparse
import json
import re
from collections import Counter, defaultdict
from pathlib import Path
import numpy as np
def load_metrics(path):
rows = []
with open(path) as f:
for line in f:
line = line.strip()
if not line:
continue
rows.append(json.loads(line))
return rows
def load_summary(path):
with open(path) as f:
return json.load(f)
def pct(arr, p):
if not arr:
return float("nan")
return float(np.percentile(arr, p))
def summarize_run(label, rows, summary):
ok = [r for r in rows if r.get("error") is None]
err = [r for r in rows if r.get("error") is not None]
lats = [r["latency_s"] for r in ok if r.get("latency_s") is not None]
ttfts = [r["ttft_s"] for r in ok if r.get("ttft_s") is not None]
return {
"label": label,
"n": len(rows),
"ok": len(ok),
"err": len(err),
"lat_mean": float(np.mean(lats)) if lats else float("nan"),
"lat_p50": pct(lats, 50),
"lat_p90": pct(lats, 90),
"lat_p99": pct(lats, 99),
"ttft_mean": float(np.mean(ttfts)) if ttfts else float("nan"),
"ttft_p50": pct(ttfts, 50),
"summary": summary,
}
def headline_table(stats):
print("\n" + "=" * 110)
print("HEADLINE: same trace, same scale, same ts=1")
print("=" * 110)
cols = ["label", "ok/n", "err", "lat_mean", "lat_p50", "lat_p90", "lat_p99", "ttft_mean", "ttft_p50"]
print(f"{cols[0]:<22}{cols[1]:>12}{cols[2]:>6}{cols[3]:>10}{cols[4]:>10}{cols[5]:>10}{cols[6]:>10}{cols[7]:>10}{cols[8]:>10}")
for s in stats:
ok_n = f"{s['ok']}/{s['n']}"
print(f"{s['label']:<22}{ok_n:>12}{s['err']:>6}"
f"{s['lat_mean']:>9.3f}s{s['lat_p50']:>9.3f}s{s['lat_p90']:>9.3f}s{s['lat_p99']:>9.3f}s"
f"{s['ttft_mean']:>9.3f}s{s['ttft_p50']:>9.3f}s")
def session_pinning(rows, label):
"""§1: distinct D per session — should be ~1.0 if pin behavior persists."""
sess_d = defaultdict(set)
for r in rows:
sid = r.get("session_id")
d = r.get("assigned_decode_node") or r.get("decode_node")
if sid is not None and d is not None:
sess_d[sid].add(d)
if not sess_d:
return None
distinct = [len(s) for s in sess_d.values()]
return {
"label": label,
"n_sessions": len(sess_d),
"avg_distinct_D": float(np.mean(distinct)),
"max_distinct_D": max(distinct),
"sess_d": {sid: sorted(ds) for sid, ds in sess_d.items()},
}
def direct_to_d_distribution(rows, label):
"""§1: per-session direct-to-D rate; check for bimodal."""
sess_total = Counter()
sess_direct = Counter()
for r in rows:
sid = r.get("session_id")
if sid is None:
continue
sess_total[sid] += 1
mode = r.get("execution_mode", "")
if mode == "kvcache-direct-to-d-session":
sess_direct[sid] += 1
rates = []
for sid in sess_total:
rate = sess_direct[sid] / sess_total[sid]
rates.append((sid, rate, sess_total[sid]))
bins = [0, 0.2, 0.4, 0.6, 0.8, 1.01]
bin_labels = ["0-20%", "20-40%", "40-60%", "60-80%", "80-100%"]
counts = [0] * 5
for _, r, _ in rates:
for i in range(5):
if bins[i] <= r < bins[i + 1]:
counts[i] += 1
break
print(f"\n [{label}] direct-to-D rate distribution (n={len(rates)} sessions):")
for lbl, cnt in zip(bin_labels, counts):
bar = "" * cnt
print(f" {lbl:<10}: {cnt:>3} {bar}")
return rates
def starved_cross_run(per_run_rates, threshold=0.20):
"""§1: sessions starved (<threshold direct-to-D) in ALL runs."""
if len(per_run_rates) < 2:
return None
sess_starved = defaultdict(int)
sess_lucky = defaultdict(int)
for rates in per_run_rates:
for sid, rate, _ in rates:
if rate < threshold:
sess_starved[sid] += 1
elif rate > 0.80:
sess_lucky[sid] += 1
n_runs = len(per_run_rates)
consistently_starved = [sid for sid, c in sess_starved.items() if c == n_runs]
consistently_lucky = [sid for sid, c in sess_lucky.items() if c == n_runs]
return {
"n_runs": n_runs,
"consistently_starved": consistently_starved,
"consistently_lucky": consistently_lucky,
}
def session_size_comparison(rows, sids_a, sids_b, label_a="A", label_b="B"):
"""Compare peak input_length of two session groups."""
sess_max_input = defaultdict(int)
for r in rows:
sid = r.get("session_id")
ilen = r.get("input_length") or 0
if sid is not None and ilen > sess_max_input[sid]:
sess_max_input[sid] = ilen
a_inputs = [sess_max_input[s] for s in sids_a if s in sess_max_input]
b_inputs = [sess_max_input[s] for s in sids_b if s in sess_max_input]
if a_inputs and b_inputs:
ratio = np.mean(a_inputs) / np.mean(b_inputs)
print(f"\n Cross-run starvation correlates with session size?")
print(f" consistently {label_a} (n={len(a_inputs)}): peak_input mean = {np.mean(a_inputs):.0f}")
print(f" consistently {label_b} (n={len(b_inputs)}): peak_input mean = {np.mean(b_inputs):.0f}")
print(f" {label_a}/{label_b} ratio = {ratio:.2f}x (ts=10 baseline was 1.98x)")
def per_d_balance(rows, label):
"""§7: per-D load balance."""
per_d = Counter()
for r in rows:
d = r.get("assigned_decode_node") or r.get("decode_node")
if d:
per_d[d] += 1
if not per_d:
return
counts = list(per_d.values())
spread = (max(counts) - min(counts)) / max(np.mean(counts), 1)
print(f"\n [{label}] per-D load: {dict(sorted(per_d.items()))}")
print(f" spread (max-min)/mean = {spread*100:.1f}% "
f"(ts=10 KVC 2P6D = ±26%, 8DP CA = ±10%)")
def execution_modes_table(rows, label):
"""Show top execution modes."""
ok = [r for r in rows if r.get("error") is None]
if not ok:
return
modes = Counter(r["execution_mode"] for r in ok)
print(f"\n [{label}] execution modes (n_ok={len(ok)}):")
for mode, cnt in modes.most_common(8):
mode_rows = [r for r in ok if r["execution_mode"] == mode]
lats = [r["latency_s"] for r in mode_rows if r.get("latency_s") is not None]
ttfts = [r["ttft_s"] for r in mode_rows if r.get("ttft_s") is not None]
if lats:
print(f" {mode:<55} {cnt:>5} ({cnt/len(ok)*100:>4.1f}%) "
f"lat p50={pct(lats,50):.3f}s p90={pct(lats,90):.3f}s ttft p50={pct(ttfts,50):.3f}s")
def lru_vs_errors(run_dir, label):
"""§2: trim events vs KVTransferError per worker."""
log_dir = run_dir / "logs"
if not log_dir.exists():
return
print(f"\n [{label}] D-side LRU vs errors (from worker logs):")
print(f" {'worker':<14}{'trim':>8}{'KVTransferError':>20}{'peak_token_usage':>20}")
for log_file in sorted(log_dir.glob("decode-*.log")):
worker = log_file.stem
text = log_file.read_text(errors="ignore")
trim_count = len(re.findall(r"Trimmed decode session cache", text))
err_count = len(re.findall(r"KVTransferError", text))
usages = re.findall(r"token usage: ([\d.]+)", text)
peak = max((float(u) for u in usages), default=0.0)
print(f" {worker:<14}{trim_count:>8}{err_count:>20}{peak:>20.3f}")
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--root", default="outputs/qwen3-30b-tp1-ts1-validation",
help="Sweep output root")
args = parser.parse_args()
root = Path(args.root)
if not root.is_absolute():
root = Path("/mnt/kzlin/workflow/pd-hybrid/agentic-pd-hybrid") / root
# Load all available runs
stats = []
rows_by_run = {}
for label in ("kvc_1p3d_run1", "kvc_1p3d_run2", "kvc_1p3d_run3", "dp4"):
m = root / f"{label}_metrics.jsonl"
s = root / f"{label}_summary.json"
if not m.exists() or not s.exists():
print(f" [{label}] not yet available ({m.name})")
continue
rows = load_metrics(m)
summary = load_summary(s)
rows_by_run[label] = rows
stats.append(summarize_run(label, rows, summary))
if not stats:
print("No runs available yet.")
return
# 1. Headline table
headline_table(stats)
# 2. §1 session pinning per KVC run + per-D balance + execution modes
print("\n" + "=" * 110)
print("§1 / §7: SESSION PINNING + LOAD BALANCE")
print("=" * 110)
per_run_rates = []
for label, rows in rows_by_run.items():
if not label.startswith("kvc_"):
continue
pin = session_pinning(rows, label)
if pin:
print(f"\n [{label}] sessions={pin['n_sessions']} "
f"avg_distinct_D={pin['avg_distinct_D']:.2f} "
f"max_distinct_D={pin['max_distinct_D']} "
f"(ts=10 baseline avg=1.00 → 100% pin)")
rates = direct_to_d_distribution(rows, label)
per_run_rates.append(rates)
per_d_balance(rows, label)
execution_modes_table(rows, label)
# 3. §1 cross-run starvation
if len(per_run_rates) >= 2:
print("\n" + "=" * 110)
print(f"§1 CROSS-RUN STARVATION (across {len(per_run_rates)} KVC runs)")
print("=" * 110)
cross = starved_cross_run(per_run_rates)
if cross:
n_starved = len(cross["consistently_starved"])
n_lucky = len(cross["consistently_lucky"])
print(f"\n Sessions starved (<20% direct-to-D) in all {cross['n_runs']} runs: {n_starved}")
print(f" Sessions lucky (>80% direct-to-D) in all {cross['n_runs']} runs: {n_lucky}")
print(f" (ts=10 baseline: 13/52 starved, 14/52 lucky — extreme bimodal)")
# session size comparison from run 1
if "kvc_1p3d_run1" in rows_by_run and n_starved and n_lucky:
session_size_comparison(rows_by_run["kvc_1p3d_run1"],
cross["consistently_starved"],
cross["consistently_lucky"],
"starved", "lucky")
# 4. §2 D-side LRU vs errors from raw logs
print("\n" + "=" * 110)
print("§2: D-SIDE LRU TRIM vs KVTransferError (from worker logs)")
print("=" * 110)
for label in rows_by_run:
if not label.startswith("kvc_"):
continue
# find the matching raw run dir
run_dirs = sorted(root.glob("kvcache-centric-*/"))
if not run_dirs:
continue
# naive: index matches run order; could be wrong if dirs got reordered
idx = int(label.split("run")[-1]) - 1
if idx < len(run_dirs):
lru_vs_errors(run_dirs[idx], label)
# 5. DP-only inspection
if "dp4" in rows_by_run:
print("\n" + "=" * 110)
print("4DP CA SANITY")
print("=" * 110)
per_d_balance(rows_by_run["dp4"], "dp4")
execution_modes_table(rows_by_run["dp4"], "dp4")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,89 @@
#!/usr/bin/env python3
"""Analyze v3 (kv-aware) results — find why fallback-large-append-session-cap dominates."""
import json
import numpy as np
from pathlib import Path
from collections import Counter, defaultdict
BASE = Path(__file__).parent
def load_rows(jsonl_path):
rows = []
with open(jsonl_path) as f:
for line in f:
rows.append(json.loads(line))
return rows
exp1 = load_rows(BASE / "exp1_1p7d_kvc_kvaware_metrics.jsonl")
exp2 = load_rows(BASE / "exp2_2p6d_kvc_kvaware_metrics.jsonl")
for name, rows in [("Exp1 1P7D", exp1), ("Exp2 2P6D", exp2)]:
print(f"\n========== {name} ==========")
ok = [r for r in rows if r.get("error") is None]
# Execution mode breakdown by latency
modes = Counter(r["execution_mode"] for r in ok)
print(f"\nExecution modes (n={len(ok)}):")
for mode, count in modes.most_common():
mode_rows = [r for r in ok if r["execution_mode"] == mode]
lats = [r["latency_s"] for r in mode_rows]
ttfts = [r["ttft_s"] for r in mode_rows]
print(f" {mode}: n={count} ({count/len(ok)*100:.1f}%) "
f"lat P50={np.percentile(lats,50):.3f}s P90={np.percentile(lats,90):.3f}s | "
f"ttft P50={np.percentile(ttfts,50):.3f}s P90={np.percentile(ttfts,90):.3f}s")
# Per-D session distribution
per_d_sessions = defaultdict(set)
for r in ok:
d = r.get("assigned_decode_node", "?")
per_d_sessions[d].add(r["session_id"])
print(f"\nSessions per D worker:")
for d in sorted(per_d_sessions.keys()):
print(f" {d}: {len(per_d_sessions[d])} unique sessions")
# session-cap fallback analysis
sc_rows = [r for r in ok if r["execution_mode"] == "pd-router-fallback-large-append-session-cap"]
if sc_rows:
print(f"\nSession-cap fallback details (n={len(sc_rows)}):")
# Which sessions hit this most?
sc_per_sess = Counter(r["session_id"] for r in sc_rows)
print(f" Sessions hitting session-cap (top 5):")
for sid, cnt in sc_per_sess.most_common(5):
print(f" session {sid}: {cnt} times")
# Per-D distribution
sc_per_d = Counter(r.get("assigned_decode_node", "?") for r in sc_rows)
print(f" Per-D distribution: {dict(sc_per_d.most_common())}")
# Input length distribution
inp = [r.get("input_length", 0) for r in sc_rows]
print(f" Input length: P50={np.percentile(inp,50):.0f} P90={np.percentile(inp,90):.0f}")
# Turn distribution
turns = Counter(r.get("turn_id", -1) for r in sc_rows)
print(f" Turn distribution (top 5): {dict(turns.most_common(5))}")
# Direct-to-D analysis (ideal path)
dd_rows = [r for r in ok if r["execution_mode"] == "kvcache-direct-to-d-session"]
if dd_rows:
lats = [r["latency_s"] for r in dd_rows]
ttfts = [r["ttft_s"] for r in dd_rows]
kv_blocks = [r.get("actual_kv_transfer_blocks", 0) for r in dd_rows]
cached = [r.get("cached_tokens", 0) for r in dd_rows]
print(f"\nDirect-to-D details (n={len(dd_rows)}):")
print(f" lat P50={np.percentile(lats,50):.3f}s P90={np.percentile(lats,90):.3f}s P99={np.percentile(lats,99):.3f}s")
print(f" ttft P50={np.percentile(ttfts,50):.3f}s P90={np.percentile(ttfts,90):.3f}s")
print(f" KV transfer: P50={np.percentile(kv_blocks,50):.0f} (should be 0 — no P involved)")
print(f" cached_tokens P50={np.percentile(cached,50):.0f}")
# Sessions: how many turns each, how many used direct-to-d
print(f"\nPer-session direct-to-D rate (top 10 by total turns):")
per_sess = defaultdict(list)
for r in ok:
per_sess[r["session_id"]].append(r)
sess_stats = []
for sid, sreqs in per_sess.items():
total = len(sreqs)
dd = sum(1 for r in sreqs if r["execution_mode"] == "kvcache-direct-to-d-session")
sc = sum(1 for r in sreqs if "session-cap" in r["execution_mode"])
sess_stats.append((sid, total, dd, sc))
sess_stats.sort(key=lambda x: -x[1])
for sid, total, dd, sc in sess_stats[:10]:
print(f" session {sid}: {total} turns, {dd} direct-to-D ({dd/total*100:.0f}%), {sc} session-cap fallback ({sc/total*100:.0f}%)")

View File

@@ -0,0 +1,52 @@
#!/usr/bin/env python3
"""V4 results analysis: errors, execution modes, latency by mode."""
import json
import numpy as np
from pathlib import Path
from collections import Counter
BASE = Path(__file__).parent
def load_rows(jsonl_path):
rows = []
with open(jsonl_path) as f:
for line in f:
rows.append(json.loads(line))
return rows
for name, path in [
("Exp1 1P7D cap=16", BASE / "exp1_1p7d_kvc_cap16_metrics.jsonl"),
("Exp2 2P6D cap=16", BASE / "exp2_2p6d_kvc_cap16_metrics.jsonl"),
]:
rows = load_rows(path)
print(f"\n========== {name} ==========")
ok = [r for r in rows if r.get("error") is None]
err = [r for r in rows if r.get("error") is not None]
print(f"Total: {len(rows)}, OK: {len(ok)}, Errors: {len(err)}")
# Errors finish_reason
if err:
finish_reasons = Counter()
for r in err:
fr = str(r.get("finish_reason") or r.get("error") or "?")
# Truncate long messages
short = fr[:120]
finish_reasons[short] += 1
print(f"\nError finish_reasons (top 5):")
for fr, cnt in finish_reasons.most_common(5):
print(f" {cnt}x: {fr}")
# Execution mode latency breakdown
modes = Counter(r["execution_mode"] for r in ok)
print(f"\nTop execution modes by latency:")
print(f"{'mode':<55}{'n':<8}{'%':<8}{'P50 lat':<10}{'P90 lat':<10}{'TTFT P50':<10}")
for mode, count in modes.most_common(8):
mode_rows = [r for r in ok if r["execution_mode"] == mode]
lats = [r["latency_s"] for r in mode_rows]
ttfts = [r["ttft_s"] for r in mode_rows]
print(f" {mode:<53}{count:<8}{count/len(ok)*100:>5.1f}% {np.percentile(lats,50):>7.3f}s {np.percentile(lats,90):>7.3f}s {np.percentile(ttfts,50):>7.3f}s")
# Per-D load
per_d = Counter(r.get("assigned_decode_node", "?") for r in ok)
print(f"\nPer-D load: max/min ratio = {max(per_d.values())/max(min(per_d.values()),1):.2f}x")
print(f" {dict(per_d.most_common())}")

View File

@@ -0,0 +1,136 @@
#!/usr/bin/env python3
"""Compare KVC variants vs baseline, EXCLUDING errors and truncated requests."""
import json
import numpy as np
from pathlib import Path
OUT = Path("/mnt/kzlin/workflow/pd-hybrid/agentic-pd-hybrid/outputs")
DATASETS = [
("baseline 8DP", OUT / "qwen3-30b-tp1-v2-fixed/exp1_8way_dp_cache_aware_metrics.jsonl"),
("v3 1P7D", OUT / "qwen3-30b-tp1-v3-kvaware/exp1_1p7d_kvc_kvaware_metrics.jsonl"),
("v3 2P6D", OUT / "qwen3-30b-tp1-v3-kvaware/exp2_2p6d_kvc_kvaware_metrics.jsonl"),
("v4 1P7D", OUT / "qwen3-30b-tp1-v4-cap16/exp1_1p7d_kvc_cap16_metrics.jsonl"),
("v4 2P6D", OUT / "qwen3-30b-tp1-v4-cap16/exp2_2p6d_kvc_cap16_metrics.jsonl"),
]
def load_rows(path):
rows = []
with open(path) as f:
for line in f:
rows.append(json.loads(line))
return rows
def is_truncated(row):
a = row.get("actual_output_tokens")
r = row.get("requested_output_tokens")
if a is not None and r is not None and r > 1:
return a < r * 0.5
return False
def stats(values):
if not values:
return {"n": 0}
a = np.array(values)
return {
"n": len(a),
"mean": float(np.mean(a)),
"p50": float(np.percentile(a, 50)),
"p90": float(np.percentile(a, 90)),
"p99": float(np.percentile(a, 99)),
}
def fmt(s, key):
if s["n"] == 0:
return "N/A"
v = s[key]
return f"{v:.3f}s" if v < 100 else f"{v:.1f}s"
results = []
for label, path in DATASETS:
if not path.exists():
print(f"SKIP {label}")
continue
rows = load_rows(path)
total = len(rows)
err_n = sum(1 for r in rows if r.get("error") is not None)
trunc_n = sum(1 for r in rows if r.get("error") is None and is_truncated(r))
# Filter: error=None AND not truncated AND latency present
clean = [r for r in rows
if r.get("error") is None
and not is_truncated(r)
and r.get("latency_s") is not None]
lats = [r["latency_s"] for r in clean]
ttfts = [r["ttft_s"] for r in clean if r.get("ttft_s") is not None]
results.append({
"label": label,
"total": total,
"err": err_n,
"trunc": trunc_n,
"clean_n": len(clean),
"lat": stats(lats),
"ttft": stats(ttfts),
})
# Print comparison table
print(f"\n{'='*100}")
print("LATENCY (excluding errors AND truncated)")
print(f"{'='*100}")
print(f"{'config':<16}{'total':>7}{'err':>6}{'trunc':>7}{'clean':>7} {'mean':>9}{'P50':>9}{'P90':>9}{'P99':>9}")
for r in results:
print(f"{r['label']:<16}{r['total']:>7}{r['err']:>6}{r['trunc']:>7}{r['clean_n']:>7} "
f"{fmt(r['lat'],'mean'):>9}{fmt(r['lat'],'p50'):>9}{fmt(r['lat'],'p90'):>9}{fmt(r['lat'],'p99'):>9}")
print(f"\n{'='*100}")
print("TTFT (excluding errors AND truncated)")
print(f"{'='*100}")
print(f"{'config':<16}{'clean':>7} {'mean':>9}{'P50':>9}{'P90':>9}{'P99':>9}")
for r in results:
print(f"{r['label']:<16}{r['clean_n']:>7} "
f"{fmt(r['ttft'],'mean'):>9}{fmt(r['ttft'],'p50'):>9}{fmt(r['ttft'],'p90'):>9}{fmt(r['ttft'],'p99'):>9}")
# Also: per-execution-mode breakdown for v4 only (the most interesting)
print(f"\n{'='*100}")
print("V4 2P6D: per-execution-mode (excluding errors and truncated)")
print(f"{'='*100}")
v4_2p6d = next((p for l, p in DATASETS if l == "v4 2P6D"), None)
if v4_2p6d:
rows = load_rows(v4_2p6d)
clean = [r for r in rows if r.get("error") is None and not is_truncated(r)]
from collections import Counter
modes = Counter(r["execution_mode"] for r in clean)
print(f"{'mode':<55}{'n':>7}{'%':>7} {'mean':>9}{'P50':>9}{'P90':>9}{'P99':>9}")
for mode, count in modes.most_common(10):
m_rows = [r for r in clean if r["execution_mode"] == mode]
s = stats([r["latency_s"] for r in m_rows])
pct = count/len(clean)*100
print(f" {mode:<53}{count:>7}{pct:>6.1f}% {fmt(s,'mean'):>9}{fmt(s,'p50'):>9}{fmt(s,'p90'):>9}{fmt(s,'p99'):>9}")
# Also: WHAT IF we only count direct-to-D? (Pure KVC performance)
print(f"\n{'='*100}")
print("Pure KVC (kvcache-direct-to-d-session ONLY) vs Baseline")
print(f"{'='*100}")
for label, path in DATASETS:
if not path.exists() or "1P7D" not in label and "2P6D" not in label:
continue
rows = load_rows(path)
direct = [r for r in rows
if r.get("error") is None and not is_truncated(r)
and r.get("execution_mode") == "kvcache-direct-to-d-session"]
if not direct:
continue
s_lat = stats([r["latency_s"] for r in direct])
s_ttft = stats([r["ttft_s"] for r in direct if r.get("ttft_s") is not None])
print(f"{label:<16}n={s_lat['n']:>5} lat: P50={fmt(s_lat,'p50')} P90={fmt(s_lat,'p90')} ttft: P50={fmt(s_ttft,'p50')} P90={fmt(s_ttft,'p90')}")
# Baseline for reference (already non-fallback by definition)
print()
baseline_path = OUT / "qwen3-30b-tp1-v2-fixed/exp1_8way_dp_cache_aware_metrics.jsonl"
baseline_rows = load_rows(baseline_path)
clean = [r for r in baseline_rows if r.get("error") is None and not is_truncated(r)]
s_lat = stats([r["latency_s"] for r in clean])
s_ttft = stats([r["ttft_s"] for r in clean if r.get("ttft_s") is not None])
print(f"{'baseline 8DP':<16}n={s_lat['n']:>5} lat: P50={fmt(s_lat,'p50')} P90={fmt(s_lat,'p90')} ttft: P50={fmt(s_ttft,'p50')} P90={fmt(s_ttft,'p90')}")

View File

@@ -0,0 +1,209 @@
#!/usr/bin/env python3
"""Cache efficiency comparison: KVC 1P3D v2 vs 4-way DP CA.
Generates docs/figures/cache_efficiency.png — two-panel:
left: cache hit rate vs turn number (mechanism: affinity vs LRU)
right: ECDF of per-request uncached tokens (per-request impact)
Resolves the apparent paradox: KVC has 27% less total KV pool capacity
(3 × 92K = 276K vs DP 4 × 87K = 351K) yet achieves higher cache hit rate
(98.1% vs 96.8%) and lower mean uncached tokens per request (560 vs 952).
The left panel shows the mechanism: KVC's session affinity makes cache hit
rate grow with turn count (more cache accumulates on the pinned D), while
DP's hash + radix-LRU causes cache hit rate to decay through the middle
turns (other sessions' KV competes via LRU eviction).
The right panel quantifies the impact: KVC's uncached tokens are
concentrated near 0 (mean 560), DP's are spread (mean 952).
Aborted / errored requests are excluded.
"""
from __future__ import annotations
import json
from collections import defaultdict
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
ROOT = Path(__file__).resolve().parents[2]
KVC = ROOT / "outputs/qwen3-30b-tp1-ts1-migration-v2/kvc_1p3d_migration_v2_run1_metrics.jsonl"
DP = ROOT / "outputs/qwen3-30b-tp1-ts1-validation/dp4_metrics.jsonl"
OUT = ROOT / "docs/figures/cache_efficiency.png"
def load(p: Path) -> list[dict]:
return [json.loads(line) for line in p.open()]
def is_failed(r: dict) -> bool:
if r.get("error"):
return True
fr = r.get("finish_reason")
if fr and ("abort" in str(fr).lower() or "badrequest" in str(fr).lower()):
return True
return False
def main() -> None:
kvc = [r for r in load(KVC) if not is_failed(r)]
dp = [r for r in load(DP) if not is_failed(r)]
KVC_COLOR = "#1F77B4"
DP_COLOR = "#D62728"
fig, axes = plt.subplots(1, 2, figsize=(15, 6.5))
# ------------------------------------------------------------------
# Left panel: cache hit rate per turn
# Bin requests by turn_id, plot mean hit rate per bin with shaded band
# ------------------------------------------------------------------
def bin_by_turn(rows: list[dict]) -> tuple[list[int], list[float], list[float], list[float]]:
per_turn: defaultdict[int, list[float]] = defaultdict(list)
for r in rows:
if r["input_length"] == 0:
continue
hit = r.get("cached_tokens", 0) / r["input_length"]
per_turn[r["turn_id"]].append(hit)
turns = sorted(per_turn.keys())
means, p25s, p75s = [], [], []
for t in turns:
arr = np.array(per_turn[t])
means.append(float(np.mean(arr)))
p25s.append(float(np.quantile(arr, 0.25)))
p75s.append(float(np.quantile(arr, 0.75)))
return turns, means, p25s, p75s
kvc_t, kvc_m, kvc_lo, kvc_hi = bin_by_turn(kvc)
dp_t, dp_m, dp_lo, dp_hi = bin_by_turn(dp)
# Cap x-axis: tails get noisy below ~5 samples per bin
max_turn = 100
ax = axes[0]
ax.plot(kvc_t, kvc_m, color=KVC_COLOR, lw=2.5,
label=f"KVC 1P3D v2 (overall hit 98.1%)")
ax.fill_between(kvc_t, kvc_lo, kvc_hi, color=KVC_COLOR, alpha=0.18,
label="KVC IQR (p25-p75)")
ax.plot(dp_t, dp_m, color=DP_COLOR, lw=2.5,
label=f"4-way DP CA (overall hit 96.8%)")
ax.fill_between(dp_t, dp_lo, dp_hi, color=DP_COLOR, alpha=0.18,
label="DP IQR (p25-p75)")
# Annotate the mid-turn drift gap
drift_turns = list(range(8, 25))
drift_kvc = np.mean([m for t, m in zip(kvc_t, kvc_m) if t in drift_turns])
drift_dp = np.mean([m for t, m in zip(dp_t, dp_m) if t in drift_turns])
ax.axvspan(8, 25, color="#999", alpha=0.08, label="_nolegend_")
ax.text(16, 0.65,
f"Mid-turn region\n(turns 8-25):\nKVC {drift_kvc*100:.1f}% | DP {drift_dp*100:.1f}%\nGap {(drift_kvc-drift_dp)*100:+.1f} pp",
ha="center", va="center", fontsize=9.5,
bbox=dict(facecolor="white", edgecolor="gray", alpha=0.92, pad=4))
ax.set_xlim(1, max_turn)
ax.set_ylim(0.4, 1.02)
ax.set_xlabel("Turn number within session", fontsize=11)
ax.set_ylabel("Per-request cache hit rate (cached / input_length)", fontsize=11)
ax.set_title("Cache hit rate vs turn number\n(mechanism: session affinity vs hash-LRU)",
fontsize=12, pad=10)
ax.legend(loc="lower right", fontsize=9.5, framealpha=0.95)
ax.grid(True, linestyle=":", alpha=0.4)
ax.set_axisbelow(True)
# ------------------------------------------------------------------
# Right panel: ECDF of per-request uncached tokens (log x)
# ------------------------------------------------------------------
def ecdf(rows: list[dict]) -> tuple[np.ndarray, np.ndarray]:
vals = np.array([
max(1, r["input_length"] - r.get("cached_tokens", 0))
for r in rows
])
vals = np.sort(vals)
return vals, np.arange(1, len(vals) + 1) / len(vals)
kvc_x, kvc_y = ecdf(kvc)
dp_x, dp_y = ecdf(dp)
ax = axes[1]
ax.plot(kvc_x, kvc_y, color=KVC_COLOR, lw=2.5,
label=f"KVC 1P3D v2 (mean {int(np.mean(kvc_x))} tokens)")
ax.plot(dp_x, dp_y, color=DP_COLOR, lw=2.5,
label=f"4-way DP CA (mean {int(np.mean(dp_x))} tokens)")
# Median markers
kvc_p50 = np.quantile(kvc_x, 0.50)
dp_p50 = np.quantile(dp_x, 0.50)
ax.axhline(0.5, color="gray", linestyle=":", alpha=0.5)
ax.text(1.2, 0.52, "median (50% of requests below this)",
fontsize=8.5, color="gray", style="italic")
ax.axvline(kvc_p50, color=KVC_COLOR, ls="--", alpha=0.5, lw=1.0)
ax.axvline(dp_p50, color=DP_COLOR, ls="--", alpha=0.5, lw=1.0)
ax.text(kvc_p50, 0.06, f"KVC\nmedian\n{int(kvc_p50)}",
color=KVC_COLOR, fontsize=9, ha="center", va="bottom",
bbox=dict(facecolor="white", edgecolor="none", alpha=0.75, pad=1))
ax.text(dp_p50, 0.06, f"DP\nmedian\n{int(dp_p50)}",
color=DP_COLOR, fontsize=9, ha="center", va="bottom",
bbox=dict(facecolor="white", edgecolor="none", alpha=0.75, pad=1))
# Annotate the separation: at uncached = 500 tokens, what fraction below?
sep_x = 500
kvc_at_sep = (kvc_x <= sep_x).mean()
dp_at_sep = (dp_x <= sep_x).mean()
ax.axvline(sep_x, color="#666", linestyle=":", alpha=0.6, lw=1.0)
ax.annotate(
f"At uncached = {sep_x} tokens:\n"
f"KVC {kvc_at_sep*100:.0f}% of requests below\n"
f"DP {dp_at_sep*100:.0f}% of requests below",
xy=(sep_x, dp_at_sep),
xytext=(2500, 0.35),
fontsize=9.5,
bbox=dict(facecolor="white", edgecolor="gray", alpha=0.92, pad=4),
arrowprops=dict(arrowstyle="->", color="#666", lw=0.8),
)
ax.set_xscale("log")
ax.set_xlim(1, 1e5)
ax.set_xticks([1, 10, 100, 1000, 10000, 100000])
ax.set_xticklabels(["1", "10", "100", "1K", "10K", "100K"])
ax.set_ylim(0, 1.02)
ax.set_xlabel("Uncached tokens per request (log scale)", fontsize=11)
ax.set_ylabel("Cumulative fraction of requests", fontsize=11)
ax.set_title("ECDF of uncached tokens per request\n(impact: KVC concentrates near zero)",
fontsize=12, pad=10)
ax.legend(loc="lower right", fontsize=10, framealpha=0.95)
ax.grid(True, which="both", linestyle=":", alpha=0.4)
ax.set_axisbelow(True)
fig.suptitle(
"Cache efficiency paradox: KVC has 27% LESS total KV pool (276K vs 351K tokens) yet caches MORE per request.\n"
"Left: session-affinity lets KVC's cache accumulate with turns; DP's hash-LRU loses cache to cross-session competition.\n"
"Right: net effect — KVC's uncached compute is concentrated near zero, DP's is spread over 100-10K tokens.",
fontsize=11.5, y=1.05,
)
plt.tight_layout()
plt.savefig(OUT, dpi=150, bbox_inches="tight")
print(f"wrote {OUT}")
plt.close(fig)
# ------------------------------------------------------------------
# Print summary for doc reference
# ------------------------------------------------------------------
print("\n=== Cache efficiency stats ===")
print(f"KVC v2: total_input={sum(r['input_length'] for r in kvc)/1e6:.1f}M tokens")
print(f" total_cached={sum(r.get('cached_tokens',0) for r in kvc)/1e6:.1f}M tokens")
print(f" hit rate {sum(r.get('cached_tokens',0) for r in kvc)/sum(r['input_length'] for r in kvc)*100:.2f}%")
print(f" mean uncached {np.mean(kvc_x):.0f} p50 {kvc_p50:.0f} p90 {np.quantile(kvc_x, 0.9):.0f}")
print(f"\nDP 4w: total_input={sum(r['input_length'] for r in dp)/1e6:.1f}M tokens")
print(f" total_cached={sum(r.get('cached_tokens',0) for r in dp)/1e6:.1f}M tokens")
print(f" hit rate {sum(r.get('cached_tokens',0) for r in dp)/sum(r['input_length'] for r in dp)*100:.2f}%")
print(f" mean uncached {np.mean(dp_x):.0f} p50 {dp_p50:.0f} p90 {np.quantile(dp_x, 0.9):.0f}")
print(f"\nMid-turn region (8-25): KVC {drift_kvc*100:.2f}% DP {drift_dp*100:.2f}% (gap {(drift_kvc-drift_dp)*100:+.2f}pp)")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,234 @@
#!/usr/bin/env python3
"""Per-GPU utilization breakdown: KVC 1P3D v2 vs 4-way DP CA.
Generates docs/figures/gpu_utilization.png — two-panel:
left: per-GPU request count
right: per-GPU compute work (uncached prefill tokens + decode tokens, stacked)
The point of the figure is to push back on the naïve reading
"KVC's prefill GPU is idle 90% of the time, so KVC is using fewer GPUs."
By request count, the prefill GPU is indeed touched by only ~8% of requests.
By compute work, the prefill GPU bears comparable per-GPU load to each
decode GPU — it is a low-frequency, high-cost safety net for cache misses,
not idle capacity.
Work attribution:
KVC direct-to-D path: prefill happens locally on the assigned D worker
(append-prefill of `uncached_tokens` tokens).
KVC seed/reseed/fallback path: prefill happens on prefill-0
(full uncached_tokens), decode on assigned D.
DP: all work on assigned direct-N worker.
Aborted / errored requests are excluded.
"""
from __future__ import annotations
import json
from collections import defaultdict
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
ROOT = Path(__file__).resolve().parents[2]
KVC = ROOT / "outputs/qwen3-30b-tp1-ts1-migration-v2/kvc_1p3d_migration_v2_run1_metrics.jsonl"
DP = ROOT / "outputs/qwen3-30b-tp1-ts1-validation/dp4_metrics.jsonl"
OUT = ROOT / "docs/figures/gpu_utilization.png"
def load(p: Path) -> list[dict]:
return [json.loads(line) for line in p.open()]
def is_failed(r: dict) -> bool:
if r.get("error"):
return True
fr = r.get("finish_reason")
if fr and ("abort" in str(fr).lower() or "badrequest" in str(fr).lower()):
return True
return False
def uncached(r: dict) -> int:
return max(0, r["input_length"] - r.get("cached_tokens", 0))
def out_tokens(r: dict) -> int:
return r.get("actual_output_tokens") or r.get("output_length") or 0
def main() -> None:
kvc = [r for r in load(KVC) if not is_failed(r)]
dp = [r for r in load(DP) if not is_failed(r)]
# ------------------------------------------------------------------
# KVC per-GPU attribution
# ------------------------------------------------------------------
kvc_req_count = defaultdict(int)
kvc_prefill_tokens = defaultdict(int) # uncached prefill compute
kvc_decode_tokens = defaultdict(int)
for r in kvc:
d = r["assigned_decode_node"] # decode-0/1/2
p = r["assigned_prefill_node"] # prefill-0
mode = r.get("execution_mode", "")
if mode == "kvcache-direct-to-d-session":
# P is bypassed entirely; D does the append-prefill + decode
kvc_req_count[d] += 1
kvc_prefill_tokens[d] += uncached(r)
kvc_decode_tokens[d] += out_tokens(r)
else:
# P does the full prefill; D handles decode
kvc_req_count[p] += 1
kvc_req_count[d] += 1 # decode side still counts
kvc_prefill_tokens[p] += uncached(r)
kvc_decode_tokens[d] += out_tokens(r)
# ------------------------------------------------------------------
# DP per-GPU attribution (fused P+D on every worker)
# ------------------------------------------------------------------
dp_req_count = defaultdict(int)
dp_prefill_tokens = defaultdict(int)
dp_decode_tokens = defaultdict(int)
for r in dp:
w = r["assigned_decode_node"] # direct-0..3
dp_req_count[w] += 1
dp_prefill_tokens[w] += uncached(r)
dp_decode_tokens[w] += out_tokens(r)
# ------------------------------------------------------------------
# Build ordered GPU list, KVC then DP
# ------------------------------------------------------------------
kvc_gpus = ["prefill-0", "decode-0", "decode-1", "decode-2"]
dp_gpus = ["direct-0", "direct-1", "direct-2", "direct-3"]
all_gpus = kvc_gpus + dp_gpus
def get(d, k):
return d.get(k, 0)
counts = [get(kvc_req_count, g) for g in kvc_gpus] + \
[get(dp_req_count, g) for g in dp_gpus]
prefill_tk = [get(kvc_prefill_tokens, g) for g in kvc_gpus] + \
[get(dp_prefill_tokens, g) for g in dp_gpus]
decode_tk = [get(kvc_decode_tokens, g) for g in kvc_gpus] + \
[get(dp_decode_tokens, g) for g in dp_gpus]
# Display labels: P/D role + worker id
labels = [
"KVC P\nprefill-0",
"KVC D\ndecode-0",
"KVC D\ndecode-1",
"KVC D\ndecode-2",
"DP P+D\ndirect-0",
"DP P+D\ndirect-1",
"DP P+D\ndirect-2",
"DP P+D\ndirect-3",
]
kvc_mask = [True, True, True, True, False, False, False, False]
KVC_P_COLOR = "#E89D44" # orange — P GPU stands out
KVC_D_COLOR = "#1F77B4" # blue
DP_COLOR = "#D62728" # red
bar_colors = [KVC_P_COLOR, KVC_D_COLOR, KVC_D_COLOR, KVC_D_COLOR,
DP_COLOR, DP_COLOR, DP_COLOR, DP_COLOR]
fig, axes = plt.subplots(1, 2, figsize=(15, 6.5))
x = np.arange(len(all_gpus))
# -- Left: per-GPU request count ----------------------------------
ax = axes[0]
bars = ax.bar(x, counts, color=bar_colors, edgecolor="black", linewidth=0.6)
for xi, c in zip(x, counts):
ax.text(xi, c + max(counts) * 0.015, f"{c:,}",
ha="center", va="bottom", fontsize=9.5)
ax.set_xticks(x)
ax.set_xticklabels(labels, fontsize=9.5)
ax.set_ylabel("Number of requests touching this GPU", fontsize=11)
ax.set_title("Per-GPU request count\n(naïve view: P seems idle)", fontsize=12, pad=10)
ax.grid(axis="y", linestyle=":", alpha=0.4)
ax.set_axisbelow(True)
# Annotate: KVC P GPU is "low frequency"
p_idx = 0
p_pct = counts[p_idx] / sum(counts[:4]) * 100 # vs KVC total
ax.annotate(
f"P GPU only sees\n"
f"{counts[p_idx]:,} requests\n"
f"({counts[p_idx]/len(kvc)*100:.1f}% of total)",
xy=(p_idx, counts[p_idx]),
xytext=(p_idx + 0.6, max(counts) * 0.55),
fontsize=9, color=KVC_P_COLOR, fontweight="bold",
arrowprops=dict(arrowstyle="->", color=KVC_P_COLOR, lw=1.0),
)
# -- Right: per-GPU compute work (stacked prefill + decode) -------
ax = axes[1]
prefill_M = [t / 1e6 for t in prefill_tk]
decode_M = [t / 1e6 for t in decode_tk]
total_M = [p + d for p, d in zip(prefill_M, decode_M)]
bars_p = ax.bar(x, prefill_M, color=[c for c in bar_colors],
edgecolor="black", linewidth=0.6, label="Uncached prefill tokens",
alpha=0.95)
bars_d = ax.bar(x, decode_M, bottom=prefill_M, color=[c for c in bar_colors],
edgecolor="black", linewidth=0.6, hatch="///",
label="Decode tokens", alpha=0.55)
for xi, t in zip(x, total_M):
ax.text(xi, t + max(total_M) * 0.015, f"{t:.2f}M",
ha="center", va="bottom", fontsize=9.5)
ax.set_xticks(x)
ax.set_xticklabels(labels, fontsize=9.5)
ax.set_ylabel("Compute tokens (millions)", fontsize=11)
ax.set_title("Per-GPU compute work\n(work view: P is comparable to each D)",
fontsize=12, pad=10)
ax.grid(axis="y", linestyle=":", alpha=0.4)
ax.set_axisbelow(True)
ax.legend(loc="upper left", fontsize=10, framealpha=0.95)
# Annotate: KVC P GPU does similar work to each D
ax.annotate(
f"P GPU does {total_M[p_idx]:.2f}M tokens of\n"
f"prefill — comparable per-GPU\n"
f"load to each KVC D worker",
xy=(p_idx, total_M[p_idx]),
xytext=(p_idx + 0.6, max(total_M) * 0.62),
fontsize=9, color=KVC_P_COLOR, fontweight="bold",
arrowprops=dict(arrowstyle="->", color=KVC_P_COLOR, lw=1.0),
)
# Separator + group labels
for ax in axes:
ax.axvline(3.5, color="gray", linestyle="--", linewidth=1.0, alpha=0.5)
ymin, ymax = ax.get_ylim()
ax.text(1.5, ymax * 1.05, "KVC 1P3D", ha="center", fontsize=11,
fontweight="bold", color="#444")
ax.text(5.5, ymax * 1.05, "DP 4-way CA", ha="center", fontsize=11,
fontweight="bold", color="#444")
fig.suptitle(
"Per-GPU utilization: \"is KVC's prefill GPU wasted?\"\n"
"Left view says yes (only 8% of requests); right view says no (comparable work to each D).",
fontsize=13, y=1.02,
)
plt.tight_layout()
plt.savefig(OUT, dpi=150, bbox_inches="tight")
print(f"wrote {OUT}")
plt.close(fig)
# ------------------------------------------------------------------
# Print numbers for doc reference
# ------------------------------------------------------------------
print("\n=== Per-GPU numbers ===")
print(f"{'GPU':<22} {'requests':>10} {'prefill(M)':>12} {'decode(M)':>12} {'total(M)':>10}")
for lbl, n, pM, dM in zip(labels, counts, prefill_M, decode_M):
print(f" {lbl.replace(chr(10), ' '):<20} {n:>10} {pM:>12.3f} {dM:>12.3f} {pM+dM:>10.3f}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,199 @@
#!/usr/bin/env python3
"""Generate TTFT probability density curves: KVC 1P3D v2 vs 4-way DP CA.
Inputs:
outputs/qwen3-30b-tp1-ts1-migration-v2/kvc_1p3d_migration_v2_run1_metrics.jsonl
outputs/qwen3-30b-tp1-ts1-validation/dp4_metrics.jsonl
Outputs:
docs/figures/ttft_pdf_comparison.png -- two-panel figure:
left panel: linear x in [0, 1.0]s zoomed on the body
right panel: log x covering full range (0.01 -- 10 s)
Each KDE curve uses scipy.stats.gaussian_kde with Scott's rule bandwidth.
Aborted requests are excluded (same filter as metrics.py:_is_failed_request).
"""
from __future__ import annotations
import json
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde
ROOT = Path(__file__).resolve().parents[2]
KVC = ROOT / "outputs/qwen3-30b-tp1-ts1-migration-v2/kvc_1p3d_migration_v2_run1_metrics.jsonl"
DP = ROOT / "outputs/qwen3-30b-tp1-ts1-validation/dp4_metrics.jsonl"
OUT = ROOT / "docs/figures/ttft_pdf_comparison.png"
def load(p: Path) -> list[dict]:
return [json.loads(line) for line in p.open()]
def is_failed(r: dict) -> bool:
if r.get("error"):
return True
fr = r.get("finish_reason")
if fr and ("abort" in str(fr).lower() or "badrequest" in str(fr).lower()):
return True
return False
def pct(vals: np.ndarray, q: float) -> float:
return float(np.quantile(vals, q))
def main() -> None:
kvc = [r for r in load(KVC) if not is_failed(r)]
dp = [r for r in load(DP) if not is_failed(r)]
kvc_ttft = np.array([r["ttft_s"] for r in kvc if r.get("ttft_s") is not None])
dp_ttft = np.array([r["ttft_s"] for r in dp if r.get("ttft_s") is not None])
# Trim absurdly small zeros (rare measurement artifacts) so log KDE behaves.
kvc_ttft = kvc_ttft[kvc_ttft > 1e-4]
dp_ttft = dp_ttft[dp_ttft > 1e-4]
KVC_COLOR = "#1F77B4" # blue
DP_COLOR = "#D62728" # red
fig, axes = plt.subplots(1, 2, figsize=(16, 6.5))
# ------------------------------------------------------------------
# Left panel: linear x ∈ [0, 0.6]s -- body of the distribution
# ------------------------------------------------------------------
ax = axes[0]
x_body = np.linspace(0.0, 0.6, 600)
# KDE on linear ttft values, clipped to body
kde_kvc_lin = gaussian_kde(kvc_ttft, bw_method=0.15)
kde_dp_lin = gaussian_kde(dp_ttft, bw_method=0.15)
ax.plot(x_body, kde_kvc_lin(x_body),
color=KVC_COLOR, lw=2.5, label=f"KVC 1P3D v2 (n={len(kvc_ttft)})")
ax.fill_between(x_body, kde_kvc_lin(x_body), alpha=0.20, color=KVC_COLOR)
ax.plot(x_body, kde_dp_lin(x_body),
color=DP_COLOR, lw=2.5, label=f"4-way DP CA (n={len(dp_ttft)})")
ax.fill_between(x_body, kde_dp_lin(x_body), alpha=0.20, color=DP_COLOR)
# Vertical lines for p50, p90
for q, ls in [(0.50, "-"), (0.90, "--")]:
ax.axvline(pct(kvc_ttft, q), color=KVC_COLOR, ls=ls, alpha=0.55, lw=1.1)
ax.axvline(pct(dp_ttft, q), color=DP_COLOR, ls=ls, alpha=0.55, lw=1.1)
ymax = ax.get_ylim()[1]
ax.text(pct(kvc_ttft, 0.50), ymax * 0.97,
f"KVC p50\n{pct(kvc_ttft, 0.50)*1000:.0f}ms",
color=KVC_COLOR, fontsize=9, va="top", ha="left",
bbox=dict(facecolor="white", edgecolor="none", alpha=0.7, pad=2))
ax.text(pct(dp_ttft, 0.50), ymax * 0.50,
f"DP p50\n{pct(dp_ttft, 0.50)*1000:.0f}ms",
color=DP_COLOR, fontsize=9, va="top", ha="left",
bbox=dict(facecolor="white", edgecolor="none", alpha=0.7, pad=2))
ax.text(pct(kvc_ttft, 0.90), ymax * 0.30,
f"KVC p90\n{pct(kvc_ttft, 0.90)*1000:.0f}ms",
color=KVC_COLOR, fontsize=9, va="top", ha="left",
bbox=dict(facecolor="white", edgecolor="none", alpha=0.7, pad=2))
ax.text(pct(dp_ttft, 0.90), ymax * 0.18,
f"DP p90\n{pct(dp_ttft, 0.90)*1000:.0f}ms",
color=DP_COLOR, fontsize=9, va="top", ha="left",
bbox=dict(facecolor="white", edgecolor="none", alpha=0.7, pad=2))
ax.set_xlim(0, 0.6)
ax.set_xlabel("TTFT (seconds, linear)", fontsize=11)
ax.set_ylabel("Probability density", fontsize=11)
ax.set_title("Body of distribution (TTFT ≤ 0.6 s)", fontsize=12, pad=10)
ax.legend(loc="upper right", fontsize=10, framealpha=0.95)
ax.grid(True, linestyle=":", alpha=0.4)
ax.set_axisbelow(True)
# ------------------------------------------------------------------
# Right panel: log x ∈ [0.01, 10]s -- full range incl. tail
# PDF on log-x: we plot density vs log10(t) so the curve integrates
# to 1 over log space (standard "log-density" presentation).
# ------------------------------------------------------------------
ax = axes[1]
# KDE on log10(ttft) so the resulting curve integrates to 1 over log10 t
kde_kvc_log = gaussian_kde(np.log10(kvc_ttft), bw_method="scott")
kde_dp_log = gaussian_kde(np.log10(dp_ttft), bw_method="scott")
log_x = np.linspace(np.log10(0.01), np.log10(10.0), 600)
x_full = 10 ** log_x
y_kvc = kde_kvc_log(log_x)
y_dp = kde_dp_log(log_x)
ax.plot(x_full, y_kvc, color=KVC_COLOR, lw=2.5, label=f"KVC 1P3D v2 (n={len(kvc_ttft)})")
ax.fill_between(x_full, y_kvc, alpha=0.20, color=KVC_COLOR)
ax.plot(x_full, y_dp, color=DP_COLOR, lw=2.5, label=f"4-way DP CA (n={len(dp_ttft)})")
ax.fill_between(x_full, y_dp, alpha=0.20, color=DP_COLOR)
ax.set_xscale("log")
ax.set_xlim(0.01, 10.0)
# Percentile markers
quartile_styles = [(0.50, "-", "p50"), (0.90, "--", "p90"), (0.99, ":", "p99")]
for q, ls, name in quartile_styles:
ax.axvline(pct(kvc_ttft, q), color=KVC_COLOR, ls=ls, alpha=0.55, lw=1.1)
ax.axvline(pct(dp_ttft, q), color=DP_COLOR, ls=ls, alpha=0.55, lw=1.1)
# Annotate p99 specifically since this is the key reviewer-targeted callout
ymax = max(y_kvc.max(), y_dp.max())
kvc_p99 = pct(kvc_ttft, 0.99)
dp_p99 = pct(dp_ttft, 0.99)
ax.annotate(f"KVC p99 = {kvc_p99:.2f}s\n(slow-path reseed tail)",
xy=(kvc_p99, kde_kvc_log(np.log10(kvc_p99))[0]),
xytext=(2.0, ymax * 0.65),
fontsize=10, color=KVC_COLOR, fontweight="bold",
arrowprops=dict(arrowstyle="->", color=KVC_COLOR, lw=1.0))
ax.annotate(f"DP p99 = {dp_p99*1000:.0f}ms",
xy=(dp_p99, kde_dp_log(np.log10(dp_p99))[0]),
xytext=(0.025, ymax * 0.80),
fontsize=10, color=DP_COLOR, fontweight="bold",
arrowprops=dict(arrowstyle="->", color=DP_COLOR, lw=1.0))
# Highlight the KVC bimodal structure
ax.annotate("KVC fast path\n(direct-to-D, 91.6%)",
xy=(0.05, y_kvc[np.argmin(np.abs(x_full - 0.05))]),
xytext=(0.012, ymax * 0.45),
fontsize=9, color=KVC_COLOR, style="italic",
arrowprops=dict(arrowstyle="->", color=KVC_COLOR, lw=0.7, alpha=0.6))
ax.annotate("KVC slow path\n(reseed, ~3.4%)",
xy=(2.5, y_kvc[np.argmin(np.abs(x_full - 2.5))]),
xytext=(3.0, ymax * 0.30),
fontsize=9, color=KVC_COLOR, style="italic",
arrowprops=dict(arrowstyle="->", color=KVC_COLOR, lw=0.7, alpha=0.6))
# Custom tick labels in seconds (instead of 10^-2, 10^-1, 10^0, 10^1)
ax.set_xticks([0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0])
ax.set_xticklabels(["10ms", "50ms", "100ms", "500ms", "1s", "5s", "10s"])
ax.set_xlabel("TTFT (log scale)", fontsize=11)
ax.set_ylabel("Density (per log₁₀ s)", fontsize=11)
ax.set_title("Full range (TTFT 10 ms 10 s, log x)", fontsize=12, pad=10)
ax.legend(loc="upper left", fontsize=10, framealpha=0.95)
ax.grid(True, which="both", linestyle=":", alpha=0.4)
ax.set_axisbelow(True)
fig.suptitle(
"TTFT probability density: KVC 1P3D v2 vs 4-way DP CA\n"
"SWE-Bench 50sess trace · ts=1 · 4× H100 80GB · aborted/error requests excluded",
fontsize=13, y=1.02,
)
plt.tight_layout()
plt.savefig(OUT, dpi=150, bbox_inches="tight")
print(f"wrote {OUT}")
plt.close(fig)
# ------------------------------------------------------------------
# Print summary stats for doc cross-reference
# ------------------------------------------------------------------
print(f"\n=== TTFT distribution summary ===")
for name, arr in [("KVC v2", kvc_ttft), ("DP 4w", dp_ttft)]:
print(f" {name} (n={len(arr)})")
print(f" min={arr.min()*1000:.1f}ms p10={pct(arr,0.10)*1000:.1f}ms "
f"p50={pct(arr,0.50)*1000:.1f}ms p90={pct(arr,0.90)*1000:.1f}ms "
f"p99={pct(arr,0.99)*1000:.1f}ms max={arr.max()*1000:.1f}ms")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,223 @@
#!/usr/bin/env python3
"""Generate the two figures referenced by docs/V2_DEEP_ANALYSIS_ZH.md §3.1 and §3.2.
Inputs:
outputs/qwen3-30b-tp1-ts1-migration-v2/kvc_1p3d_migration_v2_run1_metrics.jsonl
outputs/qwen3-30b-tp1-ts1-validation/dp4_metrics.jsonl
Outputs:
docs/figures/v2_execution_mode_distribution.png (for §3.1)
docs/figures/v2_path_level_latency.png (for §3.2)
"""
from __future__ import annotations
import json
import statistics
from collections import Counter, defaultdict
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
ROOT = Path(__file__).resolve().parents[2]
KVC = ROOT / "outputs/qwen3-30b-tp1-ts1-migration-v2/kvc_1p3d_migration_v2_run1_metrics.jsonl"
DP = ROOT / "outputs/qwen3-30b-tp1-ts1-validation/dp4_metrics.jsonl"
OUT = ROOT / "docs/figures"
OUT.mkdir(parents=True, exist_ok=True)
def load(p: Path) -> list[dict]:
return [json.loads(line) for line in p.open()]
def is_failed(r: dict) -> bool:
if r.get("error"):
return True
fr = r.get("finish_reason")
if fr and ("abort" in str(fr).lower() or "badrequest" in str(fr).lower()):
return True
return False
def pct(vals: list[float], q: float) -> float:
s = sorted(vals)
if not s:
return float("nan")
return s[max(0, min(len(s) - 1, int(len(s) * q)))]
def main() -> None:
kvc = load(KVC)
dp = load(DP)
kvc_ok = [r for r in kvc if not is_failed(r)]
dp_ok = [r for r in dp if not is_failed(r)]
# ------------------------------------------------------------------
# Figure 1: §3.1 execution_mode distribution (horizontal bar)
# Use ALL rows (incl. failures) so percentages match the doc's 91.6%
# ------------------------------------------------------------------
mode_counts = Counter(r["execution_mode"] for r in kvc)
total_kvc = len(kvc)
short_label = {
"kvcache-direct-to-d-session": "direct-to-D-session (fast path)",
"pd-router-d-session-reseed": "d-session-reseed (mooncake reseed)",
"pd-router-fallback-session-not-resident-session-cap":
"fallback: session-not-resident + session-cap",
"pd-router-fallback-session-not-resident-seed-filter-early-turn":
"fallback: session-not-resident + seed-filter",
"pd-router-turn1-seed": "turn1-seed (first turn of each session)",
"pd-router-fallback-no-d-capacity": "fallback: no-d-capacity",
"pd-router-fallback-real-large-append-session-cap":
"fallback: real-large-append",
"pd-router-fallback-policy-no-bypass-session-cap":
"fallback: policy-no-bypass",
"pd-router-d-session-reseed-after-eviction":
"d-session-reseed-after-eviction",
"kvcache-centric": "kvcache-centric (admit-but-then-error)",
}
sorted_modes = mode_counts.most_common()
labels = [short_label.get(m, m) for m, _ in sorted_modes]
counts = [c for _, c in sorted_modes]
pcts = [c / total_kvc * 100 for c in counts]
is_fast = ["direct-to-D" in lbl for lbl in labels]
colors = ["#2C8C2C" if f else "#D62728" for f in is_fast]
fig, ax = plt.subplots(figsize=(11, 5.5))
y = np.arange(len(labels))[::-1]
ax.barh(y, counts, color=colors, edgecolor="black", linewidth=0.5)
ax.set_yticks(y)
ax.set_yticklabels(labels, fontsize=10)
ax.set_xscale("log")
ax.set_xlabel("Request count (log scale)", fontsize=11)
ax.set_xlim(left=1)
# Annotate count + percentage at end of each bar
for yi, (c, p) in zip(y, zip(counts, pcts)):
ax.text(c * 1.05, yi, f"{c} ({p:.1f}%)",
va="center", fontsize=9.5)
ax.set_title(
f"KVC v2 execution_mode distribution (n = {total_kvc} total requests)\n"
"green = fast path (direct-to-D), red = slow / fallback / failure paths",
fontsize=12, pad=12,
)
ax.grid(axis="x", linestyle=":", alpha=0.4)
ax.set_axisbelow(True)
plt.tight_layout()
out1 = OUT / "v2_execution_mode_distribution.png"
plt.savefig(out1, dpi=150)
print(f"wrote {out1}")
plt.close(fig)
# ------------------------------------------------------------------
# Figure 2: §3.2 path-level latency (grouped bars, log y)
# ------------------------------------------------------------------
# Group KVC paths semantically
def kvc_group(mode: str) -> str:
if mode == "kvcache-direct-to-d-session":
return "KVC direct-to-D\n(fast path, 91.6%)"
if "reseed" in mode:
return "KVC reseed\n(slow path, 3.4%)"
if "no-d-capacity" in mode:
return "KVC no-d-capacity\n(fallback, 0.7%)"
if "session-not-resident" in mode:
return "KVC session-not-resident\n(misc, 2.3%)"
return "KVC other\n(<2%)"
groups = defaultdict(list)
for r in kvc_ok:
groups[kvc_group(r["execution_mode"])].append(r)
# Order paths by intuitive progression (fast → slow)
ordered_paths = [
"KVC direct-to-D\n(fast path, 91.6%)",
"KVC session-not-resident\n(misc, 2.3%)",
"KVC reseed\n(slow path, 3.4%)",
"KVC no-d-capacity\n(fallback, 0.7%)",
]
# Filter to only ones present
ordered_paths = [p for p in ordered_paths if p in groups]
ordered_paths.append("DP dp-colo-router\n(100%)")
def stats(rows: list[dict]) -> dict[str, float]:
ttfts = [r["ttft_s"] for r in rows if r.get("ttft_s") is not None]
lats = [r["latency_s"] for r in rows if r.get("latency_s") is not None]
return {
"n": len(rows),
"ttft_p50": pct(ttfts, 0.50),
"ttft_p99": pct(ttfts, 0.99),
"lat_p50": pct(lats, 0.50),
}
path_stats = {p: stats(groups[p]) for p in ordered_paths if "DP" not in p}
path_stats["DP dp-colo-router\n(100%)"] = stats(dp_ok)
metrics = [("TTFT p50", "ttft_p50"), ("TTFT p99", "ttft_p99"), ("Latency p50", "lat_p50")]
bar_w = 0.25
fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(ordered_paths))
colors_metric = ["#1F77B4", "#FF7F0E", "#9467BD"]
for i, (label, key) in enumerate(metrics):
vals = [path_stats[p][key] for p in ordered_paths]
bars = ax.bar(x + (i - 1) * bar_w, vals, bar_w, label=label,
color=colors_metric[i], edgecolor="black", linewidth=0.4)
for xi, v in zip(x + (i - 1) * bar_w, vals):
if v > 0 and v == v: # not nan
fmt = f"{v*1000:.0f}ms" if v < 1 else f"{v:.2f}s"
ax.text(xi, v * 1.10, fmt,
ha="center", va="bottom", fontsize=8.5, rotation=0)
ax.set_yscale("log")
ax.set_xticks(x)
ax.set_xticklabels(ordered_paths, fontsize=9.5)
ax.set_ylabel("Latency (seconds, log scale)", fontsize=11)
ax.set_title(
"Path-level latency: KVC v2 paths vs DP single-path baseline\n"
"log y-axis · same SWE-Bench 50sess trace · ts=1 · 4× H100 80GB",
fontsize=12, pad=12,
)
ax.legend(loc="upper left", fontsize=10, framealpha=0.95)
ax.grid(axis="y", linestyle=":", alpha=0.4, which="both")
ax.set_axisbelow(True)
# Annotate sample counts under each path label
ymin = ax.get_ylim()[0]
for xi, p in zip(x, ordered_paths):
n = path_stats[p]["n"]
ax.text(xi, ymin * 0.5, f"n={n}", ha="center", va="top",
fontsize=8.5, color="#555")
plt.tight_layout()
out2 = OUT / "v2_path_level_latency.png"
plt.savefig(out2, dpi=150)
print(f"wrote {out2}")
plt.close(fig)
# ------------------------------------------------------------------
# Print numeric values used (for doc reference)
# ------------------------------------------------------------------
print("\n=== Numeric values plotted ===")
print("\nExecution mode counts (KVC v2):")
for label, c, p in zip(labels, counts, pcts):
print(f" {c:>5} ({p:>5.2f}%) {label}")
print("\nPath-level latency:")
for p in ordered_paths:
s = path_stats[p]
nl = " | ".join([
f"n={s['n']}",
f"TTFT p50={s['ttft_p50']*1000:.1f}ms",
f"TTFT p99={s['ttft_p99']*1000:.1f}ms",
f"Lat p50={s['lat_p50']:.3f}s",
])
print(f" {p.replace(chr(10), ' '):<55} {nl}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,105 @@
#!/usr/bin/env python3
"""Re-derive summary.json from existing metrics.jsonl using the fixed metrics.py.
Bug fixed: requests aborted by SGLang (e.g. input > max-input-len returns
a fast 400 with latency_s ~ 0.08s) were previously counted in latency_stats
as if successful, deflating mean/p50/p90. The fixed metrics.py excludes
all failed requests (errors or aborts) from latency/ttft/tpot stats and
exposes abort_count / failure_count.
Usage:
python3 scripts/analysis/recompute_summary.py path/to/metrics.jsonl ...
python3 scripts/analysis/recompute_summary.py --diff path/to/metrics.jsonl path/to/old_summary.json
"""
from __future__ import annotations
import argparse
import json
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).resolve().parents[2] / "src"))
from agentic_pd_hybrid.metrics import RequestMetrics, write_summary_json
def load_rows(metrics_path: Path) -> list[RequestMetrics]:
rows = []
field_names = {f for f in RequestMetrics.__dataclass_fields__}
with metrics_path.open() as handle:
for line in handle:
line = line.strip()
if not line:
continue
raw = json.loads(line)
kwargs = {k: raw.get(k) for k in field_names}
rows.append(RequestMetrics(**kwargs))
return rows
def main() -> None:
parser = argparse.ArgumentParser()
parser.add_argument("metrics_paths", nargs="+", type=Path)
parser.add_argument(
"--out",
type=Path,
default=None,
help="output summary path (default: alongside metrics with .recomputed_summary.json)",
)
parser.add_argument(
"--diff",
action="store_true",
help="print before/after diff against the old <metrics>.summary.json",
)
args = parser.parse_args()
for metrics_path in args.metrics_paths:
rows = load_rows(metrics_path)
out_path = args.out or metrics_path.with_suffix(".recomputed_summary.json")
write_summary_json(
out_path,
rows,
trace_path=metrics_path,
router_url=None,
)
new = json.load(out_path.open())
print(f"\n=== {metrics_path} ===")
print(f" written: {out_path}")
print(f" total rows: {new['request_count']}")
print(f" error_count: {new['error_count']}")
print(f" abort_count: {new.get('abort_count', '?')}")
print(f" failure_count: {new.get('failure_count', '?')}")
ls = new.get("latency_stats_s", {}) or {}
ts = new.get("ttft_stats_s", {}) or {}
print(f" lat: n={ls.get('count')} mean={ls.get('mean'):.4f} p50={ls.get('p50'):.4f} p90={ls.get('p90'):.4f} p99={ls.get('p99'):.4f}")
print(f" ttft: n={ts.get('count')} mean={ts.get('mean'):.4f} p50={ts.get('p50'):.4f} p90={ts.get('p90'):.4f} p99={ts.get('p99'):.4f}")
if args.diff:
# find old summary (sibling file)
candidates = [
metrics_path.parent / f"{metrics_path.stem}.summary.json",
metrics_path.with_suffix(".summary.json"),
]
old_path = next((p for p in candidates if p.exists()), None)
if old_path:
old = json.load(old_path.open())
print(f" vs old {old_path}:")
old_ls = old.get("latency_stats_s", {}) or {}
old_ts = old.get("ttft_stats_s", {}) or {}
for k in ("count", "mean", "p50", "p90", "p99"):
o = old_ls.get(k)
n = ls.get(k)
if o is not None and n is not None:
delta = n - o
print(f" lat.{k}: {o:.4f} -> {n:.4f} ({delta:+.4f})")
for k in ("count", "mean", "p50", "p90", "p99"):
o = old_ts.get(k)
n = ts.get(k)
if o is not None and n is not None:
delta = n - o
print(f" ttft.{k}: {o:.4f} -> {n:.4f} ({delta:+.4f})")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,110 @@
#!/usr/bin/env python3
"""Convert sibench audit.jsonl to agentic-pd-hybrid trace format.
Source format (sibench audit.jsonl):
{"instance_id": "...", "ts": float, "messages": [...],
"audit": {"prompt_tokens": int, "completion_tokens": int, ...}}
Target format (agentic-pd-hybrid trace JSONL):
{"chat_id": int, "parent_chat_id": int, "timestamp": float,
"turn": int, "input_length": int, "output_length": int,
"type": str, "hash_ids": [int, ...]}
"""
import json
import sys
from collections import defaultdict
from pathlib import Path
BLOCK_TOKEN_BUDGET = 24 # tokens per block, matching trace.py default
def convert(src: Path, dst: Path) -> None:
# Group lines by instance_id, preserving order within each instance
instances: dict[str, list[dict]] = defaultdict(list)
with src.open() as f:
for line in f:
line = line.strip()
if not line:
continue
rec = json.loads(line)
instances[rec["instance_id"]].append(rec)
# Sort each instance's turns by timestamp
for iid in instances:
instances[iid].sort(key=lambda r: r["ts"])
# Assign stable chat_id bases: each instance gets a block of IDs
# Max turns across all instances determines the spacing
max_turns = max(len(turns) for turns in instances.values())
spacing = max_turns + 10 # extra headroom
total_written = 0
with dst.open("w") as out:
for inst_idx, (iid, turns) in enumerate(instances.items()):
base_chat_id = (inst_idx + 1) * spacing # start from spacing to avoid 0
# Track cumulative hash_ids for prefix cache simulation
cumulative_hash_ids: list[int] = []
global_block_counter = inst_idx * 100_000 # unique block namespace per instance
for turn_idx, rec in enumerate(turns):
audit = rec.get("audit", {})
input_length = audit.get("prompt_tokens", 0)
output_length = audit.get("completion_tokens", 0)
if input_length <= 0:
# Fallback: estimate from message content
total_chars = sum(len(m.get("content", "")) for m in rec.get("messages", []))
input_length = max(1, total_chars // 4)
if output_length <= 0:
output_length = 128 # reasonable default
chat_id = base_chat_id + turn_idx
if turn_idx == 0:
parent_chat_id = -1
else:
parent_chat_id = base_chat_id + turn_idx - 1
# Build hash_ids: for turn 0, generate blocks for full input
# For turn N>0, keep previous blocks and add new ones for the delta
if turn_idx == 0:
num_blocks = input_length // BLOCK_TOKEN_BUDGET
cumulative_hash_ids = list(
range(global_block_counter, global_block_counter + num_blocks)
)
global_block_counter += num_blocks
else:
# The new input is the full prompt (cumulative), so the delta
# is the new tokens beyond what was in the previous turn's prompt
prev_input = audit.get("prompt_tokens", 0)
prev_rec_audit = turns[turn_idx - 1].get("audit", {})
prev_input_length = prev_rec_audit.get("prompt_tokens", 0)
delta = max(0, prev_input - prev_input_length) if prev_input_length > 0 else 0
new_blocks = delta // BLOCK_TOKEN_BUDGET
new_ids = list(
range(global_block_counter, global_block_counter + new_blocks)
)
global_block_counter += new_blocks
cumulative_hash_ids = cumulative_hash_ids + new_ids
trace_line = {
"chat_id": chat_id,
"parent_chat_id": parent_chat_id,
"timestamp": rec["ts"],
"turn": turn_idx,
"input_length": input_length,
"output_length": output_length,
"type": "chat",
"hash_ids": cumulative_hash_ids,
}
out.write(json.dumps(trace_line, separators=(",", ":")) + "\n")
total_written += 1
print(f"Converted {total_written} lines from {len(instances)} instances -> {dst}")
if __name__ == "__main__":
if len(sys.argv) != 3:
print(f"Usage: {sys.argv[0]} <input_audit.jsonl> <output_trace.jsonl>")
sys.exit(1)
convert(Path(sys.argv[1]), Path(sys.argv[2]))

73
scripts/run_all_experiments.sh Executable file
View File

@@ -0,0 +1,73 @@
#!/bin/bash
# Run all 3 PD hybrid experiments sequentially
# Uses 52 sessions / 4,449 requests (10% sample of 497 sessions)
# Each experiment takes ~30-40 min
set -euo pipefail
cd "$(dirname "$0")/.."
TRACE="outputs/qwen35-swebench-50sess.jsonl"
MODEL="/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B"
OUTPUT="outputs/swebench-exps"
echo "=== Experiment A: pd-disaggregation ==="
uv run agentic-pd-hybrid benchmark-live \
--trace "$TRACE" \
--output-root "$OUTPUT" \
--mechanism pd-disaggregation \
--policy default \
--model-path "$MODEL" \
--prefill-workers 1 --decode-workers 1 \
--prefill-tp-size 4 --decode-tp-size 4 \
--prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300
echo "=== Experiment B: pd-colo ==="
uv run agentic-pd-hybrid benchmark-live \
--trace "$TRACE" \
--output-root "$OUTPUT" \
--mechanism pd-colo \
--policy default \
--model-path "$MODEL" \
--prefill-workers 0 --decode-workers 0 \
--direct-workers 2 --direct-tp-size 4 \
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300
echo "=== Experiment C: kvcache-centric ==="
uv run agentic-pd-hybrid benchmark-live \
--trace "$TRACE" \
--output-root "$OUTPUT" \
--mechanism kvcache-centric \
--policy default \
--model-path "$MODEL" \
--prefill-workers 1 --decode-workers 1 \
--prefill-tp-size 4 --decode-tp-size 4 \
--prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 2 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
echo "=== All experiments complete ==="

24
scripts/run_exp_a_pd_disagg.sh Executable file
View File

@@ -0,0 +1,24 @@
#!/bin/bash
# Experiment A: pd-disaggregation baseline
# 1P(GPU 0-3) + 1D(GPU 4-7), TP4, mooncake TCP
# Full 39K trace from SWE-Bench 500 instances
set -euo pipefail
cd "$(dirname "$0")/.."
uv run agentic-pd-hybrid benchmark-live \
--trace outputs/qwen35-swebench-500.jsonl \
--output-root outputs/swebench-exps \
--mechanism pd-disaggregation \
--policy default \
--model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
--prefill-workers 1 --decode-workers 1 \
--prefill-tp-size 4 --decode-tp-size 4 \
--prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 64 \
--timeout-s 900 \
--request-timeout-s 300

View File

@@ -0,0 +1,23 @@
#!/bin/bash
# Experiment B1: Naive DP colocation — round-robin policy
# 2 direct workers (GPU 0-3, 4-7), TP4, DP router with round-robin
# No disaggregation — each worker does prefill+decode locally
set -euo pipefail
cd "$(dirname "$0")/.."
uv run agentic-pd-hybrid benchmark-live \
--trace outputs/qwen35-swebench-50sess.jsonl \
--output-root outputs/swebench-exps \
--mechanism pd-colo \
--policy default \
--model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
--prefill-workers 0 --decode-workers 0 \
--direct-workers 2 --direct-tp-size 4 \
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300

View File

@@ -0,0 +1,23 @@
#!/bin/bash
# Experiment B2: Naive DP colocation — cache-aware (kv-aware) policy
# 2 direct workers (GPU 0-3, 4-7), TP4, DP router with consistent-hashing
# Replay kv-aware policy picks the worker with most prefix overlap
set -euo pipefail
cd "$(dirname "$0")/.."
uv run agentic-pd-hybrid benchmark-live \
--trace outputs/qwen35-swebench-50sess.jsonl \
--output-root outputs/swebench-exps \
--mechanism pd-colo \
--policy kv-aware \
--model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
--prefill-workers 0 --decode-workers 0 \
--direct-workers 2 --direct-tp-size 4 \
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300

24
scripts/run_exp_b_pd_colo.sh Executable file
View File

@@ -0,0 +1,24 @@
#!/bin/bash
# Experiment B: pd-colo (direct/colocation)
# 2 direct workers (GPU 0-3, 4-7), TP4, no router
# Full 39K trace from SWE-Bench 500 instances
set -euo pipefail
cd "$(dirname "$0")/.."
uv run agentic-pd-hybrid benchmark-live \
--trace outputs/qwen35-swebench-500.jsonl \
--output-root outputs/swebench-exps \
--mechanism pd-colo \
--policy default \
--model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
--prefill-workers 0 --decode-workers 0 \
--direct-workers 2 --direct-tp-size 4 \
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 64 \
--timeout-s 900 \
--request-timeout-s 300

View File

@@ -0,0 +1,28 @@
#!/bin/bash
# Experiment C: kvcache-centric (session-aware PD)
# 1P(GPU 0-3) + 1D(GPU 4-7), TP4, mooncake TCP
# Full 39K trace from SWE-Bench 500 instances
set -euo pipefail
cd "$(dirname "$0")/.."
uv run agentic-pd-hybrid benchmark-live \
--trace outputs/qwen35-swebench-500.jsonl \
--output-root outputs/swebench-exps \
--mechanism kvcache-centric \
--policy default \
--model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
--prefill-workers 1 --decode-workers 1 \
--prefill-tp-size 4 --decode-tp-size 4 \
--prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 64 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 2 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction

30
scripts/smoke_test.sh Executable file
View File

@@ -0,0 +1,30 @@
#!/bin/bash
# Smoke test: pd-disaggregation with mooncake TCP, 100 requests
set -euo pipefail
cd "$(dirname "$0")/.."
# Sample a small trace for smoke testing
uv run agentic-pd-hybrid sample-sessions \
--trace outputs/qwen35-swebench-500.jsonl \
--output outputs/qwen35-smoke-3sess.jsonl \
--session-sample-rate 0.02 \
--min-turns 5 \
--target-duration-s 300 \
--max-requests 100
# Run smoke test
uv run agentic-pd-hybrid benchmark-live \
--trace outputs/qwen35-smoke-3sess.jsonl \
--output-root outputs/smoke \
--mechanism pd-disaggregation \
--policy default \
--model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
--prefill-workers 1 --decode-workers 1 \
--prefill-tp-size 4 --decode-tp-size 4 \
--prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300

View File

@@ -0,0 +1,114 @@
#!/usr/bin/env bash
# Smoke sweep: validate backpressure code change on top of v5 Option D config.
# Designed to fit in ~3-4h GPU budget (4 runs × ~30-60 min).
#
# Usage:
# bash scripts/sweep_backpressure_smoke.sh
#
# Prerequisites: GPUs available; trace at outputs/qwen35-swebench-50sess.jsonl;
# model at $MODEL_PATH (default Qwen3-30B-A3B-Instruct-2507).
set -euo pipefail
REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
cd "$REPO_ROOT"
OUT_ROOT=${OUT_ROOT:-outputs/sweep_backpressure_smoke}
TRACE=${TRACE:-outputs/qwen35-swebench-50sess.jsonl}
MODEL=${MODEL:-/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507}
mkdir -p "$OUT_ROOT"
LOG="$OUT_ROOT/sweep.log"
echo "[$(date '+%F %T')] Starting backpressure smoke sweep" | tee -a "$LOG"
echo " Trace: $TRACE" | tee -a "$LOG"
echo " Model: $MODEL" | tee -a "$LOG"
echo " Output root: $OUT_ROOT" | tee -a "$LOG"
KVC_COMMON_ARGS=(
--trace "$TRACE"
--model "$MODEL"
--mechanism kvcache-centric
--policy kv-aware
--kvcache-admission-mode worker
--kvcache-seed-min-turn-id 1
--kvcache-seed-max-inflight-decode -1
--kvcache-prefill-backup-policy release-after-transfer
--kvcache-prefill-priority-eviction
--prefill-workers 2
--decode-workers 6
--prefill-gpu-ids 0,1
--decode-gpu-ids 2,3,4,5,6,7
--transfer-backend mooncake
--target-duration-s 2000
--session-sample-rate 1.0
--min-turns 2
--concurrency-limit 32
)
DP_COMMON_ARGS=(
--trace "$TRACE"
--model "$MODEL"
--mechanism pd-colo
--policy kv-aware
--direct-workers 8
--direct-gpu-ids 0,1,2,3,4,5,6,7
--transfer-backend mooncake
--target-duration-s 2000
--session-sample-rate 1.0
--min-turns 2
--concurrency-limit 32
)
run_kvc_baseline_ts10() {
local out="$OUT_ROOT/E1_kvc_baseline_ts10"
echo "[$(date '+%F %T')] === E1: KVC baseline (no backpressure) time-scale=10 ===" | tee -a "$LOG"
python -m agentic_pd_hybrid.cli benchmark-live \
"${KVC_COMMON_ARGS[@]}" \
--output-root "$out" \
--time-scale 10 \
2>&1 | tee -a "$LOG"
}
run_kvc_backpressure_ts10() {
local out="$OUT_ROOT/E2_kvc_backpressure_ts10"
echo "[$(date '+%F %T')] === E2: KVC + backpressure ON, time-scale=10 ===" | tee -a "$LOG"
python -m agentic_pd_hybrid.cli benchmark-live \
"${KVC_COMMON_ARGS[@]}" \
--output-root "$out" \
--time-scale 10 \
--enable-backpressure \
--backpressure-max-pause-s 2.0 \
2>&1 | tee -a "$LOG"
}
run_kvc_backpressure_ts1() {
local out="$OUT_ROOT/E3_kvc_backpressure_ts1_short"
echo "[$(date '+%F %T')] === E3: KVC + backpressure ON, time-scale=1, FIRST 1000 reqs ===" | tee -a "$LOG"
python -m agentic_pd_hybrid.cli benchmark-live \
"${KVC_COMMON_ARGS[@]}" \
--output-root "$out" \
--time-scale 1 \
--enable-backpressure \
--backpressure-max-pause-s 2.0 \
--target-duration-s 1800 \
2>&1 | tee -a "$LOG"
}
run_dp_baseline_ts1() {
local out="$OUT_ROOT/E4_dp_ts1_short"
echo "[$(date '+%F %T')] === E4: 8-way DP cache-aware, time-scale=1, FIRST 1000 reqs ===" | tee -a "$LOG"
python -m agentic_pd_hybrid.cli benchmark-live \
"${DP_COMMON_ARGS[@]}" \
--output-root "$out" \
--time-scale 1 \
--target-duration-s 1800 \
2>&1 | tee -a "$LOG"
}
# Sequence — add/remove as fits the budget.
run_kvc_baseline_ts10
run_kvc_backpressure_ts10
run_kvc_backpressure_ts1
run_dp_baseline_ts1
echo "[$(date '+%F %T')] === sweep DONE ===" | tee -a "$LOG"
echo "Run analysis with: python scripts/analysis/analyze_backpressure_smoke.py $OUT_ROOT" | tee -a "$LOG"

60
scripts/sweep_kvc_qwen3_30b.sh Executable file
View File

@@ -0,0 +1,60 @@
#!/bin/bash
# KVC admission control parameter sweep on Qwen3-30B
# 5 experiments, ~35 min each, ~3 hours total
set -euo pipefail
cd "$(dirname "$0")/.."
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
TRACE=outputs/qwen35-swebench-50sess.jsonl
OUTPUT=outputs/qwen3-30b-exps
VENV_PYTHON=.venv/bin/python
run_kvc() {
local label=$1
local inflight=$2
local min_turn=$3
echo "=== [$label] inflight=$inflight min_turn=$min_turn === $(date)"
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy default \
--model-path $MODEL \
--prefill-workers 1 --decode-workers 1 \
--prefill-tp-size 4 --decode-tp-size 4 \
--prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id $min_turn \
--kvcache-seed-max-inflight-decode $inflight \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
echo "=== [$label] DONE === $(date)"
echo ""
}
# C1: inflight=8, min-turn=2
run_kvc "C1" 8 2
# C2: inflight=16, min-turn=2
run_kvc "C2" 16 2
# C3: inflight=-1 (disabled), min-turn=2
run_kvc "C3" -1 2
# C4: inflight=8, min-turn=1
run_kvc "C4" 8 1
# C5: inflight=-1 (disabled), min-turn=1
run_kvc "C5" -1 1
echo "=== ALL SWEEP EXPERIMENTS DONE === $(date)"

133
scripts/sweep_tp1_configs.sh Executable file
View File

@@ -0,0 +1,133 @@
#!/bin/bash
# TP1 configuration sweep: 8-way DP, 1P7D KVC, 2P6D KVC
# Qwen3-30B-A3B TP=1, single GPU per worker
# Most aggressive KVC admission: inflight=-1 (off), seed-min-turn=1
set -euo pipefail
cd "$(dirname "$0")/.."
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
TRACE=outputs/qwen35-swebench-50sess.jsonl
OUTPUT=outputs/qwen3-30b-tp1-exps
VENV_PYTHON=.venv/bin/python
RESULTS_FILE=$OUTPUT/sweep_results.txt
mkdir -p $OUTPUT
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
}
save_result() {
local label=$1
local run_dir=$2
log "=== $label COMPLETED ==="
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
log "Summary:"
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
echo "" >> $RESULTS_FILE
# Also copy summary to a named file for easy access
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
log "Saved to $OUTPUT/${label}_summary.json"
else
log "WARNING: No summary file found in $run_dir"
fi
}
log "Starting TP1 configuration sweep"
log "Model: $MODEL"
log "Trace: $TRACE (4449 requests, 52 sessions)"
########################################
# Experiment 1: 8-way DP cache-aware
########################################
log ""
log "=== [EXP1] 8-way DP cache-aware (8 direct × TP1) ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism pd-colo \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 0 --decode-workers 0 \
--direct-workers 8 --direct-tp-size 1 \
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300
# Find latest run dir for this experiment
EXP1_DIR=$(ls -td $OUTPUT/pd-colo-kv-aware-*/ 2>/dev/null | head -1)
save_result "exp1_8way_dp_cache_aware" "$EXP1_DIR"
########################################
# Experiment 2: 1P + 7D KVC (most aggressive)
########################################
log ""
log "=== [EXP2] 1P7D KVC (inflight=off, min-turn=1) ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy default \
--model-path $MODEL \
--prefill-workers 1 --decode-workers 7 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp2_1p7d_kvc_aggressive" "$EXP2_DIR"
########################################
# Experiment 3: 2P + 6D KVC (most aggressive)
########################################
log ""
log "=== [EXP3] 2P6D KVC (inflight=off, min-turn=1) ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy default \
--model-path $MODEL \
--prefill-workers 2 --decode-workers 6 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
EXP3_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp3_2p6d_kvc_aggressive" "$EXP3_DIR"
########################################
log ""
log "=== ALL TP1 SWEEP EXPERIMENTS DONE ==="

131
scripts/sweep_tp1_v2_fixed.sh Executable file
View File

@@ -0,0 +1,131 @@
#!/bin/bash
# TP1 configuration sweep v2 — after session_params fix + audit fields
# Qwen3-30B-A3B TP=1, single GPU per worker
set -euo pipefail
cd "$(dirname "$0")/.."
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
TRACE=outputs/qwen35-swebench-50sess.jsonl
OUTPUT=outputs/qwen3-30b-tp1-v2-fixed
VENV_PYTHON=.venv/bin/python
RESULTS_FILE=$OUTPUT/sweep_results.txt
mkdir -p $OUTPUT
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
}
save_result() {
local label=$1
local run_dir=$2
log "=== $label COMPLETED ==="
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
log "Summary:"
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
echo "" >> $RESULTS_FILE
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl"
else
log "WARNING: No summary file found in $run_dir"
fi
}
log "Starting TP1 v2 sweep (session_params fix + audit fields)"
log "Model: $MODEL"
log "Trace: $TRACE (4449 requests, 52 sessions)"
########################################
# Experiment 1: 8-way DP cache-aware
########################################
log ""
log "=== [EXP1] 8-way DP cache-aware (8 direct × TP1) ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism pd-colo \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 0 --decode-workers 0 \
--direct-workers 8 --direct-tp-size 1 \
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300
EXP1_DIR=$(ls -td $OUTPUT/pd-colo-kv-aware-*/ 2>/dev/null | head -1)
save_result "exp1_8way_dp_cache_aware" "$EXP1_DIR"
########################################
# Experiment 2: 1P + 7D KVC (aggressive)
########################################
log ""
log "=== [EXP2] 1P7D KVC (inflight=off, min-turn=1) ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy default \
--model-path $MODEL \
--prefill-workers 1 --decode-workers 7 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp2_1p7d_kvc_aggressive" "$EXP2_DIR"
########################################
# Experiment 3: 2P + 6D KVC (aggressive)
########################################
log ""
log "=== [EXP3] 2P6D KVC (inflight=off, min-turn=1) ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy default \
--model-path $MODEL \
--prefill-workers 2 --decode-workers 6 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
EXP3_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp3_2p6d_kvc_aggressive" "$EXP3_DIR"
########################################
log ""
log "=== ALL TP1 V2 SWEEP EXPERIMENTS DONE ==="

108
scripts/sweep_tp1_v3_kvaware.sh Executable file
View File

@@ -0,0 +1,108 @@
#!/bin/bash
# TP1 v3 sweep — KVC with kv-aware policy (fix routing mismatch)
# v2 used --policy default for KVC experiments, causing session routing
# mismatch: replay round-robin ≠ router round-robin → "session not found".
# v3 uses --policy kv-aware for KVC to ensure session affinity.
set -euo pipefail
cd "$(dirname "$0")/.."
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
TRACE=outputs/qwen35-swebench-50sess.jsonl
OUTPUT=outputs/qwen3-30b-tp1-v3-kvaware
VENV_PYTHON=.venv/bin/python
RESULTS_FILE=$OUTPUT/sweep_results.txt
mkdir -p $OUTPUT
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
}
save_result() {
local label=$1
local run_dir=$2
log "=== $label COMPLETED ==="
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
log "Summary:"
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
echo "" >> $RESULTS_FILE
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl"
else
log "WARNING: No summary file found in $run_dir"
fi
}
log "Starting TP1 v3 sweep (KVC with kv-aware policy)"
log "Model: $MODEL"
log "Trace: $TRACE (4449 requests, 52 sessions)"
log "Key change: --policy kv-aware for KVC (was --policy default in v2)"
########################################
# Experiment 1: 1P + 7D KVC kv-aware
########################################
log ""
log "=== [EXP1] 1P7D KVC kv-aware ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 1 --decode-workers 7 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
EXP1_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp1_1p7d_kvc_kvaware" "$EXP1_DIR"
########################################
# Experiment 2: 2P + 6D KVC kv-aware
########################################
log ""
log "=== [EXP2] 2P6D KVC kv-aware ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 2 --decode-workers 6 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp2_2p6d_kvc_kvaware" "$EXP2_DIR"
########################################
log ""
log "=== ALL TP1 V3 SWEEP EXPERIMENTS DONE ==="

108
scripts/sweep_tp1_v4_cap16.sh Executable file
View File

@@ -0,0 +1,108 @@
#!/bin/bash
# TP1 v4 sweep — KVC with kv-aware policy + soft_cap raised from 4 to 16
# v3 (kv-aware) fixed routing but session-cap fallback still dominated 52-65%
# of requests. Hardcoded min(4, ...) in _decode_session_soft_cap was the
# bottleneck — only 4*7=28 session slots for 52 trace sessions.
# v4 raises the cap to 16 (4*7=28 -> 16*7=112 slots).
set -euo pipefail
cd "$(dirname "$0")/.."
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
TRACE=outputs/qwen35-swebench-50sess.jsonl
OUTPUT=outputs/qwen3-30b-tp1-v4-cap16
VENV_PYTHON=.venv/bin/python
RESULTS_FILE=$OUTPUT/sweep_results.txt
mkdir -p $OUTPUT
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
}
save_result() {
local label=$1
local run_dir=$2
log "=== $label COMPLETED ==="
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
log "Summary:"
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
echo "" >> $RESULTS_FILE
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl"
else
log "WARNING: No summary file found in $run_dir"
fi
}
log "Starting TP1 v4 sweep (KVC kv-aware, session soft_cap raised 4->16)"
log "Model: $MODEL"
log "Trace: $TRACE (4449 requests, 52 sessions)"
log "Key change: _decode_session_soft_cap now min(16, ...) instead of min(4, ...)"
########################################
# Experiment 1: 1P + 7D KVC kv-aware (cap=16)
########################################
log ""
log "=== [EXP1] 1P7D KVC kv-aware cap=16 ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 1 --decode-workers 7 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
EXP1_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp1_1p7d_kvc_cap16" "$EXP1_DIR"
########################################
# Experiment 2: 2P + 6D KVC kv-aware (cap=16)
########################################
log ""
log "=== [EXP2] 2P6D KVC kv-aware cap=16 ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 2 --decode-workers 6 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp2_2p6d_kvc_cap16" "$EXP2_DIR"
log ""
log "=== ALL TP1 V4 SWEEP EXPERIMENTS DONE ==="

View File

@@ -0,0 +1,89 @@
#!/bin/bash
# P0: Re-run v5 baseline EXP2 (2P6D) three times to establish whether
# errors=9 is a stable property of the v5 config or single-run variance.
# Critic of V5_PROFILE_INVESTIGATION_ZH.md flagged that the 415 errors in
# v5+profile EXP2 may have been polling-induced. We need 3 baseline runs
# (no polling, identical config to original v5) to test reproducibility.
#
# Output:
# outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/
# ├── exp2_2p6d_run{1,2,3}_summary.json
# ├── exp2_2p6d_run{1,2,3}_metrics.jsonl
# └── kvcache-centric-...<ts>/ (one per run)
set -euo pipefail
cd "$(dirname "$0")/.."
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
TRACE=outputs/qwen35-swebench-50sess.jsonl
OUTPUT=outputs/qwen3-30b-tp1-v5-optD-baseline-rerun
VENV_PYTHON=.venv/bin/python
RESULTS_FILE=$OUTPUT/sweep_results.txt
mkdir -p $OUTPUT
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
}
run_exp2() {
local run_idx=$1
local label="exp2_2p6d_run${run_idx}"
log ""
log "=== [RUN ${run_idx}/3] EXP2 2P6D KVC kv-aware Option D (no polling) ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 2 --decode-workers 6 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
local run_dir=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
log "=== [RUN ${run_idx}/3] $label COMPLETED ==="
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
local errs=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('error_count',0))")
log " errors = $errs (baseline reference = 9)"
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
echo "" >> $RESULTS_FILE
else
log "WARNING: no summary file in $run_dir"
fi
}
log "=== P0: v5 baseline EXP2 reproducibility test (3 runs, no polling) ==="
log "Model: $MODEL"
log "Trace: $TRACE (4449 requests, 52 sessions)"
log "Goal: confirm whether errors=9 in v5 baseline EXP2 is reproducible"
log " (v5+profile saw 415 errors; we need to know if polling was causal)"
for i in 1 2 3; do
run_exp2 $i
done
log ""
log "=== P0 SUMMARY: errors per run ==="
for i in 1 2 3; do
if [ -f "$OUTPUT/exp2_2p6d_run${i}_summary.json" ]; then
e=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/exp2_2p6d_run${i}_summary.json')); print(d.get('error_count',0))")
log " run ${i}: errors = $e"
fi
done
log "=== P0 ALL DONE ==="

114
scripts/sweep_tp1_v5_optD.sh Executable file
View File

@@ -0,0 +1,114 @@
#!/bin/bash
# TP1 v5 sweep — Option D: D-side admission for seed/reseed.
#
# v4 (cap=16) still saw 35% session-cap fallback because the local soft_cap
# evaluates min(16, usable_capacity_tokens / target_tokens) and target_tokens
# (= input + output) is 50-100K in agentic workloads, giving cap = 1-2.
#
# v5 makes worker admission_mode authoritative for ALL admission decisions
# (direct_append AND seed/reseed). Replay calls D's
# /session_cache/admit_direct_append with mode={direct_append|seed} and
# defers to D's KV pool availability + LRU eviction. Replay's local
# _decode_session_soft_cap is bypassed entirely under worker mode.
set -euo pipefail
cd "$(dirname "$0")/.."
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
TRACE=outputs/qwen35-swebench-50sess.jsonl
OUTPUT=outputs/qwen3-30b-tp1-v5-optD
VENV_PYTHON=.venv/bin/python
RESULTS_FILE=$OUTPUT/sweep_results.txt
mkdir -p $OUTPUT
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
}
save_result() {
local label=$1
local run_dir=$2
log "=== $label COMPLETED ==="
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
log "Summary:"
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
echo "" >> $RESULTS_FILE
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl"
else
log "WARNING: No summary file found in $run_dir"
fi
}
log "Starting TP1 v5 sweep (Option D: D-side seed admission)"
log "Model: $MODEL"
log "Trace: $TRACE (4449 requests, 52 sessions)"
log "Key change: worker admission_mode now drives seed/reseed via D's admit endpoint"
########################################
# Experiment 1: 1P + 7D KVC kv-aware Option D
########################################
log ""
log "=== [EXP1] 1P7D KVC kv-aware Option D ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 1 --decode-workers 7 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
EXP1_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp1_1p7d_kvc_optD" "$EXP1_DIR"
########################################
# Experiment 2: 2P + 6D KVC kv-aware Option D
########################################
log ""
log "=== [EXP2] 2P6D KVC kv-aware Option D ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 2 --decode-workers 6 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp2_2p6d_kvc_optD" "$EXP2_DIR"
log ""
log "=== ALL TP1 V5 SWEEP EXPERIMENTS DONE ==="

View File

@@ -0,0 +1,125 @@
#!/bin/bash
# TP1 v5 + profiling — re-run the v5 (Option D) config with the new
# d-pool-timeseries poller enabled, so we can attribute each session-cap
# fallback to actual D KV pool occupancy (held vs available vs idle-evictable
# vs prefill-backup) instead of guessing.
#
# Output:
# outputs/qwen3-30b-tp1-v5-optD-profile/
# ├── kvcache-centric-kv-aware-worker-admission-<ts>/
# │ ├── request-metrics.jsonl
# │ ├── request-metrics.jsonl.summary.json
# │ └── d-pool-timeseries.jsonl ← NEW (1Hz P/D /server_info snapshots)
# ├── exp1_1p7d_kvc_optD_profile_metrics.jsonl
# └── exp2_2p6d_kvc_optD_profile_metrics.jsonl
set -euo pipefail
cd "$(dirname "$0")/.."
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
TRACE=outputs/qwen35-swebench-50sess.jsonl
OUTPUT=outputs/qwen3-30b-tp1-v5-optD-profile
VENV_PYTHON=.venv/bin/python
RESULTS_FILE=$OUTPUT/sweep_results.txt
POLL_INTERVAL=1.0
mkdir -p $OUTPUT
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
}
save_result() {
local label=$1
local run_dir=$2
log "=== $label COMPLETED ==="
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
log "Summary:"
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
echo "" >> $RESULTS_FILE
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
if [ -f "$run_dir/d-pool-timeseries.jsonl" ]; then
cp "$run_dir/d-pool-timeseries.jsonl" "$OUTPUT/${label}_pool_timeseries.jsonl"
log "Pool timeseries: $(wc -l < $OUTPUT/${label}_pool_timeseries.jsonl) rows"
else
log "WARNING: no d-pool-timeseries.jsonl produced"
fi
log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl + ${label}_pool_timeseries.jsonl"
else
log "WARNING: No summary file found in $run_dir"
fi
}
log "Starting TP1 v5 + profile sweep (Option D + ${POLL_INTERVAL}s pool polling)"
log "Model: $MODEL"
log "Trace: $TRACE (4449 requests, 52 sessions)"
log "Profiling: --pool-poll-interval-s $POLL_INTERVAL (writes d-pool-timeseries.jsonl)"
########################################
# Experiment 1: 1P + 7D KVC kv-aware Option D + profile
########################################
log ""
log "=== [EXP1] 1P7D KVC kv-aware Option D + profile ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 1 --decode-workers 7 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction \
--pool-poll-interval-s $POLL_INTERVAL
EXP1_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp1_1p7d_kvc_optD_profile" "$EXP1_DIR"
########################################
# Experiment 2: 2P + 6D KVC kv-aware Option D + profile
########################################
log ""
log "=== [EXP2] 2P6D KVC kv-aware Option D + profile ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 2 --decode-workers 6 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction \
--pool-poll-interval-s $POLL_INTERVAL
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp2_2p6d_kvc_optD_profile" "$EXP2_DIR"
log ""
log "=== ALL TP1 V5+PROFILE EXPERIMENTS DONE ==="

View File

@@ -0,0 +1,129 @@
#!/bin/bash
# v6 P1: re-run the v5 (Option D) config with the pool_breakdown instrument
# (commit 4978c0d) so d-pool-timeseries.jsonl carries radix_protected /
# slot_private / running_batch / {transfer,prealloc,retracted}_queue tokens.
#
# This is the same config as scripts/sweep_tp1_v5_optD_profile.sh but writes
# to a separate output dir, leaving the pre-instrument v5+profile run intact
# for before/after comparison.
#
# Output:
# outputs/qwen3-30b-tp1-v6-p1-profile/
# ├── kvcache-centric-kv-aware-worker-admission-<ts>/
# │ ├── request-metrics.jsonl
# │ ├── request-metrics.jsonl.summary.json
# │ └── d-pool-timeseries.jsonl ← now with pool_breakdown fields
# ├── exp{1,2}_*_metrics.jsonl
# └── exp{1,2}_*_pool_timeseries.jsonl
set -euo pipefail
cd "$(dirname "$0")/.."
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
TRACE=outputs/qwen35-swebench-50sess.jsonl
OUTPUT=outputs/qwen3-30b-tp1-v6-p1-profile
VENV_PYTHON=.venv/bin/python
RESULTS_FILE=$OUTPUT/sweep_results.txt
POLL_INTERVAL=1.0
mkdir -p $OUTPUT
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
}
save_result() {
local label=$1
local run_dir=$2
log "=== $label COMPLETED ==="
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
log "Summary:"
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
echo "" >> $RESULTS_FILE
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
if [ -f "$run_dir/d-pool-timeseries.jsonl" ]; then
cp "$run_dir/d-pool-timeseries.jsonl" "$OUTPUT/${label}_pool_timeseries.jsonl"
log "Pool timeseries: $(wc -l < $OUTPUT/${label}_pool_timeseries.jsonl) rows"
else
log "WARNING: no d-pool-timeseries.jsonl produced"
fi
log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl + ${label}_pool_timeseries.jsonl"
else
log "WARNING: No summary file found in $run_dir"
fi
}
log "Starting v6 P1 sweep (v5 Option D config + ${POLL_INTERVAL}s pool polling + pool_breakdown)"
log "Model: $MODEL"
log "Trace: $TRACE (4449 requests, 52 sessions)"
log "Goal: capture pool_breakdown fields (radix_protected / slot_private / running_batch / queues)"
log " to decompose 'other' on the v5 baseline workload"
########################################
# Experiment 1: 1P + 7D KVC kv-aware Option D + profile
########################################
log ""
log "=== [EXP1] 1P7D KVC kv-aware Option D + profile ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 1 --decode-workers 7 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction \
--pool-poll-interval-s $POLL_INTERVAL
EXP1_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp1_1p7d_kvc_v6_p1" "$EXP1_DIR"
########################################
# Experiment 2: 2P + 6D KVC kv-aware Option D + profile
########################################
log ""
log "=== [EXP2] 2P6D KVC kv-aware Option D + profile ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 2 --decode-workers 6 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
--transfer-backend mooncake \
--gpu-budget 8 \
--time-scale 10 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction \
--pool-poll-interval-s $POLL_INTERVAL
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
save_result "exp2_2p6d_kvc_v6_p1" "$EXP2_DIR"
log ""
log "=== ALL v6 P1 EXPERIMENTS DONE ==="

View File

@@ -0,0 +1,146 @@
#!/bin/bash
# Time-scale=1 validation sweep, downscaled to 4 GPUs:
# - KVC v5 1P3D × N=3 (new data, validates §1/§2 structural claims at real timing)
# - 4-way DP cache-aware × 1 (sanity baseline at same scale + ts=1)
#
# Goal: per docs/AGENTIC_FIT_ANALYSIS_ZH.md §7 / TEAM_REPORT §2.6 — all v3-v6 KVC
# data was at time-scale=10 (inter-turn gap p50 = 0.25s, vs real 2.5s). This run
# tests whether the gap structurally reverses any conclusion.
#
# CONFIG NOTE: Original experiments used 8 GPUs (2P6D / 8-way DP). This host has
# only 4 H100s available, so we downscale proportionally to 1P3D / 4-way DP.
# Cross-compare against existing 2P6D ts=10 data is confounded by *both*
# time-scale and capacity. Internal comparison (1P3D KVC vs 4DP) at ts=1 is the
# clean signal. §5 (P-side imbalance) is NOT testable here — only 1 P.
#
# Capacity ratio: 3D × ~92K tok = 276K KV pool vs 52 sessions × ~50K peak input
# working set ≈ 1.5M → ~5.4× overload (vs 2.7× in original 2P6D).
# Pressure is HIGHER than original; partly offset by ts=1 letting D drain between turns.
#
# Output:
# outputs/qwen3-30b-tp1-ts1-validation/
# ├── kvc_1p3d_run{1,2,3}_summary.json
# ├── kvc_1p3d_run{1,2,3}_metrics.jsonl
# ├── dp4_summary.json
# ├── dp4_metrics.jsonl
# └── kvcache-centric-... / pd-colo-kv-aware-... (raw run dirs)
#
# Estimated GPU time: KVC ts=1 ≈ 100-180 min/run × 3 = 5-9h
# DP ts=1 ≈ 100-120 min × 1 = ~2h
# Total = 7-11h
set -euo pipefail
cd "$(dirname "$0")/.."
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
TRACE=outputs/qwen35-swebench-50sess.jsonl
OUTPUT=outputs/qwen3-30b-tp1-ts1-validation
VENV_PYTHON=.venv/bin/python
RESULTS_FILE=$OUTPUT/sweep_results.txt
mkdir -p $OUTPUT
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
}
run_kvc_1p3d() {
local run_idx=$1
local label="kvc_1p3d_run${run_idx}"
log ""
log "=== [KVC ${run_idx}/3] 1P3D KVC kv-aware Option D, time-scale=1 ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 1 --decode-workers 3 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3 \
--transfer-backend mooncake \
--gpu-budget 4 \
--time-scale 1 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction
local run_dir=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
log "=== [KVC ${run_idx}/3] $label COMPLETED ==="
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
local errs=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('error_count',0))")
log " errors = $errs"
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
echo "" >> $RESULTS_FILE
else
log "WARNING: no summary file in $run_dir"
fi
}
run_dp4_sanity() {
local label="dp4"
log ""
log "=== [DP] 4-way DP cache-aware sanity, time-scale=1 ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism pd-colo \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 0 --decode-workers 0 \
--direct-workers 4 --direct-tp-size 1 \
--direct-gpu-ids 0,1,2,3 \
--gpu-budget 4 \
--time-scale 1 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300
local run_dir=$(ls -td $OUTPUT/pd-colo-kv-aware-*/ 2>/dev/null | head -1)
log "=== [DP] $label COMPLETED ==="
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
local errs=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('error_count',0))")
log " errors = $errs"
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
echo "" >> $RESULTS_FILE
else
log "WARNING: no summary file in $run_dir"
fi
}
log "=== TS=1 VALIDATION (4-GPU): KVC 1P3D × N=3 + 4DP × 1 ==="
log "Model: $MODEL"
log "Trace: $TRACE (4449 requests, 52 sessions)"
log "Goal: validate whether ts=10 was the main distortion in v3-v6 KVC vs DP"
# KVC × 3 first (the new data we need); DP last (cheaper sanity at end)
for i in 1 2 3; do
run_kvc_1p3d $i
done
run_dp4_sanity
log ""
log "=== TS=1 SUMMARY ==="
for label in kvc_1p3d_run1 kvc_1p3d_run2 kvc_1p3d_run3 dp4; do
if [ -f "$OUTPUT/${label}_summary.json" ]; then
e=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('error_count',0))")
p50=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('latency_stats_s',{}).get('p50','n/a'))")
log " ${label}: errors=$e lat_p50=${p50}s"
fi
done
log "=== TS=1 ALL DONE ==="

View File

@@ -0,0 +1,65 @@
#!/bin/bash
# Migration v1 validation: KVC 1P3D ts=1 with --kvcache-migration-reject-threshold=3
# Compare against baseline outputs/qwen3-30b-tp1-ts1-validation/kvc_1p3d_run{1,2,3}
# (all of which had no migration — runs were structurally identical).
#
# Goal: verify §1 fix changes the categorical outcome — direct-to-D % up,
# fallback-session-not-resident % down, lat mean down.
#
# ts=1 is deterministic at the categorical level, so N=1 is sufficient
# (TEAM_REPORT §2.8 revised).
set -euo pipefail
cd "$(dirname "$0")/.."
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
TRACE=outputs/qwen35-swebench-50sess.jsonl
OUTPUT=outputs/qwen3-30b-tp1-ts1-migration-v1
VENV_PYTHON=.venv/bin/python
RESULTS_FILE=$OUTPUT/sweep_results.txt
mkdir -p $OUTPUT
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE; }
log "=== TS=1 MIGRATION v1: KVC 1P3D --kvcache-migration-reject-threshold=3 ==="
log "Baseline reference: outputs/qwen3-30b-tp1-ts1-validation/kvc_1p3d_run1 (errors=5, lat mean=1.574s, direct-to-D=42.8%)"
label=kvc_1p3d_migration_run1
log ""
log "=== [migration v1] starting ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 1 --decode-workers 3 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3 \
--transfer-backend mooncake \
--gpu-budget 4 \
--time-scale 1 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction \
--kvcache-migration-reject-threshold 3
run_dir=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
log "=== [migration v1] $label COMPLETED ==="
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
errs=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('error_count',0))")
p50=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('latency_stats_s',{}).get('p50',0))")
log " errors=$errs lat_p50=${p50}s"
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
fi
log "=== migration v1 DONE ==="

View File

@@ -0,0 +1,76 @@
#!/bin/bash
# Migration v2 validation: KVC 1P3D ts=1 with BOTH:
# (1) reset-on-success blacklist decay (replay.py code change)
# (2) --kvcache-direct-max-uncached-tokens 8192 (was 2048 default)
#
# v1 results (kvc_1p3d_migration_run1) showed:
# - lat mean WORSE +11.7%, TTFT mean WORSE +71.3% — thrashing tax
# - direct-to-D rate UP +10.5pp (42.8 → 53.3%)
# - Fallback breakdown surprise: 41.3% are 'real-large-append' (>2048 tok),
# NOT 'session-not-resident' as we hypothesized
#
# v2 design (REFACTOR_PLAN_V1 + MIGRATION_V1_FINDINGS):
# (1) reset-on-success: clear (sess,D) reject counter on successful direct-to-D
# — eliminates blacklist-permanence bug → kills thrashing
# (2) bump direct-append threshold 2048 → 8192: lets more large-append turns
# go direct-to-D instead of fall through to seed (which often rejects)
set -euo pipefail
cd "$(dirname "$0")/.."
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
TRACE=outputs/qwen35-swebench-50sess.jsonl
OUTPUT=outputs/qwen3-30b-tp1-ts1-migration-v2
VENV_PYTHON=.venv/bin/python
RESULTS_FILE=$OUTPUT/sweep_results.txt
mkdir -p $OUTPUT
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE; }
log "=== TS=1 MIGRATION v2: reset-on-success + threshold=8192 ==="
log "Baselines:"
log " baseline (no migration): kvc_1p3d_run1 errors=5 lat_p50=0.811s ttft_p50=0.124s direct=42.8%"
log " v1 (migration permanent): kvc_1p3d_migration_run1 errors=6 lat_p50=0.773s ttft_p50=0.057s direct=53.3% lat_mean=1.758s"
log " 4DP ts=1: errors=0 lat_p50=0.659s ttft_p50=0.090s lat_mean=1.443s"
log "Goal: kill thrashing tax (lat_mean ≤ 1.5s, p99 ≤ 9s) while preserving v1's direct-to-D gains."
label=kvc_1p3d_migration_v2_run1
log ""
log "=== [migration v2] starting ==="
PYTHONPATH=src:third_party/sglang/python \
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
--trace $TRACE \
--output-root $OUTPUT \
--mechanism kvcache-centric \
--policy kv-aware \
--model-path $MODEL \
--prefill-workers 1 --decode-workers 3 \
--prefill-tp-size 1 --decode-tp-size 1 \
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3 \
--transfer-backend mooncake \
--gpu-budget 4 \
--time-scale 1 \
--session-sample-rate 1.0 \
--target-duration-s 100000 \
--concurrency-limit 32 \
--timeout-s 900 \
--request-timeout-s 300 \
--kvcache-admission-mode worker \
--kvcache-seed-min-turn-id 1 \
--kvcache-seed-max-inflight-decode -1 \
--kvcache-prefill-backup-policy release-after-transfer \
--kvcache-prefill-priority-eviction \
--kvcache-migration-reject-threshold 3 \
--kvcache-direct-max-uncached-tokens 8192
run_dir=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
log "=== [migration v2] $label COMPLETED ==="
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
errs=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('error_count',0))")
p50=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('latency_stats_s',{}).get('p50',0))")
log " errors=$errs lat_p50=${p50}s"
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
fi
log "=== migration v2 DONE ==="

View File

@@ -43,6 +43,11 @@ class BenchmarkConfig:
kvcache_prefill_priority_eviction: bool = False
kvcache_prefill_direct_priority: int = -100
kvcache_prefill_normal_priority: int = 100
pool_poll_interval_s: float = 0.0
pool_poll_include_sessions: bool = True
enable_backpressure: bool = False
backpressure_max_pause_s: float = 2.0
kvcache_migration_reject_threshold: int = 3
sample_profile: str = "default"
min_initial_input_tokens: int | None = None
max_initial_input_tokens: int | None = None
@@ -119,6 +124,8 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
try:
signal.signal(signal.SIGINT, _handle_termination)
signal.signal(signal.SIGTERM, _handle_termination)
_mechanisms_with_router = {"pd-disaggregation", "kvcache-centric", "pd-colo"}
_naive_dp = config.mechanism_name == "pd-colo"
if config.launch_stack:
stack = launch_pd_stack(
topology=topology,
@@ -132,18 +139,19 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
else config.timeout_s
),
include_router=(
config.mechanism_name in {"pd-disaggregation", "kvcache-centric"}
config.mechanism_name in _mechanisms_with_router
),
naive_dp=_naive_dp,
)
router_url = (
stack.router_url
if config.mechanism_name in {"pd-disaggregation", "kvcache-centric"}
if config.mechanism_name in _mechanisms_with_router
else None
)
else:
router_url = (
topology.router_url
if config.mechanism_name in {"pd-disaggregation", "kvcache-centric"}
if config.mechanism_name in _mechanisms_with_router
else None
)
@@ -187,6 +195,11 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
),
kvcache_prefill_direct_priority=config.kvcache_prefill_direct_priority,
kvcache_prefill_normal_priority=config.kvcache_prefill_normal_priority,
pool_poll_interval_s=config.pool_poll_interval_s,
pool_poll_include_sessions=config.pool_poll_include_sessions,
enable_backpressure=config.enable_backpressure,
backpressure_max_pause_s=config.backpressure_max_pause_s,
kvcache_migration_reject_threshold=config.kvcache_migration_reject_threshold,
)
if config.request_timeout_s is not None:
replay_config = replace(
@@ -243,6 +256,11 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
"kvcache_prefill_normal_priority": (
config.kvcache_prefill_normal_priority
),
"pool_poll_interval_s": config.pool_poll_interval_s,
"pool_poll_include_sessions": config.pool_poll_include_sessions,
"enable_backpressure": config.enable_backpressure,
"backpressure_max_pause_s": config.backpressure_max_pause_s,
"kvcache_migration_reject_threshold": config.kvcache_migration_reject_threshold,
"sample_profile": config.sample_profile,
"min_initial_input_tokens": config.min_initial_input_tokens,
"max_initial_input_tokens": config.max_initial_input_tokens,

View File

@@ -228,6 +228,48 @@ def main() -> None:
)
replay.add_argument("--kvcache-prefill-direct-priority", type=int, default=-100)
replay.add_argument("--kvcache-prefill-normal-priority", type=int, default=100)
replay.add_argument(
"--pool-poll-interval-s",
type=float,
default=0.0,
help=(
"Poll each P/D worker's /server_info every N seconds and write a "
"time-series snapshot to <run_dir>/d-pool-timeseries.jsonl. "
"0 disables polling."
),
)
replay.add_argument(
"--pool-poll-no-sessions",
action="store_true",
help=(
"Disable per-session detail in the pool timeseries (smaller files)."
),
)
replay.add_argument(
"--enable-backpressure",
action="store_true",
help=(
"Honor recommended_pause_ms hints from D's admission endpoint. "
"When set, replay sleeps before issuing requests to a saturated D. "
"Default off — preserves baseline behavior."
),
)
replay.add_argument(
"--backpressure-max-pause-s",
type=float,
default=2.0,
help="Cap on per-request backpressure sleep, regardless of D hint.",
)
replay.add_argument(
"--kvcache-migration-reject-threshold",
type=int,
default=3,
help=(
"Per-(session, D) admission-reject count after which KvAwarePolicy "
"skips that D for the session (forces migration). 0 disables. "
"See REFACTOR_PLAN_V1 §6.2 / TEAM_REPORT §2.1."
),
)
sample = subparsers.add_parser(
"sample-sessions",
@@ -439,6 +481,46 @@ def main() -> None:
)
benchmark.add_argument("--kvcache-prefill-direct-priority", type=int, default=-100)
benchmark.add_argument("--kvcache-prefill-normal-priority", type=int, default=100)
benchmark.add_argument(
"--pool-poll-interval-s",
type=float,
default=0.0,
help=(
"Poll each P/D worker's /server_info every N seconds and write a "
"time-series snapshot to <run_dir>/d-pool-timeseries.jsonl. "
"0 disables polling."
),
)
benchmark.add_argument(
"--pool-poll-no-sessions",
action="store_true",
help=(
"Disable per-session detail in the pool timeseries (smaller files)."
),
)
benchmark.add_argument(
"--enable-backpressure",
action="store_true",
help=(
"Honor recommended_pause_ms hints from D's admission endpoint."
),
)
benchmark.add_argument(
"--backpressure-max-pause-s",
type=float,
default=2.0,
help="Cap on per-request backpressure sleep, regardless of D hint.",
)
benchmark.add_argument(
"--kvcache-migration-reject-threshold",
type=int,
default=3,
help=(
"Per-(session, D) admission-reject count after which KvAwarePolicy "
"skips that D for the session (forces migration). 0 disables. "
"See REFACTOR_PLAN_V1 §6.2 / TEAM_REPORT §2.1."
),
)
benchmark.add_argument(
"--sample-profile",
choices=["default", "small-append"],
@@ -455,11 +537,18 @@ def main() -> None:
if args.command == "print-launch":
topology = _topology_from_args(args)
has_pd = bool(topology.prefill_workers and topology.decode_workers)
has_direct_only = bool(
topology.direct_workers
and not topology.prefill_workers
and not topology.decode_workers
)
plan = build_launch_plan(
topology,
prefill_policy=args.prefill_policy,
decode_policy=args.decode_policy,
include_router=bool(topology.prefill_workers and topology.decode_workers),
include_router=has_pd or has_direct_only,
naive_dp=has_direct_only,
)
print(plan.render())
return
@@ -513,6 +602,11 @@ def main() -> None:
),
kvcache_prefill_direct_priority=args.kvcache_prefill_direct_priority,
kvcache_prefill_normal_priority=args.kvcache_prefill_normal_priority,
pool_poll_interval_s=args.pool_poll_interval_s,
pool_poll_include_sessions=not args.pool_poll_no_sessions,
enable_backpressure=args.enable_backpressure,
backpressure_max_pause_s=args.backpressure_max_pause_s,
kvcache_migration_reject_threshold=args.kvcache_migration_reject_threshold,
)
results = asyncio.run(replay_trace(config))
print(
@@ -655,6 +749,11 @@ def main() -> None:
kvcache_prefill_normal_priority=(
args.kvcache_prefill_normal_priority
),
pool_poll_interval_s=args.pool_poll_interval_s,
pool_poll_include_sessions=not args.pool_poll_no_sessions,
enable_backpressure=args.enable_backpressure,
backpressure_max_pause_s=args.backpressure_max_pause_s,
kvcache_migration_reject_threshold=args.kvcache_migration_reject_threshold,
sample_profile=args.sample_profile,
min_initial_input_tokens=args.min_initial_input_tokens,
max_initial_input_tokens=args.max_initial_input_tokens,

View File

@@ -34,7 +34,24 @@ def build_launch_plan(
decode_policy: str = "manual",
include_router: bool = True,
router_request_timeout_s: float | None = None,
naive_dp: bool = False,
) -> LaunchPlan:
router_command: tuple[str, ...] | None = None
if include_router:
if topology.prefill_workers and topology.decode_workers:
router_command = _build_router_command(
topology,
prefill_policy=prefill_policy,
decode_policy=decode_policy,
request_timeout_s=router_request_timeout_s,
)
elif naive_dp and topology.direct_workers:
router_command = _build_dp_router_command(
topology,
backend_policy=decode_policy,
request_timeout_s=router_request_timeout_s,
)
return LaunchPlan(
prefill_commands=tuple(
_build_server_command(topology, worker) for worker in topology.prefill_workers
@@ -43,24 +60,17 @@ def build_launch_plan(
_build_server_command(topology, worker) for worker in topology.decode_workers
),
direct_commands=tuple(
_build_server_command(topology, worker) for worker in topology.direct_workers
),
router_command=(
_build_router_command(
topology,
prefill_policy=prefill_policy,
decode_policy=decode_policy,
request_timeout_s=router_request_timeout_s,
)
if include_router and topology.prefill_workers and topology.decode_workers
else None
_build_server_command(topology, worker, naive_dp=naive_dp)
for worker in topology.direct_workers
),
router_command=router_command,
)
def _build_server_command(
topology: SingleNodeTopology,
worker: WorkerSpec,
naive_dp: bool = False,
) -> tuple[str, ...]:
command = [
sys.executable,
@@ -76,11 +86,15 @@ def _build_server_command(
str(worker.port),
"--base-gpu-id",
str(worker.gpu_id),
]
# Naive DP direct workers: no disaggregation flags at all
if not (naive_dp and worker.role == "direct"):
command.extend([
"--disaggregation-mode",
_disaggregation_mode_for(worker),
"--disaggregation-transfer-backend",
topology.transfer_backend,
]
])
if worker.tp_size > 1:
command.extend(["--tp-size", str(worker.tp_size)])
if topology.trust_remote_code:
@@ -135,6 +149,32 @@ def _build_router_command(
return tuple(command)
def _build_dp_router_command(
topology: SingleNodeTopology,
*,
backend_policy: str,
request_timeout_s: float | None,
) -> tuple[str, ...]:
command: list[str] = [
sys.executable,
"-B",
"-u",
"-m",
"agentic_pd_hybrid.pd_router",
"--host",
topology.router_host,
"--port",
str(topology.router_port),
"--backend-policy",
backend_policy,
]
if request_timeout_s is not None:
command.extend(["--request-timeout-s", str(request_timeout_s)])
for worker in topology.direct_workers:
command.extend(["--backend", worker.url])
return tuple(command)
def _render_named_command(name: str, command: tuple[str, ...]) -> str:
return f"# {name}\n" + " ".join(shlex.quote(part) for part in command)

View File

@@ -43,6 +43,9 @@ class RequestMetrics:
ttft_s: float | None
tpot_s: float | None
error: str | None = None
actual_output_tokens: int | None = None
requested_output_tokens: int | None = None
finish_reason: str | None = None
@classmethod
def from_decision(
@@ -63,6 +66,9 @@ class RequestMetrics:
prefill_request_priority: int | None = None,
decode_request_priority: int | None = None,
error: str | None = None,
actual_output_tokens: int | None = None,
requested_output_tokens: int | None = None,
finish_reason: str | None = None,
) -> "RequestMetrics":
return cls(
request_id=request.request_id,
@@ -95,6 +101,9 @@ class RequestMetrics:
ttft_s=ttft_s,
tpot_s=tpot_s,
error=error,
actual_output_tokens=actual_output_tokens,
requested_output_tokens=requested_output_tokens,
finish_reason=finish_reason,
)
@@ -105,6 +114,16 @@ def write_metrics_jsonl(path: Path, rows: list[RequestMetrics]) -> None:
handle.write(json.dumps(asdict(row), sort_keys=True) + "\n")
def _is_failed_request(row: RequestMetrics) -> bool:
if row.error is not None:
return True
if row.finish_reason is not None:
fr = str(row.finish_reason).lower()
if "abort" in fr or "badrequest" in fr:
return True
return False
def write_summary_json(
path: Path,
rows: list[RequestMetrics],
@@ -112,9 +131,10 @@ def write_summary_json(
trace_path: Path,
router_url: str | None,
) -> None:
latencies = [row.latency_s for row in rows if row.latency_s is not None]
ttfts = [row.ttft_s for row in rows if row.ttft_s is not None]
tpots = [row.tpot_s for row in rows if row.tpot_s is not None]
successful = [row for row in rows if not _is_failed_request(row)]
latencies = [row.latency_s for row in successful if row.latency_s is not None]
ttfts = [row.ttft_s for row in successful if row.ttft_s is not None]
tpots = [row.tpot_s for row in successful if row.tpot_s is not None]
per_decode_load = Counter(row.assigned_decode_node for row in rows)
per_prefill_load = Counter(row.assigned_prefill_node for row in rows)
prefill_priorities = Counter(
@@ -158,6 +178,28 @@ def write_summary_json(
str(key): value for key, value in sorted(decode_priorities.items())
},
"error_count": sum(1 for row in rows if row.error is not None),
"abort_count": sum(
1
for row in rows
if row.error is None
and row.finish_reason is not None
and (
"abort" in str(row.finish_reason).lower()
or "badrequest" in str(row.finish_reason).lower()
)
),
"failure_count": sum(1 for row in rows if _is_failed_request(row)),
"truncated_request_count": sum(
1
for row in rows
if row.actual_output_tokens is not None
and row.requested_output_tokens is not None
and row.requested_output_tokens > 1
and row.actual_output_tokens < row.requested_output_tokens * 0.5
),
"actual_output_tokens_stats": _stats(
[float(row.actual_output_tokens) for row in rows if row.actual_output_tokens is not None]
),
}
path.parent.mkdir(parents=True, exist_ok=True)
with path.open("w", encoding="utf-8") as handle:

View File

@@ -74,8 +74,58 @@ class RouterState:
return idx
@dataclass
class DpRouterConfig:
host: str
port: int
backend_urls: list[str]
backend_policy: str = "round_robin"
request_timeout_s: float = 1800.0
class DpRouterState:
"""DP (data-parallel) router: forward each request to exactly one backend."""
def __init__(self, config: DpRouterConfig):
if not config.backend_urls:
raise ValueError("At least one backend worker is required")
self.config = config
self.cursor = 0
self.sticky_map: dict[str, int] = {}
def select_backend(self, headers: dict[str, str]) -> str:
idx = self._select_index(headers)
return self.config.backend_urls[idx]
def _select_index(self, headers: dict[str, str]) -> int:
target_worker = headers.get("x-smg-target-worker")
routing_key = headers.get("x-smg-routing-key")
if (
self.config.backend_policy == "consistent_hashing"
and target_worker is not None
):
idx = int(target_worker)
if 0 <= idx < len(self.config.backend_urls):
return idx
if self.config.backend_policy == "manual" and routing_key:
cached = self.sticky_map.get(routing_key)
if cached is not None:
return cached
idx = self.cursor % len(self.config.backend_urls)
self.cursor += 1
self.sticky_map[routing_key] = idx
return idx
idx = self.cursor % len(self.config.backend_urls)
self.cursor += 1
return idx
app = FastAPI()
router_state: RouterState | None = None
dp_state: DpRouterState | None = None
@app.get("/health")
@@ -85,6 +135,16 @@ async def health() -> Response:
@app.get("/health_generate")
async def health_generate() -> Response:
if dp_state is not None:
async with aiohttp.ClientSession() as session:
tasks = [
session.get(f"{url}/health_generate")
for url in dp_state.config.backend_urls
]
for response in asyncio.as_completed(tasks):
async with await response:
pass
return Response(status_code=200)
state = _require_state()
async with aiohttp.ClientSession() as session:
tasks = []
@@ -101,6 +161,11 @@ async def health_generate() -> Response:
@app.get("/v1/models")
async def models() -> ORJSONResponse:
if dp_state is not None:
async with aiohttp.ClientSession() as session:
async with session.get(f"{dp_state.config.backend_urls[0]}/v1/models") as resp:
payload = await resp.json()
return ORJSONResponse(payload, status_code=resp.status)
state = _require_state()
async with aiohttp.ClientSession() as session:
async with session.get(f"{state.config.prefill_urls[0][0]}/v1/models") as response:
@@ -147,6 +212,15 @@ async def _forward_to_backend(
headers: dict[str, str],
endpoint_name: str,
) -> Response:
# DP mode: forward to a single backend
if dp_state is not None:
return await _forward_to_dp_backend(
request_data=request_data,
headers=headers,
endpoint_name=endpoint_name,
)
# PD mode: coordinate prefill + decode
state = _require_state()
prefill_server, bootstrap_port, decode_server = state.select_pair(headers)
prefill_request, decode_request = _build_backend_requests(
@@ -186,6 +260,63 @@ async def _forward_to_backend(
)
async def _forward_to_dp_backend(
*,
request_data: dict,
headers: dict[str, str],
endpoint_name: str,
) -> Response:
assert dp_state is not None
backend_server = dp_state.select_backend(headers)
cleaned = _strip_internal_fields(request_data)
timeout_s = dp_state.config.request_timeout_s
if request_data.get("stream", False):
return StreamingResponse(
_stream_dp_generate(
request_data=cleaned,
backend_server=backend_server,
endpoint_name=endpoint_name,
timeout_s=timeout_s,
),
media_type="text/event-stream",
)
async with aiohttp.ClientSession(
timeout=aiohttp.ClientTimeout(total=timeout_s)
) as session:
async with session.post(
f"{backend_server}/{endpoint_name}", json=cleaned
) as response:
body = await response.read()
return Response(
content=body,
status_code=response.status,
media_type=response.content_type,
)
async def _stream_dp_generate(
*,
request_data: dict,
backend_server: str,
endpoint_name: str,
timeout_s: float,
) -> AsyncIterator[bytes]:
async with aiohttp.ClientSession(
timeout=aiohttp.ClientTimeout(total=timeout_s)
) as session:
async with session.post(
f"{backend_server}/{endpoint_name}", json=request_data
) as response:
if response.status != HTTPStatus.OK:
payload = await response.read()
yield payload
return
async for chunk in response.content.iter_chunked(_STREAM_CHUNK_SIZE):
yield chunk
async def _stream_generate(
*,
prefill_request: dict,
@@ -241,6 +372,12 @@ def _build_backend_requests(
prefill_request.update(bootstrap_payload)
decode_request.update(bootstrap_payload)
# session_params is only meaningful for the decode worker (streaming session
# KV reuse). Sending it to the prefill worker causes the D side to
# short-circuit with local-prefill on already-open sessions, returning
# truncated responses while P's KV transfer gets aborted.
prefill_request.pop("session_params", None)
if prefill_priority is not None:
prefill_request["priority"] = int(prefill_priority)
if decode_priority is not None:
@@ -262,7 +399,7 @@ def _require_state() -> RouterState:
def main() -> None:
parser = argparse.ArgumentParser(description="Minimal local PD router")
parser = argparse.ArgumentParser(description="Minimal local PD / DP router")
parser.add_argument("--host", default="127.0.0.1")
parser.add_argument("--port", type=int, default=8000)
parser.add_argument(
@@ -270,19 +407,44 @@ def main() -> None:
nargs=2,
metavar=("URL", "BOOTSTRAP_PORT"),
action="append",
required=True,
default=None,
)
parser.add_argument(
"--decode",
action="append",
required=True,
default=None,
)
parser.add_argument("--prefill-policy", default="round_robin")
parser.add_argument("--decode-policy", default="manual")
parser.add_argument(
"--backend",
action="append",
default=None,
help="Backend URL for DP (data-parallel) mode. Repeat for each worker.",
)
parser.add_argument(
"--backend-policy",
default="round_robin",
help="Routing policy for DP mode: round_robin, manual, consistent_hashing.",
)
parser.add_argument("--request-timeout-s", type=float, default=1800.0)
args = parser.parse_args()
global router_state
global router_state, dp_state
if args.backend:
# DP mode: simple forward to one of N backends
dp_state = DpRouterState(
DpRouterConfig(
host=args.host,
port=args.port,
backend_urls=list(args.backend),
backend_policy=args.backend_policy,
request_timeout_s=args.request_timeout_s,
)
)
elif args.prefill and args.decode:
# PD mode: prefill/decode coordination
router_state = RouterState(
RouterConfig(
host=args.host,
@@ -294,6 +456,9 @@ def main() -> None:
request_timeout_s=args.request_timeout_s,
)
)
else:
parser.error("Either --backend (DP mode) or both --prefill and --decode (PD mode) are required")
uvicorn.run(app, host=args.host, port=args.port, log_level="info")

View File

@@ -44,6 +44,10 @@ class RoutingState:
inflight_decode: Counter[str] = field(default_factory=Counter)
decode_assignment_counts: Counter[str] = field(default_factory=Counter)
decode_resident_blocks: dict[str, set[int]] = field(default_factory=dict)
# Migration support: per-(session_id, decode_worker_id) admission reject counter.
# KvAwarePolicy uses this to skip D's that have repeatedly rejected this session
# (avoids the structural starvation observed in TEAM_REPORT §2.1).
session_d_rejects: Counter[tuple[str, str]] = field(default_factory=Counter)
@classmethod
def create(cls, topology: SingleNodeTopology) -> "RoutingState":
@@ -66,6 +70,12 @@ class RoutingState:
self.decode_cursor += 1
return worker.worker_id
def record_admission_reject(self, session_id: str, decode_worker_id: str) -> int:
"""Increment per-(session, D) rejection counter. Returns new count."""
key = (session_id, decode_worker_id)
self.session_d_rejects[key] += 1
return self.session_d_rejects[key]
def finish(self, request: TraceRequest, decision: RoutingDecision) -> None:
session = self.session_state.setdefault(request.session_id, SessionRouteState())
session.last_decode_worker = decision.decode_worker_id
@@ -146,6 +156,11 @@ class StickyDecodePolicy:
class KvAwarePolicy:
name: str = "kv-aware"
sticky_bonus: int = 1
# Session migration: when (session, D) has been rejected this many times,
# skip D entirely for this session (force migration to another D).
# 0 disables the mechanism. Default 3 picked empirically to allow brief
# transient saturation without panicking, but to reroute persistent starvation.
migration_reject_threshold: int = 3
def select(
self,
@@ -158,8 +173,19 @@ class KvAwarePolicy:
session = state.session_state.get(request.session_id)
best_decode_worker_id: str | None = None
best_score: tuple[int, int, int] | None = None
best_score: tuple[int, int, int, int] | None = None
candidates_considered = 0
for worker in topology.route_workers:
# Migration: skip workers that have rejected this session too many times.
# If all candidates get filtered (degenerate case), fall through to
# un-filtered selection below.
if self.migration_reject_threshold > 0:
rejects = state.session_d_rejects.get(
(request.session_id, worker.worker_id), 0
)
if rejects >= self.migration_reject_threshold:
continue
candidates_considered += 1
overlap = _overlap_blocks(request, state, worker.worker_id)
sticky = int(session is not None and session.last_decode_worker == worker.worker_id)
inflight_penalty = -state.inflight_decode.get(worker.worker_id, 0)
@@ -174,6 +200,16 @@ class KvAwarePolicy:
best_score = score
best_decode_worker_id = worker.worker_id
# Degenerate fallback: every D was filtered. Pick the least-rejected D.
if best_decode_worker_id is None:
best_decode_worker_id = min(
(w.worker_id for w in topology.route_workers),
key=lambda wid: state.session_d_rejects.get(
(request.session_id, wid), 0
),
)
best_score = (0, 0, 0, 0)
assert best_decode_worker_id is not None
reuse_expected = bool(best_score and best_score[0] > 0)
return _build_decision(
@@ -187,14 +223,14 @@ class KvAwarePolicy:
)
def create_policy(name: str) -> RoutingPolicy:
def create_policy(name: str, *, migration_reject_threshold: int = 3) -> RoutingPolicy:
normalized = name.strip().lower()
if normalized == "default":
return DefaultPolicy()
if normalized == "sticky":
return StickyDecodePolicy()
if normalized in {"kv-aware", "kv_aware", "kv"}:
return KvAwarePolicy()
return KvAwarePolicy(migration_reject_threshold=migration_reject_threshold)
raise ValueError(f"Unsupported policy: {name}")

View File

@@ -31,6 +31,44 @@ KvCachePrefillBackupPolicy = Literal["release-after-transfer", "capacity-backup"
_ADMISSION_PROBE_TIMEOUT_S = 2.0
# --- Structural event logging (admission probes, backpressure pauses, ---
# --- session-D bindings). Module-level state keeps call-site diff small. ---
_STRUCTURAL_LOG_DIR: Path | None = None
_STRUCTURAL_LOG_LOCK = asyncio.Lock()
_STRUCTURAL_LOG_FILES: dict[str, Any] = {}
_STRUCTURAL_RUN_START_S: float = 0.0
def _structural_init(log_dir: Path | None) -> None:
global _STRUCTURAL_LOG_DIR, _STRUCTURAL_RUN_START_S
_STRUCTURAL_LOG_DIR = log_dir
_STRUCTURAL_RUN_START_S = time.perf_counter()
if log_dir is not None:
log_dir.mkdir(parents=True, exist_ok=True)
def _structural_close() -> None:
for handle in _STRUCTURAL_LOG_FILES.values():
try:
handle.close()
except Exception:
pass
_STRUCTURAL_LOG_FILES.clear()
async def _structural_emit(filename: str, event: dict[str, Any]) -> None:
if _STRUCTURAL_LOG_DIR is None:
return
event = {"t": round(time.perf_counter() - _STRUCTURAL_RUN_START_S, 4), **event}
async with _STRUCTURAL_LOG_LOCK:
handle = _STRUCTURAL_LOG_FILES.get(filename)
if handle is None:
handle = (_STRUCTURAL_LOG_DIR / filename).open("a", encoding="utf-8")
_STRUCTURAL_LOG_FILES[filename] = handle
handle.write(json.dumps(event, sort_keys=True) + "\n")
handle.flush()
@dataclass(frozen=True)
class ReplayConfig:
trace_path: Path
@@ -64,6 +102,16 @@ class ReplayConfig:
kvcache_prefill_priority_eviction: bool = False
kvcache_prefill_direct_priority: int = -100
kvcache_prefill_normal_priority: int = 100
pool_poll_interval_s: float = 0.0
pool_poll_include_sessions: bool = True
enable_backpressure: bool = False
backpressure_max_pause_s: float = 2.0
# Session migration via per-(sess, D) admission reject memory.
# When a session has been admission-rejected this many times on a given D,
# KvAwarePolicy skips that D for the session (forcing migration). Default 3.
# Set 0 to disable. See REFACTOR_PLAN_V1 §6.2.
kvcache_migration_reject_threshold: int = 3
structural_log_dir: Path | None = None
@dataclass
@@ -95,6 +143,8 @@ class DecodeResidencyState:
prefill_reserved_tokens_by_server: dict[str, int] = field(default_factory=dict)
decode_evictions_prefill_backed: int = 0
decode_evictions_without_prefill_backup: int = 0
# Backpressure: per-D timestamp until which new requests should pause.
pause_until_s: dict[str, float] = field(default_factory=dict)
@dataclass(frozen=True)
@@ -124,9 +174,16 @@ class ExecutionResult:
prefill_request_priority: int | None = None
decode_request_priority: int | None = None
error: str | None = None
actual_output_tokens: int | None = None
requested_output_tokens: int | None = None
finish_reason: str | None = None
async def replay_trace(config: ReplayConfig) -> list[RequestMetrics]:
structural_dir = config.structural_log_dir
if structural_dir is None and config.output_path is not None:
structural_dir = config.output_path.parent / "structural"
_structural_init(structural_dir)
requests = load_trace(config.trace_path, request_limit=config.request_limit)
if config.kvcache_seed_only_multiturn_sessions:
session_turns = Counter(request.session_id for request in requests)
@@ -138,7 +195,10 @@ async def replay_trace(config: ReplayConfig) -> list[RequestMetrics]:
if turn_count > 1
),
)
policy = create_policy(config.policy_name)
policy = create_policy(
config.policy_name,
migration_reject_threshold=config.kvcache_migration_reject_threshold,
)
state = RoutingState.create(config.topology)
state_lock = asyncio.Lock()
semaphore = asyncio.Semaphore(config.concurrency_limit)
@@ -152,6 +212,25 @@ async def replay_trace(config: ReplayConfig) -> list[RequestMetrics]:
client=client,
config=config,
)
poll_task: asyncio.Task[None] | None = None
if config.pool_poll_interval_s > 0:
poll_workers: list[tuple[str, str, str]] = []
for worker in config.topology.decode_workers:
poll_workers.append((worker.worker_id, "decode", worker.url))
for worker in config.topology.prefill_workers:
poll_workers.append((worker.worker_id, "prefill", worker.url))
if poll_workers:
poll_output = config.output_path.parent / "d-pool-timeseries.jsonl"
poll_task = asyncio.create_task(
_poll_pool_timeseries(
client=client,
workers=poll_workers,
interval_s=config.pool_poll_interval_s,
output_path=poll_output,
start_time=start_time,
include_sessions=config.pool_poll_include_sessions,
)
)
tasks = []
for request in requests:
if config.pace:
@@ -179,6 +258,12 @@ async def replay_trace(config: ReplayConfig) -> list[RequestMetrics]:
session_tail_tasks[request.session_id] = tasks[-1]
results = await asyncio.gather(*tasks)
if poll_task is not None:
poll_task.cancel()
try:
await poll_task
except asyncio.CancelledError:
pass
for session in direct_sessions.values():
if session.opened:
try:
@@ -208,6 +293,7 @@ async def replay_trace(config: ReplayConfig) -> list[RequestMetrics]:
trace_path=config.trace_path,
router_url=config.router_url,
)
_structural_close()
return results
@@ -231,6 +317,21 @@ async def _run_request(
async with state_lock:
decision = policy.select(request, topology=config.topology, state=state)
await _structural_emit(
"session-d-binding.jsonl",
{
"session_id": request.session_id,
"request_id": request.request_id,
"turn_id": request.turn_id,
"decode_worker_index": decision.decode_worker_index,
"decode_worker_id": decision.decode_worker_id,
"prefill_worker_id": decision.prefill_worker_id,
"observed_overlap_blocks": decision.observed_overlap_blocks,
"kv_transfer_blocks": decision.kv_transfer_blocks,
"inflight_decode_load": decision.inflight_decode_load,
},
)
try:
execution = await _execute_request(
client=client,
@@ -257,6 +358,22 @@ async def _run_request(
async with state_lock:
state.finish(request, decision)
# Migration feedback: if this request was forced into a fallback path
# because the chosen D rejected admission, record the (session, D)
# rejection so KvAwarePolicy can migrate this session next turn.
if _is_admission_rejection_mode(execution.execution_mode):
state.record_admission_reject(
request.session_id,
decision.decode_worker_id,
)
# Reset-on-success: a successful direct-to-D path proves D-X can
# currently serve this session — clear the cumulative reject counter
# so that brief past saturation doesn't permanently blacklist the D.
# (MIGRATION_V1_FINDINGS §4.1: blacklist-permanence bug fix.)
elif execution.execution_mode == "kvcache-direct-to-d-session":
state.session_d_rejects[
(request.session_id, decision.decode_worker_id)
] = 0
return RequestMetrics.from_decision(
request,
@@ -274,6 +391,9 @@ async def _run_request(
prefill_request_priority=execution.prefill_request_priority,
decode_request_priority=execution.decode_request_priority,
error=execution.error,
actual_output_tokens=execution.actual_output_tokens,
requested_output_tokens=execution.requested_output_tokens,
finish_reason=execution.finish_reason,
)
@@ -286,7 +406,17 @@ async def _invoke_router(
session_id: str | None = None,
prefill_request_priority: int | None = None,
decode_request_priority: int | None = None,
) -> tuple[float, float | None, float | None, int]:
decode_residency: "DecodeResidencyState | None" = None,
) -> GenerateResult:
if decode_residency is not None and config.enable_backpressure:
decode_url = config.topology.decode_workers[decode_worker_index].url
await _wait_for_decode_pause(
config=config,
residency=decode_residency,
server_url=decode_url,
request_id=request.request_id,
session_id=session_id,
)
headers = _build_headers(
request=request,
header_mode=config.header_mode,
@@ -414,6 +544,18 @@ async def _invoke_chat_completion(
return latency_s, ttft_s, tpot_s, cached_tokens
@dataclass(frozen=True)
class GenerateResult:
latency_s: float
ttft_s: float | None
tpot_s: float | None
cached_tokens: int
actual_output_tokens: int
requested_output_tokens: int
finish_reason: str | None
server_meta_info: dict | None
async def _invoke_generate(
*,
client: httpx.AsyncClient,
@@ -423,12 +565,16 @@ async def _invoke_generate(
timeout_s: float,
stream_idle_timeout_s: float | None,
stream: bool,
) -> tuple[float, float | None, float | None, int]:
) -> GenerateResult:
start = time.perf_counter()
ttft_s: float | None = None
cached_tokens = 0
sampling_params = payload.get("sampling_params", {})
generated_tokens = int(sampling_params.get("max_new_tokens", 1))
requested_output_tokens = int(sampling_params.get("max_new_tokens", 1))
actual_token_count = 0
finish_reason: str | None = None
last_meta_info: dict | None = None
if stream:
async with client.stream(
"POST",
@@ -452,8 +598,19 @@ async def _invoke_generate(
if isinstance(error, dict):
raise ValueError(error.get("message", json.dumps(error)))
cached_tokens = max(cached_tokens, _extract_generate_cached_tokens(parsed))
if _contains_generate_token(parsed) and ttft_s is None:
if _contains_generate_token(parsed):
actual_token_count += 1
if ttft_s is None:
ttft_s = time.perf_counter() - start
meta_info = parsed.get("meta_info")
if isinstance(meta_info, dict):
last_meta_info = meta_info
completion_tokens = int(meta_info.get("completion_tokens", 0))
if completion_tokens > actual_token_count:
actual_token_count = completion_tokens
fr = meta_info.get("finish_reason")
if fr is not None:
finish_reason = str(fr)
if _is_generate_terminal_chunk(parsed):
break
else:
@@ -469,15 +626,33 @@ async def _invoke_generate(
if isinstance(error, dict):
raise ValueError(error.get("message", json.dumps(error)))
cached_tokens = _extract_generate_cached_tokens(parsed)
meta_info = parsed.get("meta_info")
if isinstance(meta_info, dict):
last_meta_info = meta_info
actual_token_count = int(meta_info.get("completion_tokens", 0))
finish_reason = meta_info.get("finish_reason")
latency_s = time.perf_counter() - start
if stream and ttft_s is None and generated_tokens > 0:
if stream and ttft_s is None and requested_output_tokens > 0:
raise RuntimeError("generate stream ended before producing any token")
# Use actual token count for TPOT (not requested count)
effective_tokens = max(1, actual_token_count) if actual_token_count > 0 else max(1, requested_output_tokens)
if ttft_s is None:
tpot_s = None
else:
tpot_s = max(0.0, latency_s - ttft_s) / max(1, generated_tokens)
return latency_s, ttft_s, tpot_s, cached_tokens
tpot_s = max(0.0, latency_s - ttft_s) / effective_tokens
return GenerateResult(
latency_s=latency_s,
ttft_s=ttft_s,
tpot_s=tpot_s,
cached_tokens=cached_tokens,
actual_output_tokens=actual_token_count,
requested_output_tokens=requested_output_tokens,
finish_reason=finish_reason,
server_meta_info=last_meta_info,
)
async def _open_streaming_session(
@@ -593,6 +768,139 @@ async def _fetch_decode_server_state(
)
async def _query_pool_snapshot(
*,
client: httpx.AsyncClient,
server_url: str,
include_sessions: bool,
) -> dict[str, Any]:
try:
response = await client.get(
f"{server_url.rstrip('/')}/server_info",
timeout=_ADMISSION_PROBE_TIMEOUT_S,
)
response.raise_for_status()
payload = response.json()
except Exception as exc:
return {"error": type(exc).__name__}
internal = _extract_internal_state(payload)
session_cache = _extract_session_cache(payload)
sessions: list[dict[str, Any]] = []
if include_sessions and isinstance(session_cache.get("sessions"), list):
for entry in session_cache["sessions"]:
if not isinstance(entry, dict):
continue
sessions.append(
{
"session_id": entry.get("session_id"),
"resident": bool(entry.get("resident")),
"resident_tokens": int(entry.get("resident_tokens") or 0),
"idle_evictable": bool(entry.get("idle_evictable")),
"timed_out": bool(entry.get("timed_out")),
}
)
memory_usage = internal.get("memory_usage") if isinstance(internal, dict) else None
if not isinstance(memory_usage, dict):
memory_usage = {}
# P1 instrument: pool_breakdown decomposes "other" into named buckets
pool_breakdown = internal.get("pool_breakdown") if isinstance(internal, dict) else None
if not isinstance(pool_breakdown, dict):
pool_breakdown = {}
return {
"session_cache_enabled": bool(session_cache.get("enabled")),
"session_count": int(session_cache.get("session_count") or 0),
"resident_session_count": int(session_cache.get("resident_session_count") or 0),
"held_tokens": int(session_cache.get("held_tokens") or 0),
"available_tokens": int(session_cache.get("available_tokens") or 0),
"capacity_tokens": int(session_cache.get("capacity_tokens") or 0),
"idle_evictable_session_count": int(
session_cache.get("idle_evictable_session_count") or 0
),
"idle_evictable_tokens": int(session_cache.get("idle_evictable_tokens") or 0),
"kvcache_mem_gb": float(memory_usage.get("kvcache") or 0.0),
"token_capacity": int(memory_usage.get("token_capacity") or 0),
"max_total_num_tokens": int(internal.get("max_total_num_tokens") or 0)
if isinstance(internal, dict)
else 0,
"last_gen_throughput": float(internal.get("last_gen_throughput") or 0.0)
if isinstance(internal, dict)
else 0.0,
"radix_evictable_tokens": int(pool_breakdown.get("radix_evictable_tokens") or 0),
"radix_protected_tokens": int(pool_breakdown.get("radix_protected_tokens") or 0),
"slot_private_held_tokens": int(pool_breakdown.get("slot_private_held_tokens") or 0),
"session_slot_count": int(pool_breakdown.get("session_slot_count") or 0),
"running_batch_reqs": int(pool_breakdown.get("running_batch_reqs") or 0),
"running_batch_kv_tokens": int(pool_breakdown.get("running_batch_kv_tokens") or 0),
"transfer_queue_reqs": int(pool_breakdown.get("transfer_queue_reqs") or 0),
"transfer_queue_tokens": int(pool_breakdown.get("transfer_queue_tokens") or 0),
"prealloc_queue_reqs": int(pool_breakdown.get("prealloc_queue_reqs") or 0),
"prealloc_queue_tokens": int(pool_breakdown.get("prealloc_queue_tokens") or 0),
"retracted_queue_reqs": int(pool_breakdown.get("retracted_queue_reqs") or 0),
"retracted_queue_tokens": int(pool_breakdown.get("retracted_queue_tokens") or 0),
"sessions": sessions,
}
async def _poll_pool_timeseries(
*,
client: httpx.AsyncClient,
workers: list[tuple[str, str, str]],
interval_s: float,
output_path: Path,
start_time: float,
include_sessions: bool,
) -> None:
output_path.parent.mkdir(parents=True, exist_ok=True)
with output_path.open("w", encoding="utf-8") as handle:
try:
while True:
tick_started = time.perf_counter()
ts = time.time()
wall_s = tick_started - start_time
snapshots = await asyncio.gather(
*(
_query_pool_snapshot(
client=client,
server_url=url,
include_sessions=include_sessions,
)
for _, _, url in workers
),
return_exceptions=True,
)
for (worker_id, role, url), snap in zip(workers, snapshots):
if isinstance(snap, BaseException):
row: dict[str, Any] = {
"ts": ts,
"wall_s": wall_s,
"worker_id": worker_id,
"worker_role": role,
"worker_url": url,
"error": type(snap).__name__,
}
else:
row = {
"ts": ts,
"wall_s": wall_s,
"worker_id": worker_id,
"worker_role": role,
"worker_url": url,
**snap,
}
handle.write(json.dumps(row, sort_keys=True) + "\n")
handle.flush()
elapsed = time.perf_counter() - tick_started
sleep_s = interval_s - elapsed
if sleep_s > 0:
await asyncio.sleep(sleep_s)
except asyncio.CancelledError:
return
async def _query_decode_direct_admission(
*,
client: httpx.AsyncClient,
@@ -600,7 +908,13 @@ async def _query_decode_direct_admission(
session_id: str,
uncached_input_tokens: int,
output_tokens: int,
mode: str = "direct_append",
config: "ReplayConfig | None" = None,
residency: "DecodeResidencyState | None" = None,
request_id: str | None = None,
turn_id: int | None = None,
) -> dict[str, Any]:
started = time.perf_counter()
try:
response = await client.post(
f"{server_url.rstrip('/')}/session_cache/admit_direct_append",
@@ -608,16 +922,22 @@ async def _query_decode_direct_admission(
"session_id": session_id,
"uncached_input_tokens": max(0, uncached_input_tokens),
"output_tokens": max(0, output_tokens),
"mode": mode,
},
timeout=_ADMISSION_PROBE_TIMEOUT_S,
)
response.raise_for_status()
payload = response.json()
if isinstance(payload, dict):
return payload
except Exception:
pass
return {
if not isinstance(payload, dict):
payload = None
except Exception as exc:
payload = None
_last_exc_msg = type(exc).__name__
else:
_last_exc_msg = None
if payload is None:
payload = {
"can_admit": False,
"resident": False,
"reason": "admission-query-failed",
@@ -628,6 +948,78 @@ async def _query_decode_direct_admission(
"freed_tokens": 0,
}
rtt_s = time.perf_counter() - started
pause_ms = int(payload.get("recommended_pause_ms", 0) or 0)
# Update per-D pause window when backpressure is enabled.
if (
config is not None
and residency is not None
and config.enable_backpressure
and pause_ms > 0
):
max_pause_s = max(0.0, config.backpressure_max_pause_s)
applied_pause_s = min(pause_ms / 1000.0, max_pause_s)
new_until = time.perf_counter() + applied_pause_s
prev = residency.pause_until_s.get(server_url, 0.0)
if new_until > prev:
residency.pause_until_s[server_url] = new_until
# Always emit admission event for analysis (even if backpressure disabled).
await _structural_emit(
"admission-events.jsonl",
{
"server_url": server_url,
"session_id": session_id,
"request_id": request_id,
"turn_id": turn_id,
"mode": mode,
"rtt_s": round(rtt_s, 4),
"can_admit": bool(payload.get("can_admit")),
"resident": bool(payload.get("resident")),
"reason": payload.get("reason"),
"queue_depth": int(payload.get("decode_transfer_queue_reqs", 0) or 0),
"retracted_depth": int(payload.get("decode_retracted_queue_reqs", 0) or 0),
"available_tokens_after": int(payload.get("available_tokens_after", 0) or 0),
"token_usage": float(payload.get("token_usage", 0.0) or 0.0),
"evicted_session_count": int(payload.get("evicted_session_count", 0) or 0),
"recommended_pause_ms": pause_ms,
"uncached_input_tokens": int(uncached_input_tokens),
"output_tokens": int(output_tokens),
},
)
return payload
async def _wait_for_decode_pause(
*,
config: "ReplayConfig",
residency: "DecodeResidencyState",
server_url: str,
request_id: str | None = None,
session_id: str | None = None,
) -> None:
if not config.enable_backpressure:
return
until = residency.pause_until_s.get(server_url, 0.0)
if until <= 0:
return
now = time.perf_counter()
if now >= until:
return
sleep_s = min(until - now, config.backpressure_max_pause_s)
await _structural_emit(
"backpressure-events.jsonl",
{
"server_url": server_url,
"session_id": session_id,
"request_id": request_id,
"sleep_s": round(sleep_s, 4),
"until_offset_s": round(until - _STRUCTURAL_RUN_START_S, 4),
},
)
await asyncio.sleep(sleep_s)
async def _discover_decode_residency(
*,
@@ -850,8 +1242,8 @@ def _decode_session_soft_cap(
- residency.headroom_tokens.get(server_url, 0),
)
if usable_capacity_tokens <= 0:
return 4
return max(1, min(4, usable_capacity_tokens // target_tokens))
return 16
return max(1, min(16, usable_capacity_tokens // target_tokens))
def _should_admit_new_decode_session(
@@ -862,6 +1254,7 @@ def _should_admit_new_decode_session(
session: DirectSessionState,
direct_sessions: dict[str, DirectSessionState],
treat_as_fresh_session: bool,
admission_mode: KvCacheAdmissionMode = "router",
) -> bool:
if (
not treat_as_fresh_session
@@ -869,6 +1262,11 @@ def _should_admit_new_decode_session(
and session.server_url == server_url
):
return True
if admission_mode == "worker":
# Defer the capacity decision to D's admit_direct_append (mode=seed),
# which checks real KV pool availability and runs LRU eviction. The
# local soft cap is router-mode only.
return True
open_sessions = sum(
1
for candidate in direct_sessions.values()
@@ -975,6 +1373,49 @@ def _is_stale_decode_session_error(exc: Exception) -> bool:
)
# execution_mode substrings that signal D-side admission rejected this request.
# Used by _run_request to update state.session_d_rejects so KvAwarePolicy can
# migrate persistently-starved sessions to a different D next turn.
_ADMISSION_REJECTION_SUBSTRINGS = (
"session-cap",
"no-d-capacity",
"d-backpressure",
)
def _is_admission_rejection_mode(execution_mode: str) -> bool:
return any(token in execution_mode for token in _ADMISSION_REJECTION_SUBSTRINGS)
def _fallthrough_reason(
*,
request: TraceRequest,
config: ReplayConfig,
decision,
direct_append_length: int | None,
direct_session_reused: bool,
direct_session_reset: bool,
) -> str:
"""Classify why a turn-2+ KVC request fell through to the seed/large-append branch.
Returns a short label suffix used in execution_mode strings to replace the
misleading 'large-append' label (TEAM_REPORT §2.7). In particular,
'session-not-resident' is the §1 starvation signature — direct_session_reused
is False because the session was never opened on the policy-chosen D.
"""
if not direct_session_reused:
return "session-not-resident"
if direct_session_reset:
return "session-was-evicted"
if direct_append_length is None:
return "no-direct-info"
if direct_append_length > config.kvcache_direct_max_uncached_tokens:
return "real-large-append"
if not _should_bypass_prefill(request=request, config=config, decision=decision):
return "policy-no-bypass"
return "other-large-append"
def _dynamic_decode_headroom_tokens(
*,
residency: DecodeResidencyState,
@@ -1280,6 +1721,11 @@ async def _reserve_decode_session_capacity(
session_id=session.session_id,
uncached_input_tokens=max(0, request.input_length - current_tokens),
output_tokens=request.output_length,
mode="direct_append",
config=config,
residency=residency,
request_id=request.request_id,
turn_id=request.turn_id,
)
if not bool(admission.get("resident")):
return False, 0, 0, 0, str(admission.get("reason") or "d-session-not-resident")
@@ -1304,6 +1750,45 @@ async def _reserve_decode_session_capacity(
None,
)
# Seed / reseed path: ask D itself via the seed-mode admission endpoint
# instead of estimating capacity from a stale router-state snapshot. D
# will run LRU eviction internally to make room. Falls through to the
# legacy router-state logic below if the endpoint is unavailable.
seed_admission = await _query_decode_direct_admission(
client=client,
server_url=server_url,
session_id=session.session_id,
uncached_input_tokens=max(0, request.input_length - current_tokens),
output_tokens=request.output_length,
mode="seed",
config=config,
residency=residency,
request_id=request.request_id,
turn_id=request.turn_id,
)
seed_reason = seed_admission.get("reason")
if seed_reason != "admission-query-failed":
if not bool(seed_admission.get("can_admit")):
return (
False,
0,
int(seed_admission.get("evicted_session_count", 0) or 0),
0,
str(seed_reason or "d-no-space"),
)
reserved_tokens = int(
seed_admission.get("required_tokens", required_extra_tokens)
or required_extra_tokens
)
_add_reserved_tokens(residency, server_url, reserved_tokens)
return (
True,
reserved_tokens,
int(seed_admission.get("evicted_session_count", 0) or 0),
0,
None,
)
session_cache, max_total_num_tokens, reserved_decode_tokens = (
await _fetch_decode_server_state(
client=client,
@@ -1582,29 +2067,34 @@ async def _invoke_plain_router(
config: ReplayConfig,
decision,
execution_mode: str,
decode_residency: "DecodeResidencyState | None" = None,
) -> ExecutionResult:
prefill_priority = _prefill_priority_for_router_request(
config=config,
direct_to_d_predicted=False,
)
latency_s, ttft_s, tpot_s, cached_tokens = await _invoke_router(
gen = await _invoke_router(
client=client,
request=request,
config=config,
decode_worker_index=decision.decode_worker_index,
prefill_request_priority=prefill_priority,
decode_residency=decode_residency,
)
return ExecutionResult(
execution_mode=execution_mode,
actual_kv_transfer_blocks=decision.kv_transfer_blocks,
effective_input_length=request.input_length,
cached_tokens=cached_tokens,
cached_tokens=gen.cached_tokens,
prefill_request_priority=prefill_priority,
session_reused=False,
session_reset=False,
latency_s=latency_s,
ttft_s=ttft_s,
tpot_s=tpot_s,
latency_s=gen.latency_s,
ttft_s=gen.ttft_s,
tpot_s=gen.tpot_s,
actual_output_tokens=gen.actual_output_tokens,
requested_output_tokens=gen.requested_output_tokens,
finish_reason=gen.finish_reason,
)
@@ -1676,13 +2166,14 @@ async def _invoke_kvcache_seeded_router(
decode_session.opened = True
decode_session_newly_opened = True
decode_session.active_requests += 1
latency_s, ttft_s, tpot_s, cached_tokens = await _invoke_router(
gen = await _invoke_router(
client=client,
request=request,
config=config,
decode_worker_index=decision.decode_worker_index,
session_id=request.session_id,
prefill_request_priority=prefill_priority,
decode_residency=decode_residency,
)
except Exception:
async with direct_session_lock:
@@ -1742,13 +2233,16 @@ async def _invoke_kvcache_seeded_router(
execution_mode=execution_mode,
actual_kv_transfer_blocks=decision.kv_transfer_blocks,
effective_input_length=request.input_length,
cached_tokens=cached_tokens,
cached_tokens=gen.cached_tokens,
prefill_request_priority=prefill_priority,
session_reused=False,
session_reset=False,
latency_s=latency_s,
ttft_s=ttft_s,
tpot_s=tpot_s,
latency_s=gen.latency_s,
ttft_s=gen.ttft_s,
tpot_s=gen.tpot_s,
actual_output_tokens=gen.actual_output_tokens,
requested_output_tokens=gen.requested_output_tokens,
finish_reason=gen.finish_reason,
)
@@ -1771,17 +2265,21 @@ async def _execute_request(
config=config,
decision=decision,
execution_mode="pd-disaggregation-router",
decode_residency=decode_residency,
)
if config.mechanism_name == "pd-colo":
return await _invoke_direct(
if not config.router_url:
raise ValueError("router_url is required for pd-colo replay")
result = await _invoke_plain_router(
client=client,
request=request,
config=config,
decision=decision,
direct_sessions=direct_sessions,
direct_session_lock=direct_session_lock,
execution_mode="dp-colo-router",
decode_residency=decode_residency,
)
return replace(result, actual_kv_transfer_blocks=0)
if config.mechanism_name == "kvcache-centric":
if not config.router_url:
@@ -1838,6 +2336,7 @@ async def _execute_request(
config=config,
decision=decision,
execution_mode=f"pd-router-turn1-{seed_filter_reason}",
decode_residency=decode_residency,
)
async with direct_session_lock:
admit_new_decode_session = _should_admit_new_decode_session(
@@ -1847,6 +2346,7 @@ async def _execute_request(
session=decode_session,
direct_sessions=direct_sessions,
treat_as_fresh_session=True,
admission_mode=config.kvcache_admission_mode,
)
if not admit_new_decode_session:
can_seed = False
@@ -1874,6 +2374,7 @@ async def _execute_request(
config=config,
decision=decision,
execution_mode="pd-router-turn1-session-cap",
decode_residency=decode_residency,
)
if can_seed:
return await _invoke_kvcache_seeded_router(
@@ -1899,6 +2400,7 @@ async def _execute_request(
if seed_reason is not None and seed_reason != "d-no-space"
else "pd-router-turn1-no-d-capacity"
),
decode_residency=decode_residency,
)
if (
@@ -1970,6 +2472,7 @@ async def _execute_request(
config=config,
decision=decision,
execution_mode="pd-router-fallback-stale-d-session",
decode_residency=decode_residency,
)
if _is_decode_backpressure_reason(direct_reason):
return await _invoke_plain_router(
@@ -1978,6 +2481,7 @@ async def _execute_request(
config=config,
decision=decision,
execution_mode="pd-router-fallback-d-backpressure",
decode_residency=decode_residency,
)
seed_filter_reason = _seed_filter_reason(
@@ -1992,6 +2496,7 @@ async def _execute_request(
config=config,
decision=decision,
execution_mode=f"pd-router-fallback-{seed_filter_reason}",
decode_residency=decode_residency,
)
async with direct_session_lock:
admit_new_decode_session = _should_admit_new_decode_session(
@@ -2001,6 +2506,7 @@ async def _execute_request(
session=decode_session,
direct_sessions=direct_sessions,
treat_as_fresh_session=True,
admission_mode=config.kvcache_admission_mode,
)
if not admit_new_decode_session:
can_seed = False
@@ -2036,6 +2542,7 @@ async def _execute_request(
config=config,
decision=decision,
execution_mode="pd-router-fallback-session-cap",
decode_residency=decode_residency,
)
if can_seed:
return await _invoke_kvcache_seeded_router(
@@ -2067,8 +2574,20 @@ async def _execute_request(
if _is_decode_backpressure_reason(seed_reason)
else "pd-router-fallback-no-d-capacity"
),
decode_residency=decode_residency,
)
# TEAM_REPORT §2.7: 'large-append' is misleading — most fallthroughs are
# actually 'session-not-resident-on-pinned-D' (§1 starvation). Classify
# the real reason and embed it in the execution_mode label.
fallthrough = _fallthrough_reason(
request=request,
config=config,
decision=decision,
direct_append_length=direct_append_length,
direct_session_reused=direct_session_reused,
direct_session_reset=direct_session_reset,
)
seed_filter_reason = _seed_filter_reason(
request=request,
config=config,
@@ -2080,7 +2599,8 @@ async def _execute_request(
client=client,
config=config,
decision=decision,
execution_mode=f"pd-router-fallback-large-append-{seed_filter_reason}",
execution_mode=f"pd-router-fallback-{fallthrough}-{seed_filter_reason}",
decode_residency=decode_residency,
)
async with direct_session_lock:
admit_new_decode_session = _should_admit_new_decode_session(
@@ -2124,7 +2644,8 @@ async def _execute_request(
client=client,
config=config,
decision=decision,
execution_mode="pd-router-fallback-large-append-session-cap",
execution_mode=f"pd-router-fallback-{fallthrough}-session-cap",
decode_residency=decode_residency,
)
if can_seed:
return await _invoke_kvcache_seeded_router(
@@ -2139,23 +2660,28 @@ async def _execute_request(
decode_residency=decode_residency,
reserved_tokens=reserved_tokens,
execution_mode=(
"pd-router-large-append-reseed"
f"pd-router-{fallthrough}-reseed"
+ _eviction_suffix(
evicted_sessions,
prefill_backed_evictions,
)
),
)
# Preserve seed_reason in the label so migration feedback fires for
# 'd-no-space' / 'd-*-backpressure' (matched via _is_admission_rejection_mode).
if _is_decode_backpressure_reason(seed_reason):
mode_label = f"pd-router-fallback-{fallthrough}-d-backpressure"
elif seed_reason == "d-no-space":
mode_label = f"pd-router-fallback-{fallthrough}-no-d-capacity"
else:
mode_label = f"pd-router-fallback-{fallthrough}"
return await _invoke_plain_router(
request=request,
client=client,
config=config,
decision=decision,
execution_mode=(
"pd-router-fallback-d-backpressure"
if _is_decode_backpressure_reason(seed_reason)
else "pd-router-fallback-large-append"
),
execution_mode=mode_label,
decode_residency=decode_residency,
)
raise ValueError(f"Unsupported mechanism: {config.mechanism_name}")
@@ -2201,6 +2727,14 @@ async def _invoke_session_direct(
reserved_tokens: int = 0,
direct_session_lock: asyncio.Lock | None = None,
) -> ExecutionResult:
if decode_residency is not None and config.enable_backpressure:
await _wait_for_decode_pause(
config=config,
residency=decode_residency,
server_url=session.server_url,
request_id=request.request_id,
session_id=session.session_id,
)
_prompt, effective_input_length, session_reused, session_reset = _build_direct_prompt(
request=request,
session=session,
@@ -2238,7 +2772,7 @@ async def _invoke_session_direct(
session.active_requests += 1
try:
latency_s, ttft_s, tpot_s, cached_tokens = await _invoke_generate(
gen = await _invoke_generate(
client=client,
base_url=session.server_url,
headers={"x-request-id": request.request_id},
@@ -2277,12 +2811,15 @@ async def _invoke_session_direct(
execution_mode=execution_mode,
actual_kv_transfer_blocks=0,
effective_input_length=len(input_ids),
cached_tokens=cached_tokens,
cached_tokens=gen.cached_tokens,
session_reused=session_reused,
session_reset=session_reset,
latency_s=latency_s,
ttft_s=ttft_s,
tpot_s=tpot_s,
latency_s=gen.latency_s,
ttft_s=gen.ttft_s,
tpot_s=gen.tpot_s,
actual_output_tokens=gen.actual_output_tokens,
requested_output_tokens=gen.requested_output_tokens,
finish_reason=gen.finish_reason,
)

View File

@@ -66,6 +66,7 @@ def launch_pd_stack(
timeout_s: float = 1200.0,
router_request_timeout_s: float | None = None,
include_router: bool = True,
naive_dp: bool = False,
) -> ManagedPdStack:
run_dir.mkdir(parents=True, exist_ok=True)
logs_dir = run_dir / "logs"
@@ -77,6 +78,7 @@ def launch_pd_stack(
decode_policy=decode_policy,
include_router=include_router,
router_request_timeout_s=router_request_timeout_s,
naive_dp=naive_dp,
)
prefill_processes = [
@@ -195,6 +197,9 @@ def _build_process_env(topology: SingleNodeTopology) -> dict[str, str]:
env["MC_MS_AUTO_DISC"] = "0"
if topology.ib_device:
env["MOONCAKE_DEVICE"] = topology.ib_device
elif topology.transfer_backend == "mooncake":
# Default to TCP when RDMA is not forced (e.g. loopback on same node)
env.setdefault("MOONCAKE_PROTOCOL", "tcp")
repo_root = Path(__file__).resolve().parents[2]
python_paths = [

View File

@@ -189,10 +189,11 @@ class MooncakeTransferEngine:
device_name if device_name is not None else "",
)
else:
protocol = os.environ.get("MOONCAKE_PROTOCOL", "rdma")
ret_value = self.engine.initialize(
hostname,
"P2PHANDSHAKE",
"rdma",
protocol,
device_name if device_name is not None else "",
)
if ret_value != 0:

View File

@@ -1602,6 +1602,9 @@ class DirectAppendAdmissionReqInput(BaseReq):
session_id: str
uncached_input_tokens: int
output_tokens: int
# "direct_append": existing behavior — require session resident on this D
# "seed": new admission for session not yet resident; do capacity check + LRU eviction
mode: str = "direct_append"
@dataclass
@@ -1619,6 +1622,9 @@ class DirectAppendAdmissionReqOutput(BaseReq):
decode_prealloc_queue_reqs: int = 0
decode_transfer_queue_reqs: int = 0
decode_retracted_queue_reqs: int = 0
# Backpressure hint: if > 0, the caller should pause this many ms before
# sending more requests to this D. Computed from transfer-queue depth.
recommended_pause_ms: int = 0
@dataclass

View File

@@ -3181,6 +3181,89 @@ class Scheduler(
success = False
return success
def _compute_pool_breakdown_for_diagnostics(self) -> dict:
"""Read-only KV pool decomposition for the agentic-pd-hybrid profiler.
Decomposes capacity into:
- radix_evictable_tokens / radix_protected_tokens: tree-managed
- slot_private_held_tokens: SessionAwareCache out-of-tree slot holds
- running_batch_kv_tokens: kv_allocated_len of currently-decoding reqs
(overlaps with radix_protected; not additive)
- {transfer,prealloc,retracted}_queue_{reqs,tokens}: disagg queues
- available_tokens: free pool
Caller computes "unaccounted = capacity - sum_of_known" to find leakage.
Implementation is best-effort; missing components return omitted keys.
"""
breakdown: dict = {
"capacity_tokens": int(self.max_total_num_tokens or 0),
"available_tokens": int(self.token_to_kv_pool_allocator.available_size()),
}
# Radix tree (works for SessionAwareCache and most inner caches)
try:
ev = self.tree_cache.evictable_size()
pr = self.tree_cache.protected_size()
if isinstance(ev, tuple):
ev = ev[0]
if isinstance(pr, tuple):
pr = pr[0]
breakdown["radix_evictable_tokens"] = int(ev or 0)
breakdown["radix_protected_tokens"] = int(pr or 0)
except Exception:
pass
# SessionAwareCache slot-private holds (already in session_cache.held_tokens
# but mirrored here for one-stop decomposition)
try:
from sglang.srt.mem_cache.session_aware_cache import SessionAwareCache
if isinstance(self.tree_cache, SessionAwareCache):
breakdown["slot_private_held_tokens"] = int(
self.tree_cache.session_held_tokens()
)
breakdown["session_slot_count"] = int(
self.tree_cache.session_held_req_count()
)
except Exception:
pass
# Running batch KV (overlaps with radix_protected for tree-tracked reqs)
try:
running_reqs = self.running_batch.reqs
breakdown["running_batch_reqs"] = len(running_reqs)
breakdown["running_batch_kv_tokens"] = sum(
int(getattr(req, "kv_allocated_len", 0) or 0)
for req in running_reqs
)
except Exception:
pass
# Disagg decode queues
if self.disaggregation_mode == DisaggregationMode.DECODE:
try:
tq = self.disagg_decode_transfer_queue.queue
pq = self.disagg_decode_prealloc_queue.queue
rq = self.disagg_decode_prealloc_queue.retracted_queue
breakdown["transfer_queue_reqs"] = len(tq)
breakdown["transfer_queue_tokens"] = sum(
int(getattr(getattr(dr, "req", None), "kv_allocated_len", 0) or 0)
for dr in tq
)
breakdown["prealloc_queue_reqs"] = len(pq)
breakdown["prealloc_queue_tokens"] = sum(
int(getattr(getattr(dr, "req", None), "kv_allocated_len", 0) or 0)
for dr in pq
)
breakdown["retracted_queue_reqs"] = len(rq)
breakdown["retracted_queue_tokens"] = sum(
int(getattr(req, "kv_allocated_len", 0) or 0)
for req in rq
)
except Exception:
pass
return breakdown
def get_internal_state(self, recv_req: GetInternalStateReq):
ret = vars(get_global_server_args())
ret["last_gen_throughput"] = self.last_gen_throughput
@@ -3196,6 +3279,7 @@ class Scheduler(
ret["session_cache"] = (
self.session_controller.get_streaming_session_cache_status()
)
ret["pool_breakdown"] = self._compute_pool_breakdown_for_diagnostics()
if not self.spec_algorithm.is_none() and self.spec_total_num_forward_ct > 0:
ret["avg_spec_accept_length"] = (
@@ -3508,6 +3592,9 @@ class Scheduler(
reason="unsupported",
)
mode = getattr(recv_req, "mode", "direct_append") or "direct_append"
is_seed = mode == "seed"
session_cache_status = self.session_controller.get_streaming_session_cache_status(
recv_req.session_id
)
@@ -3515,27 +3602,28 @@ class Scheduler(
resident = bool(
isinstance(target_session, dict) and target_session.get("resident")
)
if not resident:
if not resident and not is_seed:
# direct_append requires the session already resident on this D.
# For seed we skip this check and let capacity decide.
transfer_queue_depth = len(self.disagg_decode_transfer_queue.queue)
retracted_queue_depth = len(self.disagg_decode_prealloc_queue.retracted_queue)
available_size = int(self.token_to_kv_pool_allocator.available_size())
token_usage = 1.0 - available_size / max(1, self.max_total_num_tokens)
return DirectAppendAdmissionReqOutput(
can_admit=False,
resident=False,
reason="session-not-resident",
available_tokens_before=int(
self.token_to_kv_pool_allocator.available_size()
),
available_tokens_after=int(
self.token_to_kv_pool_allocator.available_size()
),
token_usage=(
1.0
- self.token_to_kv_pool_allocator.available_size()
/ max(1, self.max_total_num_tokens)
),
available_tokens_before=available_size,
available_tokens_after=available_size,
token_usage=token_usage,
num_running_reqs=len(self.running_batch.reqs),
decode_prealloc_queue_reqs=len(self.disagg_decode_prealloc_queue.queue),
decode_transfer_queue_reqs=len(self.disagg_decode_transfer_queue.queue),
decode_retracted_queue_reqs=len(
self.disagg_decode_prealloc_queue.retracted_queue
decode_transfer_queue_reqs=transfer_queue_depth,
decode_retracted_queue_reqs=retracted_queue_depth,
recommended_pause_ms=self._compute_backpressure_pause_hint(
transfer_queue_depth=transfer_queue_depth,
retracted_queue_depth=retracted_queue_depth,
token_usage_after=token_usage,
),
)
@@ -3543,10 +3631,13 @@ class Scheduler(
0, recv_req.output_tokens
)
available_tokens_before = int(self.token_to_kv_pool_allocator.available_size())
# Don't evict the session itself when it's already resident; for seed
# of a fresh session there is nothing to exclude.
exclude_ids = {recv_req.session_id} if resident else set()
trim_result = self.maybe_trim_decode_session_cache(
required_tokens=required_tokens,
force=available_tokens_before < required_tokens,
exclude_session_ids={recv_req.session_id},
exclude_session_ids=exclude_ids,
)
available_tokens_after = int(self.token_to_kv_pool_allocator.available_size())
decode_retracted_queue_reqs = len(self.disagg_decode_prealloc_queue.retracted_queue)
@@ -3556,6 +3647,7 @@ class Scheduler(
)
reason = None if can_admit else "no-space"
transfer_queue_depth = len(self.disagg_decode_transfer_queue.queue)
return DirectAppendAdmissionReqOutput(
can_admit=can_admit,
resident=True,
@@ -3570,10 +3662,36 @@ class Scheduler(
),
num_running_reqs=len(self.running_batch.reqs),
decode_prealloc_queue_reqs=len(self.disagg_decode_prealloc_queue.queue),
decode_transfer_queue_reqs=len(self.disagg_decode_transfer_queue.queue),
decode_transfer_queue_reqs=transfer_queue_depth,
decode_retracted_queue_reqs=decode_retracted_queue_reqs,
recommended_pause_ms=self._compute_backpressure_pause_hint(
transfer_queue_depth=transfer_queue_depth,
retracted_queue_depth=decode_retracted_queue_reqs,
token_usage_after=(
1.0 - available_tokens_after / max(1, self.max_total_num_tokens)
),
),
)
def _compute_backpressure_pause_hint(
self,
*,
transfer_queue_depth: int,
retracted_queue_depth: int,
token_usage_after: float,
) -> int:
# If D is already retracting requests, pause aggressively.
if retracted_queue_depth > 0:
return 1500
# KV pool above 90%: pause proportional to overshoot.
if token_usage_after >= 0.90:
overshoot = int((token_usage_after - 0.90) * 10000)
return max(200, min(2000, overshoot * 5))
# Transfer queue heavy: pause linearly with depth.
if transfer_queue_depth >= 8:
return min(2000, transfer_queue_depth * 100)
return 0
def maybe_sleep_on_idle(self):
if self.idle_sleeper is not None:
self.idle_sleeper.maybe_sleep()