Commit Graph

209 Commits

Author SHA1 Message Date
41a0c1c48f Migration correctness smoke tests: direct-read, partial-transfer, NIXL
Standalone smoke tests validating KV-migration correctness paths before
trace replay: full migrate-cache, partial-prefill transfer, and a
NIXL-connector variant, each with a runner.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 11:53:13 +08:00
1262c9c22e Migration transfer-cost study: KV transfer is slow on busy GPUs
MIGRATION_TRANSFER_COST.md: under real load, migration KV transfer runs at
~3 GB/s vs ~10 GB/s idle. Decomposed (instruments + MB6 microbench) into
~55% RDMA-actual (HBM/PCIe contention with running kernels: 7.6->4.0 GB/s)
+ ~45% control-plane GIL starvation during long prefills. Reproduced on a
fresh upstream venv (byte-identical transfer path) -> upstream/hardware
inherent, not our patch. Layerwise is the wrong lever; the tax is structural
on a loaded agentic cluster. Includes mb6_transfer_under_load + run_mb6,
instrument_dst_migration/mooncake, and the dst/transfer decomposition analyzers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 11:53:01 +08:00
67fcec7933 Unified-routing A+B ablation: decode-aware LMetric + v3 anti-hotspot
cache_aware_proxy: add lmetric_decode_weight (decode-load penalty in the
LMetric fallback score) and a v3 anti-hotspot recent-migration penalty
(effective_load = num_req + recent-migration count over a sliding window),
preventing back-to-back migration clustering. UNIFIED_ABLATION.md documents
the A (overload_factor=1.3) + B' (decode-weight, max(num_req,1)) + RaceFix
sweep: A+B'+RaceFix reaches TTFT p90 7770ms, beating v3 PD-sep migration by
~20%. Runners/analyzer for the b3 trace replay included.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 11:52:44 +08:00
a2f2645fda PD_DISAGG_RESULTS §6.3: producer hot-pinning figure
Direct per-producer KV-pool evidence for the session-affinity backfire.
At the same 4P+4D ratio:
- round-robin: 4 producers within 1pp of each other (spread 0pp, CV 0.01)
- session-affinity: spread 49pp (one producer ~93%, another 45%; CV 0.25)

A 25x jump in producer load imbalance — heavy multi-turn sessions
concentrate onto single producers, the same hot-pinning pathology as
sticky routing in the colocated §3.3 study.

plot_producer_hotspot.py: reduce (numpy, per-producer KV timeline from
snapshots, runs on the serving host) + plot (matplotlib, 2-panel rr vs
session comparison) — same two-stage pattern as aggregate_mb5.py.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 00:38:20 +08:00
6243b78bba PD_DISAGG_RESULTS §6: session-affinity routing does not rescue PD
Swept session-affinity P routing (MB5_P_ROUTING=session) across all
four ratios on the metrics-fixed stack. Findings:

- Strictly worse than round-robin at every ratio. 4P+4D: round-robin
  100% vs session-affinity 36% completion.
- Success DECREASES monotonically as decode capacity grows
  (6P+2D 59% -> 4P+4D 36% -> 3P+5D 24% -> 2P+6D 19%) — refutes the
  "session prefill is faster so it needs more D" hypothesis.
- GPUs sit at ~0% utilization (2P+6D entirely idle) — the cluster
  stalls on KV-transfer/admission coordination, not compute. This is
  the deepest anti-PD argument: paid-for hardware does nothing while
  requests pile up; colocation keeps every GPU busy.
- Mechanism: session-affinity pins heavy multi-turn sessions onto
  single producers (producer hot-pinning, same pathology as sticky
  routing in the colocated §3.3 study); fewer producers -> worse
  concentration -> the monotonic decline. Failed transfers also pin
  producer KV (kv_load_failure_policy=fail), compounding to deadlock.

Verdict: neither ratio tuning nor routing policy rescues static
PD-disagg for this agentic workload — the failure is structural.

mb5_launch.sh: add 5P+3D / 3P+5D ratios for the sweep.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 00:25:10 +08:00
4cd71b6631 Working-set figure: extend left panel to ~50 nodes
Include T=600s/1800s points so the diminishing-returns tail is visible:
14 -> 52 nodes buys only +6pp APC (74%->79.8%), still under the 80.4%
ceiling that oracle/LRU reaches at 14 nodes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 17:11:12 +08:00
2247d1de08 Working-set figure: right panel = W(t) time series
Replace the (redundant) nodes-vs-T cost curve with the working-set
W(t) over wall-clock time for T=2/30/300s. Shows footprint is steady
(peak ~ median) after a short warm-up, so peak-based sizing is sound;
the 300s curve hugs the 14-node ceiling throughout.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 16:31:26 +08:00
c94b2e237a Working-set figure: linear node axes + benefit/cost split
Drop log node axis (decade ticks were unreadable). Left = APC vs #nodes
(linear), right = #nodes vs retention window T. Mark the 1-node budget
crossing (~7s reuse, ~8% APC) and the 14-node oracle ceiling.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 16:24:15 +08:00
3b8be5bb61 Working-set figure: express footprint in node count, not GB
Both axes now in "# nodes" (footprint / per-node KV pool) so the
cluster-size implication is direct: 1-node budget line + 14-node oracle
ceiling, instead of raw GB.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 16:16:00 +08:00
dae98c6472 Working-set sizing tool + GLM-5.1-FP8/B300 result
Configurable KV working-set analyzer (GPU model x TP/PP/EP x model
config.json with MLA/GQA auto x KV/weight dtype). Computes Denning W(T),
oracle [first,last], and retain-forever footprints vs a per-replica KV
pool, plus the APC captured at each retention window.

GLM-5.1-FP8 (MLA, 43.9 KiB/token) on 1x B300 node (1528 GB KV pool):
live KV fits trivially (~533 GB), but the full 80.4% APC ceiling needs
~14 nodes (oracle) -> long-tail reuse motivates DRAM offload, not HBM.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 16:03:25 +08:00
2e6a369046 PD_DISAGG_RESULTS §5.1: D-pool pressure crashes consumers
Document the consumer EngineCore crash chain (D-pool 97% -> 112k-token
KV transfer fails -> negative prompt-token counter -> prometheus
ValueError -> engine dead -> cliff failure). Explains the round-robin
6P+2D rep variance (100/56/80%) as intermittent consumer death, and
notes the counter-clamp patch needed to compare routing arms fairly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 13:02:21 +08:00
3957c2df86 MB5 patch: clamp PD-consumer metrics counter underflow
Root cause of the 6P+2D run-to-run collapse (rep1 100%, rep2 56%,
rep3 80%, session-routing 6.6%): not load-shedding, but a consumer
EngineCore crash.

Failure chain observed in the consumer logs:
  1. D-pool fills to ~97% (decode-side capacity ceiling, the H1 story)
  2. a large request's KV transfer fails: "Mooncake transfer engine
     returned -1" (112k-token request, pool full)
  3. scheduler fails the request (kv_load_failure_policy=fail)
  4. PromptTokenStats.local_cache_hit = num_cached + recomputed -
     num_external_computed goes NEGATIVE (external transfer exceeded
     cached count)
  5. loggers.record() calls Counter.inc(negative) -> prometheus raises
     "Counters can only be incremented by non-negative amounts."
  6. EngineCore dies -> every subsequent request fails (the cliff:
     all successes in the first ~110s, zero after)

This turns ONE failed request into a total config collapse, and is
what made the round-robin 6P+2D reps look randomly variable.

Fix: clamp the three per-source prompt-token counts to >= 0 in
loggers.record() before they hit Counter.inc(). Pure insertion,
revertible via the existing sentinel mechanism. Lets a transfer
failure stay a single failed request instead of killing the engine,
so routing arms can be compared on equal footing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 13:01:23 +08:00
8596135680 MB5 analysis: per-role KV split proves static-partition mismatch
aggregate_mb5.py:
- Split the cluster KV timeline by role (P-pool vs D-pool) using a
  PID->role map parsed from vllm_logs filenames. The cluster average
  hid the result — 6P+2D/4P+4D look ~45% utilized but the decode pool
  is actually pegged at ~100% while prefill idles at ~30%.
- Two-stage reduce/plot: --reduce-to (numpy-only, runs on the serving
  host over multi-GB snapshot dirs) dumps a compact JSON; --from-reduced
  (matplotlib) renders locally. matplotlib import is now lazy.
- New plot_role_split figure + p/d peak/steady columns in the CSV.

PD_DISAGG_RESULTS.md: consolidated writeup with figures inline.
Verdict: no static P:D ratio beats 8C colocation. The binding
constraint moves with the ratio (D-pool saturates at 6P+2D/4P+4D,
P-pool jams at 2P+6D -> 91% request loss); 8C's shared pool stays
elastic at 34% steady, 100% completion. PD wins TPOT (10-35x cleaner,
the MB1 phase-isolation benefit is real) but loses TTFT and sheds
load. Round-robin P routing also zeroes prefix-cache reuse; a
session-affinity re-run of 6P+2D is in flight to test the fix.

Figures (rep1): mb5_kv_timeline, mb5_role_split, mb5_peak_utilization,
mb5_latency_compare + mb5_summary.csv.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 12:05:17 +08:00
e8980ce957 MB5 proxy: session-affinity P routing (MB5_P_ROUTING=session)
The upstream mooncake_connector_proxy round-robins both P and D
selection. For agentic multi-turn sessions this destroys prefix-cache
reuse on the producer side — every turn of a session lands on a
different P, so the prefix-cache hit ratio collapses to 0 (observed in
the 6P+2D round-robin baseline) and every turn re-prefills from
scratch, piling extra load on the P pool.

Add an env-gated routing mode so the same proxy serves both arms of a
clean A/B:
  MB5_P_ROUTING=rr       round-robin (default, = upstream behavior)
  MB5_P_ROUTING=session  consistent md5 hash on X-Session-Id -> same
                         producer for all turns of a session

Decode side stays round-robin (load balance) in both modes — decode
KV is freshly transferred per turn, so D gains nothing from affinity
but everything from even load spreading.

mb5_launch.sh threads MB5_P_ROUTING through to the proxy and logs the
active mode. Default path is byte-for-byte the old behavior, so an
in-flight round-robin sweep is unaffected if this is redeployed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 11:05:25 +08:00
b13ca10d19 PD_DISAGG_INVESTIGATION: snapshot Phase 0 done + sweep in flight
Phase 0 infrastructure (vendored proxy, dual-file vLLM patcher,
per-instance + cross-config plotters) is fully assembled and
smoke-validated. Sweep RUN_TAG=20260527_164040 (4 configs × 3 reps
on w600) is running on dash1.

Also realigned the figure list with what `aggregate_mb5.py`
actually produces (mb5_kv_timeline, mb5_peak_utilization,
mb5_latency_compare, mb5_summary.csv).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 00:51:28 +08:00
a66f24d242 MB5 aggregate: cross-config KV-pool + latency comparison
Reads sweep root + tag, for each (config, rep):
- merges per-PID snapshots into cluster-wide KV timeline (carry-forward
  for PIDs without a sample in the bin)
- computes peak (max) and steady-state (10-90% median) pool utilization
- pulls latency p50/p90/p99 from replay_metrics.summary.json

Produces 4 outputs in --out-dir:
- mb5_kv_timeline.png    — N-panel cluster KV % over time, one panel per
                            config, faint per-rep lines + bold median
- mb5_peak_utilization.png — bar chart (peak vs steady) with ±std error bars
- mb5_latency_compare.png  — bar chart p50/p90/p99 e2e latency per config
- mb5_summary.csv          — flat per-(config, rep) table for the writeup

Validated on 4P+4D × 20-req smoke:
  4P+4D rep1: peak=12.8% steady=10.7% peak_wait=1
  p50=1.3s p90=10.5s p99=17.1s (vs. <1s for 8C — expected gap).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 00:49:21 +08:00
a9c7310f4a MB5 PD-disagg pipeline: working end-to-end
Three independent bugs were blocking PD-disagg smoke; each fix is
isolated so the next PD experiment doesn't re-hit them.

1. mb5_launch.sh
   - stop_all() also kills mb5_pd_proxy.py (our vendored copy),
     not just the upstream filename, and asserts ports 8000-8007 +
     PROXY_PORT are free before launching — stale proxies were
     silently passing the readiness check.
   - Proxy readiness uses a generic "any HTTP response" probe;
     mooncake_connector_proxy only exposes /v1/completions so
     /v1/models 404 is expected.

2. mb5_pd_proxy.py (vendored from third_party so deploy.sh ships it)
   - Force min_tokens=1 on the prefill leg. Clients that set
     min_tokens == max_tokens (our replayer does) collide with
     vLLM's min_tokens<=max_tokens check after the proxy caps
     max_tokens=1.

3. instrument_kv_snapshot.py
   - Adds a second patch target: initialize
     MooncakeConnectorWorker.bootstrap_server = None in __init__.
     vLLM 0.18.1 only sets it under the is_kv_producer branch, so
     kv_consumer hits AttributeError as soon as the first remote
     prefill request lands.
   - apply/revert refactored to iterate over (path, patches) pairs.

plot_kv_pool_timeline.py also handles snapshot files that never
captured a running request (would otherwise IndexError on an empty
stackplot input).

Smoke: 4P+4D × 20 reqs → 20/20 success, mean 3.9s, p99 17s, 8 PIDs
all writing snapshots (601 total), well above the 8C baseline.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 00:14:22 +08:00
e0d3b5150a MB5 driver fixes: bash env-prefix + replayer flag names + python date math
Two bugs caught by 8C smoke:

mb5_launch.sh
  ${env_bp_arg} expanded as a literal command line prefix doesn't work
  when env_bp_arg is itself a variable — bash only treats VAR=val as
  an env assignment if it sees the literal in the parsed command, not
  after expansion. Fix: always export VLLM_MOONCAKE_BOOTSTRAP_PORT as
  a literal, defaulting to 9999 when caller passed no port (consumer
  mode ignores the var so the placeholder is harmless).

mb5_run.sh
  replayer's actual CLI flags are --trace / --output / --endpoint /
  --model, not the --*-path / --*-name variants I had. Plus dash1
  has no `bc`; compute wall_clock_s via python instead.

Both fixed; 8C smoke (CONFIG=8C REPS=1 REQUEST_LIMIT=20) now runs
end-to-end in ~30 s:
  - 8 vLLM kv_both instances on GPU 0-7 come up
  - replayer round-robins 20 reqs across them
  - MB5 instrumentation captures 8 snapshot files (one per EngineCore
    PID), ranging 7-139 snapshots each = ~10 Hz throttle works
  - plot_kv_pool_timeline.py renders the stacked-area + queue-depth
    chart cleanly (figs/mb5_smoke/*.png)

Pipeline validated. Ready for the real PD-ratio sweep.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 23:23:23 +08:00
e9abd70c8d MB5 driver: launcher, orchestrator, KV-pool timeline plotter
Three new files to drive the PD ratio sweep + per-request KV occupancy
capture, plus a deploy.sh update so the patched replayer rides along
to the fresh-venv host.

mb5_launch.sh
  One script handles all four configs we plan to sweep:
    CONFIG=8C / 6P+2D / 4P+4D / 2P+6D
  - For 8C: 8 vLLM instances with kv_role=kv_both on GPU 0-7. Replayer
    talks to them via the existing comma-separated round-robin in
    replayer/replay.py — no proxy.
  - For PD configs: kv_role=kv_producer for the P pool (with
    VLLM_MOONCAKE_BOOTSTRAP_PORT) + kv_role=kv_consumer for the D pool,
    routed by the official vLLM example
    third_party/vllm/examples/online_serving/disaggregated_serving/
    mooncake_connector/mooncake_connector_proxy.py — no policy choice
    made by us, per user instruction to use the standard recipe.
  - Applies instrument_kv_snapshot.py before launching so every
    EngineCore writes its per-step KV snapshot to
    $RUN_ROOT/kv_snapshots/mb5_kv_snapshot_pid<pid>.jsonl
  - Reverts the patch on stop.
  - Emits ENDPOINTS= line on stdout for the orchestrator to read.

mb5_run.sh
  For each CONFIG × rep: launch, replay w600 trace via the existing
  replayer, capture wall-clock, tear down, cool down 10 s. Defaults:
    CONFIGS="8C 6P+2D 4P+4D 2P+6D"
    REPS=3
    TRACE=traces/w600_r0.0015_st30.jsonl
  All artefacts go under $FRESH_ROOT/mb5_runs/$RUN_TAG_${config}_rep${rep}/
  (vllm_logs/, kv_snapshots/, replay_metrics.jsonl, wall_clock_s.txt).

plot_kv_pool_timeline.py
  Reads one or more mb5_kv_snapshot_pid*.jsonl files and renders a
  stacked-area chart per file:
    x = wall-clock since first snapshot
    y = KV block count, stacked by per-request contribution
    overlay: pool-total ceiling, 90% line, waiting-queue depth subplot
  Bands are colored by a deterministic hash of request_id so individual
  requests are visually tractable across the run.
  This is the figure the user asked for — turns headline "PD-disagg is
  10× worse" into a system-level picture of *where* the KV pool is
  blocked, when, and by which requests.

deploy.sh
  Also tar-syncs the local replayer/ dir to
  /home/admin/cpfs/wjh/agentic-kv-fresh/replayer/ so mb5_run.sh can
  `python -m replayer` against the patched (trace_span_s/amplification)
  version, not the older copy under /home/admin/cpfs/wjh/agentic-kv/.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 23:02:57 +08:00
a4f5dd56aa MB5 instrumentation: per-request KV-block snapshot from vLLM V1 scheduler
The §3.2 H1 (D-pool capacity wall) argument needs system-level evidence,
not just headline latency. This patch lets us record, every ~100 ms,
the exact composition of each vLLM instance's KV pool:

  - total / free / used block counts
  - for each RUNNING request: blocks held, computed tokens, prompt tokens
  - for each WAITING request: prompt tokens, status

Hook: inside Scheduler.schedule() right before the return. Per-request
blocks come from coordinator.single_type_managers[*].req_to_blocks
(vLLM 0.18.1's own per-request bookkeeping; no new tracking layer).
Throttled by MB5_PERIOD_MS env var (default 100 ms = 10 Hz) so a
13-min trace replay produces ~8 k snapshots per instance instead of
~80 k unthrottled.

Output: $MB5_LOG_DIR/mb5_kv_snapshot_pid<pid>.jsonl
(default MB5_LOG_DIR=/tmp). One file per EngineCore PID.

Apply/revert idempotent, same pattern as instrument_mooncake.py.
Markers: # MB5_INSTRUMENT_START / # MB5_INSTRUMENT_END.

Validated on dash1 venv: apply → py_compile ok → revert → py_compile ok.

With this in place we can build the stacked-area "KV pool composition
over time" figure the user asked for: x = wall-clock, y = block count,
colored bands = per-request portions. Comparing 8C colo vs 4P+4D
on the same trace will directly show whether (and when) the D pool
hits its ceiling — turning "PD-disagg is X× worse" into "PD-disagg
is X× worse BECAUSE these specific requests at this specific time
filled the pool and forced this queue depth".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 22:30:53 +08:00
4a93096c1e Add PD_DISAGG_INVESTIGATION.md — living TODO for proving H1–H4
We don't have paper-grade evidence yet that PD-disagg fails in agentic.
MB1+MB2 corrected accounting puts phase-isolation cost-benefit on
PD-disagg's side; the only direct support is colleague's one data point
on a patched dash0 build (TTFT p50 62×, success 52%) and the f4b
geometric capacity argument. To close §3.2 properly we need fresh-venv
empirical replication PLUS system-level instrumentation that tells the
reviewer *which* component is the bottleneck — not just headline
latency.

This document tracks the four candidate failure hypotheses (H1 D-pool
capacity, H2 static-partition mismatch, H3 cache reuse + P-pool hotspot,
H4 end-to-end throughput loss), their current evidence status, and the
phased experiment plan to address each.

Key findings already recorded:
- Phase 0 TODO 0.1 (find standard PD-disagg deployment) is done — vLLM
  ships an official example at
  examples/online_serving/disaggregated_serving/mooncake_connector/
  with a kv_producer+kv_consumer launcher and a Mooncake-aware proxy
  that supports arbitrary P:D ratios via env vars. Per user direction,
  we will NOT polish PD-disagg policy ourselves; we use the official
  recipe as the "PD-disagg" baseline in §3.2 / §5.2.
- Phase 1 (MB5+3 combined: PD ratio sweep with D-pool occupancy logging)
  is the critical path. Designed to either confirm H1 with system
  breakdown evidence (D-pool ≥ 90% for ≥ 30% of trace + queue depth
  spike) or falsify it (some ratio matches 8C colo, in which case §3.2
  needs rewriting).
- D-pool occupancy timeline is the single most important new
  instrumentation — turns "PD-disagg is 10× worse" into "PD-disagg is
  10× worse BECAUSE the D pool sits at >90% for X% of the trace".

Configurations to run on dash1 8-GPU first:
  8C (colo baseline), 6P+2D, 4P+4D, 2P+6D  ×  3 reps  ×  w600 trace.

Open question still in the doc: vLLM 0.18.1 had an AttributeError on
self.bootstrap_server in kv_consumer mode when we hit it during MB2
sanity; likely the issue was bad kv_transfer_params from our side
(missing transfer_id, wrong field names), which we have since fixed.
Official proxy uses the same handshake we now have, so it should
just work. If not, single-line patch to initialize self.bootstrap_server
= None for consumer mode.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 22:24:31 +08:00
f739f7d461 Proxy/runner support for Nixl connector + unified_v3 (offload-decode) policy
scripts/b3_isolated_policy.sh:
  Recognize unified_v3 as a kv_both-requiring policy; respect explicit
  KV_CONNECTOR=Nixl override (so unified_v2 / unified_v3 / unified_kv_both
  can run against either Mooncake or Nixl back-end). When Nixl is
  selected, skip the bootstrap-ports plumbing — Nixl uses its own UCX
  side-channel and the proxy forwards kv_transfer_params from the src
  response body instead of pre-baking engine_id/bootstrap_addr.

scripts/cache_aware_proxy.py:
  - New unified_v3 policy (~250 lines): prefill stays on session-affinity
    host (preserves intra-session prefix-cache reuse), decode is migrated
    to a lower-load target when the affinity host is busy with concurrent
    decodes. KV transfer flows prefill_host → decode_target, opposite of
    v2. Knobs: v3_min_new_tokens, v3_min_prefill_decode_busy,
    v3_target_load_ratio, v3_min_load_gap, v3_rotate_affinity,
    v3_prefer_cache_target. cache_miss_audit found rotation hurts cross-
    turn locality (9.5% hit with vs ~80% without) so default
    v3_rotate_affinity=False.
  - New connector_type setting ("mooncake" | "nixl") gating the PD-sep
    handshake form: mooncake uses pre-baked kv_transfer_params,
    nixl forwards them from the response body.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 22:05:19 +08:00
da39ab6804 Correct PD-disagg cost/benefit framing across repo
The §3.2 cost-vs-benefit math in commits 029821c (MB1 plot +
pd_cost_vs_benefit.png) and abde010 (RESULTS_SUMMARY.md) was wrong.

What was wrong:
  I framed PD-disagg's max phase-isolation benefit as "≤ decode duration
  of the new request (~50–200 ms)" — implicitly treating the benefit as
  per-request and bounded by that request's own decode. The correct
  accounting is per-prefill-event across all stalled streams:

      benefit_per_prefill = D × T_prefill × (1 − TPOT_baseline/TPOT_during)
                          ≈ D × T_prefill

  which follows from the chunked-prefill math (each of L/N chunks slows
  D ongoing decode steps from ~10 ms to t ms, summing to D × T_prefill).

Plug MB1 + MB2 numbers in:

  prefill size | T_prefill | T_transfer | D=8 benefit | cost/benefit
   2k tok      | 0.14 s    |     8 ms   |   1.1 s     |    0.7 %
  33k tok      | 4.5  s    |  320 ms    |  36   s     |    0.9 %
 125k tok      | 57   s    |  1.9 s     | 456   s     |    0.4 %

On the phase-isolation axis alone, PD-disagg WINS by 100×–250× — the
opposite of what the deleted figure showed.

The actual dominant reason static PD-disagg fails in agentic is the
D-side KV pool capacity wall (figs/f4b_pdsep_kv_wall.png) — p99
single-request KV is 11.5 GiB, per-D-instance pool is 38 GiB, so 4P+4D
halves system decode capacity. Colleague's 4P+4D experiment showed
TTFT p50 62× worse and success rate 99.5% → 52%, driven by pool
overflow + queueing, not by transfer latency.

Changes (all touched files explicitly listed; no `git add -u`):
- figs/pd_cost_vs_benefit.png : DELETED (figure built on wrong math)
- microbench/fresh_setup/plot_mb1.py : drop the pd_cost_vs_benefit
  function; keep mb1_interference.png and update its title to note
  per-prefill aggregate stall = D × T_prefill (not capped by decode)
- figs/mb1_interference.png : regenerated, no misleading band annotation
- analysis/mb1/README.md : Summary block rewritten ("what MB1 measures";
  no more "max benefit = decode duration" claim); §3.2 implications
  section replaced with the corrected per-prefill-event table; explicit
  ⚠ Correction note documents what was wrong
- analysis/mb2/README.md : Summary block + §3.2 implications section
  rewritten the same way; ⚠ Correction note links to RESULTS_SUMMARY §4
- RESULTS_SUMMARY.md §4 + §6 : §4 reordered to lead with the D-side
  capacity argument (the real failure mode), MB1/MB2 demoted from
  "kill-shot for PD-disagg" to "supporting context inputs to a
  cost-benefit table that actually favors PD-disagg on this axis";
  §6 paper-claims list reordered to remove the wrong "PD-disagg loses
  on cost-vs-benefit" claim and replace with the corrected ones

PAPER_OUTLINE.md and MEETING.md were checked and never picked up this
specific wrong claim — they already (correctly) frame §3.2 around the
D-side KV memory wall.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 22:04:49 +08:00
abde010b64 Add RESULTS_SUMMARY.md — concise Chinese summary of current findings
One-page distillation of what the paper can claim today, with figure /
data path next to each row. Sections:

  1. Workload 性质 — intra-session reuse, skew, KV footprint
  2. Dispatch Coupling — agentic vs chatbot inter-turn gap regime
  3. 现有调度三类失败 — load-balance / static PD-disagg / pure sticky
  4. PD-disagg cost vs benefit — MB2 (transfer 9.7 GB/s ceiling,
     topology-independent) + MB1 (decode halted during prefill 15-200x),
     joined into the §3.2 cost > benefit headline for any KV ≥ 80 MiB
  5. EAR 实证状态 — Pillar 1 (affinity) validated, Pillar 2 (migration)
     substrate validated + strategy-layer pending
  6. 已能写的 paper 主张(按 confidence 排序)
  7. 待做(MB3-5, migration e2e, wall-clock sweep, scale-out)

Designed to be the one doc to read when re-entering the project after
a break.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 21:38:28 +08:00
029821c1b6 MB1: prefill-decode interference under chunked-prefill default; §3.2 headline
Single-GPU bench on dash1 GPU 0 (vanilla vLLM 0.18.1, chunked-prefill on,
no kv_connector). 3 decode batch sizes × 5 prefill sizes × 3 reps.

Method recap (driver: microbench/interference/driver.py, repurposed):
- Pin D streaming decode requests at constant max_tokens
- Inject one prefill-only request (max_tokens=1) of varying input length
- Bin decode-stream token timestamps into "during prefill" vs baseline
- Headline metric: effective per-stream TPOT during the prefill burst,
  = prefill_ttft / (num_tokens_during_prefill / D). This is the average
  rate at which each decode stream produces tokens during the burst.
  p50 of inter-token intervals is deceptive (chunked-prefill makes most
  intervals look normal); the burst-average gives the true cost.

Results (D=8 row, the most agentic-realistic case):
  P (tokens) | prefill_ttft | per-stream TPOT during | penalty
       2048  |    143 ms    |      32 ms             |    4×
       8192  |    583 ms    |     114 ms             |   15×
      32768  |  4520 ms     |     388 ms             |   52×
      65536  | 15615 ms     |     757 ms             |   99×
     131072  | 56991 ms     |    1419 ms             |  183×

Baseline TPOT at D=8: ~7.7 ms. So during a 131k-token prefill burst
each ongoing decode is running ~183× slower (i.e. essentially halted)
for ~57 seconds.

§3.2 implication: PD-disagg's promised phase-isolation benefit per
agentic request is bounded by the decode duration, which is 50–200 ms
for tool-call output. MB2 says the KV-transfer cost of PD-disagg
is 300 ms – 10 s for agentic-size requests. Cost > benefit for every
KV size above ~80 MiB (well below trace mean 192 MiB).

The new figs/pd_cost_vs_benefit.png overlays MB1 benefit ceiling
(50–200 ms band, capped by decode) onto MB2 transfer cost curve and
marks the agentic-distribution waypoints (trace mean, p90, p95, p99)
on the x-axis. Across the entire agentic distribution, the cost curve
sits above the benefit band.

Adds:
- microbench/fresh_setup/mb1_launch.sh: single-GPU vLLM launcher (no
  kv_connector, default chunked_prefill=on, max_num_batched_tokens=8192)
- microbench/fresh_setup/mb1_driver.py: copy of the existing
  microbench/interference/driver.py for cpfs deployment
- microbench/fresh_setup/analyze_mb1.py: aggregator emitting
  per-(D, P) effective-TPOT-during + max PD-disagg-benefit table
- microbench/fresh_setup/plot_mb1.py: mb1 standalone +
  pd_cost_vs_benefit headline figure
- analysis/mb1/summary.csv: 45 raw rows from the sweep
- analysis/mb1/breakdown.json: per-(D, P) aggregate
- analysis/mb1/README.md: persistent doc
- figs/mb1_interference.png: effective TPOT during prefill, one line per D
- figs/pd_cost_vs_benefit.png: §3.2 headline (cost > benefit everywhere)

Caveats noted in README:
- chunk_tokens=8192 only; Sarathi-Serve's smaller chunks would
  interleave decode more aggressively. Chunk-size sensitivity is
  flagged as next run.
- D ≤ 8; higher D may saturate or shrink the penalty further.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 21:25:09 +08:00
90127c3389 MB2 inter-node: dash1↔dash2 transfer cost is identical to intra-node
Sweep on dash1 GPU 0 → dash2 GPU 0 over 200 Gbps RoCE.
remote_bootstrap_addr=http://172.27.123.142:8998. Same 9-size × 5-rep
config as the 2026-05-27 intra-node run.

Per-size pure_transfer (p50) lines up within 1–3% of the intra-node
numbers across all sizes:

  size      intra p50   inter p50
   512 tok    5.3 ms      5.2 ms
  2048 tok   20.6         20.0
  8192 tok   83.7         80.9
 32k  tok  320.9        309.6
 64k  tok 1895          1734       (bimodal in both)
128k  tok 2835          2818       (bimodal in both)

=> Mooncake's batch_transfer_sync_write **does not use NVLink** for
intra-node peers; both paths go through the 200 Gbps RDMA NIC, with
the 200 Gbps NIC (not the GPU interconnect) being the bottleneck. The
~9.7 GB/s steady-state ceiling and the 6+ GiB variance regime are
identical across topologies.

Operational implication for §3.2: PD-disaggregation does not get
cheaper by co-locating P and D on the same node — every routed request
pays the same ~10 GB/s ceiling for KV transfer, no matter where it
lands. Halving the transfer cost cannot be bought back by topology.

Caveat: B's receive_kv events did not log on dash2 — `MB2_LOG_DIR`
env var did not propagate through vLLM's EngineCore subprocess on
the consumer host (cat /proc/$ENGINE_PID/environ is empty on dash2
for that var, but the producer host on dash1 worked). For this run
pure_transfer numbers are from A's send_blocks alone; full rx_total
breakdown is not available, but pure_transfer is the dominant term.

Adds:
- analyze_mb2_send_only.py — analyzer that works from A's send_blocks
  alone when B's receive_kv events are absent
- plot_mb2_compare.py — overlay intra vs inter on the same axes
- plot_mb2.py — tolerate the `rows`-less send-only schema
- figs/mb2_transfer_{time,bw}_inter.png — inter-node single-curve
- figs/mb2_transfer_{time,bw}_compare.png — intra vs inter overlay
- analysis/mb2/A_inter_kvboth.jsonl, inter_kvboth_client.json,
  inter_kvboth_breakdown.json
- analysis/mb2/README.md — Summary block updated to reference both
  paths, dated 2026-05-27 run-log entry appended with the full table
  and the topology-independence framing

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 20:56:08 +08:00
50f72d8875 MB2 inter-node scaffolding: per-host single-instance launcher + client host args
Adds the pieces needed to run the producer on dash1 and the consumer on
dash2 with the same shared cpfs venv:

start_vllm_single.sh
  INSTANCE / GPU / PORT / BP / MASTER / ROLE env vars; brings up ONE
  vLLM instance + applies the mooncake instrumentation patch (idempotent
  since the venv is cpfs-shared, so the first invocation applies and the
  second is a no-op). Per-instance MB2_LOG_DIR keeps producer/consumer
  events separate even though both directories live on the same cpfs
  path visible to both hosts.

mb2_kv_transfer.py
  New --src-host / --dst-host args. Defaults stay 127.0.0.1 for
  backward-compat with the intra-node sweep. /v1/completions URLs and
  /query URLs now use the supplied hosts. remote_bootstrap_addr is
  built as http://<src_host>:<src_bp> so the consumer's
  do_remote_prefill request carries a routable address.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 20:26:54 +08:00
3f791ee074 MB2 doc: analysis/mb2/README.md as persistent record
Lifts the MB2 intra-node results out of commit messages into a single
place the paper can cite. Structure:

  Summary  — one-line table + headline numbers for §3.2
  Setup    — exact hardware/software/config
  Method   — 3-step bench, instrumentation, pair-by-time-window
  Results  — full per-size table (latest run dated)
  Known limitations — kv_both vs strict, serial-only, intra-only,
                       sanity preamble in the logs
  §3.2 implications — transfer/decode ratio table at agentic sizes
  Open questions / next runs — inter-node, bandwidth-ceiling
                                investigation, concurrent transfers,
                                strict kv_producer/consumer
  Reproduction — exact commands
  Run log     — dated entries; new runs append here

The latest "intra-node" entry references `de164e5` for the raw
artifacts + figures.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 20:23:50 +08:00
de164e5a64 MB2: pure KV-transfer cost on dash1 intra-node — Mooncake ~9.7 GB/s steady
Full sweep result on dash1 GPU 0+1 with vanilla vLLM 0.18.1 +
mooncake-transfer-engine 0.3.11, kv_both connector. Per-stage decomposition
via the instrumentation patch (analyze_mb2.py pairs A's send_blocks with
B's receive_kv enter/finish by time window).

Steady-state (1k..32k tokens, 96 MiB..3 GiB KV):
   pure_transfer ≈ size / 9.7 GB/s
   rx_overhead   ≈ 2–3 ms (ZMQ handshake + P-side setup)
   bandwidth     ≈ 9.6–10.1 GB/s, very stable

Large-size regime (65k..131k tokens, 6..12 GiB):
   p50 bandwidth collapses to 3.4–4.5 GB/s
   max bandwidth still hits ~9.7 GB/s (some runs achieve it)
   p99 agentic request (11.5 GiB) lands here

Implication for §3.2 PD-disaggregation cost argument:
   median agentic decode = 50–200 ms (tool-call JSON output)
   median agentic-tail KV transfer (p99 11.5 GiB):
     best case (9.7 GB/s)  ≈ 1.19 s
     observed range         1.5 – 10 s
   ⇒ KV transfer is 8–100× larger than the decode it enables.

This is intra-node — the lower-bound transfer cost. Inter-node RDMA
will be slower; that's MB2 phase 2.

Adds:
- analyze_mb2.py: pair A.send_blocks ↔ B.receive_kv by time window;
  per-size aggregation (n, ms_p50, ms_min/max, GB/s_p50/max)
- plot_mb2.py: log-log transfer-time chart + bandwidth-vs-size chart
- analysis/mb2/A_intra_kvboth.jsonl, B_intra_kvboth.jsonl: raw events
  (51 + 102 events including the sanity preamble)
- analysis/mb2/intra_kvboth_breakdown.json: paired and aggregated
- figs/mb2_transfer_time_intra.png, figs/mb2_transfer_bw_intra.png

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 19:04:03 +08:00
91673f1fb8 MB2: working end-to-end intra-node KV transfer microbench
This commit closes the loop on the fresh-venv MB2 path. Three corrections
on top of the previous scaffold made the bench fire successfully on
dash1 GPU 0+1 with kv_both connector roles:

1. Re-target instrumentation patch to vLLM's shipped MooncakeConnector
   (vllm/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py).
   The mooncake-package's own mooncake_connector_v1.py turned out not to
   be the implementation vLLM 0.18.1 loads — the
   '{"kv_connector": "MooncakeConnector"}' config picks up the vLLM-shipped
   one. Patches go at _send_blocks (P-side) and receive_kv_from_single_worker
   (D-side, async, both entry and FINISH branch).

2. /query lives on the mooncake bootstrap port, not the vLLM HTTP port.
   Add --src-bp / --dst-bp args; default 8998 / 8999.

3. kv_transfer_params schema for the vanilla connector:
     do_remote_decode  → {transfer_id}
     do_remote_prefill → {transfer_id, remote_engine_id, remote_bootstrap_addr}
   where remote_bootstrap_addr must include the http:// scheme. The dash0
   smoke_test_migrate_cache.py was written for the patched build, which
   used a different field-name set (remote_host, remote_port,
   remote_block_ids); those are rejected here.

Also discovered (and worked around): vLLM 0.18.1 with kv_role=kv_consumer
raises AttributeError on `self.bootstrap_server` because that attribute
is only assigned conditionally inside `if not self.is_kv_consumer`. We
sidestep by running kv_both for the microbench — transfer mechanics are
identical (same batch_transfer_sync_write call); the role gate only
affects which request types each instance accepts. For §5 strict PD-disagg
baseline we'll need either to fix this bug or front the pair with a
role-aware proxy.

Sanity smoke (3 sizes × 2 repeats, dash1 GPU 0+1, kv_both intra-node):
  input    KV-MiB  send_blocks_ms (P)  receive_kv_ms (D)  client_step2_ms
   512        48          5–23                  7–33               18–91
  2048       192            21                    23                  37
  8192       768            85                    88                 110
=> intra-node bandwidth ~9 GB/s on the actual transfer for 768 MiB,
   which is well below NVLink p2p; likely PCIe-staged. Worth verifying.

Next step (in flight): full sweep 512..128k tokens × 5 repeats with
the per-stage analyzer.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 18:53:25 +08:00
622e0bc04c MB2: parameterize vLLM roles (kv_producer + kv_consumer default)
start_vllm_pair.sh
  ROLE_A / ROLE_B env vars (default kv_producer / kv_consumer for strict
  PD-disagg). Override to kv_both for the kv_both control. The role is
  injected into --kv-transfer-config so vLLM imposes the role restriction.

mb2_kv_transfer.py
  --skip-verify flag drops step 3 (the plain completion sanity-check on
  the destination), required when the dst is kv_consumer-only since a
  kv_consumer instance refuses to serve a request without
  do_remote_prefill. The transfer-time itself is still measured from
  step 2 (do_remote_prefill on the consumer).

Also: per-step client-side wall-clock timestamps (t_step1_client_unix,
t_step2_client_unix, t_step2_end_unix) are now captured so the
post-hoc breakdown analyzer can join with the per-instance JSONL logs
on absolute time.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 18:17:42 +08:00
efdcf3c555 MB2: per-stage instrumentation patch + launcher integration
Per-stage breakdown of "step 2" (the B-side do_remote_prefill) requires
vLLM/mooncake-internal timing — we cannot infer it from black-box HTTP
E2E. This commit adds the four pieces to do that breakdown:

instrument_mooncake.py
  apply / revert / check patches on mooncake_connector_v1.py to emit
  structured JSONL transfer events at two key sites:

    send_blocks (P-side, on batch_transfer_sync_write):
      {event, remote_session, total_bytes, duration_s, t_start_unix,
       ret, tp_rank, t_log_unix}
    receive_kv (D-side, on the ZMQ-driven pull request):
      {event, path, local_req_ids, remote_req_ids, duration_s,
       t_start_unix, tp_rank, t_log_unix}

  All injected code is bracketed by `# MB2_INSTRUMENT_START/END` so the
  --revert pass is a single regex scan. Apply-revert round-trip
  validated on dash1 (PATCHED → py_compile ok → revert → CLEAN → ok).

start_vllm_pair.sh (updated)
  - Picks up instrument_mooncake.py via SCRIPT_DIR.
  - On `start`: applies patch before launching the two vLLM instances.
  - On `stop` (or trap exit): reverts patch.
  - Sets per-instance MB2_LOG_DIR = $FRESH_ROOT/mb2_transfer_logs/{A,B}/
    so send-side and receive-side events land in cleanly separated dirs.

deploy.sh
  tar-over-ssh sync of microbench/fresh_setup/ → cpfs
  /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/ so dash1 / dash2 see
  the same scripts (dash{1,2} don't have rsync; tar pipe works).

The mb2_kv_transfer.py client still uses black-box E2E timing — the
next commit will teach it to ingest the per-instance JSONL logs to
produce the 4-way breakdown (queueing / setup / transfer / decode).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 18:12:44 +08:00
7437422618 MB2 scaffolding: launch script for vLLM pair + KV-transfer-time client
Two new files prepare measurement of T_transfer(KV_size, network_path),
the gap §3.2's PD-disagg cost argument has had since day one.

microbench/fresh_setup/start_vllm_pair.sh
  start | status | stop two vLLM 0.18.1 instances on local GPUs (A, B)
  with --kv-transfer-config '{"kv_connector":"MooncakeConnector",
  "kv_role":"kv_both"}' running off the fresh venv (vanilla wheel +
  vanilla mooncake 0.3.11, NOT the dash0 patched build). GPU IDs and
  ports are env-overridable so the same script drives the intra-node
  pair (GPU_A=0 GPU_B=1 on one host) and the inter-node pair (GPU_A=0
  on dash1, GPU_B=0 on dash2 — launched per host separately).

microbench/fresh_setup/mb2_kv_transfer.py
  Three-step measurement borrowed from connector_tax/.../smoke_test_
  migrate_cache.py:
    1. do_remote_decode  on A   (compute & cache KV; max_tokens=1)
    2. do_remote_prefill on B   (pull KV from A — this is the timed step)
    3. plain completion on B    (sanity check: cached_tokens ≈ prompt len)
  Sweeps input_tokens ∈ {512, 1k, 2k, 4k, 8k, 16k, 32k, 64k} with 5
  repeats each; reports mean / p50 / p90 transfer time and a per-size
  raw log. Per-token KV is 98304 B (Qwen3-Coder-30B-A3B), so the upper
  end ≈ 6 GiB transfers — within the p99 11.5 GiB range from §2 but
  below it (the model's max_model_len 200000 caps the absolute upper).

What we will NOT learn from this design:
  - Bandwidth saturation when the system is loaded (single-request bench)
  - vLLM-internal scheduling overhead vs pure transfer (the timed step
    folds them together — but for the §3.2 argument that's the right
    "what does PD-disagg actually pay" number)

Intentionally not committed yet: an orchestrator that loops over
intra-/inter-node configs. We start manual on dash1 intra-node to
verify the measurement is sane before scaling out.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 17:47:04 +08:00
0a63de5bcf Phase 0: fresh vllm 0.18.1 + mooncake-transfer-engine on dash1/dash2
Install script lives in microbench/fresh_setup/install.sh. Single shared
venv at /home/admin/cpfs/wjh/agentic-kv-fresh/.venv (cpfs is mounted at
the same path on dash0/1/2 so one install serves all three).

  vllm                    : 0.18.1   (official wheel)
  mooncake-transfer-engine: 0.3.11.post1

Smoke-tested on dash1 + dash2: imports succeed, kv_transfer module
resolves. This venv is the vanilla reference for all subsequent
microbench / PD-disagg experiments — not the dash0 patched build that
carries the connector_tax fix.

The script defines proxyOn inline (ipads 127.0.0.1:11235) so it works
under non-interactive ssh (~/.bashrc proxyOn is interactive-only).
Sets -eo pipefail (not -u) because venv activation references unset
PS1-like vars under -u.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 17:42:36 +08:00
b11dc30945 §2.3 reframe: dispatch coupling is regime-dependent, not binary chatbot/agentic
The previous §2.3 narrative said "chatbot has T_human ≈ 30 s think-time,
agentic has T_external ≈ 0, so agentic is always closed-loop and chatbot
never is". The new T_external measurements on the production chatbot
trace (qwen3-max, n=42 k inter-turn gaps from formatted parent_chat_id
sessions) show the binary framing is wrong:

  agentic   p50 1.6 s,  39% gaps < 1 s,  p99 738 s
  chatbot   p50 7.2 s,   4% gaps < 1 s,  p99  43 s

Both have nonzero T_external. The right distinction is the *shape*:
chatbot is unimodal around 5–15 s (human cadence); agentic is bimodal
with a sub-second tool-call mass (39 % vs chatbot's 4 %) plus a long-
pause tail (13 % > 30 s). The agentic sub-second mass is what activates
dispatch coupling — for any W_turn > 1 s scheduler those turns satisfy
W_turn ≫ T_external by construction.

The empirical regime split:
                 unified  TTFT p90 = 7.3 s   →  agentic 73% closed-loop, chatbot 32%
                 lmetric  TTFT p90 = 15.7s   →  agentic 80%,             chatbot 88%

lmetric is bad enough that it drags the chatbot regime into closed-loop
too. This is a direct empirical explanation for lmetric underperforming
on both workloads.

Updates:
- PAPER_OUTLINE.md §2.3: lead with the regime threshold W_turn ≷
  T_external, replace the "T_human dominates" Little's Law with the
  general form L = Λ · N · (W_turn(L) + T_external), embed f3a CDF,
  add the empirical regime table; correct the small-perturbation
  formula to include the +T_external dampening term.
- MEETING.md §1: same reframe, condensed (CDF figure, two-row regime
  table, one-line conclusion).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 16:51:38 +08:00
876d09db83 Add chatbot T_external CDF; overlay on f3a vs agentic
User-requested comparison of inter-turn external gap distribution between
the production agentic trace (Qwen3-Coder) and a production chatbot trace
(qwen3-max chat). Both computed as
  T_external = next_turn.start_ms - prev_turn.end_ms
on the same kind of pipeline (raw input + raw output join on request_id,
session structure from the formatted trace's parent_chat_id chains).

The chatbot trace lives as two files on dash0:
  input  : bailian-trace/qwen-trace-260321-260327/qwen3-max-input-032309-032311.jsonl
  output : bailian-trace/qwen-trace-260321-260327/qwen3-max-output-032109-032711.jsonl
The raw input has no session_id (uuid is per-record, user_id has only 4
distinct tenant values for 346 k requests). We recover session structure
from the formatted file (qwen_chat_blksz_64_032309-032311.jsonl, which
groups requests by parent_chat_id), matching each formatted record to a
raw record by (timestamp, output_length) — prompt_token_num is anonymized
to 0 in this trace, so we use generate_token_num as the join key.
End time is derived from time_to_finish_token (ms duration) not the "time"
string field (which is the log-write time, not request completion).

Numbers (chatbot, 42 228 inter-turn gaps over 32 262 multi-turn sessions):
  p25  4.85 s   p50  7.18 s   p75  8.22 s   p90 15.0 s   p99  43 s
  4%  gaps < 1 s   29% < 5 s   78% < 10 s   98% < 30 s

Compare to agentic (same metric, scripts/compute_inter_turn_gap_remote.py):
  p25  0.69 s   p50  1.6  s   p75  8.6  s   p90  44  s   p99 738 s
  39% gaps < 1 s   67% < 5 s   77% < 10 s   87% < 30 s

Distributions differ in shape, not just location:
- Chatbot is tight, unimodal around 5–10 s (human interaction).
- Agentic is bimodal: a sub-second autonomous tool-call mode (39 % < 1 s)
  plus a long-pause tail (13 % > 30 s, p99 = 738 s) for sessions where
  the operator steps away.
- The sub-second tool-call mass is where dispatch coupling lives —
  those turns have W_turn ≫ T_external for any current scheduler.

The earlier "chatbot has T_human ≈ 30 s" hand-wave was wrong empirically.
The right framing for §2.3 is "agentic has a sub-second tool-call mode
that chatbot doesn't", not "chatbot has think-time and agentic doesn't".

Adds:
- scripts/compute_inter_turn_gap_chatbot.py: dash0-side aggregator
  (raw input/output join + formatted alignment by ts + output_length)
- analysis/characterization/data/chatbot_inter_turn_gap.json: CDF cache
- scripts/plot_inter_turn_gap.py: overlays both curves on log-x

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 14:49:44 +08:00
cef914ecd4 §3.1: add LMetric vs load_only design analysis (cache signal diluted by ×score)
Why the LMetric → load_only APC gap is only +3.3pp despite LMetric
explicitly being "cache-aware load routing":

  P = pending_prefill_tokens + (input_length - cache_hit)
  score = P × num_requests   <-- multiplicative

cache_hit appears only as a reduction inside P. Because score is
multiplicative in num_requests, a session-affinity instance whose
num_requests has climbed will lose argmin to a cold instance even
when cache_hit on the warm one is ~90%. Worked example:

  warm: P=2500, num_req=5 -> score 12500
  cold: P=10000, num_req=1 -> score 10000   <-- LMetric picks cold

  load_only 53.9% APC  (pure num_requests)
  LMetric   57.2%      +3.3pp (cache as additive cost term)
  sticky    77.7%     +23.8pp (cache as hard constraint)
  unified   78.7%     +24.8pp (cache as hard+soft hybrid)

Lesson worth stating explicitly in §3.1: cache awareness folded into
a multiplicative load cost-model is structurally insufficient. Affinity
must be a separate routing branch (sticky / unified hybrid), not a
correction term inside a load score.

PAPER_OUTLINE.md §3.1 gets the design analysis + the new APC table;
MEETING.md gets a one-paragraph version of the same point.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 14:04:14 +08:00
c33c825256 figs/v2: drop unified_v2 (buggy variant); re-render 4-policy panels
User flagged unified_v2 as a still-buggy build. Regenerate the four
per-policy figures with only the four stable policies:
  lmetric, load_only, sticky, unified

Story is now directly comparable to v1: unified still dominates p90
TTFT (8.8s) and E2E p90 (20.0s) over the other three on the fresh run.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 13:55:10 +08:00
03d8c5d0d1 Render 4 per-policy figures on b3_replay_20260527_0114 into figs/v2/
User-provided fresh run with five policies (lmetric, load_only, sticky,
unified, plus a new unified_v2 variant). Reproduces the v1 set under
figs/v2/ so we can A/B the same panels:
  f4a_apc_loss.png         — APC bars per policy
  f4c_per_worker_ttft.png  — per-worker TTFT p90 panel per policy
  f6_e2e_latency_bars.png  — TTFT/TPOT/E2E p90 bars per policy
  f6_e2e_latency_full_grid — mean/p50/p90/p99 × TTFT/TPOT/E2E grid

scripts/render_b3_figures_v2.py is a standalone driver that reads each
policy's metrics.summary.json and breakdown.json directly from the run
directory — the breakdown.json `routed_to` field is required to recover
per-worker assignment because the new setup routes every request
through a proxy (127.0.0.1:9300), so metrics.jsonl's endpoint_url no
longer identifies the backend.

Headline numbers, new vs v1:
  APC          v2: lmetric 57.2% / load_only 53.9% / sticky 77.7%
                   unified 78.7% / unified_v2 78.4%
              v1: lmetric 56.9% / load_only 54.1% / sticky 77.2% / unified 79.4%
  TTFT p90 (s) v2: lmetric 14.8 / load_only 20.1 / sticky 14.8 /
                   unified  8.8 / unified_v2 10.1
              v1: lmetric 15.7 / load_only 20.2 / sticky 18.0 / unified 7.3
  E2E p90 (s)  v2: lmetric 25.4 / load_only 33.9 / sticky 30.3 /
                   unified 20.0 / unified_v2 24.1
              v1: lmetric 24.8 / load_only 33.5 / sticky 34.6 / unified 18.0
  Worker p90 (s, median / max)
              v2: lmetric 13.3/30.4 · load_only 21.3/29.2 · sticky 13.5/33.0
                  unified 10.0/35.1 · unified_v2 8.6/34.2
              v1: lmetric 13.9/31.3 · load_only 19.4/25.1 · sticky 20.3/55.4
                  unified 10.3/37.7

Story is unchanged: unified dominates at p90 across TTFT/E2E and on
median-worker latency; unified_v2 is competitive at p50 but slightly
worse than unified at p90.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 13:52:17 +08:00
41232f49d3 Measure inter-turn T_external on the raw production trace; add f3a CDF
The earlier conversation suggested agentic might "have no human think-time"
and therefore live in a strict closed-loop regime. The user pushed back:
tool calls also take time and might restore a chatbot-like buffer between
turns. To resolve this, we go to the actual data.

The previously-published per-record formatted trace only carries arrival
timestamps, so an arrival-to-arrival diff conflates W_turn + T_external.
The raw trace (/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/
051315-051317-raw.jsonl on dash0) additionally carries request_end_time_ms,
which lets us compute the pure inter-turn external gap
T_external = next.request_ready_time_ms - prev.request_end_time_ms
for each session's consecutive turn pair.

Headline numbers (n = 783 k inter-turn gaps over 127 k multi-turn sessions):

  p25  = 0.69 s
  p50  = 1.6  s
  p75  = 8.6  s
  p90  = 44   s
  mean = 37   s   (heavy long-tail; paused/abandoned sessions)

  39 % of gaps < 1 s
  67 % of gaps < 5 s
  87 % of gaps < 30 s

The bulk of the distribution is dominated by sub-second to a-few-seconds
tool-call latencies. Under any current scheduler (e.g. unified TTFT p90 =
7.3 s, lmetric 15.7 s), W_turn is already at or above the 75th percentile
of T_external, so dispatch coupling is the dominant regime for the
majority of turns — not a corner case.

This corrects the earlier conflated arrival-to-arrival "median gap 11 s"
figure (which folded W_turn into T_external). The true T_external median
is 1.6 s.

Adds:
- scripts/compute_inter_turn_gap_remote.py: dash0-side aggregator
- analysis/characterization/data/agentic_inter_turn_gap.json: 500-point
  CDF cache + summary stats, scp'd back from dash0
- scripts/plot_inter_turn_gap.py: local figure renderer
- figs/f3a_inter_turn_gap.png: log-x CDF with p25/p50/p75/p90 anchors and
  unified/lmetric TTFT p90 reference lines

Next step (per user): pull a chatbot trace through the same pipeline and
compare distributions side by side; this will let §2.3 stop hand-waving
about "no think-time" and instead present the regime split empirically.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 12:37:32 +08:00
555cabcf1f f2c: switch to per-instance decode-concurrency view; correct KV pool ceiling
Old f2c plotted per-request KV footprint MiB against an "H20 ~95 GiB
usable" reference line. That ceiling was wrong — a 30B-A3B bf16
deployment burns roughly:
  ~50% HBM for model params (~48 GiB on 96 GiB H20)
  ~10% for runtime activation buffers
  ~40% left for the KV cache pool (~38.4 GiB)
so 95 GiB was overstating the available pool by 2.5×.

New f2c reframes the same data into the answer that actually motivates
the paper: how many concurrent decodes does a single instance hold,
and how does PD-disagg change that? Grouped bars per percentile show
system-wide concurrent decode capacity for three 8-GPU deployments:
  Combined 8C, PD-disagg 4P+4D (N_D=4), PD-disagg 6P+2D (N_D=2)

Key reads off the figure:
  p50 (1.8 GiB/req): 20 fit/inst → 160 / 80 / 40 system-wide
  p90 (8.0 GiB/req):  4 fit/inst →  32 / 16 /  8
  p95 (9.6 GiB/req):  4 fit/inst →  32 / 16 /  8
  p99 (11.5 GiB/req): 3 fit/inst →  24 / 12 /  6

PD-disagg 4P+4D literally halves the decode population at the same
per-request KV pressure — this is the concrete §3.2 "KV memory wall"
penalty stated in terms users care about (concurrency).

- analysis/characterization/render_window1_figures.py:
  fig_kv_footprint_cdf rewritten; reads same kv_footprint_summary.json
  but computes floor(KV_pool / req_size) × N_D and annotates the
  per-instance fit count below each percentile group.
- figs/f2c_kv_footprint_cdf.png: regenerated.
- MEETING.md / PAPER_OUTLINE.md §2.1, §2.4: prose updated with the
  new ceiling and the "3 p99 decodes per instance / halved by PD-disagg"
  framing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 11:28:47 +08:00
922d79ac95 Add full latency grid (mean/p50/p90/p99 × TTFT/TPOT/E2E) as f6 companion
The headline f6_e2e_latency_bars only shows p90, hiding three regimes:
  - mean: unified dominates (3.3s TTFT, 7.0s E2E vs sticky 5.6s / 12.1s)
  - p50: sticky and unified are tied on first-turn TTFT (0.5s each) —
    sticky's first turn of each session is free, after which queues
    accumulate. Unified beats sticky everywhere else.
  - p99: tail amplification reveals unified's biggest gap —
    TTFT 42.3s vs sticky 74.1s; E2E 68.8s vs sticky 139.7s.

The 12-panel figure is the honest full picture; the 3-panel headline
stays for slide-friendly summary.

- analysis/characterization/window_1_results/raw_stats/{policy}.json:
  cached ttft/tpot/e2e {mean,p50,p90,p99} pulled from dash0
  /home/admin/cpfs/wjh/agentic-kv/outputs/b3_sweep_20260525_095043/
  (b3_policy_comparison.json doesn't record mean, only percentiles).
- analysis/characterization/render_window1_figures.py:
  new fig_b3_latency_full_grid renders the 4×3 grid from the cache.
- figs/f6_e2e_latency_full_grid.png: 12-panel companion.
- PAPER_OUTLINE.md §5.2: both figures embedded; main table column
  renamed from "Hotspot idx" to "Worker p90 (median / max)" to match
  the new metric convention.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 11:15:18 +08:00
5e6e98aee7 Replace max/median hotspot index with (median, max) absolute pair
The max/median ratio inverts the actual user-facing p90 ranking:
  sticky:  hotspot=2.73 but system e2e p90 = 34.6s  (worst)
  unified: hotspot=3.67 but system e2e p90 = 18.0s  (best)
because sticky's median is also high (everyone slow) while unified
concentrates the damage on one worker and keeps the other 7 fast.
Any "imbalance" metric structurally punishes the affinity-then-escape
schemes that we actually want to advocate for.

Changes:
- analysis/characterization/render_window1_figures.py:
  fig_b3_per_worker_ttft now annotates each subplot with
  "median X.Xs · max Y.Ys" instead of "hotspot=Y.YY"; docstring
  documents why we drop the ratio.
- figs/f4c_per_worker_ttft.png: regenerated with new titles.
- figs/f4c_apc_vs_hotspot_tradeoff.png: deleted. The scatter's y-axis
  was the deprecated ratio; superseded by f4c per-worker bars + f6
  e2e bars which together carry the same information honestly.
- PAPER_OUTLINE.md: C3, §3.3, §4.1 wording, §5 metric list, §8
  conclusion — replace "hotspot index" mentions with
  "worst-worker p90" or "(median, max) worker p90"; promote the
  §3.3 methodology note to a top-level sub-finding ("hot pin
  failure must be measured with per-worker absolute latency,
  not normalized ratio").
- MEETING.md: §3.3 narrative reworded to lead with the (median, max)
  pair directly; explicit one-line note on why the ratio is dropped.

Conceptual uses of "hot session" / "hot instance" / "hot pin" remain
unchanged — only the *metric* called hotspot index is retired.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 11:07:12 +08:00
9ddabee6ae Remove 'capped' references from MEETING.md and PAPER_OUTLINE.md prose
Companion to the figure cleanup: prose in §3.1 was still quoting
"capped 31.6% APC" as one of the failure-mode datapoints. Same reason
as the figures — capped is a workload manipulation, not a policy, so
it doesn't belong in the §3.1 routing-policy narrative.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 11:02:29 +08:00
09ff1069c3 Drop 'capped' from per-policy figures (f4a, f4c×2, f6)
'capped' is not a routing policy — it's lmetric run on a separately
truncated trace (sessions capped to 8 turns via build_capped_trace.py).
Putting it alongside lmetric/load_only/sticky/unified in per-policy
comparison figures is misleading because the workload differs, not
the routing decision. Comparing apples to a different-trace orange
inflates/deflates apparent policy gaps for the wrong reasons.

Regenerated 4 figures with --exclude-policies capped on
analysis/characterization/render_window1_figures.py:
  - f4a_apc_loss.png                 (APC bars)
  - f4c_apc_vs_hotspot_tradeoff.png  (APC vs hotspot scatter)
  - f4c_per_worker_ttft.png          (per-worker TTFT panel)
  - f6_e2e_latency_bars.png          (TTFT/TPOT/E2E bars)

Added --exclude-policies CLI flag to the renderer so this is a
reversible choice, not a permanent script mutation. capped data remains
in b3_policy_comparison.json and can be brought back in workload-
sensitivity sections (where it actually belongs) by omitting the flag.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 10:57:43 +08:00
74e0c2157a Add solo production-trace CDF figure (f2b_session_skew_prod.png)
Single-curve variant of f2b — production trace only, no replay overlay
and no uniform reference. Cleaner for boss-meeting/talk slides where the
extra context is noise. The combined three-curve figure is unchanged.

scripts/plot_session_skew_cdf.py: split into plot_combined +
plot_production_solo helpers; one run emits both PNGs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 10:53:30 +08:00
1220da249c f2b: regenerate CDF from production trace (1.3M sessions on dash0)
Pulls 456 (rank%, cum%) sample points from the raw production trace at
dash0:/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl,
cached locally so the figure is reproducible without ssh access. Sampled
anchors match the precomputed summary exactly:
  top 1% = 46.5%, top 5% = 66.5%, top 10% = 74.6%
plus newly readable points:
  top 25% = 87.5%, top 50% = 96.0%

Workload characterization is now consistent with the production
distribution rather than the small replay subset. Replay window CDF kept
as an overlay to show the same hockey-stick shape on the data §5 actually
uses.

- analysis/characterization/data/production_session_skew_cdf.json: cached
  sample points (29 KB), so the figure rebuilds locally
- scripts/plot_session_skew_cdf.py: now plots from the cache + replay raw
- MEETING.md / PAPER_OUTLINE.md: revert numbers to production trace,
  add top-25%/50% data points

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 10:41:53 +08:00
22c4aa58e4 f2b: replace top-1/5/10% bars with full CDF; align all docs to replay-trace numbers
The previous f2b_session_skew.png was a 3-bar chart (top 1/5/10%) computed
from the production trace summary (which is not present locally, only its
precomputed JSON). The new figure is a continuous CDF of cumulative
input-token mass vs session rank percentile, generated directly from the
replay trace traces/w600_r0.0015_st30.jsonl so any percentile is readable.

Headline numbers update accordingly:
  replay trace (n=274 sessions): top 1% = 24.3%, top 5% = 61.9%, top 10% = 75.8%
  production trace (n=1.3M):     top 1% = 46.5%, top 5% = 66.5%, top 10% = 74.6%

Both show extreme skew well above the y=x uniform reference; the replay
trace is less extreme at top-1% because n=274 makes that bucket only
~3 sessions. We standardize §2/§3 narrative on the replay-trace numbers
so motivation matches §5 evaluation; production numbers kept as a side
note for context.

- scripts/plot_session_skew_cdf.py: reproducible figure generator
- MEETING.md / PAPER_OUTLINE.md: update narrative + caption

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 10:37:22 +08:00
020a5c79a7 §3.3 reframe: hot pin failure is uniformly-slow workers, not max/median ratio
User pointed out the apparent paradox: in fig_b3_per_worker_ttft_p90, unified
has hotspot index 3.67 while sticky has 2.73, yet unified e2e p90 is roughly
half of sticky's. Resolution: hotspot index (max/median) is a *ratio* and
misleading on its own. Per-worker absolute TTFT p90:

  sticky : median 20.3s, max 55.4s -> system e2e p90 34.6s
  unified: median 10.3s, max 37.7s -> system e2e p90 18.0s

Mechanism: top 1% sessions own 46.5% input mass and there are more hot
sessions than instances (8), so sticky's hash binding gives *every* worker
its own hot session and the median worker is also slow. Unified's LMetric
fallback re-routes cold/new sessions away from hot affinity instances,
preserving 7/8 worker speed. System p90 is dominated by the majority of
requests landing on fast workers, hence the 2x e2e gap.

Changes:
- Replace §3.3 figure with figs/f4c_per_worker_ttft.png (per-worker bars)
  instead of figs/f4c_apc_vs_hotspot_tradeoff.png (the ratio scatter)
- §3.3 narrative in PAPER_OUTLINE.md and MEETING.md rewritten around
  absolute median + max + system e2e p90 instead of hotspot ratio
- Add a §3.3 sub-finding: "hot pin failure must be measured with
  per-worker absolute latency, not normalized ratio"
- Keep the scatter as supplementary for §5 multi-policy summary

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 10:10:23 +08:00
18f1bd4240 Update MEETING.md + PAPER_OUTLINE.md with connector_tax substrate validation
2026-05-27 trace-replay A/B/C (commit ef9e010) shows the kv_both substrate
is net positive on current codebase, not just neutral:
  - TTFT p90: 11.97s plain → 9.74s kv_both (−18.6%) → 7.58s with DR-fix (−36.6%)

This reverses the elastic_migration_v2 paper's +45% kv_both penalty claim
and removes the primary cause of the 4 prior migration reverts.

Reframes EAR Pillar 2 from "DEFERRED" to "PARTIAL" — substrate verified,
e2e strategy-layer validation (trigger thresholds + target selection in
the dispatch-coupling feedback loop) remains as the only open risk.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 09:17:31 +08:00