Commit Graph

174 Commits

Author SHA1 Message Date
876d09db83 Add chatbot T_external CDF; overlay on f3a vs agentic
User-requested comparison of inter-turn external gap distribution between
the production agentic trace (Qwen3-Coder) and a production chatbot trace
(qwen3-max chat). Both computed as
  T_external = next_turn.start_ms - prev_turn.end_ms
on the same kind of pipeline (raw input + raw output join on request_id,
session structure from the formatted trace's parent_chat_id chains).

The chatbot trace lives as two files on dash0:
  input  : bailian-trace/qwen-trace-260321-260327/qwen3-max-input-032309-032311.jsonl
  output : bailian-trace/qwen-trace-260321-260327/qwen3-max-output-032109-032711.jsonl
The raw input has no session_id (uuid is per-record, user_id has only 4
distinct tenant values for 346 k requests). We recover session structure
from the formatted file (qwen_chat_blksz_64_032309-032311.jsonl, which
groups requests by parent_chat_id), matching each formatted record to a
raw record by (timestamp, output_length) — prompt_token_num is anonymized
to 0 in this trace, so we use generate_token_num as the join key.
End time is derived from time_to_finish_token (ms duration) not the "time"
string field (which is the log-write time, not request completion).

Numbers (chatbot, 42 228 inter-turn gaps over 32 262 multi-turn sessions):
  p25  4.85 s   p50  7.18 s   p75  8.22 s   p90 15.0 s   p99  43 s
  4%  gaps < 1 s   29% < 5 s   78% < 10 s   98% < 30 s

Compare to agentic (same metric, scripts/compute_inter_turn_gap_remote.py):
  p25  0.69 s   p50  1.6  s   p75  8.6  s   p90  44  s   p99 738 s
  39% gaps < 1 s   67% < 5 s   77% < 10 s   87% < 30 s

Distributions differ in shape, not just location:
- Chatbot is tight, unimodal around 5–10 s (human interaction).
- Agentic is bimodal: a sub-second autonomous tool-call mode (39 % < 1 s)
  plus a long-pause tail (13 % > 30 s, p99 = 738 s) for sessions where
  the operator steps away.
- The sub-second tool-call mass is where dispatch coupling lives —
  those turns have W_turn ≫ T_external for any current scheduler.

The earlier "chatbot has T_human ≈ 30 s" hand-wave was wrong empirically.
The right framing for §2.3 is "agentic has a sub-second tool-call mode
that chatbot doesn't", not "chatbot has think-time and agentic doesn't".

Adds:
- scripts/compute_inter_turn_gap_chatbot.py: dash0-side aggregator
  (raw input/output join + formatted alignment by ts + output_length)
- analysis/characterization/data/chatbot_inter_turn_gap.json: CDF cache
- scripts/plot_inter_turn_gap.py: overlays both curves on log-x

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 14:49:44 +08:00
cef914ecd4 §3.1: add LMetric vs load_only design analysis (cache signal diluted by ×score)
Why the LMetric → load_only APC gap is only +3.3pp despite LMetric
explicitly being "cache-aware load routing":

  P = pending_prefill_tokens + (input_length - cache_hit)
  score = P × num_requests   <-- multiplicative

cache_hit appears only as a reduction inside P. Because score is
multiplicative in num_requests, a session-affinity instance whose
num_requests has climbed will lose argmin to a cold instance even
when cache_hit on the warm one is ~90%. Worked example:

  warm: P=2500, num_req=5 -> score 12500
  cold: P=10000, num_req=1 -> score 10000   <-- LMetric picks cold

  load_only 53.9% APC  (pure num_requests)
  LMetric   57.2%      +3.3pp (cache as additive cost term)
  sticky    77.7%     +23.8pp (cache as hard constraint)
  unified   78.7%     +24.8pp (cache as hard+soft hybrid)

Lesson worth stating explicitly in §3.1: cache awareness folded into
a multiplicative load cost-model is structurally insufficient. Affinity
must be a separate routing branch (sticky / unified hybrid), not a
correction term inside a load score.

PAPER_OUTLINE.md §3.1 gets the design analysis + the new APC table;
MEETING.md gets a one-paragraph version of the same point.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 14:04:14 +08:00
c33c825256 figs/v2: drop unified_v2 (buggy variant); re-render 4-policy panels
User flagged unified_v2 as a still-buggy build. Regenerate the four
per-policy figures with only the four stable policies:
  lmetric, load_only, sticky, unified

Story is now directly comparable to v1: unified still dominates p90
TTFT (8.8s) and E2E p90 (20.0s) over the other three on the fresh run.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 13:55:10 +08:00
03d8c5d0d1 Render 4 per-policy figures on b3_replay_20260527_0114 into figs/v2/
User-provided fresh run with five policies (lmetric, load_only, sticky,
unified, plus a new unified_v2 variant). Reproduces the v1 set under
figs/v2/ so we can A/B the same panels:
  f4a_apc_loss.png         — APC bars per policy
  f4c_per_worker_ttft.png  — per-worker TTFT p90 panel per policy
  f6_e2e_latency_bars.png  — TTFT/TPOT/E2E p90 bars per policy
  f6_e2e_latency_full_grid — mean/p50/p90/p99 × TTFT/TPOT/E2E grid

scripts/render_b3_figures_v2.py is a standalone driver that reads each
policy's metrics.summary.json and breakdown.json directly from the run
directory — the breakdown.json `routed_to` field is required to recover
per-worker assignment because the new setup routes every request
through a proxy (127.0.0.1:9300), so metrics.jsonl's endpoint_url no
longer identifies the backend.

Headline numbers, new vs v1:
  APC          v2: lmetric 57.2% / load_only 53.9% / sticky 77.7%
                   unified 78.7% / unified_v2 78.4%
              v1: lmetric 56.9% / load_only 54.1% / sticky 77.2% / unified 79.4%
  TTFT p90 (s) v2: lmetric 14.8 / load_only 20.1 / sticky 14.8 /
                   unified  8.8 / unified_v2 10.1
              v1: lmetric 15.7 / load_only 20.2 / sticky 18.0 / unified 7.3
  E2E p90 (s)  v2: lmetric 25.4 / load_only 33.9 / sticky 30.3 /
                   unified 20.0 / unified_v2 24.1
              v1: lmetric 24.8 / load_only 33.5 / sticky 34.6 / unified 18.0
  Worker p90 (s, median / max)
              v2: lmetric 13.3/30.4 · load_only 21.3/29.2 · sticky 13.5/33.0
                  unified 10.0/35.1 · unified_v2 8.6/34.2
              v1: lmetric 13.9/31.3 · load_only 19.4/25.1 · sticky 20.3/55.4
                  unified 10.3/37.7

Story is unchanged: unified dominates at p90 across TTFT/E2E and on
median-worker latency; unified_v2 is competitive at p50 but slightly
worse than unified at p90.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 13:52:17 +08:00
41232f49d3 Measure inter-turn T_external on the raw production trace; add f3a CDF
The earlier conversation suggested agentic might "have no human think-time"
and therefore live in a strict closed-loop regime. The user pushed back:
tool calls also take time and might restore a chatbot-like buffer between
turns. To resolve this, we go to the actual data.

The previously-published per-record formatted trace only carries arrival
timestamps, so an arrival-to-arrival diff conflates W_turn + T_external.
The raw trace (/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/
051315-051317-raw.jsonl on dash0) additionally carries request_end_time_ms,
which lets us compute the pure inter-turn external gap
T_external = next.request_ready_time_ms - prev.request_end_time_ms
for each session's consecutive turn pair.

Headline numbers (n = 783 k inter-turn gaps over 127 k multi-turn sessions):

  p25  = 0.69 s
  p50  = 1.6  s
  p75  = 8.6  s
  p90  = 44   s
  mean = 37   s   (heavy long-tail; paused/abandoned sessions)

  39 % of gaps < 1 s
  67 % of gaps < 5 s
  87 % of gaps < 30 s

The bulk of the distribution is dominated by sub-second to a-few-seconds
tool-call latencies. Under any current scheduler (e.g. unified TTFT p90 =
7.3 s, lmetric 15.7 s), W_turn is already at or above the 75th percentile
of T_external, so dispatch coupling is the dominant regime for the
majority of turns — not a corner case.

This corrects the earlier conflated arrival-to-arrival "median gap 11 s"
figure (which folded W_turn into T_external). The true T_external median
is 1.6 s.

Adds:
- scripts/compute_inter_turn_gap_remote.py: dash0-side aggregator
- analysis/characterization/data/agentic_inter_turn_gap.json: 500-point
  CDF cache + summary stats, scp'd back from dash0
- scripts/plot_inter_turn_gap.py: local figure renderer
- figs/f3a_inter_turn_gap.png: log-x CDF with p25/p50/p75/p90 anchors and
  unified/lmetric TTFT p90 reference lines

Next step (per user): pull a chatbot trace through the same pipeline and
compare distributions side by side; this will let §2.3 stop hand-waving
about "no think-time" and instead present the regime split empirically.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 12:37:32 +08:00
555cabcf1f f2c: switch to per-instance decode-concurrency view; correct KV pool ceiling
Old f2c plotted per-request KV footprint MiB against an "H20 ~95 GiB
usable" reference line. That ceiling was wrong — a 30B-A3B bf16
deployment burns roughly:
  ~50% HBM for model params (~48 GiB on 96 GiB H20)
  ~10% for runtime activation buffers
  ~40% left for the KV cache pool (~38.4 GiB)
so 95 GiB was overstating the available pool by 2.5×.

New f2c reframes the same data into the answer that actually motivates
the paper: how many concurrent decodes does a single instance hold,
and how does PD-disagg change that? Grouped bars per percentile show
system-wide concurrent decode capacity for three 8-GPU deployments:
  Combined 8C, PD-disagg 4P+4D (N_D=4), PD-disagg 6P+2D (N_D=2)

Key reads off the figure:
  p50 (1.8 GiB/req): 20 fit/inst → 160 / 80 / 40 system-wide
  p90 (8.0 GiB/req):  4 fit/inst →  32 / 16 /  8
  p95 (9.6 GiB/req):  4 fit/inst →  32 / 16 /  8
  p99 (11.5 GiB/req): 3 fit/inst →  24 / 12 /  6

PD-disagg 4P+4D literally halves the decode population at the same
per-request KV pressure — this is the concrete §3.2 "KV memory wall"
penalty stated in terms users care about (concurrency).

- analysis/characterization/render_window1_figures.py:
  fig_kv_footprint_cdf rewritten; reads same kv_footprint_summary.json
  but computes floor(KV_pool / req_size) × N_D and annotates the
  per-instance fit count below each percentile group.
- figs/f2c_kv_footprint_cdf.png: regenerated.
- MEETING.md / PAPER_OUTLINE.md §2.1, §2.4: prose updated with the
  new ceiling and the "3 p99 decodes per instance / halved by PD-disagg"
  framing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 11:28:47 +08:00
922d79ac95 Add full latency grid (mean/p50/p90/p99 × TTFT/TPOT/E2E) as f6 companion
The headline f6_e2e_latency_bars only shows p90, hiding three regimes:
  - mean: unified dominates (3.3s TTFT, 7.0s E2E vs sticky 5.6s / 12.1s)
  - p50: sticky and unified are tied on first-turn TTFT (0.5s each) —
    sticky's first turn of each session is free, after which queues
    accumulate. Unified beats sticky everywhere else.
  - p99: tail amplification reveals unified's biggest gap —
    TTFT 42.3s vs sticky 74.1s; E2E 68.8s vs sticky 139.7s.

The 12-panel figure is the honest full picture; the 3-panel headline
stays for slide-friendly summary.

- analysis/characterization/window_1_results/raw_stats/{policy}.json:
  cached ttft/tpot/e2e {mean,p50,p90,p99} pulled from dash0
  /home/admin/cpfs/wjh/agentic-kv/outputs/b3_sweep_20260525_095043/
  (b3_policy_comparison.json doesn't record mean, only percentiles).
- analysis/characterization/render_window1_figures.py:
  new fig_b3_latency_full_grid renders the 4×3 grid from the cache.
- figs/f6_e2e_latency_full_grid.png: 12-panel companion.
- PAPER_OUTLINE.md §5.2: both figures embedded; main table column
  renamed from "Hotspot idx" to "Worker p90 (median / max)" to match
  the new metric convention.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 11:15:18 +08:00
5e6e98aee7 Replace max/median hotspot index with (median, max) absolute pair
The max/median ratio inverts the actual user-facing p90 ranking:
  sticky:  hotspot=2.73 but system e2e p90 = 34.6s  (worst)
  unified: hotspot=3.67 but system e2e p90 = 18.0s  (best)
because sticky's median is also high (everyone slow) while unified
concentrates the damage on one worker and keeps the other 7 fast.
Any "imbalance" metric structurally punishes the affinity-then-escape
schemes that we actually want to advocate for.

Changes:
- analysis/characterization/render_window1_figures.py:
  fig_b3_per_worker_ttft now annotates each subplot with
  "median X.Xs · max Y.Ys" instead of "hotspot=Y.YY"; docstring
  documents why we drop the ratio.
- figs/f4c_per_worker_ttft.png: regenerated with new titles.
- figs/f4c_apc_vs_hotspot_tradeoff.png: deleted. The scatter's y-axis
  was the deprecated ratio; superseded by f4c per-worker bars + f6
  e2e bars which together carry the same information honestly.
- PAPER_OUTLINE.md: C3, §3.3, §4.1 wording, §5 metric list, §8
  conclusion — replace "hotspot index" mentions with
  "worst-worker p90" or "(median, max) worker p90"; promote the
  §3.3 methodology note to a top-level sub-finding ("hot pin
  failure must be measured with per-worker absolute latency,
  not normalized ratio").
- MEETING.md: §3.3 narrative reworded to lead with the (median, max)
  pair directly; explicit one-line note on why the ratio is dropped.

Conceptual uses of "hot session" / "hot instance" / "hot pin" remain
unchanged — only the *metric* called hotspot index is retired.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 11:07:12 +08:00
9ddabee6ae Remove 'capped' references from MEETING.md and PAPER_OUTLINE.md prose
Companion to the figure cleanup: prose in §3.1 was still quoting
"capped 31.6% APC" as one of the failure-mode datapoints. Same reason
as the figures — capped is a workload manipulation, not a policy, so
it doesn't belong in the §3.1 routing-policy narrative.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 11:02:29 +08:00
09ff1069c3 Drop 'capped' from per-policy figures (f4a, f4c×2, f6)
'capped' is not a routing policy — it's lmetric run on a separately
truncated trace (sessions capped to 8 turns via build_capped_trace.py).
Putting it alongside lmetric/load_only/sticky/unified in per-policy
comparison figures is misleading because the workload differs, not
the routing decision. Comparing apples to a different-trace orange
inflates/deflates apparent policy gaps for the wrong reasons.

Regenerated 4 figures with --exclude-policies capped on
analysis/characterization/render_window1_figures.py:
  - f4a_apc_loss.png                 (APC bars)
  - f4c_apc_vs_hotspot_tradeoff.png  (APC vs hotspot scatter)
  - f4c_per_worker_ttft.png          (per-worker TTFT panel)
  - f6_e2e_latency_bars.png          (TTFT/TPOT/E2E bars)

Added --exclude-policies CLI flag to the renderer so this is a
reversible choice, not a permanent script mutation. capped data remains
in b3_policy_comparison.json and can be brought back in workload-
sensitivity sections (where it actually belongs) by omitting the flag.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 10:57:43 +08:00
74e0c2157a Add solo production-trace CDF figure (f2b_session_skew_prod.png)
Single-curve variant of f2b — production trace only, no replay overlay
and no uniform reference. Cleaner for boss-meeting/talk slides where the
extra context is noise. The combined three-curve figure is unchanged.

scripts/plot_session_skew_cdf.py: split into plot_combined +
plot_production_solo helpers; one run emits both PNGs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 10:53:30 +08:00
1220da249c f2b: regenerate CDF from production trace (1.3M sessions on dash0)
Pulls 456 (rank%, cum%) sample points from the raw production trace at
dash0:/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl,
cached locally so the figure is reproducible without ssh access. Sampled
anchors match the precomputed summary exactly:
  top 1% = 46.5%, top 5% = 66.5%, top 10% = 74.6%
plus newly readable points:
  top 25% = 87.5%, top 50% = 96.0%

Workload characterization is now consistent with the production
distribution rather than the small replay subset. Replay window CDF kept
as an overlay to show the same hockey-stick shape on the data §5 actually
uses.

- analysis/characterization/data/production_session_skew_cdf.json: cached
  sample points (29 KB), so the figure rebuilds locally
- scripts/plot_session_skew_cdf.py: now plots from the cache + replay raw
- MEETING.md / PAPER_OUTLINE.md: revert numbers to production trace,
  add top-25%/50% data points

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 10:41:53 +08:00
22c4aa58e4 f2b: replace top-1/5/10% bars with full CDF; align all docs to replay-trace numbers
The previous f2b_session_skew.png was a 3-bar chart (top 1/5/10%) computed
from the production trace summary (which is not present locally, only its
precomputed JSON). The new figure is a continuous CDF of cumulative
input-token mass vs session rank percentile, generated directly from the
replay trace traces/w600_r0.0015_st30.jsonl so any percentile is readable.

Headline numbers update accordingly:
  replay trace (n=274 sessions): top 1% = 24.3%, top 5% = 61.9%, top 10% = 75.8%
  production trace (n=1.3M):     top 1% = 46.5%, top 5% = 66.5%, top 10% = 74.6%

Both show extreme skew well above the y=x uniform reference; the replay
trace is less extreme at top-1% because n=274 makes that bucket only
~3 sessions. We standardize §2/§3 narrative on the replay-trace numbers
so motivation matches §5 evaluation; production numbers kept as a side
note for context.

- scripts/plot_session_skew_cdf.py: reproducible figure generator
- MEETING.md / PAPER_OUTLINE.md: update narrative + caption

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 10:37:22 +08:00
020a5c79a7 §3.3 reframe: hot pin failure is uniformly-slow workers, not max/median ratio
User pointed out the apparent paradox: in fig_b3_per_worker_ttft_p90, unified
has hotspot index 3.67 while sticky has 2.73, yet unified e2e p90 is roughly
half of sticky's. Resolution: hotspot index (max/median) is a *ratio* and
misleading on its own. Per-worker absolute TTFT p90:

  sticky : median 20.3s, max 55.4s -> system e2e p90 34.6s
  unified: median 10.3s, max 37.7s -> system e2e p90 18.0s

Mechanism: top 1% sessions own 46.5% input mass and there are more hot
sessions than instances (8), so sticky's hash binding gives *every* worker
its own hot session and the median worker is also slow. Unified's LMetric
fallback re-routes cold/new sessions away from hot affinity instances,
preserving 7/8 worker speed. System p90 is dominated by the majority of
requests landing on fast workers, hence the 2x e2e gap.

Changes:
- Replace §3.3 figure with figs/f4c_per_worker_ttft.png (per-worker bars)
  instead of figs/f4c_apc_vs_hotspot_tradeoff.png (the ratio scatter)
- §3.3 narrative in PAPER_OUTLINE.md and MEETING.md rewritten around
  absolute median + max + system e2e p90 instead of hotspot ratio
- Add a §3.3 sub-finding: "hot pin failure must be measured with
  per-worker absolute latency, not normalized ratio"
- Keep the scatter as supplementary for §5 multi-policy summary

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 10:10:23 +08:00
18f1bd4240 Update MEETING.md + PAPER_OUTLINE.md with connector_tax substrate validation
2026-05-27 trace-replay A/B/C (commit ef9e010) shows the kv_both substrate
is net positive on current codebase, not just neutral:
  - TTFT p90: 11.97s plain → 9.74s kv_both (−18.6%) → 7.58s with DR-fix (−36.6%)

This reverses the elastic_migration_v2 paper's +45% kv_both penalty claim
and removes the primary cause of the 4 prior migration reverts.

Reframes EAR Pillar 2 from "DEFERRED" to "PARTIAL" — substrate verified,
e2e strategy-layer validation (trigger thresholds + target selection in
the dispatch-coupling feedback loop) remains as the only open risk.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 09:17:31 +08:00
ef9e0102ec Connector tax: trace-replay confirms +45% kv_both penalty is gone; DR-fix adds 22% more
Re-runs the elastic_migration_v2 trace (w600 r0.0015 st30, 1214 reqs,
274 sessions, 8×TP1 vLLM + cache_aware_proxy) with three configs:
- plain unified
- unified + Mooncake kv_both
- unified + Mooncake kv_both + DR-fix (env-gated O(|cache|) hash sync removal)

TTFT p90: 11.97 s → 9.74 s (−18.6%) → 7.58 s (−36.6% vs plain)
E2E p90:  23.48 s → 21.25 s (−9.5%) → 17.93 s (−23.6% vs plain)

Two findings:
1. The "+45% kv_both penalty" claim from elastic_migration_v2 is OBSOLETE
   on current codebase — kv_both is now *faster* than plain at p90.
   Likely fixed by e3a1d70 (RDMA-READ → bootstrap PUSH refactor) and
   the connector-mode delay_free_blocks extending cross-turn prefix
   cache hits on a 93%-intra-session-reuse trace.
2. DR-fix removes another 22% from TTFT p90 by skipping the
   O(|cache|) hash sync in build_connector_meta. Cache-sweep with
   DR-fix shows slope drops from +94.5 to +2.3 μs/1k blocks.

Adds:
- run_trace_replay_drfix.sh: A/B/C harness (env CT_DR_FIX gates patch)
- analyze_trace_replay.py: TTFT/TPOT/E2E delta analysis
- REPORT_TRACE_REPLAY.md: summary + reproduction
- results/20260526_1627_drfix/: cache-sweep with DR-fix
- results/trace_replay_20260526_1652/: full trace-replay A/B/C

Implication for EAR paper: the kv_both substrate is no longer the
bottleneck blocking session migration. The prior 4 migration reverts
were dominated by transfer overhead that has now been characterized
and (partially) removed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 09:13:50 +08:00
df0ee5a02b Use PNG for KV memory wall figure; switch outline to inline image embeds
- Convert figs/f4b_pdsep_kv_wall.pdf to PNG via pdftoppm @ 150 DPI so
  MEETING.md and PAPER_OUTLINE.md render the figure inline on GitHub /
  any standard markdown viewer (PDF !() embeds don't render).
- PAPER_OUTLINE.md F2, F4, F6: switch from backtick code references to
  proper ![]() image embeds so the doc is actually viewable as a deck.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 09:13:26 +08:00
0bb97c9dca Add EAR meeting pitch doc
Minimal one-page sell doc for advisor meeting. Leads with dispatch
coupling insight + 8x amplification number, then workload chars,
three baseline failure modes, EAR two-pillar design, progress/TODO/risk.

Uses the 8 figs already in figs/. Migration Pillar 2 explicitly marked
as design-complete-validation-pending (the 4 prior reverts + DR-fix
context).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 01:48:53 +08:00
52cdb80367 EAR outline: copy reusable figures, mark migration sections deferred
- replayer/replay.py: emit trace_span_s and amplification in summary
  (Phase 1 of the wall-clock amplification measurement plan; needed for
  §2.3 dispatch coupling empirical closure)
- figs/: 8 reusable figures copied from analysis/ with paper-spec names
  (f2a/b/c workload, f4a/b/c/d failure modes, f6 e2e partial)
- PAPER_OUTLINE.md: real figure paths, explicit TBD markers for
  custom drawings and pending data; new "Validation Status" table at top
  and reorganized "Work Plan" splitting can-do-now vs migration-deferred

Migration validation deferred per user: 4 prior attempts (6b255fa,
e991960/5772149, cc6e562, 4c583f2) were reverted due to transfer
overhead; pending re-test on top of connector_tax DR-fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 01:44:13 +08:00
e2f94495a1 EAR paper outline: anchor + dispatch coupling motivation
Initial 8-section outline for "Elastic Affinity Router" — agentic LLM
scheduler with session-affinity routing + hot-triggered session migration.

Centerpiece is §2.3's dispatch coupling argument: agentic workloads close
Little's Law on themselves (no human think-time), so per-turn W enters Λ,
amplifying small latency differences into throughput differences. This is
the intellectual hook the design hangs on.

§3 attacks three baselines on three orthogonal failure modes (load-balance
loses locality, static PD-disagg hits D-side KV wall, pure sticky creates
hot pin). §4 frames EAR as the single scheduler that addresses all three.

All figures and several numbers (T_hot, T_cool, EAR wall-clock factor) are
TBD — see Open Items at bottom.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 01:24:02 +08:00
31cf8c9b11 DR-fix A/B: env-gate hash sync drops slope from +81 to -0.7 μs/1k blocks
Adds an env-gated skip for the per-step `set(cache.keys())` walk in
MooncakeConnectorScheduler.build_connector_meta() that was introduced
in our own commit a7df84b (Direct RDMA read). Re-runs the cache_sweep
A/B with three configs: plain (control), mooncake_both (baseline), and
mooncake_both_drfix (VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1).

Files:
  apply_direct_read_fix.py  one-line env-gate patch (markered revert)
  run_drfix.sh              orchestrator for plain + mooncake_both + drfix
  analyze.py                extended to compare mooncake_both_drfix vs plain
                            and mooncake_both vs mooncake_both_drfix
  REPORT_DRFIX.md           findings
  results/20260526_1543_drfix/ run artifacts

Headline:

  config                | slope (μs/1k blocks) | step_dur p50 @ 16.6k
  ----------------------|----------------------|---------------------
  mooncake_both         | +81.0                | 1 550 μs
  mooncake_both_drfix   | -0.7  (≈ 0)          |    95 μs
  plain (control)       | -1.8  (≈ 0)          |    72 μs

  build_meta p50 @ 16.6k blocks:
    mooncake_both        = 1 459 μs
    mooncake_both_drfix  =     6 μs    (residual loop bookkeeping)

  worker get_finished p50:
    mooncake_both        = 178 μs    (unchanged; this fix doesn't touch it)
    mooncake_both_drfix  = 183 μs

The fix recovers 1 453 μs (99.6 %) of the scheduler-side cost at
|cache|=16.6k blocks. drfix's per-bin step_dur tracks plain within
±50 μs across the full cache range — that's noise-level. The slope
goes from +81 to essentially zero.

Worker-side get_finished (180 μs constant) is unchanged because the
DR-fix touches scheduler.build_connector_meta only. That's the next
target if we want to bring kv_both fully back to plain-level.

Extrapolation to trace-replay (|cache|≈13k, APC≈79%):
  before: build_meta 1 060 μs + get_finished 180 μs = 1.24 ms/step
  after DR-fix: build_meta 6 μs + get_finished 180 μs = ~0.19 ms/step
  → 85% reduction in per-step connector cost
  → TPOT inflation drops from ~+18% to ~+3% on a 7 ms decode step

Confirms: the entire O(|cache|) slope was introduced by our own
direct-RDMA-read implementation (commit a7df84b), not upstream
Mooncake. Production fix: gate the sync on the presence of any
direct_read consumer, or replace per-step diff with an incremental
delta listener fed by block_pool add/remove callbacks.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 00:03:23 +08:00
8829928fc5 Cache-size sweep: build_meta is O(|cache|), +85.6 μs / 1k blocks
Follow-up to Microbench 3 that finally tests H5 (cache-size
dependence) and instruments worker-side connector callbacks the
original patch missed.

Patch v2 (apply_step_timing_v2.py) adds:
  scheduler: `cache_size` field in engine_step.jsonl
  worker:    `get_finished_us` + `start_load_kv_us` in worker_step.r0.jsonl
  uses BLOCK_BEGIN/END sentinels for safe multi-line revert
  (the original v1 patch survives this v2's apply/revert cycle)

Driver: continuous open-loop (1.5 req/s, 4096x256 random per req)
that lets APC fill from 0 → ceiling within one vLLM lifetime so a
single run produces the full cache_size sweep. Decode-only steps
are filtered post-hoc to remove prefill-mix variance.

Findings (H20 96GB, ceiling reached ~17.5k blocks; n=15-18k decode
steps per config):

  config         | slope (μs / 1k blocks) | step_dur p50 @ |cache|=16.6k
  ---------------|------------------------|-----------------------------
  mooncake_both  | +85.6                  | 1528 μs (build_meta=1442, 94%)
  noop_connector | -0.8 (≈0)              |  79 μs
  plain          | +1.0 (≈0)              |  84 μs

  Worker-side get_finished p50/p90/p99 (μs/step):
    mooncake_both:  180 / 257 / 333
    noop_connector:   0 /   0 /   2

H5 PASSES. mooncake_both step_duration scales linearly with |cache|
because build_connector_meta walks set(cache.keys()) every step
(`mooncake_connector.py:434-450`). plain and noop are flat.

The previously-uninstrumented get_finished() adds a constant
180 μs/step on top — two `run_coroutine_threadsafe(...).result()`
blocking waits in kv_both mode (`mooncake_connector.py:1107-1137`)
fire every step even when no transfer is pending.

Trace-replay reconciliation (APC ≈ 79% → |cache| ≈ 13k blocks):
  build_meta @ 13k ≈ 1060 μs + get_finished ≈ 180 μs = 1.24 ms/step
  On ~7 ms decode forward → +15-20% TPOT per step.
  This explains most of the trace-replay +25% TPOT p90 gap from
  single-instance per-step cost alone, leaving a smaller residual
  for multi-instance coupling than originally assumed.

Two clear fixes pointed out in REPORT.md:
  1. replace O(|cache|) per-step walk with incremental delta
     listener using block_pool's add/remove callbacks
  2. short-circuit get_finished() when both producer/consumer
     queues are empty in kv_both

Heavy raw artifacts (engine_step.jsonl, vllm_stdout/stderr,
.vllm.pid) are .gitignored — they re-derive from `bash run_all.sh`
and SUMMARY.md / per_config.json fully capture the conclusions.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 23:34:21 +08:00
54de78eb11 Connector tax RESULTS.md: errata + run-to-run variance disclosure
The prior write-up presented one specific reading of the data as
the headline without flagging methodology gaps. Three corrections:

1. The "0% low-concurrency tax" comes from a single back-to-back
   mooncake_both_v2/plain_v2 rerun. The original Phase A pair
   showed TTFT p90 +29%, TPOT p90 +54%, E2E p90 +55% at rate=2
   — a 40 percentage-point swing between two consecutive runs
   that the original write-up did not call out. The run-to-run
   noise floor is too high to claim "0%" at low concurrency.

2. get_finished() was never instrumented. The patch only times
   step_duration_us and build_meta_us. "100% of per-step cost is
   build_meta" is an upper bound on what was timed, not a true
   decomposition.

3. H5 (cache-size dependence) was the central hypothesis but
   was never tested in the prior run; random content kept APC
   near empty.

The +7-9% high-concurrency (single instance, 512x64, rate=8-16)
and +17% 8-instance-saturated numbers are kept; they were
measured with adequate sample sizes and are reproducible.

The follow-up sweep in cache_sweep/ tests H5 directly and
revises the decomposition.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 23:33:01 +08:00
e3480f7d28 8-instance connector tax: +2% at non-saturated, +17% only at saturation
8×TP1 + load_only proxy, shape 512×64, rates 32/64/128 req/s total:

  Rate=32 (non-saturated, thr=0.95-0.97):
    plain TTFT p90=64ms,  mooncake_both=65ms  → +2% (noise)
  Rate=64 (non-saturated, thr=0.96):
    plain TTFT p90=114ms, mooncake_both=107ms → -6% (noise)
  Rate=128 (saturated, thr=0.70-0.71):
    plain TTFT p90=702ms, mooncake_both=822ms → +17%
    plain TTFT p50=339ms, mooncake_both=470ms → +39%

Conclusion: The elastic_migration_v2 +45% is a saturation artifact.
Under SLO-compliant load (TTFT<10s, thr_ratio>0.9), mooncake_both's
1.4ms/step build_connector_meta overhead is completely masked by the
scheduler-model async pipeline. The tax only manifests when the system
is already saturated and queueing amplifies per-step differences.

For practical deployment: enabling kv_role=kv_both has effectively zero
cost as long as the serving system stays within SLO capacity bounds.
2026-05-26 21:32:46 +08:00
c8ec73c548 Connector tax: high-concurrency confirms +7-9% tax, resolves trace-replay gap
High-concurrency test (512 input, 64 output, rates 4-32 req/s):
  Rate=8:  plain TTFT p90=94ms, mooncake_both=102ms → +9% tax
  Rate=16: plain TTFT p90=144ms, mooncake_both=156ms → +8% tax
  Rate=32: both saturated at ~6.1s → no distinguishable difference

Low-concurrency back-to-back retest (4096 input, 256 output):
  mooncake_both_v2 vs plain_v2: tax is ≈0% (within noise)
  because scheduler's 1.4ms/step is hidden behind model forward.

Decomposition of trace-replay's +45%:
  +7-9% from build_connector_meta per-step cost (this microbench)
  +20-30% from multi-instance coupling amplification (not measurable here)
  remainder from large-cache O(|cache|) scaling (Phase B follow-up)

Also: bench_loop.py now emits mean/p50/p90/p99 for all three metrics.
2026-05-26 21:00:25 +08:00
a473c71cac Connector tax Phase A: build_connector_meta is 1.4ms/step (the tax source)
Per-step timing from engine_step.jsonl definitively resolves H3:
  plain:            53 μs/step (p50)
  noop_connector:   69 μs/step (+16 μs = negligible framework cost)
  mooncake_producer: 1461 μs/step (build_connector_meta = 1386 μs)
  mooncake_both:    1452 μs/step (same as producer)

The substrate tax is NOT in the v1 framework — it's specifically in
Mooncake's build_connector_meta() which walks set(cache.keys()) every
scheduler step (O(|cache|) per step, E2 audit §6.5).

Accumulated per-request tax: 256 decode steps × 1.4ms = 358ms.
Observed TTFT tax at rate=1.0: plain 378ms vs mooncake_both 422ms (+12%).
At rate=2.0 (near saturation): +29%, approaching trace-replay's +45%.

Also fixes kill_vllm() to properly kill EngineCore subprocesses.
2026-05-26 19:33:15 +08:00
297fed6e73 Microbench 3 (connector_tax): infrastructure for KV connector substrate tax
Validates the elastic_migration_v2 finding that kv_role=kv_both adds
TTFT p90 +45% even when PD-sep never fires. Replicates under
single-instance, synthetic, open-loop workload to disambiguate
mechanism cost from 8-instance feedback amplification.

Configurations (8):
  plain, noop_connector, mooncake_{producer,consumer,both},
  nixl_both, lmcache_only, multi_mooncake_lmcache.

Pre-flight verification gates risky configs (kv_consumer needs dummy
bootstrap, multi-connector composition, NoOp custom class loading).

Workload: two-phase sweep
  Phase A: rate {0.5..32} req/s × shape (4096, 256), saturation criteria
  Phase B: ref_safe rate × cartesian (input ∈ {512,4k,32k}, output ∈ {64,256,1024})

Step-timing patch enriches vLLM's existing AGENTIC_STEP_LOG_PATH emit
with step_duration_us and build_meta_us — directly measures per-step
substrate cost, not just user-visible TTFT/TPOT.

run_all.sh runs as 5-stage barrier:
  0 pre-flight + apply patch
  1 Phase A all configs
  2 pick ref_safe / ref_load
  3 Phase B all configs
  4 revert patch + analyze + plot

Outputs aggregate.{json,csv}, MANIFEST.tsv, and 5 figures.
Estimated runtime: 4-5.5 hours on idle dash0 H20.
2026-05-26 17:27:41 +08:00
3fdcec9c0f Fix review P2s: lockfile, model path convention, trap robustness
- Regenerate uv.lock after adding fastapi/uvicorn deps so uv sync
  --locked no longer fails
- B3 scripts: default MODEL to $HOME/models/... matching documented
  convention and other launch scripts (repo has no models/ directory)
- launch_elastic_p2p: append || true to each trap command so set -e
  doesn't abort cleanup when jobs -p is empty and EngineCore orphans
  remain
2026-05-26 16:05:43 +08:00
dc6d24d1ca Add NIXL substrate isolation control + attribution decomposition
Adds unified_nixl_both to elastic_migration_v2: same picker as
unified_kv_both (never triggers PD-sep), but launches vLLM with
NixlConnector instead of MooncakeConnector. Compared against plain
unified and unified_kv_both (Mooncake) we can now attribute the
substrate overhead between "v1 connector framework irreducible
cost" (proxied by the leaner NIXL) and "Mooncake implementation
extra" (Mooncake - NIXL).

Result (vs plain unified, both substrates never PD-sep):

   metric          plain    NIXL          Mooncake
   TTFT p90        7.35s    +37.9%        +45.3%      (NIXL: +7pp better)
   TPOT p90        17.1ms   +15.5%        +24.5%      (NIXL: +9pp better)
   E2E p90         18.03s   +17.4%        +27.0%      (NIXL: +10pp better)
   hotspot         3.667    +0.2%         +19.0%      (NIXL: keeps it flat)
   APC             79.4%    -0.3pp        -1.1pp
   interference    -        5.58          8.57         (NIXL: ~35% lower)

The cleanest signal is hotspot: NIXL preserves plain-unified's
distribution (3.674 vs 3.667), while Mooncake's per-scheduler-step
O(|cache|) `set(self._block_pool.cache.keys())` diff against
_known_hash_keys (mooncake_connector.py:432-456) inflates routing
imbalance by 19%. The hash sync runs unconditionally even when no
direct_read consumer is present.

Attribution: NIXL-plain ~= v1 framework irreducible cost (kv_buffer
GPU memory, per-step SchedulerOutput.kv_connector_metadata
round-trip, altered kv_cache_manager block-lifecycle). Mooncake-NIXL
~= Mooncake-specific overhead (the hash-sync loop and stricter
delay_free semantics).

Practical implication: NIXL is meaningfully better than Mooncake on
this stack, but even NIXL imposes 16-38% across metrics — too
expensive for selective-PD-sep on agentic workloads where the
trigger rate is < 0.5%.

Launch fixes required for NIXL multi-instance:
- VLLM_NIXL_SIDE_CHANNEL_PORT must be unique per instance (default
  5600; we use 5600+i). Without this, 7 of 8 instances silently hang
  in `zmq.error.ZMQError: Address already in use` and the launcher
  trap kills all of them at health-check timeout.
- Health-check timeout raised from 180s to 360s; NIXL initialization
  (UCX agent + memory registration) is ~100-150s per instance under
  8-way concurrent load, vs Mooncake's ~30-60s.

New figure: fig_connector_substrate_attribution.png stacks plain /
framework / Mooncake-extra / v2-branch overhead per metric.
Existing figures (fig_kv_both_overhead, fig_three_way_hotspot)
updated to include NIXL as a fourth bar.

README updated with 4-way table, Result 1 reframed as "the cost is
mostly framework, not Mooncake — but Mooncake adds the hotspot
penalty", and the substrate-vs-PD-sep tradeoff math.

Refs: nixl_connector.py:700 handshake listener bind, factory.py
register_connector for the NixlConnector entry.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 16:02:12 +08:00
645b067dd4 Fix review bugs: PD-sep counter leaks, hardcoded paths, missing deps
Critical:
- cache_aware_proxy: _handle_pd_sep leaked p_inst.num_requests (never
  decremented) and never managed d_inst.num_requests; fix media_type
  from application/json to text/event-stream for SSE stream

High:
- b3_sweep/b3_isolated_policy/b3_analyze: replace hardcoded
  /home/admin/cpfs/wjh/ ROOT with script-relative $(dirname "$0")/..
- b3_analyze: replace hardcoded 8-port WORKER_MAP with dynamic
  generation from BASE_PORT and N_INSTANCES

Medium:
- analyze_breakdown: warn on stderr when records are skipped (was silent)
- deploy_vllm_patches: fail-fast on SSH/SCP errors instead of
  continuing with empty VENV_SITE
- pyproject.toml: declare fastapi and uvicorn as runtime dependencies
- launch_elastic_p2p: kill EngineCore and proxy in trap handler to
  prevent GPU memory leaks on exit
2026-05-26 15:54:55 +08:00
0eb49dcc34 Fix NIXL multi-instance port conflict: per-instance SIDE_CHANNEL_PORT
NIXL's _nixl_handshake_listener (vllm/distributed/kv_transfer/
kv_connector/v1/nixl_connector.py:700) binds a ZMQ ROUTER socket on
the side_channel_port, which defaults to 5600. When 8 NIXL vLLMs
launch concurrently on the same host all 8 race for tcp://localhost:5600;
exactly one succeeds and the others silently hang in the listener
thread with:

    zmq.error.ZMQError: Address already in use (addr='tcp://localhost:5600')

The engines themselves never reach "Application startup complete"
and the b3_isolated_policy.sh health-check times out. First observed
when 7 of 8 inst_X.log files contained the ZMQ error and the 8th
(by random ordering) was the one healthy instance.

Fix: set VLLM_NIXL_SIDE_CHANNEL_PORT=$((5600 + i)) per instance in
the NIXL launch branch. Each engine now gets a distinct handshake
port (5600..5607 by default). Verified: all 8 instances now reach
"Application startup complete" within the 360 s health budget.

This is NIXL-specific; Mooncake uses VLLM_MOONCAKE_BOOTSTRAP_PORT
which we were already varying per instance.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 15:09:16 +08:00
151bf33541 Add unified_nixl_both policy: NIXL connector isolation control
Adds a NIXL-backed counterpart to unified_kv_both so we can attribute
the kv_both substrate overhead measured in the elastic_migration_v2
section to either Mooncake-specific code or a generic v1-connector
cost shared by all connectors.

- scripts/cache_aware_proxy.py: register --policy unified_nixl_both.
  Picker is identical to unified (and unified_kv_both); routing
  decisions never go through the PD-sep branch. Differs only at the
  vLLM launch layer.
- scripts/b3_isolated_policy.sh: new KV_CONNECTOR env var
  (Mooncake|Nixl), auto-set based on POLICY. NIXL launch path uses
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
  with no VLLM_MOONCAKE_BOOTSTRAP_PORT (NIXL uses UCX side-channels).
- Health-check timeout: 90 iterations * 2s -> 180 iterations * 2s
  (180s -> 360s). Empirically NIXL needs ~100-150s per instance to
  initialize the UCX agent and register KV cache memory; 8
  concurrent NIXL launches frequently overshoot the previous 180s
  budget. Mooncake is unaffected (still finishes well inside the new
  budget). The 8-vLLM unified_nixl_both first launch tripped the
  old timeout despite 7/8 instances reaching startup-complete.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 14:57:54 +08:00
06dd175441 Microbench 1 plots: prefill-decode interference heatmap + lines
plot_interference.py reads the interference sweep summary (4 D × 4 P × 3 reps,
cold prefill prompts) and produces:

  fig_interference_heatmap.png
    TPOT p90 interference index over (D, P): 14x at D=8 P=2k → 214x at D=1 P=32k.

  fig_interference_lines.png
    (a) TPOT p90 during prefill vs P, log-y, one line per D + baseline dashed
    (b) Cold prefill TTFT vs P (interference window length)

Confirms B2 finding: cold prefill on the same worker stalls overlapping
decodes for 14-214x baseline TPOT. The interference window grows linearly
with P (from ~140ms at 2k to ~4.6s at 32k) and is essentially independent
of decode batch size — prefill compute time dominates.
2026-05-26 14:21:30 +08:00
72790ae6c1 PD-sep server-side profiling: vLLM patches + per-request breakdown
Instrumentation patches (microbench/patches/):
  - pd_profile.py: shared event emitter (VLLM_PD_PROFILE_LOG env var)
  - apply_patches.py: idempotent patch installer for mooncake_connector.py
    and scheduler.py, marks insertions with # PD_PROFILE_PATCH
  - analyze_events.py: joins per-process JSONL event logs by transfer_id
    into per-request phase durations

Seven events captured per request:
  D_get_num_matched → P_zmq_received → P_prefill_done →
  P_rdma_start → P_rdma_end → D_recv_complete → D_request_promoted

Driver fix (microbench/lifecycle/driver.py):
  seed_prefix_cache now sends via the proxy URL so P and D both cache
  the seeded prefix with matching block hashes. Previously seeding D
  directly produced different block hashes than the proxy-routed
  measurement requests, making incremental transfer impossible.

Real breakdown (fig_breakdown_real.png, server_breakdown.csv, n=93):
  prefill_compute  620 ms median (95% of overhead)
  rdma_transfer     42 ms median (~71 Gbps effective)
  other overhead    10 ms median (dispatch + params + signal + promote)

Mooncake transfer is NOT the bottleneck. Even with bulk RDMA the
transfer cost is <10% of prefill cost for Qwen3-30B-A3B on H20.
2026-05-26 13:59:09 +08:00
d76eb02637 Elastic migration v2 section: PD-sep on agentic workload is net negative
New analysis/characterization/elastic_migration_v2/ packages the
unified_v2 + unified_kv_both experiments into a self-contained
results section that the paper can cite as the "we tried selective
PD-sep migration" case study. The section finds three independent
reasons PD-sep doesn't help on agentic w600:

1. Mooncake kv_both substrate alone (no PD-sep ever firing) imposes
   TTFT p90 +45%, TPOT p90 +25%, hotspot index +19% vs plain
   unified. Per-step KVConnectorMetadata maintenance and block
   reservation semantics dominate even when no transfer is pending.
2. PD-sep gate fires only 0.16-0.41% of requests across two
   gate-tightness configurations. 88-76% are killed by
   new_local < threshold because 93% intra-session reuse on agentic
   traces leaves a small uncached tail; 19% are killed by
   chosen_no_active_decode (snapshot-time gate). Even relaxed
   thresholds can't grow trigger rate past 0.5%.
3. When PD-sep fires, the calibrated cost model
   (0.3s + bytes / 2.7 GB/s) is wrong by 10-20x. 5 triggered
   requests in v2.1 saw realized TTFT 12-45s vs model-predicted
   migrate cost 0.7-2.2s, consistent with the E2 audit's finding
   that D-side block pre-reservation and missing layerwise
   pipelining dominate the decode_sent -> first_token clock.

Three-way comparison (unified vs unified_kv_both vs unified_v2):
v2 vs the kv_both control is roughly net-zero (-10% hotspot,
-14% TPOT p90, +3% TTFT p90, +9% TTFT p99). v2 vs plain unified is
strictly worse by 27-49% across latency percentiles because the
kv_both substrate tax is unavoidable when the policy is enabled.

Contents:
- README.md: the four results sections, the three-way comparison
  table, an explicit "what this claims for the paper" list, and a
  cross-reference index to the earlier characterization documents.
- data/: b3_policy_comparison.json + per-policy breakdown.json
  + per-policy hotspot_index.json for the four policies in scope.
- figures/: 4 PNGs rendered by render_figures.py:
  * fig_kv_both_overhead.png   — 4-metric bar chart with delta
    annotations showing kv_both alone costs +45% TTFT p90.
  * fig_v2_trigger_funnel.png  — per-reason request count for the
    two gate configurations on log scale.
  * fig_v2_predicted_vs_actual.png  — scatter of model-predicted
    migrate cost vs realized TTFT for the 5 triggered requests,
    with y=x, 10x, and 20x reference lines.
  * fig_three_way_hotspot.png  — per-worker TTFT p90 grouped bars
    across the three policies.

The section is intentionally self-contained: it lists what the
experiment validates (cost model picks correct candidates;
shadow-drift fix is necessary; same-worker interference is real)
alongside what it disproves (per-request PD-sep on agentic via
Mooncake is not a net win in current implementation).

Refs: E1/E2 subagent audits, B2 microbench, unified_v2 commits
19f69a9 / 4b833d3 / 95c8ef8.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 13:28:37 +08:00
95c8ef853c Fix proxy shadow drift: actively reconcile against vLLM /metrics
The proxy maintains shadow counters (num_requests, ongoing_tokens,
pending_prefill_tokens, ongoing_decode_tokens) used by every routing
picker. They are incremented in _handle_local_request and decremented
in the generator's finally block. When the StreamingResponse generator
never enters (client disconnect between proxy returning the response
and Starlette starting iteration, or Starlette failing before
iteration), the decrement never fires and the counter stays elevated
forever. Over a multi-hour run the shadow accumulates "phantom" load
on the affected instances and biases the router away from them.

Concrete observation that prompted the fix: during the unified_kv_both
B3 run, engine_0 sat at proxy num_requests=1 / ongoing_decode_tokens=80406
while vLLM's own /metrics reported num_running=0 num_waiting=0 and the
GPU sat at 0% utilization. Every routing decision after that point
believed engine_0 was busy with an 80k-token decode that did not exist.

Fix: extend _reconcile_loop to actively poll each instance's
/metrics every 30 s. If the proxy's num_requests has been higher than
vLLM's (running + waiting) for two consecutive cycles (~60 s of stable
drift), reduce the shadow to vLLM's truth. When vLLM is fully idle
(running=0, waiting=0), zero ongoing_tokens, ongoing_decode_tokens,
and pending_prefill_tokens as well.

Two-cycle persistence avoids correcting transient mismatches where
the proxy has just incremented for a new request that vLLM has not
scheduled yet. A single ~30 s blip is not large enough to corrupt
routing decisions; only persistent drift gets corrected.

The previous _reconcile_loop only clamped negatives. Phantom positives
are now caught and logged ("[reconcile] {url}: phantom drift ...").

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 11:29:02 +08:00
4b833d33b7 unified_v2.1: relax gates + add unified_kv_both isolation control
v2.0 ran on B3 and triggered PD-sep only 2 / 1214 times (0.2%). The
gates were too conservative; the v2-vs-v1 latency gap (TTFT p90
7.35 -> 8.96 s) is therefore probably attributable to kv_both
always-on overhead, not to the PD-sep mechanism itself. v2.1 has two
fixes plus an isolation control.

Bug fix:
- The "chosen has live decodes worth protecting" gate combined
  num_requests and ongoing_decode_tokens with AND, falling through
  when EITHER was small. Under agentic workloads each worker rarely
  stacks more than 1-2 concurrent requests, so the gate killed 84%
  of v2.0 candidates that reached it. Replace with a pure
  ongoing_decode_tokens == 0 check ("chosen_no_active_decode") —
  same semantic, much higher recall.

Threshold relaxation (B2 microbench is the calibration source):
- pd_sep_min_new_tokens: 16000 -> 8000 (B2 TPOT idx 1.9x already
  at 8k, TTFT idx 12x — strictly worth migrating)
- pd_sep_min_decodes_protected: 2 -> 1
- pd_sep_min_src_cache_tokens: 8000 -> 4000
- pd_sep_min_extra_cache_tokens: 4000 -> 2000

Isolation control:
- New --policy unified_kv_both option. Uses the exact same picker as
  --policy unified but the vLLMs are launched in kv_role=kv_both
  (the same launch mode unified_v2 requires). PD-sep never fires.
  Compares against unified_v2 to attribute any v2 effect to the
  PD-sep branch alone, not the kv_both always-on overhead.
- Both unified_kv_both and unified_v2 auto-enable kv_both launch in
  b3_isolated_policy.sh.

Tests:
- Updated the existing "chosen has no decodes" test for the new
  gate name and semantic.
- All 24 proxy tests pass.

Refs: window_1_results/v2_breakdown analysis (88.7% of candidates
caught by old new_local_below_threshold; 84% of the remainder
caught by the old few_decodes gate).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 10:40:57 +08:00
19f69a9d2e unified_v2: selective per-request PD-sep via Mooncake (E3+E4)
Adds a sixth routing policy --policy unified_v2 that wraps the
existing unified hybrid picker with a selective PD-sep branch.
When all of the following hold, a request is split prefill-on-src,
decode-on-chosen via Mooncake kv_role=kv_both transfer:

  1. new_local = input_length - chosen.cache_hit > 16k
     (B2 microbench shows same-worker TTFT idx >= 3x from this size up)
  2. chosen has live decodes worth protecting (>= 2 in-flight)
  3. some other instance holds materially more cache for this prefix
     (>= 8k tokens, and >= 4k more than chosen)
  4. cost(src_interference + RDMA xfer) + 0.2s margin < cost(chosen_interference)

The cost model is the audit-blessed shape from E1's post-mortem:
- gate on new_tokens (post-cache), NOT input_length (the old PUSH gate)
- bind to a single transfer mechanism (kv_both peer-to-peer pull)
- realistic RDMA cost as a function of bytes: 0.3s base +
  bytes / 2.7 GB/s (calibrated against contention_16s_elastic p50)
- both source and target decode counts considered

E2 mechanism-level patches not yet applied (this commit is policy-only).
Patches 6.2 / 6.3 / 6.5 remain on the table. Patch 6.6 (per-request
xfer timeout, 60s default) is implemented on the proxy side as an
httpx per-chunk read timeout on the dst streaming call, so a stuck
KV transfer fails the request instead of hanging for 600s.

cache_aware_proxy.py:
- Settings: kv_bytes_per_token, prefill_throughput_kv_both,
  rdma_base_overhead_s, rdma_effective_gb_per_s, pd_sep_* gating knobs
- estimate_transfer_cost(bytes) replaces the constant rdma_overhead_s
- estimate_same_worker_interference_s(new_tokens, num_decodes) reads off
  the B2 penalty curve in 4 bins
- pick_instance_unified_v2: inherits unified, returns extra
  (src_inst, src_idx) tuple when PD-sep wins the cost compare
- _handle_combined_pd_sep_v2: prefill on src (do_remote_decode=True,
  max_tokens=1), Mooncake xfer, decode-stream on dst with httpx
  Timeout(read=pd_sep_xfer_timeout_s)
- --policy unified_v2 added to argparse choices
- lifespan auto-runs init_prefill_bootstrap when policy is unified_v2

b3_isolated_policy.sh:
- ENABLE_KV_BOTH env var, auto-set when POLICY=unified_v2, threads
  kv_role=kv_both + VLLM_MOONCAKE_BOOTSTRAP_PORT to vllm and
  --bootstrap-ports to the proxy

Tests: 8 new unit tests cover the gating predicates and the cost
estimators; all 32 proxy tests still pass.

Refs: E1 (PUSH post-mortem) + E2 (Mooncake audit) reports.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 09:25:45 +08:00
c63dc151a0 Agentic PD / Unified routing story plan draft
User's 2026-05-25 draft aligning three threads (agentic-kv vLLM
experiments, dash0 artifacts, agentic-pd-hybrid SGLang work) into
a single story for the paper. Tracked so future iterations and
review history are in version control.

Co-Authored-By: Gahow Wang <chiahaco@gmail.com>
2026-05-26 01:12:42 +08:00
0881942cf3 Window 1 results: recompute with fixed metrics + reframe limitations
After the B3 audit bug fixes (joined_analysis hotspot median +
b3_analyze percentile interp), regenerate b3_policy_comparison.json
and the per-policy hotspot_index.json from the same raw run on
dash0 and re-render the three affected figures (apc-vs-hotspot,
latency-bars, per-worker TTFT).

Key number changes in window_1_results.md:
- hotspot_index magnitudes corrected (all five policies; lmetric
  smallest delta at +0.7%, sticky largest at +16.1%)
- "capped reduces hotspot 13%" -> "~10% (2.253 -> 2.020)"
- TTFT/E2E/TPOT percentiles shift by <1% from floor->interp
  (unified TTFT p90 7.24 -> 7.35 s)

Restructured "Caveats" into "Limitations (read this before quoting
B3 numbers)":
1. Agentic dispatch coupling is by design — promoted from caveat
   to top-level methodology framing, tied to
   agentic_dispatch_coupling.md
2. B3 interference_index is binary (not size-graded) — added
3. Hot-sweep cache contamination (<1%) — kept
4. Unified interference unrecoverable — kept with explicit warning
   not to read unified's failure attribution as causal
5. w600 is a sample, not full trace — kept
6. Reuse decomposition is per-token in expectation — added

current_results/characterization_claim_matrix.md updates:
- The "heavy-tail not sole cause" claim now cites the corrected
  ~10% drop with the median bug noted
- New supported claim: "B3 saturated-replay latency gaps include an
  agentic dispatch-coupling feedback term, which is intentional and
  matches production"; cited against agentic_dispatch_coupling.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 01:08:55 +08:00
0e82612100 Fix B3 analysis bugs from subagent audit (median + percentile + sweep)
Three fixes from the B3 audit:

1) joined_analysis.hotspot_index used sorted[n//2] as median, which
   returns the ~60th percentile for n=8 (even-length). Systematically
   under-states the hotspot index. Recomputed values:
       lmetric   2.238 -> 2.253  (+0.7%)
       load_only 1.140 -> 1.294  (+13.5%)
       sticky    2.349 -> 2.728  (+16.1%)
       unified   3.350 -> 3.667  (+9.5%)
       capped    1.937 -> 2.020  (+4.3%)
   Qualitative ranking preserved; "capped only modestly reduces hotspot"
   story holds with ~10% drop instead of the previously reported 13%.
   Added test_hotspot_index_uses_true_median_for_even_n to lock in the
   fix.

2) b3_analyze.sh's pct() helper used floor-indexed percentile
   sorted[int(p*(n-1))], inconsistent with metrics._percentile and
   joined_analysis._percentile which both use linear interpolation.
   Now matches.

3) b3_sweep.sh's capped step called run_policy "capped", but the
   proxy's argparse has no "capped" choice, so the hot-sweep variant
   would have crashed on this step. The actual capped data was
   produced via b3_isolated_policy.sh with --policy lmetric. Replace
   the broken inline call with an explicit launch_proxy lmetric +
   inline replayer block so the sweep script matches the data path
   it documents.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 01:08:37 +08:00
8ac41a8684 Agentic dispatch coupling: trace-replay session-sequentiality is realistic
The B3 audit flagged the trace replayer's "fire turn N+1 immediately
if turn N is behind schedule" semantics as a potential benchmark
crime, because under saturation the effective arrival process becomes
policy-dependent (slow policy -> longer session lifetimes -> more
concurrent in-flight -> harder system -> still slower). The audit
called this dispatch slip.

But in agentic workloads, turn N+1 is generated by a tool-call
response or an autonomous-loop step, not by a human reading the
previous reply. There is no inter-turn think-time. So the replayer's
"no think-time, sequential within session, fire-immediately-when-
ready" behavior is the correct model of agentic production, and the
feedback amplification is a real property of production systems
under saturation rather than an artifact of the replayer.

The note (analysis/characterization/agentic_dispatch_coupling.md)
lays out:
- The dispatch rule and the apparent feedback loop
- Why agentic workloads do not have user think-time
- Application of Little's Law: slower policy carries higher concurrent
  in-flight load, so the policy x feedback gap is real, not artifact
- Reframes B3 as the "production-replay" experiment and B4 as the
  orthogonal "controlled-load" experiment, complementary not
  hierarchical
- Calls the feedback amplification itself out as a finding worth
  reporting (e.g. unified's ~2x latency-p90 gap over lmetric in B3
  reflects both the routing improvement and the in-flight reduction)
- Contrasts with chat workloads (human think-time partially breaks
  the feedback loop, agentic removes that floor)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 01:00:25 +08:00
f784e49c07 Microbench: prefill-decode interference + PD transfer lifecycle
Two microbenchmarks quantifying the elastic offload decision:

1. Interference (corrected): cold prefill causes 14-214x TPOT p90
   degradation on same-worker decode (D∈{1,2,4,8} × P∈{2k,8k,16k,32k}).
   Earlier run had a prefix-cache bug (deterministic prompts hit cache
   after rep 0); fixed with uuid+time_ns unique prompts.

2. Transfer lifecycle: PD-sep TTFT breakdown via Mooncake proxy,
   measuring prefill→RDMA→decode startup overhead.

Key finding: offload wins at all P≥2048 operating points —
transfer cost is 25-50% of interference cost even with bulk Mooncake.
2026-05-26 00:57:06 +08:00
559faa1e26 B2 finding: TPOT idx peaks at 32k, not 65k — cost migrates to TTFT
The B2 same-worker TPOT p90 idx is non-monotone: 7.89x at 32k drops
to 2.26x at 65k. The naive reading is "interference gets weaker for
huge prefills"; the actual mechanism is a regime shift, and reading
TPOT p90 alone is misleading.

Three superimposed effects:

1. Cost migration TPOT -> TTFT. A 32k prefill is short enough that
   chunked-prefill keeps interleaving decode steps, so overlapping
   decodes trickle tokens out at painful per-token rates. A 65k
   prefill is long enough that overlapping decodes are *fully*
   blocked for ~10s; once they break through, the injection is
   winding down and subsequent iterations run unobstructed. The
   cost lands on the TTFT clock (14s) instead of inflating TPOT.

2. Bimodal TPOT distribution. At 65k overlap, decodes split into
   "blocked entire prefill then normal rate" and "trickled slowly
   through prefill chunks". p99 sits on the second population and
   grows 59 -> 169.5 ms; p90 sits on the first and shrinks.

3. "Clean" stops being clean. With 4x ~10s injections in 60s, the
   110 "clean" decodes at 65k are squeezed into 2-3s recovery
   pockets. TPOT p90 clean rises 6.9 -> 9.6 ms (40%), shrinking
   the denominator of the ratio.

window_1_results.md adds a new B2 subsection laying out the
mechanism with the per-cell data table and the explicit reading
rule: headline interference metric is TTFT idx (monotone); TPOT
p99 is the right tail indicator; TPOT p90 alone is unsafe across
regime shifts. Direct implication: TTFT and TPOT need separate
SLO thresholds under PD-colo, because they measure costs from
different points in the request lifecycle and the cost migration
between them is workload-dependent.

current_results/characterization_claim_matrix.md adds a new
supported claim for the cost migration, listed against the existing
B2 evidence. current_results/reviewer_risk_register.md adds a
low-severity entry warning future readers off TPOT p90 alone.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 00:35:45 +08:00
4722883903 Audit package refresh: Window 1 supported claims + risk register
Refresh the standing audit package now that B1' / B2 / B3 are complete.

current_results/characterization_claim_matrix.md
  Flips seven entries from "not_yet_supported" / "partially_supported"
  to "supported" with pointers into window_1_results/. New entries
  cover per-session sequentiality, KV per request, real reuse
  decomposition, theoretical APC ceiling, the LMetric locality gap,
  Unified breaking the locality-vs-latency tradeoff, B2 causal
  interference proof, sticky's interference inflation, and the
  partial heavy-tail / hot-spot story. B4 SRR + B5 attribution stay
  "not_yet_supported" (Window 2 work).

current_results/main_claim_allowed_runs.md
  New "Allowed For Routing-Policy Comparison" section pins the five
  B3 policy directories. New "Allowed For PD-colo Interference"
  section pins the B2 sweep. Legacy section retained for the
  pre-instrumentation 200/500/1000-req runs.

current_results/reviewer_risk_register.md
  Marks the two old "high"-severity risks (sequentiality / reuse
  decomposition) as resolved; adds new entries for the APC
  contamination empirics, the b3_analyze.sh truncate-write bug that
  cost unified's interference index, the GPU-0 EngineCore ghost
  cleanup, the saturated-replay caveat for trace-timestamp dispatch,
  and the synthetic B2 decode workload.

current_results/all_figures_index.md
  Adds the 8 new Window 1 figures alongside the existing 6 from the
  legacy summarize_runs run.

current_results/reproduction_commands.sh
  Records the full B3 + B2 + figure pipeline.

analysis/characterization_todo_for_interns.md
  Updates the Progress Snapshot table: B0, B1, B2, B3, B6 all DONE;
  only B4 and B5 remain (Window 2).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 23:25:27 +08:00
0c3220cbb8 Window 1 results: combined B1' + B2 + B3 report and artifacts
analysis/characterization/window_1_results.md is the headline write-up
for Window 1: workload characterization (KV per request, real reuse
decomposition, APC theoretical ceilings), B3 5-policy sweep with
per-policy interpretation, B2 same-vs-different-worker interference
microbench with causal reading, and an explicit list of what Window 1
does *not* answer (deferred to B4 SRR sweep + B5 attribution).

Under window_1_results/:
- 5 raw result JSONs from the B3 sweep, the B2 microbench, the APC
  upper bound, and the KV footprint
- per-policy hotspot_index.json snapshots so render_window1_figures.py
  can plot per-worker TTFT p90 distributions
- 8 PNG figures (figures/) covering the headline claims

Three takeaways the figures pin down:
1) intra-session reuse dominates (93.2%), so session-affinity routing
   is the right primary lever
2) unified hybrid affinity hits 79.4% APC (97% of the 79.6% intra-
   session ceiling) AND cuts TTFT p90 from lmetric's 15.6s to 7.24s
3) B2 different-worker control sits at idx ≈ 1.0 across 32× prefill-
   size variation; same-worker TTFT idx scales 2.15× -> 218×, which
   is the cleanest causal evidence for same-worker prefill-decode
   interference

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 23:25:09 +08:00
b7902061d1 Window 1 analysis: APC upper bound, B2 window-overlap, figure renderer
Three CPU-only analysis pieces that turn raw Window 1 artifacts into
publishable numbers and figures.

scripts/compute_apc_upper_bound.py
  Block-level trie walk over hash_ids to compute the theoretical APC
  ceiling on a trace, decomposed into intra-session / any-session /
  shared-prefix-only. Gives a fixed reference for what each routing
  policy could *possibly* achieve. w600 result: 79.6% intra-session,
  80.3% any-session, 0.1% shared-prefix.

analysis/characterization/b2_sweep_analysis.py (rewrite)
  Previous version used joined_analysis.interference_index() which
  labeled overlap = "any prefill in any other request during this
  decode". With short-prompt decode load this is always true
  (everyone's prefill overlaps everyone else's decode); n_overlap
  was 239/240 even in the different-worker control.

  New version labels overlap iff the decode's [t_first_token, t_finish]
  intersects an actual large *injection* window, computed from the
  cell's "prefill"-tagged metric rows. Different-worker control now
  cleanly sits at idx ≈ 1.0, same-worker scales monotonically.

analysis/characterization/render_window1_figures.py
  Renders 8 PNGs from the result JSONs: B3 latency / APC vs ceiling
  / APC vs hotspot scatter / per-worker TTFT / failure breakdown,
  B2 TPOT and TTFT curves (overlap vs clean and idx), reuse
  decomposition, KV footprint.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 23:24:54 +08:00
b9f324f2e6 B2 interference driver: request return_token_ids + text fallback
The first B2 run produced metrics with ttft_s=null/tpot_s=null for
every decode request because the OpenAI-style payload did not set
return_token_ids: true, and the parser only inspected
choices[0].token_ids. With token_ids missing the loop skipped every
chunk, so no per-token timestamps were captured and the aggregator
returned interference_index=null on all 10 cells.

Fix:
- send return_token_ids: true in the payload (matches replayer.replay)
- also accept text-delta chunks as token signals (fallback for
  servers that drop token_ids despite the flag)

vLLM engine_state was fine; only the load-gen metric capture was
broken.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 22:39:54 +08:00
df3249925b B3 analyze: prefer per-policy engine_state over slicing shared dir
The hot-sweep variant of B3 writes one shared engine_state across
all policies; the isolated variant writes per-policy. Previously
slice_engine_state.py was called unconditionally and would
overwrite an isolated policy's real data with an empty slice (the
isolated policy's run-window doesn't overlap with the shared dir's
contents).

Now we check the policy directory's engine_state for any non-empty
engine_*.jsonl first; if present, use it directly; else slice from
the shared one as before.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 22:19:43 +08:00
1d87082ca1 B3: cold-start isolated policy runner (clean APC per cell)
scripts/b3_isolated_policy.sh wraps one policy run in a fresh
8-instance vLLM lifecycle: hard reset -> launch -> health -> proxy
-> replayer -> snapshot artifacts -> cleanup. Used when cross-
policy APC contamination matters more than the ~25-min vLLM
warmup overhead per policy.

Counterpart to the existing b3_sweep.sh which keeps vLLM warm
across all policies (faster but warm-cache; we found via the
sticky pre-flight that contamination is < 1% on this trace, so
b3_sweep.sh stays the default).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 20:33:44 +08:00