Old f2c plotted per-request KV footprint MiB against an "H20 ~95 GiB
usable" reference line. That ceiling was wrong — a 30B-A3B bf16
deployment burns roughly:
~50% HBM for model params (~48 GiB on 96 GiB H20)
~10% for runtime activation buffers
~40% left for the KV cache pool (~38.4 GiB)
so 95 GiB was overstating the available pool by 2.5×.
New f2c reframes the same data into the answer that actually motivates
the paper: how many concurrent decodes does a single instance hold,
and how does PD-disagg change that? Grouped bars per percentile show
system-wide concurrent decode capacity for three 8-GPU deployments:
Combined 8C, PD-disagg 4P+4D (N_D=4), PD-disagg 6P+2D (N_D=2)
Key reads off the figure:
p50 (1.8 GiB/req): 20 fit/inst → 160 / 80 / 40 system-wide
p90 (8.0 GiB/req): 4 fit/inst → 32 / 16 / 8
p95 (9.6 GiB/req): 4 fit/inst → 32 / 16 / 8
p99 (11.5 GiB/req): 3 fit/inst → 24 / 12 / 6
PD-disagg 4P+4D literally halves the decode population at the same
per-request KV pressure — this is the concrete §3.2 "KV memory wall"
penalty stated in terms users care about (concurrency).
- analysis/characterization/render_window1_figures.py:
fig_kv_footprint_cdf rewritten; reads same kv_footprint_summary.json
but computes floor(KV_pool / req_size) × N_D and annotates the
per-instance fit count below each percentile group.
- figs/f2c_kv_footprint_cdf.png: regenerated.
- MEETING.md / PAPER_OUTLINE.md §2.1, §2.4: prose updated with the
new ceiling and the "3 p99 decodes per instance / halved by PD-disagg"
framing.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The headline f6_e2e_latency_bars only shows p90, hiding three regimes:
- mean: unified dominates (3.3s TTFT, 7.0s E2E vs sticky 5.6s / 12.1s)
- p50: sticky and unified are tied on first-turn TTFT (0.5s each) —
sticky's first turn of each session is free, after which queues
accumulate. Unified beats sticky everywhere else.
- p99: tail amplification reveals unified's biggest gap —
TTFT 42.3s vs sticky 74.1s; E2E 68.8s vs sticky 139.7s.
The 12-panel figure is the honest full picture; the 3-panel headline
stays for slide-friendly summary.
- analysis/characterization/window_1_results/raw_stats/{policy}.json:
cached ttft/tpot/e2e {mean,p50,p90,p99} pulled from dash0
/home/admin/cpfs/wjh/agentic-kv/outputs/b3_sweep_20260525_095043/
(b3_policy_comparison.json doesn't record mean, only percentiles).
- analysis/characterization/render_window1_figures.py:
new fig_b3_latency_full_grid renders the 4×3 grid from the cache.
- figs/f6_e2e_latency_full_grid.png: 12-panel companion.
- PAPER_OUTLINE.md §5.2: both figures embedded; main table column
renamed from "Hotspot idx" to "Worker p90 (median / max)" to match
the new metric convention.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The max/median ratio inverts the actual user-facing p90 ranking:
sticky: hotspot=2.73 but system e2e p90 = 34.6s (worst)
unified: hotspot=3.67 but system e2e p90 = 18.0s (best)
because sticky's median is also high (everyone slow) while unified
concentrates the damage on one worker and keeps the other 7 fast.
Any "imbalance" metric structurally punishes the affinity-then-escape
schemes that we actually want to advocate for.
Changes:
- analysis/characterization/render_window1_figures.py:
fig_b3_per_worker_ttft now annotates each subplot with
"median X.Xs · max Y.Ys" instead of "hotspot=Y.YY"; docstring
documents why we drop the ratio.
- figs/f4c_per_worker_ttft.png: regenerated with new titles.
- figs/f4c_apc_vs_hotspot_tradeoff.png: deleted. The scatter's y-axis
was the deprecated ratio; superseded by f4c per-worker bars + f6
e2e bars which together carry the same information honestly.
- PAPER_OUTLINE.md: C3, §3.3, §4.1 wording, §5 metric list, §8
conclusion — replace "hotspot index" mentions with
"worst-worker p90" or "(median, max) worker p90"; promote the
§3.3 methodology note to a top-level sub-finding ("hot pin
failure must be measured with per-worker absolute latency,
not normalized ratio").
- MEETING.md: §3.3 narrative reworded to lead with the (median, max)
pair directly; explicit one-line note on why the ratio is dropped.
Conceptual uses of "hot session" / "hot instance" / "hot pin" remain
unchanged — only the *metric* called hotspot index is retired.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Companion to the figure cleanup: prose in §3.1 was still quoting
"capped 31.6% APC" as one of the failure-mode datapoints. Same reason
as the figures — capped is a workload manipulation, not a policy, so
it doesn't belong in the §3.1 routing-policy narrative.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pulls 456 (rank%, cum%) sample points from the raw production trace at
dash0:/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl,
cached locally so the figure is reproducible without ssh access. Sampled
anchors match the precomputed summary exactly:
top 1% = 46.5%, top 5% = 66.5%, top 10% = 74.6%
plus newly readable points:
top 25% = 87.5%, top 50% = 96.0%
Workload characterization is now consistent with the production
distribution rather than the small replay subset. Replay window CDF kept
as an overlay to show the same hockey-stick shape on the data §5 actually
uses.
- analysis/characterization/data/production_session_skew_cdf.json: cached
sample points (29 KB), so the figure rebuilds locally
- scripts/plot_session_skew_cdf.py: now plots from the cache + replay raw
- MEETING.md / PAPER_OUTLINE.md: revert numbers to production trace,
add top-25%/50% data points
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous f2b_session_skew.png was a 3-bar chart (top 1/5/10%) computed
from the production trace summary (which is not present locally, only its
precomputed JSON). The new figure is a continuous CDF of cumulative
input-token mass vs session rank percentile, generated directly from the
replay trace traces/w600_r0.0015_st30.jsonl so any percentile is readable.
Headline numbers update accordingly:
replay trace (n=274 sessions): top 1% = 24.3%, top 5% = 61.9%, top 10% = 75.8%
production trace (n=1.3M): top 1% = 46.5%, top 5% = 66.5%, top 10% = 74.6%
Both show extreme skew well above the y=x uniform reference; the replay
trace is less extreme at top-1% because n=274 makes that bucket only
~3 sessions. We standardize §2/§3 narrative on the replay-trace numbers
so motivation matches §5 evaluation; production numbers kept as a side
note for context.
- scripts/plot_session_skew_cdf.py: reproducible figure generator
- MEETING.md / PAPER_OUTLINE.md: update narrative + caption
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User pointed out the apparent paradox: in fig_b3_per_worker_ttft_p90, unified
has hotspot index 3.67 while sticky has 2.73, yet unified e2e p90 is roughly
half of sticky's. Resolution: hotspot index (max/median) is a *ratio* and
misleading on its own. Per-worker absolute TTFT p90:
sticky : median 20.3s, max 55.4s -> system e2e p90 34.6s
unified: median 10.3s, max 37.7s -> system e2e p90 18.0s
Mechanism: top 1% sessions own 46.5% input mass and there are more hot
sessions than instances (8), so sticky's hash binding gives *every* worker
its own hot session and the median worker is also slow. Unified's LMetric
fallback re-routes cold/new sessions away from hot affinity instances,
preserving 7/8 worker speed. System p90 is dominated by the majority of
requests landing on fast workers, hence the 2x e2e gap.
Changes:
- Replace §3.3 figure with figs/f4c_per_worker_ttft.png (per-worker bars)
instead of figs/f4c_apc_vs_hotspot_tradeoff.png (the ratio scatter)
- §3.3 narrative in PAPER_OUTLINE.md and MEETING.md rewritten around
absolute median + max + system e2e p90 instead of hotspot ratio
- Add a §3.3 sub-finding: "hot pin failure must be measured with
per-worker absolute latency, not normalized ratio"
- Keep the scatter as supplementary for §5 multi-policy summary
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 trace-replay A/B/C (commit ef9e010) shows the kv_both substrate
is net positive on current codebase, not just neutral:
- TTFT p90: 11.97s plain → 9.74s kv_both (−18.6%) → 7.58s with DR-fix (−36.6%)
This reverses the elastic_migration_v2 paper's +45% kv_both penalty claim
and removes the primary cause of the 4 prior migration reverts.
Reframes EAR Pillar 2 from "DEFERRED" to "PARTIAL" — substrate verified,
e2e strategy-layer validation (trigger thresholds + target selection in
the dispatch-coupling feedback loop) remains as the only open risk.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Convert figs/f4b_pdsep_kv_wall.pdf to PNG via pdftoppm @ 150 DPI so
MEETING.md and PAPER_OUTLINE.md render the figure inline on GitHub /
any standard markdown viewer (PDF !() embeds don't render).
- PAPER_OUTLINE.md F2, F4, F6: switch from backtick code references to
proper ![]() image embeds so the doc is actually viewable as a deck.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- replayer/replay.py: emit trace_span_s and amplification in summary
(Phase 1 of the wall-clock amplification measurement plan; needed for
§2.3 dispatch coupling empirical closure)
- figs/: 8 reusable figures copied from analysis/ with paper-spec names
(f2a/b/c workload, f4a/b/c/d failure modes, f6 e2e partial)
- PAPER_OUTLINE.md: real figure paths, explicit TBD markers for
custom drawings and pending data; new "Validation Status" table at top
and reorganized "Work Plan" splitting can-do-now vs migration-deferred
Migration validation deferred per user: 4 prior attempts (6b255fa,
e991960/5772149, cc6e562, 4c583f2) were reverted due to transfer
overhead; pending re-test on top of connector_tax DR-fix.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Initial 8-section outline for "Elastic Affinity Router" — agentic LLM
scheduler with session-affinity routing + hot-triggered session migration.
Centerpiece is §2.3's dispatch coupling argument: agentic workloads close
Little's Law on themselves (no human think-time), so per-turn W enters Λ,
amplifying small latency differences into throughput differences. This is
the intellectual hook the design hangs on.
§3 attacks three baselines on three orthogonal failure modes (load-balance
loses locality, static PD-disagg hits D-side KV wall, pure sticky creates
hot pin). §4 frames EAR as the single scheduler that addresses all three.
All figures and several numbers (T_hot, T_cool, EAR wall-clock factor) are
TBD — see Open Items at bottom.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>