Commit Graph

12 Commits

Author SHA1 Message Date
03d8c5d0d1 Render 4 per-policy figures on b3_replay_20260527_0114 into figs/v2/
User-provided fresh run with five policies (lmetric, load_only, sticky,
unified, plus a new unified_v2 variant). Reproduces the v1 set under
figs/v2/ so we can A/B the same panels:
  f4a_apc_loss.png         — APC bars per policy
  f4c_per_worker_ttft.png  — per-worker TTFT p90 panel per policy
  f6_e2e_latency_bars.png  — TTFT/TPOT/E2E p90 bars per policy
  f6_e2e_latency_full_grid — mean/p50/p90/p99 × TTFT/TPOT/E2E grid

scripts/render_b3_figures_v2.py is a standalone driver that reads each
policy's metrics.summary.json and breakdown.json directly from the run
directory — the breakdown.json `routed_to` field is required to recover
per-worker assignment because the new setup routes every request
through a proxy (127.0.0.1:9300), so metrics.jsonl's endpoint_url no
longer identifies the backend.

Headline numbers, new vs v1:
  APC          v2: lmetric 57.2% / load_only 53.9% / sticky 77.7%
                   unified 78.7% / unified_v2 78.4%
              v1: lmetric 56.9% / load_only 54.1% / sticky 77.2% / unified 79.4%
  TTFT p90 (s) v2: lmetric 14.8 / load_only 20.1 / sticky 14.8 /
                   unified  8.8 / unified_v2 10.1
              v1: lmetric 15.7 / load_only 20.2 / sticky 18.0 / unified 7.3
  E2E p90 (s)  v2: lmetric 25.4 / load_only 33.9 / sticky 30.3 /
                   unified 20.0 / unified_v2 24.1
              v1: lmetric 24.8 / load_only 33.5 / sticky 34.6 / unified 18.0
  Worker p90 (s, median / max)
              v2: lmetric 13.3/30.4 · load_only 21.3/29.2 · sticky 13.5/33.0
                  unified 10.0/35.1 · unified_v2 8.6/34.2
              v1: lmetric 13.9/31.3 · load_only 19.4/25.1 · sticky 20.3/55.4
                  unified 10.3/37.7

Story is unchanged: unified dominates at p90 across TTFT/E2E and on
median-worker latency; unified_v2 is competitive at p50 but slightly
worse than unified at p90.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 13:52:17 +08:00
41232f49d3 Measure inter-turn T_external on the raw production trace; add f3a CDF
The earlier conversation suggested agentic might "have no human think-time"
and therefore live in a strict closed-loop regime. The user pushed back:
tool calls also take time and might restore a chatbot-like buffer between
turns. To resolve this, we go to the actual data.

The previously-published per-record formatted trace only carries arrival
timestamps, so an arrival-to-arrival diff conflates W_turn + T_external.
The raw trace (/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/
051315-051317-raw.jsonl on dash0) additionally carries request_end_time_ms,
which lets us compute the pure inter-turn external gap
T_external = next.request_ready_time_ms - prev.request_end_time_ms
for each session's consecutive turn pair.

Headline numbers (n = 783 k inter-turn gaps over 127 k multi-turn sessions):

  p25  = 0.69 s
  p50  = 1.6  s
  p75  = 8.6  s
  p90  = 44   s
  mean = 37   s   (heavy long-tail; paused/abandoned sessions)

  39 % of gaps < 1 s
  67 % of gaps < 5 s
  87 % of gaps < 30 s

The bulk of the distribution is dominated by sub-second to a-few-seconds
tool-call latencies. Under any current scheduler (e.g. unified TTFT p90 =
7.3 s, lmetric 15.7 s), W_turn is already at or above the 75th percentile
of T_external, so dispatch coupling is the dominant regime for the
majority of turns — not a corner case.

This corrects the earlier conflated arrival-to-arrival "median gap 11 s"
figure (which folded W_turn into T_external). The true T_external median
is 1.6 s.

Adds:
- scripts/compute_inter_turn_gap_remote.py: dash0-side aggregator
- analysis/characterization/data/agentic_inter_turn_gap.json: 500-point
  CDF cache + summary stats, scp'd back from dash0
- scripts/plot_inter_turn_gap.py: local figure renderer
- figs/f3a_inter_turn_gap.png: log-x CDF with p25/p50/p75/p90 anchors and
  unified/lmetric TTFT p90 reference lines

Next step (per user): pull a chatbot trace through the same pipeline and
compare distributions side by side; this will let §2.3 stop hand-waving
about "no think-time" and instead present the regime split empirically.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 12:37:32 +08:00
555cabcf1f f2c: switch to per-instance decode-concurrency view; correct KV pool ceiling
Old f2c plotted per-request KV footprint MiB against an "H20 ~95 GiB
usable" reference line. That ceiling was wrong — a 30B-A3B bf16
deployment burns roughly:
  ~50% HBM for model params (~48 GiB on 96 GiB H20)
  ~10% for runtime activation buffers
  ~40% left for the KV cache pool (~38.4 GiB)
so 95 GiB was overstating the available pool by 2.5×.

New f2c reframes the same data into the answer that actually motivates
the paper: how many concurrent decodes does a single instance hold,
and how does PD-disagg change that? Grouped bars per percentile show
system-wide concurrent decode capacity for three 8-GPU deployments:
  Combined 8C, PD-disagg 4P+4D (N_D=4), PD-disagg 6P+2D (N_D=2)

Key reads off the figure:
  p50 (1.8 GiB/req): 20 fit/inst → 160 / 80 / 40 system-wide
  p90 (8.0 GiB/req):  4 fit/inst →  32 / 16 /  8
  p95 (9.6 GiB/req):  4 fit/inst →  32 / 16 /  8
  p99 (11.5 GiB/req): 3 fit/inst →  24 / 12 /  6

PD-disagg 4P+4D literally halves the decode population at the same
per-request KV pressure — this is the concrete §3.2 "KV memory wall"
penalty stated in terms users care about (concurrency).

- analysis/characterization/render_window1_figures.py:
  fig_kv_footprint_cdf rewritten; reads same kv_footprint_summary.json
  but computes floor(KV_pool / req_size) × N_D and annotates the
  per-instance fit count below each percentile group.
- figs/f2c_kv_footprint_cdf.png: regenerated.
- MEETING.md / PAPER_OUTLINE.md §2.1, §2.4: prose updated with the
  new ceiling and the "3 p99 decodes per instance / halved by PD-disagg"
  framing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 11:28:47 +08:00
922d79ac95 Add full latency grid (mean/p50/p90/p99 × TTFT/TPOT/E2E) as f6 companion
The headline f6_e2e_latency_bars only shows p90, hiding three regimes:
  - mean: unified dominates (3.3s TTFT, 7.0s E2E vs sticky 5.6s / 12.1s)
  - p50: sticky and unified are tied on first-turn TTFT (0.5s each) —
    sticky's first turn of each session is free, after which queues
    accumulate. Unified beats sticky everywhere else.
  - p99: tail amplification reveals unified's biggest gap —
    TTFT 42.3s vs sticky 74.1s; E2E 68.8s vs sticky 139.7s.

The 12-panel figure is the honest full picture; the 3-panel headline
stays for slide-friendly summary.

- analysis/characterization/window_1_results/raw_stats/{policy}.json:
  cached ttft/tpot/e2e {mean,p50,p90,p99} pulled from dash0
  /home/admin/cpfs/wjh/agentic-kv/outputs/b3_sweep_20260525_095043/
  (b3_policy_comparison.json doesn't record mean, only percentiles).
- analysis/characterization/render_window1_figures.py:
  new fig_b3_latency_full_grid renders the 4×3 grid from the cache.
- figs/f6_e2e_latency_full_grid.png: 12-panel companion.
- PAPER_OUTLINE.md §5.2: both figures embedded; main table column
  renamed from "Hotspot idx" to "Worker p90 (median / max)" to match
  the new metric convention.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 11:15:18 +08:00
5e6e98aee7 Replace max/median hotspot index with (median, max) absolute pair
The max/median ratio inverts the actual user-facing p90 ranking:
  sticky:  hotspot=2.73 but system e2e p90 = 34.6s  (worst)
  unified: hotspot=3.67 but system e2e p90 = 18.0s  (best)
because sticky's median is also high (everyone slow) while unified
concentrates the damage on one worker and keeps the other 7 fast.
Any "imbalance" metric structurally punishes the affinity-then-escape
schemes that we actually want to advocate for.

Changes:
- analysis/characterization/render_window1_figures.py:
  fig_b3_per_worker_ttft now annotates each subplot with
  "median X.Xs · max Y.Ys" instead of "hotspot=Y.YY"; docstring
  documents why we drop the ratio.
- figs/f4c_per_worker_ttft.png: regenerated with new titles.
- figs/f4c_apc_vs_hotspot_tradeoff.png: deleted. The scatter's y-axis
  was the deprecated ratio; superseded by f4c per-worker bars + f6
  e2e bars which together carry the same information honestly.
- PAPER_OUTLINE.md: C3, §3.3, §4.1 wording, §5 metric list, §8
  conclusion — replace "hotspot index" mentions with
  "worst-worker p90" or "(median, max) worker p90"; promote the
  §3.3 methodology note to a top-level sub-finding ("hot pin
  failure must be measured with per-worker absolute latency,
  not normalized ratio").
- MEETING.md: §3.3 narrative reworded to lead with the (median, max)
  pair directly; explicit one-line note on why the ratio is dropped.

Conceptual uses of "hot session" / "hot instance" / "hot pin" remain
unchanged — only the *metric* called hotspot index is retired.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 11:07:12 +08:00
09ff1069c3 Drop 'capped' from per-policy figures (f4a, f4c×2, f6)
'capped' is not a routing policy — it's lmetric run on a separately
truncated trace (sessions capped to 8 turns via build_capped_trace.py).
Putting it alongside lmetric/load_only/sticky/unified in per-policy
comparison figures is misleading because the workload differs, not
the routing decision. Comparing apples to a different-trace orange
inflates/deflates apparent policy gaps for the wrong reasons.

Regenerated 4 figures with --exclude-policies capped on
analysis/characterization/render_window1_figures.py:
  - f4a_apc_loss.png                 (APC bars)
  - f4c_apc_vs_hotspot_tradeoff.png  (APC vs hotspot scatter)
  - f4c_per_worker_ttft.png          (per-worker TTFT panel)
  - f6_e2e_latency_bars.png          (TTFT/TPOT/E2E bars)

Added --exclude-policies CLI flag to the renderer so this is a
reversible choice, not a permanent script mutation. capped data remains
in b3_policy_comparison.json and can be brought back in workload-
sensitivity sections (where it actually belongs) by omitting the flag.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 10:57:43 +08:00
74e0c2157a Add solo production-trace CDF figure (f2b_session_skew_prod.png)
Single-curve variant of f2b — production trace only, no replay overlay
and no uniform reference. Cleaner for boss-meeting/talk slides where the
extra context is noise. The combined three-curve figure is unchanged.

scripts/plot_session_skew_cdf.py: split into plot_combined +
plot_production_solo helpers; one run emits both PNGs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 10:53:30 +08:00
1220da249c f2b: regenerate CDF from production trace (1.3M sessions on dash0)
Pulls 456 (rank%, cum%) sample points from the raw production trace at
dash0:/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl,
cached locally so the figure is reproducible without ssh access. Sampled
anchors match the precomputed summary exactly:
  top 1% = 46.5%, top 5% = 66.5%, top 10% = 74.6%
plus newly readable points:
  top 25% = 87.5%, top 50% = 96.0%

Workload characterization is now consistent with the production
distribution rather than the small replay subset. Replay window CDF kept
as an overlay to show the same hockey-stick shape on the data §5 actually
uses.

- analysis/characterization/data/production_session_skew_cdf.json: cached
  sample points (29 KB), so the figure rebuilds locally
- scripts/plot_session_skew_cdf.py: now plots from the cache + replay raw
- MEETING.md / PAPER_OUTLINE.md: revert numbers to production trace,
  add top-25%/50% data points

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 10:41:53 +08:00
22c4aa58e4 f2b: replace top-1/5/10% bars with full CDF; align all docs to replay-trace numbers
The previous f2b_session_skew.png was a 3-bar chart (top 1/5/10%) computed
from the production trace summary (which is not present locally, only its
precomputed JSON). The new figure is a continuous CDF of cumulative
input-token mass vs session rank percentile, generated directly from the
replay trace traces/w600_r0.0015_st30.jsonl so any percentile is readable.

Headline numbers update accordingly:
  replay trace (n=274 sessions): top 1% = 24.3%, top 5% = 61.9%, top 10% = 75.8%
  production trace (n=1.3M):     top 1% = 46.5%, top 5% = 66.5%, top 10% = 74.6%

Both show extreme skew well above the y=x uniform reference; the replay
trace is less extreme at top-1% because n=274 makes that bucket only
~3 sessions. We standardize §2/§3 narrative on the replay-trace numbers
so motivation matches §5 evaluation; production numbers kept as a side
note for context.

- scripts/plot_session_skew_cdf.py: reproducible figure generator
- MEETING.md / PAPER_OUTLINE.md: update narrative + caption

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 10:37:22 +08:00
020a5c79a7 §3.3 reframe: hot pin failure is uniformly-slow workers, not max/median ratio
User pointed out the apparent paradox: in fig_b3_per_worker_ttft_p90, unified
has hotspot index 3.67 while sticky has 2.73, yet unified e2e p90 is roughly
half of sticky's. Resolution: hotspot index (max/median) is a *ratio* and
misleading on its own. Per-worker absolute TTFT p90:

  sticky : median 20.3s, max 55.4s -> system e2e p90 34.6s
  unified: median 10.3s, max 37.7s -> system e2e p90 18.0s

Mechanism: top 1% sessions own 46.5% input mass and there are more hot
sessions than instances (8), so sticky's hash binding gives *every* worker
its own hot session and the median worker is also slow. Unified's LMetric
fallback re-routes cold/new sessions away from hot affinity instances,
preserving 7/8 worker speed. System p90 is dominated by the majority of
requests landing on fast workers, hence the 2x e2e gap.

Changes:
- Replace §3.3 figure with figs/f4c_per_worker_ttft.png (per-worker bars)
  instead of figs/f4c_apc_vs_hotspot_tradeoff.png (the ratio scatter)
- §3.3 narrative in PAPER_OUTLINE.md and MEETING.md rewritten around
  absolute median + max + system e2e p90 instead of hotspot ratio
- Add a §3.3 sub-finding: "hot pin failure must be measured with
  per-worker absolute latency, not normalized ratio"
- Keep the scatter as supplementary for §5 multi-policy summary

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 10:10:23 +08:00
df0ee5a02b Use PNG for KV memory wall figure; switch outline to inline image embeds
- Convert figs/f4b_pdsep_kv_wall.pdf to PNG via pdftoppm @ 150 DPI so
  MEETING.md and PAPER_OUTLINE.md render the figure inline on GitHub /
  any standard markdown viewer (PDF !() embeds don't render).
- PAPER_OUTLINE.md F2, F4, F6: switch from backtick code references to
  proper ![]() image embeds so the doc is actually viewable as a deck.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 09:13:26 +08:00
52cdb80367 EAR outline: copy reusable figures, mark migration sections deferred
- replayer/replay.py: emit trace_span_s and amplification in summary
  (Phase 1 of the wall-clock amplification measurement plan; needed for
  §2.3 dispatch coupling empirical closure)
- figs/: 8 reusable figures copied from analysis/ with paper-spec names
  (f2a/b/c workload, f4a/b/c/d failure modes, f6 e2e partial)
- PAPER_OUTLINE.md: real figure paths, explicit TBD markers for
  custom drawings and pending data; new "Validation Status" table at top
  and reorganized "Work Plan" splitting can-do-now vs migration-deferred

Migration validation deferred per user: 4 prior attempts (6b255fa,
e991960/5772149, cc6e562, 4c583f2) were reverted due to transfer
overhead; pending re-test on top of connector_tax DR-fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 01:44:13 +08:00