agentic-kvc

Author SHA1 Message Date

Author	SHA1	Message	Date
Gahow Wang	54e1f5266a	MB5 PD ablation v2 results: concurrency axis + reuse 3-way writeup - fig3_conc32k.json + fig3_concurrency_axis.png: agentic-corner concurrency sweep (in=32768, reuse=0.984, out=128), N 8->128, PD capped 600s / colo uncapped. colo completes 100% at every N (graceful, E2E 2.4s->81s); every static PD split collapses, earlier as N rises (viable only N<=16; <27% by N=32). - analysis/mb5_pd_ablation/README.md: self-contained v2 writeup. Reuse axis 3-way (A=d2048/o256, C=d2048/o128, B=d1024/o128) decomposes shape: output ~negligible, delta (real prefill/turn) dominant; crossover to colo at reuse ~90-95% robust. Run on dash2 (dash0 NICs faulted for Mooncake). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-01 09:35:25 +08:00
Gahow Wang	9c105cf05a	MB5 PD ablation: controlled-variable reuse/conc redo + campaign tooling Reuse and concurrency axes redone with proper controlled variables, plus the orchestration used to run them on dash0: - run_reuse_fixed.sh: hold REAL prefill work (delta) constant, vary only cached prefix -> reuse = C/(C+U). Supersedes old fig1 (which held input=8192 and sliced prefix out, confounding "more reuse" with "less prefill"). - run_conc.sh: agentic-corner config (in=32768, delta=512, reuse=0.984, out=128) that exposes PD's structural KV-transfer tax. Supersedes old fig3. - run_campaign{,2,3}.sh, backfill_d2048o128.sh: serial campaign drivers (strictly one driver at a time), out=128 sweeps, PD wall-cap for collapse-draining high-reuse arms, and flaked-arm backfill. - mb5_run_gpu.sh: per-config bring-up / replay / teardown orchestrator. - plot_pd_crossover.py: render the reuse_compare figures from fig_agg dumps. - fig_agg.py: tolerate null stats from fully-collapsed arms (0 successes write the stat keys as null; `dict.get(k, {})` returns null, not {}). Data: fig1_reuse_fixed.json, fig1_reuse_d{1024,2048}_o128.json Figs: reuse_compare_AB.png, reuse_compare_ABC.png Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-01 01:03:27 +08:00
Gahow Wang	fafc44da79	MB5 PD reuse-centric ablation: tooling, data, Fig 1-3 Three-axis controlled ablation of PD-colo vs PD-disagg on synthetic regular traces (closed-loop, controlled reuse via REPLAY_NO_REALIZED_PREFIX) on the clean stack (`e13391e` gated off). Axis 1 (Fig 1) -- reuse 6%->94% at N=8, in8192/out256 Axis 2 (Fig 2) -- shape in2048/out2048 -> in32768/out64 at N=8, reuse~70% Axis 3 (Fig 3) -- concurrency N=8/16/32/64 at reuse~71%, in8192/out256 Findings: * APC parity colo=PD at every reuse (5.5/22/44/66/77/82%) -- contamination fix validated. * PD edge erodes 1.57x->1.10x with reuse; prefill GPUs strand 26%->9%. * Shape: PD-best peaks mid-sweep (1.34x at in8192/out512); wrong PD ratio catastrophic at prefill extreme (in32768/out64 pd2 = 378/400, p99 432s). * Concurrency: PD wins N<=32 (1.23-1.29x), TIPS at N=64 -- pd2/pd4 crater (APC 71%->1.4%, TPS -30%) while colo scales cleanly. Infrastructure: * replayer: --max-inflight-sessions, --inter-turn-think, --no-realized-prefix (env-defaulted via REPLAY_MAX_INFLIGHT, REPLAY_INTER_TURN_THINK_S, REPLAY_NO_REALIZED_PREFIX). * mb5_run.sh: writes bench_config.json + gpu_util.csv + run_window.json + instance_apc.txt + metrics.jsonl for bench_report/fig_agg ingest. * fig_agg.py: per-arm GPU role split + producer-side APC; --json mode. * gpu_util_report.py: companion per-GPU util report from gpu_util.csv. * partial_summary.py: stats from in-flight replay_metrics.jsonl (works before metrics.summary.json exists). Data: analysis/mb5_pd_ablation/fig{1,2,3}.json (24 + 20 + 16 rows). Figures: figs/mb5_pd_ablation/fig{1_reuse,2_shape,3_concurrency}_axis.png.	2026-05-31 20:14:46 +08:00

Gahow Wang

54e1f5266a

MB5 PD ablation v2 results: concurrency axis + reuse 3-way writeup

- fig3_conc32k.json + fig3_concurrency_axis.png: agentic-corner concurrency
  sweep (in=32768, reuse=0.984, out=128), N 8->128, PD capped 600s / colo
  uncapped. colo completes 100% at every N (graceful, E2E 2.4s->81s); every
  static PD split collapses, earlier as N rises (viable only N<=16; <27% by N=32).
- analysis/mb5_pd_ablation/README.md: self-contained v2 writeup. Reuse axis
  3-way (A=d2048/o256, C=d2048/o128, B=d1024/o128) decomposes shape: output
  ~negligible, delta (real prefill/turn) dominant; crossover to colo at reuse
  ~90-95% robust. Run on dash2 (dash0 NICs faulted for Mooncake).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-01 09:35:25 +08:00

Gahow Wang

9c105cf05a

MB5 PD ablation: controlled-variable reuse/conc redo + campaign tooling

Reuse and concurrency axes redone with proper controlled variables, plus
the orchestration used to run them on dash0:

- run_reuse_fixed.sh: hold REAL prefill work (delta) constant, vary only
  cached prefix -> reuse = C/(C+U). Supersedes old fig1 (which held
  input=8192 and sliced prefix out, confounding "more reuse" with "less
  prefill").
- run_conc.sh: agentic-corner config (in=32768, delta=512, reuse=0.984,
  out=128) that exposes PD's structural KV-transfer tax. Supersedes old fig3.
- run_campaign{,2,3}.sh, backfill_d2048o128.sh: serial campaign drivers
  (strictly one driver at a time), out=128 sweeps, PD wall-cap for
  collapse-draining high-reuse arms, and flaked-arm backfill.
- mb5_run_gpu.sh: per-config bring-up / replay / teardown orchestrator.
- plot_pd_crossover.py: render the reuse_compare figures from fig_agg dumps.
- fig_agg.py: tolerate null stats from fully-collapsed arms (0 successes
  write the stat keys as null; `dict.get(k, {})` returns null, not {}).

Data: fig1_reuse_fixed.json, fig1_reuse_d{1024,2048}_o128.json
Figs: reuse_compare_AB.png, reuse_compare_ABC.png

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-01 01:03:27 +08:00

Gahow Wang

fafc44da79

MB5 PD reuse-centric ablation: tooling, data, Fig 1-3

Three-axis controlled ablation of PD-colo vs PD-disagg on synthetic regular
traces (closed-loop, controlled reuse via REPLAY_NO_REALIZED_PREFIX) on the
clean stack (e13391e gated off).

  Axis 1 (Fig 1) -- reuse 6%->94% at N=8, in8192/out256
  Axis 2 (Fig 2) -- shape in2048/out2048 -> in32768/out64 at N=8, reuse~70%
  Axis 3 (Fig 3) -- concurrency N=8/16/32/64 at reuse~71%, in8192/out256

Findings:
  * APC parity colo=PD at every reuse (5.5/22/44/66/77/82%) -- contamination
    fix validated.
  * PD edge erodes 1.57x->1.10x with reuse; prefill GPUs strand 26%->9%.
  * Shape: PD-best peaks mid-sweep (1.34x at in8192/out512); wrong PD ratio
    catastrophic at prefill extreme (in32768/out64 pd2 = 378/400, p99 432s).
  * Concurrency: PD wins N<=32 (1.23-1.29x), TIPS at N=64 -- pd2/pd4
    crater (APC 71%->1.4%, TPS -30%) while colo scales cleanly.

Infrastructure:
  * replayer: --max-inflight-sessions, --inter-turn-think, --no-realized-prefix
    (env-defaulted via REPLAY_MAX_INFLIGHT, REPLAY_INTER_TURN_THINK_S,
    REPLAY_NO_REALIZED_PREFIX).
  * mb5_run.sh: writes bench_config.json + gpu_util.csv + run_window.json +
    instance_apc.txt + metrics.jsonl for bench_report/fig_agg ingest.
  * fig_agg.py: per-arm GPU role split + producer-side APC; --json mode.
  * gpu_util_report.py: companion per-GPU util report from gpu_util.csv.
  * partial_summary.py: stats from in-flight replay_metrics.jsonl
    (works before metrics.summary.json exists).

Data: analysis/mb5_pd_ablation/fig{1,2,3}.json (24 + 20 + 16 rows).
Figures: figs/mb5_pd_ablation/fig{1_reuse,2_shape,3_concurrency}_axis.png.

2026-05-31 20:14:46 +08:00

3 Commits