agentic-kvc

Author	SHA1	Message	Date
Gahow Wang	fafc44da79	MB5 PD reuse-centric ablation: tooling, data, Fig 1-3 Three-axis controlled ablation of PD-colo vs PD-disagg on synthetic regular traces (closed-loop, controlled reuse via REPLAY_NO_REALIZED_PREFIX) on the clean stack (`e13391e` gated off). Axis 1 (Fig 1) -- reuse 6%->94% at N=8, in8192/out256 Axis 2 (Fig 2) -- shape in2048/out2048 -> in32768/out64 at N=8, reuse~70% Axis 3 (Fig 3) -- concurrency N=8/16/32/64 at reuse~71%, in8192/out256 Findings: * APC parity colo=PD at every reuse (5.5/22/44/66/77/82%) -- contamination fix validated. * PD edge erodes 1.57x->1.10x with reuse; prefill GPUs strand 26%->9%. * Shape: PD-best peaks mid-sweep (1.34x at in8192/out512); wrong PD ratio catastrophic at prefill extreme (in32768/out64 pd2 = 378/400, p99 432s). * Concurrency: PD wins N<=32 (1.23-1.29x), TIPS at N=64 -- pd2/pd4 crater (APC 71%->1.4%, TPS -30%) while colo scales cleanly. Infrastructure: * replayer: --max-inflight-sessions, --inter-turn-think, --no-realized-prefix (env-defaulted via REPLAY_MAX_INFLIGHT, REPLAY_INTER_TURN_THINK_S, REPLAY_NO_REALIZED_PREFIX). * mb5_run.sh: writes bench_config.json + gpu_util.csv + run_window.json + instance_apc.txt + metrics.jsonl for bench_report/fig_agg ingest. * fig_agg.py: per-arm GPU role split + producer-side APC; --json mode. * gpu_util_report.py: companion per-GPU util report from gpu_util.csv. * partial_summary.py: stats from in-flight replay_metrics.jsonl (works before metrics.summary.json exists). Data: analysis/mb5_pd_ablation/fig{1,2,3}.json (24 + 20 + 16 rows). Figures: figs/mb5_pd_ablation/fig{1_reuse,2_shape,3_concurrency}_axis.png.	2026-05-31 20:14:46 +08:00
Gahow Wang	e532e83d3e	mb5_run: scrape per-instance prefix-cache counters before teardown Per-port vllm:prefix_cache_{queries,hits}_total -> instance_apc.txt. For PD this is the only honest reuse signal: producer ports show cross-turn prefix hits, while the consumer's per-request cached_tokens just counts transferred KV. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 11:56:43 +08:00
Gahow Wang	ee5db0b321	MB5 driver updates: PD-proxy + snapshot instrument + launcher tweaks Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 11:53:27 +08:00
Gahow Wang	e0d3b5150a	MB5 driver fixes: bash env-prefix + replayer flag names + python date math Two bugs caught by 8C smoke: mb5_launch.sh ${env_bp_arg} expanded as a literal command line prefix doesn't work when env_bp_arg is itself a variable — bash only treats VAR=val as an env assignment if it sees the literal in the parsed command, not after expansion. Fix: always export VLLM_MOONCAKE_BOOTSTRAP_PORT as a literal, defaulting to 9999 when caller passed no port (consumer mode ignores the var so the placeholder is harmless). mb5_run.sh replayer's actual CLI flags are --trace / --output / --endpoint / --model, not the ---path / ---name variants I had. Plus dash1 has no `bc`; compute wall_clock_s via python instead. Both fixed; 8C smoke (CONFIG=8C REPS=1 REQUEST_LIMIT=20) now runs end-to-end in ~30 s: - 8 vLLM kv_both instances on GPU 0-7 come up - replayer round-robins 20 reqs across them - MB5 instrumentation captures 8 snapshot files (one per EngineCore PID), ranging 7-139 snapshots each = ~10 Hz throttle works - plot_kv_pool_timeline.py renders the stacked-area + queue-depth chart cleanly (figs/mb5_smoke/*.png) Pipeline validated. Ready for the real PD-ratio sweep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 23:23:23 +08:00
Gahow Wang	e9abd70c8d	MB5 driver: launcher, orchestrator, KV-pool timeline plotter Three new files to drive the PD ratio sweep + per-request KV occupancy capture, plus a deploy.sh update so the patched replayer rides along to the fresh-venv host. mb5_launch.sh One script handles all four configs we plan to sweep: CONFIG=8C / 6P+2D / 4P+4D / 2P+6D - For 8C: 8 vLLM instances with kv_role=kv_both on GPU 0-7. Replayer talks to them via the existing comma-separated round-robin in replayer/replay.py — no proxy. - For PD configs: kv_role=kv_producer for the P pool (with VLLM_MOONCAKE_BOOTSTRAP_PORT) + kv_role=kv_consumer for the D pool, routed by the official vLLM example third_party/vllm/examples/online_serving/disaggregated_serving/ mooncake_connector/mooncake_connector_proxy.py — no policy choice made by us, per user instruction to use the standard recipe. - Applies instrument_kv_snapshot.py before launching so every EngineCore writes its per-step KV snapshot to $RUN_ROOT/kv_snapshots/mb5_kv_snapshot_pid<pid>.jsonl - Reverts the patch on stop. - Emits ENDPOINTS= line on stdout for the orchestrator to read. mb5_run.sh For each CONFIG × rep: launch, replay w600 trace via the existing replayer, capture wall-clock, tear down, cool down 10 s. Defaults: CONFIGS="8C 6P+2D 4P+4D 2P+6D" REPS=3 TRACE=traces/w600_r0.0015_st30.jsonl All artefacts go under $FRESH_ROOT/mb5_runs/$RUN_TAG_${config}_rep${rep}/ (vllm_logs/, kv_snapshots/, replay_metrics.jsonl, wall_clock_s.txt). plot_kv_pool_timeline.py Reads one or more mb5_kv_snapshot_pid.jsonl files and renders a stacked-area chart per file: x = wall-clock since first snapshot y = KV block count, stacked by per-request contribution overlay: pool-total ceiling, 90% line, waiting-queue depth subplot Bands are colored by a deterministic hash of request_id so individual requests are visually tractable across the run. This is the figure the user asked for — turns headline "PD-disagg is 10× worse" into a system-level picture of where* the KV pool is blocked, when, and by which requests. deploy.sh Also tar-syncs the local replayer/ dir to /home/admin/cpfs/wjh/agentic-kv-fresh/replayer/ so mb5_run.sh can `python -m replayer` against the patched (trace_span_s/amplification) version, not the older copy under /home/admin/cpfs/wjh/agentic-kv/. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 23:02:57 +08:00

5 Commits