agentic-kvc/fig_breakdown.png at 72790ae6c15cdac500ab03a7b4374d89fba39c5c

Files

Gahow Wang 72790ae6c1 PD-sep server-side profiling: vLLM patches + per-request breakdown

Instrumentation patches (microbench/patches/):
  - pd_profile.py: shared event emitter (VLLM_PD_PROFILE_LOG env var)
  - apply_patches.py: idempotent patch installer for mooncake_connector.py
    and scheduler.py, marks insertions with # PD_PROFILE_PATCH
  - analyze_events.py: joins per-process JSONL event logs by transfer_id
    into per-request phase durations

Seven events captured per request:
  D_get_num_matched → P_zmq_received → P_prefill_done →
  P_rdma_start → P_rdma_end → D_recv_complete → D_request_promoted

Driver fix (microbench/lifecycle/driver.py):
  seed_prefix_cache now sends via the proxy URL so P and D both cache
  the seeded prefix with matching block hashes. Previously seeding D
  directly produced different block hashes than the proxy-routed
  measurement requests, making incremental transfer impossible.

Real breakdown (fig_breakdown_real.png, server_breakdown.csv, n=93):
  prefill_compute  620 ms median (95% of overhead)
  rdma_transfer     42 ms median (~71 Gbps effective)
  other overhead    10 ms median (dispatch + params + signal + promote)

Mooncake transfer is NOT the bottleneck. Even with bulk RDMA the
transfer cost is <10% of prefill cost for Qwen3-30B-A3B on H20.

2026-05-26 13:59:09 +08:00

144 KiB

2062x965px

Raw History

/gahow/agentic-kvc/raw/commit/72790ae6c15cdac500ab03a7b4374d89fba39c5c/microbench/lifecycle/results/fig_breakdown.png

144 KiB 2062x965px Raw History

144 KiB

2062x965px

Raw History