Files
agentic-kvc/microbench/patches/pd_profile.py
Gahow Wang 72790ae6c1 PD-sep server-side profiling: vLLM patches + per-request breakdown
Instrumentation patches (microbench/patches/):
  - pd_profile.py: shared event emitter (VLLM_PD_PROFILE_LOG env var)
  - apply_patches.py: idempotent patch installer for mooncake_connector.py
    and scheduler.py, marks insertions with # PD_PROFILE_PATCH
  - analyze_events.py: joins per-process JSONL event logs by transfer_id
    into per-request phase durations

Seven events captured per request:
  D_get_num_matched → P_zmq_received → P_prefill_done →
  P_rdma_start → P_rdma_end → D_recv_complete → D_request_promoted

Driver fix (microbench/lifecycle/driver.py):
  seed_prefix_cache now sends via the proxy URL so P and D both cache
  the seeded prefix with matching block hashes. Previously seeding D
  directly produced different block hashes than the proxy-routed
  measurement requests, making incremental transfer impossible.

Real breakdown (fig_breakdown_real.png, server_breakdown.csv, n=93):
  prefill_compute  620 ms median (95% of overhead)
  rdma_transfer     42 ms median (~71 Gbps effective)
  other overhead    10 ms median (dispatch + params + signal + promote)

Mooncake transfer is NOT the bottleneck. Even with bulk RDMA the
transfer cost is <10% of prefill cost for Qwen3-30B-A3B on H20.
2026-05-26 13:59:09 +08:00

58 lines
1.5 KiB
Python

"""
PD-Sep Lifecycle Profiler — patch for mooncake_connector.py + scheduler.py
Activated by env var: VLLM_PD_PROFILE_LOG=/path/to/events.jsonl
Each line is one event:
{"t_ns": <perf_counter_ns>, "event": "<name>", "req_id": "<id>",
"transfer_id": "<id>", "role": "<kv_producer|kv_consumer>",
"data": {...event-specific fields...}}
This module is imported by the patched mooncake_connector.py and scheduler.py
to centralize event emission.
"""
import os
import json
import time
import threading
from typing import Any
_LOG_FILE = os.environ.get("VLLM_PD_PROFILE_LOG", "")
_LOCK = threading.Lock()
_HANDLE = None
_ENABLED = bool(_LOG_FILE)
def _get_handle():
global _HANDLE
if _HANDLE is None and _ENABLED:
_HANDLE = open(_LOG_FILE, "a", buffering=1) # line-buffered
return _HANDLE
def emit(event: str, req_id: str = "", transfer_id: str = "",
role: str = "", **data: Any) -> None:
"""Record a profile event. No-op if VLLM_PD_PROFILE_LOG is not set."""
if not _ENABLED:
return
record = {
"t_ns": time.perf_counter_ns(),
"event": event,
"req_id": str(req_id),
"transfer_id": str(transfer_id),
"role": role,
"data": data,
}
try:
h = _get_handle()
if h:
with _LOCK:
h.write(json.dumps(record, default=str) + "\n")
except Exception:
pass # never let profiling break vLLM
def enabled() -> bool:
return _ENABLED