Instrumentation patches (microbench/patches/):
- pd_profile.py: shared event emitter (VLLM_PD_PROFILE_LOG env var)
- apply_patches.py: idempotent patch installer for mooncake_connector.py
and scheduler.py, marks insertions with # PD_PROFILE_PATCH
- analyze_events.py: joins per-process JSONL event logs by transfer_id
into per-request phase durations
Seven events captured per request:
D_get_num_matched → P_zmq_received → P_prefill_done →
P_rdma_start → P_rdma_end → D_recv_complete → D_request_promoted
Driver fix (microbench/lifecycle/driver.py):
seed_prefix_cache now sends via the proxy URL so P and D both cache
the seeded prefix with matching block hashes. Previously seeding D
directly produced different block hashes than the proxy-routed
measurement requests, making incremental transfer impossible.
Real breakdown (fig_breakdown_real.png, server_breakdown.csv, n=93):
prefill_compute 620 ms median (95% of overhead)
rdma_transfer 42 ms median (~71 Gbps effective)
other overhead 10 ms median (dispatch + params + signal + promote)
Mooncake transfer is NOT the bottleneck. Even with bulk RDMA the
transfer cost is <10% of prefill cost for Qwen3-30B-A3B on H20.
58 lines
1.5 KiB
Python
58 lines
1.5 KiB
Python
"""
|
|
PD-Sep Lifecycle Profiler — patch for mooncake_connector.py + scheduler.py
|
|
|
|
Activated by env var: VLLM_PD_PROFILE_LOG=/path/to/events.jsonl
|
|
|
|
Each line is one event:
|
|
{"t_ns": <perf_counter_ns>, "event": "<name>", "req_id": "<id>",
|
|
"transfer_id": "<id>", "role": "<kv_producer|kv_consumer>",
|
|
"data": {...event-specific fields...}}
|
|
|
|
This module is imported by the patched mooncake_connector.py and scheduler.py
|
|
to centralize event emission.
|
|
"""
|
|
|
|
import os
|
|
import json
|
|
import time
|
|
import threading
|
|
from typing import Any
|
|
|
|
_LOG_FILE = os.environ.get("VLLM_PD_PROFILE_LOG", "")
|
|
_LOCK = threading.Lock()
|
|
_HANDLE = None
|
|
_ENABLED = bool(_LOG_FILE)
|
|
|
|
|
|
def _get_handle():
|
|
global _HANDLE
|
|
if _HANDLE is None and _ENABLED:
|
|
_HANDLE = open(_LOG_FILE, "a", buffering=1) # line-buffered
|
|
return _HANDLE
|
|
|
|
|
|
def emit(event: str, req_id: str = "", transfer_id: str = "",
|
|
role: str = "", **data: Any) -> None:
|
|
"""Record a profile event. No-op if VLLM_PD_PROFILE_LOG is not set."""
|
|
if not _ENABLED:
|
|
return
|
|
record = {
|
|
"t_ns": time.perf_counter_ns(),
|
|
"event": event,
|
|
"req_id": str(req_id),
|
|
"transfer_id": str(transfer_id),
|
|
"role": role,
|
|
"data": data,
|
|
}
|
|
try:
|
|
h = _get_handle()
|
|
if h:
|
|
with _LOCK:
|
|
h.write(json.dumps(record, default=str) + "\n")
|
|
except Exception:
|
|
pass # never let profiling break vLLM
|
|
|
|
|
|
def enabled() -> bool:
|
|
return _ENABLED
|