Validates the elastic_migration_v2 finding that kv_role=kv_both adds
TTFT p90 +45% even when PD-sep never fires. Replicates under
single-instance, synthetic, open-loop workload to disambiguate
mechanism cost from 8-instance feedback amplification.
Configurations (8):
plain, noop_connector, mooncake_{producer,consumer,both},
nixl_both, lmcache_only, multi_mooncake_lmcache.
Pre-flight verification gates risky configs (kv_consumer needs dummy
bootstrap, multi-connector composition, NoOp custom class loading).
Workload: two-phase sweep
Phase A: rate {0.5..32} req/s × shape (4096, 256), saturation criteria
Phase B: ref_safe rate × cartesian (input ∈ {512,4k,32k}, output ∈ {64,256,1024})
Step-timing patch enriches vLLM's existing AGENTIC_STEP_LOG_PATH emit
with step_duration_us and build_meta_us — directly measures per-step
substrate cost, not just user-visible TTFT/TPOT.
run_all.sh runs as 5-stage barrier:
0 pre-flight + apply patch
1 Phase A all configs
2 pick ref_safe / ref_load
3 Phase B all configs
4 revert patch + analyze + plot
Outputs aggregate.{json,csv}, MANIFEST.tsv, and 5 figures.
Estimated runtime: 4-5.5 hours on idle dash0 H20.
91 lines
2.5 KiB
Python
91 lines
2.5 KiB
Python
"""Pure no-op KV connector for measuring vLLM v1 framework overhead.
|
|
|
|
This connector implements every abstract hook of KVConnectorBase_V1 with
|
|
the cheapest possible no-op return. Loaded via:
|
|
|
|
--kv-transfer-config '{
|
|
"kv_connector_module_path":
|
|
"microbench.connector_tax.tools.noop_connector:NoOpConnector",
|
|
"kv_role": "kv_both"
|
|
}'
|
|
|
|
It does:
|
|
- no I/O
|
|
- no per-step cache key walk
|
|
- no per-layer save/load
|
|
- no metadata serialization beyond an empty dataclass
|
|
|
|
So `tax(NoOpConnector) ≈ pure vLLM v1 framework overhead`.
|
|
"""
|
|
|
|
from typing import TYPE_CHECKING, Any
|
|
|
|
from vllm.distributed.kv_transfer.kv_connector.v1.base import (
|
|
KVConnectorBase_V1,
|
|
KVConnectorMetadata,
|
|
)
|
|
|
|
if TYPE_CHECKING:
|
|
import torch
|
|
from vllm.attention.backends.abstract import AttentionMetadata
|
|
from vllm.forward_context import ForwardContext
|
|
from vllm.v1.core.kv_cache_manager import KVCacheBlocks
|
|
from vllm.v1.core.sched.output import SchedulerOutput
|
|
from vllm.v1.request import Request
|
|
|
|
|
|
class NoOpConnector(KVConnectorBase_V1):
|
|
"""Empty connector — every hook is a no-op.
|
|
|
|
Used as a control to isolate vLLM v1 framework dispatch cost
|
|
(build_connector_meta walking SchedulerOutput, mixin hooks, etc.)
|
|
from any specific connector implementation work (RDMA setup,
|
|
per-layer save, hash table walks).
|
|
"""
|
|
|
|
# ---- scheduler-side abstract methods ------------------------------
|
|
def get_num_new_matched_tokens(
|
|
self,
|
|
request: "Request",
|
|
num_computed_tokens: int,
|
|
) -> tuple[int | None, bool]:
|
|
# Never advertises any external cache hits.
|
|
return 0, False
|
|
|
|
def update_state_after_alloc(
|
|
self,
|
|
request: "Request",
|
|
blocks: "KVCacheBlocks",
|
|
num_external_tokens: int,
|
|
) -> None:
|
|
return None
|
|
|
|
def build_connector_meta(
|
|
self,
|
|
scheduler_output: "SchedulerOutput",
|
|
) -> KVConnectorMetadata:
|
|
return KVConnectorMetadata()
|
|
|
|
# ---- worker-side abstract methods ---------------------------------
|
|
def start_load_kv(
|
|
self,
|
|
forward_context: "ForwardContext",
|
|
**kwargs: Any,
|
|
) -> None:
|
|
return None
|
|
|
|
def wait_for_layer_load(self, layer_name: str) -> None:
|
|
return None
|
|
|
|
def save_kv_layer(
|
|
self,
|
|
layer_name: str,
|
|
kv_layer: "torch.Tensor",
|
|
attn_metadata: "AttentionMetadata",
|
|
**kwargs: Any,
|
|
) -> None:
|
|
return None
|
|
|
|
def wait_for_save(self) -> None:
|
|
return None
|