Validates the elastic_migration_v2 finding that kv_role=kv_both adds
TTFT p90 +45% even when PD-sep never fires. Replicates under
single-instance, synthetic, open-loop workload to disambiguate
mechanism cost from 8-instance feedback amplification.
Configurations (8):
plain, noop_connector, mooncake_{producer,consumer,both},
nixl_both, lmcache_only, multi_mooncake_lmcache.
Pre-flight verification gates risky configs (kv_consumer needs dummy
bootstrap, multi-connector composition, NoOp custom class loading).
Workload: two-phase sweep
Phase A: rate {0.5..32} req/s × shape (4096, 256), saturation criteria
Phase B: ref_safe rate × cartesian (input ∈ {512,4k,32k}, output ∈ {64,256,1024})
Step-timing patch enriches vLLM's existing AGENTIC_STEP_LOG_PATH emit
with step_duration_us and build_meta_us — directly measures per-step
substrate cost, not just user-visible TTFT/TPOT.
run_all.sh runs as 5-stage barrier:
0 pre-flight + apply patch
1 Phase A all configs
2 pick ref_safe / ref_load
3 Phase B all configs
4 revert patch + analyze + plot
Outputs aggregate.{json,csv}, MANIFEST.tsv, and 5 figures.
Estimated runtime: 4-5.5 hours on idle dash0 H20.
17 lines
585 B
Bash
Executable File
17 lines
585 B
Bash
Executable File
#!/bin/bash
|
|
# vLLM + custom NoOpConnector loaded by dotted path.
|
|
set -euo pipefail
|
|
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
|
|
|
|
KV_CFG='{"kv_connector":"NoOpConnector","kv_connector_module_path":"microbench.connector_tax.tools.noop_connector","kv_role":"kv_both"}'
|
|
|
|
CUDA_VISIBLE_DEVICES="$GPU_ID" \
|
|
"$PYTHON" -m vllm.entrypoints.openai.api_server \
|
|
"${COMMON_VLLM_ARGS[@]}" \
|
|
--kv-transfer-config "$KV_CFG" \
|
|
> "$LOG_DIR/vllm_stdout.log" 2> "$LOG_DIR/vllm_stderr.log" &
|
|
VLLM_PID=$!
|
|
echo "$VLLM_PID" > "$LOG_DIR/.vllm.pid"
|
|
echo "vLLM PID=$VLLM_PID"
|
|
wait_for_ready 240
|