agentic-kvc

gahow/agentic-kvc

Fork 0

Commit Graph

Author	SHA1	Message	Date
Gahow Wang	06dd175441	Microbench 1 plots: prefill-decode interference heatmap + lines plot_interference.py reads the interference sweep summary (4 D × 4 P × 3 reps, cold prefill prompts) and produces: fig_interference_heatmap.png TPOT p90 interference index over (D, P): 14x at D=8 P=2k → 214x at D=1 P=32k. fig_interference_lines.png (a) TPOT p90 during prefill vs P, log-y, one line per D + baseline dashed (b) Cold prefill TTFT vs P (interference window length) Confirms B2 finding: cold prefill on the same worker stalls overlapping decodes for 14-214x baseline TPOT. The interference window grows linearly with P (from ~140ms at 2k to ~4.6s at 32k) and is essentially independent of decode batch size — prefill compute time dominates.	2026-05-26 14:21:30 +08:00
Gahow Wang	72790ae6c1	PD-sep server-side profiling: vLLM patches + per-request breakdown Instrumentation patches (microbench/patches/): - pd_profile.py: shared event emitter (VLLM_PD_PROFILE_LOG env var) - apply_patches.py: idempotent patch installer for mooncake_connector.py and scheduler.py, marks insertions with # PD_PROFILE_PATCH - analyze_events.py: joins per-process JSONL event logs by transfer_id into per-request phase durations Seven events captured per request: D_get_num_matched → P_zmq_received → P_prefill_done → P_rdma_start → P_rdma_end → D_recv_complete → D_request_promoted Driver fix (microbench/lifecycle/driver.py): seed_prefix_cache now sends via the proxy URL so P and D both cache the seeded prefix with matching block hashes. Previously seeding D directly produced different block hashes than the proxy-routed measurement requests, making incremental transfer impossible. Real breakdown (fig_breakdown_real.png, server_breakdown.csv, n=93): prefill_compute 620 ms median (95% of overhead) rdma_transfer 42 ms median (~71 Gbps effective) other overhead 10 ms median (dispatch + params + signal + promote) Mooncake transfer is NOT the bottleneck. Even with bulk RDMA the transfer cost is <10% of prefill cost for Qwen3-30B-A3B on H20.	2026-05-26 13:59:09 +08:00
Gahow Wang	f784e49c07	Microbench: prefill-decode interference + PD transfer lifecycle Two microbenchmarks quantifying the elastic offload decision: 1. Interference (corrected): cold prefill causes 14-214x TPOT p90 degradation on same-worker decode (D∈{1,2,4,8} × P∈{2k,8k,16k,32k}). Earlier run had a prefix-cache bug (deterministic prompts hit cache after rep 0); fixed with uuid+time_ns unique prompts. 2. Transfer lifecycle: PD-sep TTFT breakdown via Mooncake proxy, measuring prefill→RDMA→decode startup overhead. Key finding: offload wins at all P≥2048 operating points — transfer cost is 25-50% of interference cost even with bulk Mooncake.	2026-05-26 00:57:06 +08:00

Author

SHA1

Message

Date

Gahow Wang

06dd175441

Microbench 1 plots: prefill-decode interference heatmap + lines

plot_interference.py reads the interference sweep summary (4 D × 4 P × 3 reps,
cold prefill prompts) and produces:

  fig_interference_heatmap.png
    TPOT p90 interference index over (D, P): 14x at D=8 P=2k → 214x at D=1 P=32k.

  fig_interference_lines.png
    (a) TPOT p90 during prefill vs P, log-y, one line per D + baseline dashed
    (b) Cold prefill TTFT vs P (interference window length)

Confirms B2 finding: cold prefill on the same worker stalls overlapping
decodes for 14-214x baseline TPOT. The interference window grows linearly
with P (from ~140ms at 2k to ~4.6s at 32k) and is essentially independent
of decode batch size — prefill compute time dominates.

2026-05-26 14:21:30 +08:00

Gahow Wang

72790ae6c1

PD-sep server-side profiling: vLLM patches + per-request breakdown

Instrumentation patches (microbench/patches/):
  - pd_profile.py: shared event emitter (VLLM_PD_PROFILE_LOG env var)
  - apply_patches.py: idempotent patch installer for mooncake_connector.py
    and scheduler.py, marks insertions with # PD_PROFILE_PATCH
  - analyze_events.py: joins per-process JSONL event logs by transfer_id
    into per-request phase durations

Seven events captured per request:
  D_get_num_matched → P_zmq_received → P_prefill_done →
  P_rdma_start → P_rdma_end → D_recv_complete → D_request_promoted

Driver fix (microbench/lifecycle/driver.py):
  seed_prefix_cache now sends via the proxy URL so P and D both cache
  the seeded prefix with matching block hashes. Previously seeding D
  directly produced different block hashes than the proxy-routed
  measurement requests, making incremental transfer impossible.

Real breakdown (fig_breakdown_real.png, server_breakdown.csv, n=93):
  prefill_compute  620 ms median (95% of overhead)
  rdma_transfer     42 ms median (~71 Gbps effective)
  other overhead    10 ms median (dispatch + params + signal + promote)

Mooncake transfer is NOT the bottleneck. Even with bulk RDMA the
transfer cost is <10% of prefill cost for Qwen3-30B-A3B on H20.

2026-05-26 13:59:09 +08:00

Gahow Wang

f784e49c07

Microbench: prefill-decode interference + PD transfer lifecycle

Two microbenchmarks quantifying the elastic offload decision:

1. Interference (corrected): cold prefill causes 14-214x TPOT p90
   degradation on same-worker decode (D∈{1,2,4,8} × P∈{2k,8k,16k,32k}).
   Earlier run had a prefix-cache bug (deterministic prompts hit cache
   after rep 0); fixed with uuid+time_ns unique prompts.

2. Transfer lifecycle: PD-sep TTFT breakdown via Mooncake proxy,
   measuring prefill→RDMA→decode startup overhead.

Key finding: offload wins at all P≥2048 operating points —
transfer cost is 25-50% of interference cost even with bulk Mooncake.

2026-05-26 00:57:06 +08:00

3 Commits