agentic-kvc

Files

Gahow Wang e3480f7d28 8-instance connector tax: +2% at non-saturated, +17% only at saturation

8×TP1 + load_only proxy, shape 512×64, rates 32/64/128 req/s total:

  Rate=32 (non-saturated, thr=0.95-0.97):
    plain TTFT p90=64ms,  mooncake_both=65ms  → +2% (noise)
  Rate=64 (non-saturated, thr=0.96):
    plain TTFT p90=114ms, mooncake_both=107ms → -6% (noise)
  Rate=128 (saturated, thr=0.70-0.71):
    plain TTFT p90=702ms, mooncake_both=822ms → +17%
    plain TTFT p50=339ms, mooncake_both=470ms → +39%

Conclusion: The elastic_migration_v2 +45% is a saturation artifact.
Under SLO-compliant load (TTFT<10s, thr_ratio>0.9), mooncake_both's
1.4ms/step build_connector_meta overhead is completely masked by the
scheduler-model async pipeline. The tax only manifests when the system
is already saturated and queueing amplifies per-step differences.

For practical deployment: enabling kv_role=kv_both has effectively zero
cost as long as the serving system stays within SLO capacity bounds.

2026-05-26 21:32:46 +08:00

connector_tax

8-instance connector tax: +2% at non-saturated, +17% only at saturation

2026-05-26 21:32:46 +08:00

interference

Microbench 1 plots: prefill-decode interference heatmap + lines

2026-05-26 14:21:30 +08:00

lifecycle

PD-sep server-side profiling: vLLM patches + per-request breakdown

2026-05-26 13:59:09 +08:00

patches

PD-sep server-side profiling: vLLM patches + per-request breakdown