agentic-kvc/scripts at ce616f46d12f7c99ffe77d1ba292d048a498a24f - agentic-kvc - Local Gitea

gahow/agentic-kvc

Files

History

Gahow Wang ce616f46d1 Add per-request breakdown profiling, identify KV cache memory bottleneck

Breakdown profiling at proxy level captures:
  t_proxy_recv → t_prefill_sent → t_prefill_done → t_decode_sent → t_first_token

Key finding: 87.7% of TTFT is spent in kv+decode phase, NOT prefill.
Root cause: decode instance KV cache memory saturation (97.1% usage).

With 6P+2D config, 2 decode GPUs have only ~56GB total KV cache.
Large agentic requests (avg 33.6k tokens) fill this quickly.
Small requests (49 tokens, prefill=0.044s) wait 114s for KV cache
to be freed by large requests completing decode.

vLLM log confirms: Running=0, Waiting=6, KV cache=97.1%
GPU is idle but requests queue for KV cache memory, not compute.

This is the fundamental bottleneck of single-machine PD separation
for long-context agentic workloads: concentrating decode onto fewer
GPUs creates a KV cache memory wall.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 00:13:50 +08:00

..

ab_gpu_test.sh

Add GPU utilization A/B test and fix cache-aware proxy bugs

2026-05-21 22:13:38 +08:00

analyze_3way.py

Add P/D ratio ablation: 6P+2D vs 4P+4D vs Combined

2026-05-21 22:42:20 +08:00

analyze_ablations.py

Ablation 2: fire-and-forget vs await-prefill scheduling

2026-05-21 23:02:42 +08:00

analyze_breakdown.py

Add per-request breakdown profiling, identify KV cache memory bottleneck

2026-05-22 00:13:50 +08:00

analyze_cache_hit.py

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

analyze_gpu_ab.py

Add GPU utilization A/B test and fix cache-aware proxy bugs

2026-05-21 22:13:38 +08:00

analyze_trace.py

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

cache_aware_proxy.py

Add per-request breakdown profiling, identify KV cache memory bottleneck

2026-05-22 00:13:50 +08:00

compare_results.py

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

compute_roofline.py

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

final_comparison.py

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

gpu_monitor.sh

Add GPU utilization A/B test and fix cache-aware proxy bugs

2026-05-21 22:13:38 +08:00

launch_pd_mooncake.sh

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

launch_pd_separated.sh

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

launch_vllm.sh

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

profile_fnf.py

Add per-request breakdown profiling, identify KV cache memory bottleneck

2026-05-22 00:13:50 +08:00

run_benchmark.sh

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

run_experiments.sh

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

sample_trace.py

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00