agentic-kvc/scripts at d11d9f5cb9a5ee9d94765f381fe50294a053b8b0 - agentic-kvc - Local Gitea

gahow/agentic-kvc

Files

History

Gahow Wang d11d9f5cb9 Adaptive prefill offload v1: implementation + experiment

Added --heavy-threshold to cache_aware_proxy.py. HEAVY requests (new
tokens >= threshold) route to instance with least decode load; WARM/MEDIUM
route by cache-hit + token-level LB as before.

Result: no significant difference vs baseline on single-machine combined mode.
  TTFT: +1.2%, TPOT: -1.5%, E2E: -0.3% (all within noise)

Per-class TTFT breakdown shows the optimization target:
  WARM (75 req):   p50=0.198s  (cache hit, nearly free)
  MEDIUM (72 req): p50=1.356s
  HEAVY (54 req):  p50=7.124s  (36x slower than WARM)

Conclusion: single-machine combined mode already distributes load well
enough that adaptive routing adds no benefit. True isolation of HEAVY
prefills requires cross-machine offload (v2 with Mooncake or multi-node).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 01:00:10 +08:00

..

ab_gpu_test.sh

Add GPU utilization A/B test and fix cache-aware proxy bugs

2026-05-21 22:13:38 +08:00

analyze_3way.py

Add P/D ratio ablation: 6P+2D vs 4P+4D vs Combined

2026-05-21 22:42:20 +08:00

analyze_ablations.py

Ablation 2: fire-and-forget vs await-prefill scheduling

2026-05-21 23:02:42 +08:00

analyze_breakdown.py

Add per-request breakdown profiling, identify KV cache memory bottleneck

2026-05-22 00:13:50 +08:00

analyze_cache_hit.py

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

analyze_gpu_ab.py

Add GPU utilization A/B test and fix cache-aware proxy bugs

2026-05-21 22:13:38 +08:00

analyze_trace.py

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

cache_aware_proxy.py

Adaptive prefill offload v1: implementation + experiment

2026-05-22 01:00:10 +08:00

compare_adaptive.py

Adaptive prefill offload v1: implementation + experiment

2026-05-22 01:00:10 +08:00

compare_results.py

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

compute_roofline.py

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

final_comparison.py

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

gpu_monitor.sh

Add GPU utilization A/B test and fix cache-aware proxy bugs

2026-05-21 22:13:38 +08:00

launch_pd_mooncake.sh

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

launch_pd_separated.sh

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

launch_vllm.sh

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

profile_fnf.py

Add per-request breakdown profiling, identify KV cache memory bottleneck

2026-05-22 00:13:50 +08:00

run_benchmark.sh

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

run_experiments.sh

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

sample_trace.py

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00