agentic-kvc/scripts at e13391eeab4c85ec9360abf18a9b1b6acd8fa9fa - agentic-kvc - Local Gitea

gahow/agentic-kvc

Files

History

Gahow Wang 4b50c5a08d Fix unified cost model: include decode load in queue + hard overload gate

Two bugs caused elastic to concentrate load on cached instances (10x token
imbalance vs 2.7x baseline):

1. _instance_cost queue only counted pending_prefill_tokens, missing
   ongoing_decode_tokens entirely — instances with 50 decoding requests
   appeared idle to the cost model.

2. Cache hits made overloaded instances look "cheap", creating a positive
   feedback loop: more sessions → more cache → lower cost → more routing.
   Added a hard gate (ongoing_tokens > avg * overload_factor) that breaks
   affinity before the cost model runs, matching linear policy behavior.

Result: token imbalance 10.3x → 2.6x, TTFT p90 -37% vs baseline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-24 16:25:02 +08:00

..

scripts: archive obsolete one-off shell/python scripts to legacy/ (D2, D3)

2026-05-23 20:57:32 +08:00

analyze_agentic_patterns.py

Balanced session-sticky routing + agentic workload pattern analysis

2026-05-22 01:50:27 +08:00

analyze_breakdown.py

Add per-request breakdown profiling, identify KV cache memory bottleneck

2026-05-22 00:13:50 +08:00

analyze_cache_hit.py

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

analyze_eviction.py

KV cache lifecycle design + eviction loss analysis

2026-05-22 01:27:22 +08:00

analyze_trace.py

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

bench.sh

Fix multi-turn replay fidelity: track realized output tokens across all components

2026-05-24 14:47:51 +08:00

cache_aware_proxy.py

Fix unified cost model: include decode load in queue + hard overload gate

2026-05-24 16:25:02 +08:00

compare_results.py

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

compute_roofline.py

compute_roofline: argparse --trace, fix stale default path (D4)

2026-05-23 20:58:09 +08:00

deploy_vllm_patches.sh

Add deploy_vllm_patches.sh: sync third_party/vllm patches to site-packages

2026-05-24 11:59:52 +08:00

gpu_monitor.sh

Add GPU utilization A/B test and fix cache-aware proxy bugs

2026-05-21 22:13:38 +08:00

launch_elastic_p2p.sh

Fix multi-turn replay fidelity: track realized output tokens across all components

2026-05-24 14:47:51 +08:00

launch_pd_mooncake.sh

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

launch_pd_separated.sh

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

launch_phase1_ps.sh

launch_phase1_ps: parameterise project + model paths (B6 followup)

2026-05-23 21:14:15 +08:00

launch_vllm.sh

Agentic workload PD separation analysis with trace-driven benchmarks

2026-05-21 21:21:57 +08:00

sample_trace.py

Production-realistic baseline: APC 67.5%, TPOT +139% from interference

2026-05-23 15:44:34 +08:00

simulate_cache_policies.py

Cache policy simulation: routing quality dominates, not eviction policy

2026-05-22 01:28:53 +08:00

test_direct_read.py

Fix hash mismatch: token-based lookup instead of cross-instance hash matching

2026-05-24 01:14:33 +08:00