agentic-kvc

gahow/agentic-kvc

Fork 0

Commit Graph

Author	SHA1	Message	Date
Gahow Wang	d6e47d3742	Design doc: Adaptive Prefill Offload All 8 GPUs stay PD-combined. Global scheduler classifies requests as WARM/MEDIUM/HEAVY based on estimated new tokens after prefix cache. Only HEAVY requests (20%, cold start >20k new tokens) get offloaded; 80% of requests are co-located with zero KV transfer. This avoids the KV cache memory wall (no decode concentration) while isolating heavy prefills from decode when needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 00:44:22 +08:00

Author

SHA1

Message

Date

Gahow Wang

d6e47d3742

Design doc: Adaptive Prefill Offload

All 8 GPUs stay PD-combined. Global scheduler classifies requests as
WARM/MEDIUM/HEAVY based on estimated new tokens after prefix cache.
Only HEAVY requests (20%, cold start >20k new tokens) get offloaded;
80% of requests are co-located with zero KV transfer.

This avoids the KV cache memory wall (no decode concentration) while
isolating heavy prefills from decode when needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 00:44:22 +08:00

1 Commits