40 lines
1.7 KiB
Markdown
40 lines
1.7 KiB
Markdown
LAPS: A Length-Aware-Prefill LLM Serving System
|
|
https://arxiv.org/abs/2601.11589
|
|
|
|
|
|
Understanding Bottlenecks for Efficiently Serving LLM Inference With KV Offloading
|
|
https://arxiv.org/abs/2601.19910
|
|
|
|
|
|
|
|
DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference
|
|
https://arxiv.org/abs/2602.21548
|
|
|
|
|
|
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
|
|
https://www.usenix.org/system/files/osdi24-zhong-yinmin.pdf
|
|
|
|
|
|
Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving
|
|
https://arxiv.org/pdf/2603.13358
|
|
|
|
|
|
Efficient Multi-round LLM Inference over Disaggregated Serving
|
|
https://arxiv.org/pdf/2602.14516v1
|
|
|
|
|
|
Roadmap: SGLang Distributed KVCache System For Agentic Workload
|
|
https://github.com/sgl-project/sglang/issues/21846
|
|
|
|
RFC: P/D Prefill compute optimizations with bi-directional KV cache transfers between P and D nodes
|
|
https://github.com/vllm-project/vllm/issues/32733
|
|
|
|
|
|
|
|
- [Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving](https://arxiv.org/html/2603.13358v1?utm_source=chatgpt.com)
|
|
- [SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents](https://arxiv.org/html/2602.09447v1?utm_source=chatgpt.com)
|
|
- [DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference](https://arxiv.org/html/2602.21548v1?utm_source=chatgpt.com)
|
|
- [Prefill/Decode Disaggregation | llm-d](https://llm-d.ai/docs/guide/Installation/pd-disaggregation?utm_source=chatgpt.com)
|
|
- [CacheSlide: Unlocking Cross Position-Aware KV Cache Reuse for Accelerating LLM Serving | USENIX](https://www.usenix.org/conference/fast26/presentation/liu-yang?utm_source=chatgpt.com)
|
|
|