Files

Gahow Wang 445e491123 Add vLLM v0.18.1 source tree with KV transfer abort fix

third_party/vllm/ now tracked in git for direct patch management.
Based on vLLM v0.18.1 release with one patch applied:

  vllm/v1/core/sched/scheduler.py:
    Replace fatal assert with graceful skip when KV transfer callback
    arrives for an already-aborted request during PD disaggregated serving.

Future vLLM modifications should be made directly in third_party/vllm/
and committed normally. The patches/ directory is kept as documentation
of what changed from upstream.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 00:30:38 +08:00

api

Add vLLM v0.18.1 source tree with KV transfer abort fix

2026-05-22 00:30:38 +08:00

assets

Add vLLM v0.18.1 source tree with KV transfer abort fix

2026-05-22 00:30:38 +08:00

benchmarking

Add vLLM v0.18.1 source tree with KV transfer abort fix

2026-05-22 00:30:38 +08:00

cli

Add vLLM v0.18.1 source tree with KV transfer abort fix

2026-05-22 00:30:38 +08:00

community

Add vLLM v0.18.1 source tree with KV transfer abort fix

2026-05-22 00:30:38 +08:00

configuration

Add vLLM v0.18.1 source tree with KV transfer abort fix

2026-05-22 00:30:38 +08:00

contributing

Add vLLM v0.18.1 source tree with KV transfer abort fix

2026-05-22 00:30:38 +08:00

deployment

Add vLLM v0.18.1 source tree with KV transfer abort fix

2026-05-22 00:30:38 +08:00

design

Add vLLM v0.18.1 source tree with KV transfer abort fix

2026-05-22 00:30:38 +08:00

examples

Add vLLM v0.18.1 source tree with KV transfer abort fix

2026-05-22 00:30:38 +08:00

features

Add vLLM v0.18.1 source tree with KV transfer abort fix

2026-05-22 00:30:38 +08:00

getting_started

Add vLLM v0.18.1 source tree with KV transfer abort fix

2026-05-22 00:30:38 +08:00

governance

Add vLLM v0.18.1 source tree with KV transfer abort fix

2026-05-22 00:30:38 +08:00

mkdocs

Add vLLM v0.18.1 source tree with KV transfer abort fix

2026-05-22 00:30:38 +08:00

models

Add vLLM v0.18.1 source tree with KV transfer abort fix

2026-05-22 00:30:38 +08:00

serving

Add vLLM v0.18.1 source tree with KV transfer abort fix

2026-05-22 00:30:38 +08:00

training

Add vLLM v0.18.1 source tree with KV transfer abort fix

2026-05-22 00:30:38 +08:00

usage

Add vLLM v0.18.1 source tree with KV transfer abort fix

2026-05-22 00:30:38 +08:00

.nav.yml

Add vLLM v0.18.1 source tree with KV transfer abort fix

2026-05-22 00:30:38 +08:00

maybe_skip_pr_build.sh

Add vLLM v0.18.1 source tree with KV transfer abort fix

2026-05-22 00:30:38 +08:00

README.md

Add vLLM v0.18.1 source tree with KV transfer abort fix

2026-05-22 00:30:38 +08:00

README.md

hide

navigation

toc

Welcome to vLLM

![](./assets/logos/vllm-logo-text-light.png){ align="center" alt="vLLM Light" class="logo-light" width="60%" } ![](./assets/logos/vllm-logo-text-dark.png){ align="center" alt="vLLM Dark" class="logo-dark" width="60%" }

Easy, fast, and cheap LLM serving for everyone

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Where to get started with vLLM depends on the type of user. If you are looking to:

Run open-source models on vLLM, we recommend starting with the Quickstart Guide
Build applications with vLLM, we recommend starting with the User Guide
Build vLLM, we recommend starting with Developer Guide

For information about the development of vLLM, see:

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantization: GPTQ, AWQ, INT4, INT8, and FP8
Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
Speculative decoding
Chunked prefill

vLLM is flexible and easy to use with:

Seamless integration with popular HuggingFace models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor, pipeline, data and expert parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server
Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.
Prefix caching support
Multi-LoRA support

For more information, check out the following:

vLLM announcing blog post (intro to PagedAttention)
vLLM paper (SOSP 2023)
How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al.
vLLM Meetups