Add vLLM v0.18.1 source tree with KV transfer abort fix
third_party/vllm/ now tracked in git for direct patch management.
Based on vLLM v0.18.1 release with one patch applied:
vllm/v1/core/sched/scheduler.py:
Replace fatal assert with graceful skip when KV transfer callback
arrives for an already-aborted request during PD disaggregated serving.
Future vLLM modifications should be made directly in third_party/vllm/
and committed normally. The patches/ directory is kept as documentation
of what changed from upstream.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
65
third_party/vllm/docs/.nav.yml
vendored
Normal file
@@ -0,0 +1,65 @@
|
||||
nav:
|
||||
- Home: README.md
|
||||
- User Guide:
|
||||
- usage/README.md
|
||||
- Getting Started:
|
||||
- getting_started/quickstart.md
|
||||
- getting_started/installation
|
||||
- Examples: examples
|
||||
- General:
|
||||
- usage/v1_guide.md
|
||||
- usage/*
|
||||
- Inference and Serving:
|
||||
- serving/offline_inference.md
|
||||
- serving/openai_compatible_server.md
|
||||
- serving/*
|
||||
- serving/integrations
|
||||
- Deployment:
|
||||
- deployment/*
|
||||
- deployment/frameworks
|
||||
- deployment/integrations
|
||||
- Training: training
|
||||
- Configuration:
|
||||
- configuration/*
|
||||
- TPU: https://docs.vllm.ai/projects/tpu/en/latest/
|
||||
- Models:
|
||||
- models/supported_models.md
|
||||
- models/generative_models.md
|
||||
- models/pooling_models.md
|
||||
- models/extensions
|
||||
- Hardware Supported Models:
|
||||
- models/hardware_supported_models/*
|
||||
- TPU: https://docs.vllm.ai/projects/tpu/en/latest/recommended_models_features/
|
||||
- Features: features
|
||||
- Developer Guide:
|
||||
- contributing/README.md
|
||||
- General:
|
||||
- glob: contributing/*
|
||||
flatten_single_child_sections: true
|
||||
- Model Implementation:
|
||||
- contributing/model/README.md
|
||||
- contributing/model/basic.md
|
||||
- contributing/model/registration.md
|
||||
- contributing/model/tests.md
|
||||
- contributing/model/multimodal.md
|
||||
- contributing/model/transcription.md
|
||||
- CI: contributing/ci
|
||||
- Design Documents:
|
||||
- Plugins:
|
||||
- design/*plugin*.md
|
||||
- design/*
|
||||
- Benchmarking:
|
||||
- benchmarking/README.md
|
||||
- benchmarking/cli.md
|
||||
- benchmarking/sweeps.md
|
||||
- benchmarking/dashboard.md
|
||||
- API Reference:
|
||||
- api/README.md
|
||||
- api/vllm
|
||||
- CLI Reference: cli
|
||||
- Community:
|
||||
- community/*
|
||||
- Governance: governance
|
||||
- Blog: https://blog.vllm.ai
|
||||
- Forum: https://discuss.vllm.ai
|
||||
- Slack: https://slack.vllm.ai
|
||||
68
third_party/vllm/docs/README.md
vendored
Normal file
@@ -0,0 +1,68 @@
|
||||
---
|
||||
hide:
|
||||
- navigation
|
||||
- toc
|
||||
---
|
||||
|
||||
# Welcome to vLLM
|
||||
|
||||
<figure markdown="span">
|
||||
{ align="center" alt="vLLM Light" class="logo-light" width="60%" }
|
||||
{ align="center" alt="vLLM Dark" class="logo-dark" width="60%" }
|
||||
</figure>
|
||||
|
||||
<p style="text-align:center">
|
||||
<strong>Easy, fast, and cheap LLM serving for everyone
|
||||
</strong>
|
||||
</p>
|
||||
|
||||
<p style="text-align:center">
|
||||
<script async defer src="https://buttons.github.io/buttons.js"></script>
|
||||
<a class="github-button" href="https://github.com/vllm-project/vllm" data-show-count="true" data-size="large" aria-label="Star">Star</a>
|
||||
<a class="github-button" href="https://github.com/vllm-project/vllm/subscription" data-show-count="true" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
|
||||
<a class="github-button" href="https://github.com/vllm-project/vllm/fork" data-show-count="true" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
|
||||
</p>
|
||||
|
||||
vLLM is a fast and easy-to-use library for LLM inference and serving.
|
||||
|
||||
Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
|
||||
|
||||
Where to get started with vLLM depends on the type of user. If you are looking to:
|
||||
|
||||
- Run open-source models on vLLM, we recommend starting with the [Quickstart Guide](./getting_started/quickstart.md)
|
||||
- Build applications with vLLM, we recommend starting with the [User Guide](./usage/README.md)
|
||||
- Build vLLM, we recommend starting with [Developer Guide](./contributing/README.md)
|
||||
|
||||
For information about the development of vLLM, see:
|
||||
|
||||
- [Roadmap](https://roadmap.vllm.ai)
|
||||
- [Releases](https://github.com/vllm-project/vllm/releases)
|
||||
|
||||
vLLM is fast with:
|
||||
|
||||
- State-of-the-art serving throughput
|
||||
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
|
||||
- Continuous batching of incoming requests
|
||||
- Fast model execution with CUDA/HIP graph
|
||||
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8
|
||||
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
|
||||
- Speculative decoding
|
||||
- Chunked prefill
|
||||
|
||||
vLLM is flexible and easy to use with:
|
||||
|
||||
- Seamless integration with popular HuggingFace models
|
||||
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
|
||||
- Tensor, pipeline, data and expert parallelism support for distributed inference
|
||||
- Streaming outputs
|
||||
- OpenAI-compatible API server
|
||||
- Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.
|
||||
- Prefix caching support
|
||||
- Multi-LoRA support
|
||||
|
||||
For more information, check out the following:
|
||||
|
||||
- [vLLM announcing blog post](https://blog.vllm.ai/2023/06/20/vllm.html) (intro to PagedAttention)
|
||||
- [vLLM paper](https://arxiv.org/abs/2309.06180) (SOSP 2023)
|
||||
- [How continuous batching enables 23x throughput in LLM inference while reducing p50 latency](https://www.anyscale.com/blog/continuous-batching-llm-inference) by Cade Daniel et al.
|
||||
- [vLLM Meetups](community/meetups.md)
|
||||
93
third_party/vllm/docs/api/README.md
vendored
Normal file
@@ -0,0 +1,93 @@
|
||||
# Summary
|
||||
|
||||
## Configuration
|
||||
|
||||
API documentation for vLLM's configuration classes.
|
||||
|
||||
- [vllm.config.ModelConfig][]
|
||||
- [vllm.config.CacheConfig][]
|
||||
- [vllm.config.LoadConfig][]
|
||||
- [vllm.config.ParallelConfig][]
|
||||
- [vllm.config.SchedulerConfig][]
|
||||
- [vllm.config.DeviceConfig][]
|
||||
- [vllm.config.SpeculativeConfig][]
|
||||
- [vllm.config.LoRAConfig][]
|
||||
- [vllm.config.MultiModalConfig][]
|
||||
- [vllm.config.PoolerConfig][]
|
||||
- [vllm.config.StructuredOutputsConfig][]
|
||||
- [vllm.config.ProfilerConfig][]
|
||||
- [vllm.config.ObservabilityConfig][]
|
||||
- [vllm.config.KVTransferConfig][]
|
||||
- [vllm.config.CompilationConfig][]
|
||||
- [vllm.config.VllmConfig][]
|
||||
|
||||
## Offline Inference
|
||||
|
||||
LLM Class.
|
||||
|
||||
- [vllm.LLM][]
|
||||
|
||||
LLM Inputs.
|
||||
|
||||
- [vllm.inputs.PromptType][]
|
||||
- [vllm.inputs.TextPrompt][]
|
||||
- [vllm.inputs.TokensPrompt][]
|
||||
|
||||
## vLLM Engines
|
||||
|
||||
Engine classes for offline and online inference.
|
||||
|
||||
- [vllm.LLMEngine][]
|
||||
- [vllm.AsyncLLMEngine][]
|
||||
|
||||
## Inference Parameters
|
||||
|
||||
Inference parameters for vLLM APIs.
|
||||
|
||||
- [vllm.SamplingParams][]
|
||||
- [vllm.PoolingParams][]
|
||||
|
||||
## Multi-Modality
|
||||
|
||||
vLLM provides experimental support for multi-modal models through the [vllm.multimodal][] package.
|
||||
|
||||
Multi-modal inputs can be passed alongside text and token prompts to [supported models](../models/supported_models.md#list-of-multimodal-language-models)
|
||||
via the `multi_modal_data` field in [vllm.inputs.PromptType][].
|
||||
|
||||
Looking to add your own multi-modal model? Please follow the instructions listed [here](../contributing/model/multimodal.md).
|
||||
|
||||
- [vllm.multimodal.MULTIMODAL_REGISTRY][]
|
||||
|
||||
### Inputs
|
||||
|
||||
User-facing inputs.
|
||||
|
||||
- [vllm.multimodal.inputs.MultiModalDataDict][]
|
||||
|
||||
Internal data structures.
|
||||
|
||||
- [vllm.multimodal.inputs.PlaceholderRange][]
|
||||
- [vllm.multimodal.inputs.NestedTensors][]
|
||||
- [vllm.multimodal.inputs.MultiModalFieldElem][]
|
||||
- [vllm.multimodal.inputs.MultiModalFieldConfig][]
|
||||
- [vllm.multimodal.inputs.MultiModalKwargsItem][]
|
||||
- [vllm.multimodal.inputs.MultiModalKwargsItems][]
|
||||
- [vllm.multimodal.inputs.MultiModalInputs][]
|
||||
|
||||
### Data Parsing
|
||||
|
||||
- [vllm.multimodal.parse][]
|
||||
|
||||
### Data Processing
|
||||
|
||||
- [vllm.multimodal.processing][]
|
||||
|
||||
### Registry
|
||||
|
||||
- [vllm.multimodal.registry][]
|
||||
|
||||
## Model Development
|
||||
|
||||
- [vllm.model_executor.models.interfaces_base][]
|
||||
- [vllm.model_executor.models.interfaces][]
|
||||
- [vllm.model_executor.models.adapters][]
|
||||
2
third_party/vllm/docs/api/vllm/.meta.yml
vendored
Normal file
@@ -0,0 +1,2 @@
|
||||
search:
|
||||
exclude: true
|
||||
BIN
third_party/vllm/docs/assets/contributing/dockerfile-stages-dependency.png
vendored
Normal file
|
After Width: | Height: | Size: 325 KiB |
BIN
third_party/vllm/docs/assets/contributing/load-pattern-examples.png
vendored
Normal file
|
After Width: | Height: | Size: 577 KiB |
BIN
third_party/vllm/docs/assets/deployment/anything-llm-chat-with-doc.png
vendored
Normal file
|
After Width: | Height: | Size: 118 KiB |
BIN
third_party/vllm/docs/assets/deployment/anything-llm-chat-without-doc.png
vendored
Normal file
|
After Width: | Height: | Size: 136 KiB |
BIN
third_party/vllm/docs/assets/deployment/anything-llm-provider.png
vendored
Normal file
|
After Width: | Height: | Size: 110 KiB |
BIN
third_party/vllm/docs/assets/deployment/anything-llm-upload-doc.png
vendored
Normal file
|
After Width: | Height: | Size: 111 KiB |
BIN
third_party/vllm/docs/assets/deployment/architecture_helm_deployment.png
vendored
Normal file
|
After Width: | Height: | Size: 968 KiB |
BIN
third_party/vllm/docs/assets/deployment/chatbox-chat.png
vendored
Normal file
|
After Width: | Height: | Size: 107 KiB |
BIN
third_party/vllm/docs/assets/deployment/chatbox-settings.png
vendored
Normal file
|
After Width: | Height: | Size: 95 KiB |
BIN
third_party/vllm/docs/assets/deployment/claude-code-example.png
vendored
Normal file
|
After Width: | Height: | Size: 1.1 MiB |
BIN
third_party/vllm/docs/assets/deployment/dify-chat.png
vendored
Normal file
|
After Width: | Height: | Size: 143 KiB |
BIN
third_party/vllm/docs/assets/deployment/dify-create-chatbot.png
vendored
Normal file
|
After Width: | Height: | Size: 265 KiB |
BIN
third_party/vllm/docs/assets/deployment/dify-settings.png
vendored
Normal file
|
After Width: | Height: | Size: 52 KiB |
BIN
third_party/vllm/docs/assets/deployment/dp_external_lb.png
vendored
Normal file
|
After Width: | Height: | Size: 84 KiB |
BIN
third_party/vllm/docs/assets/deployment/dp_internal_lb.png
vendored
Normal file
|
After Width: | Height: | Size: 68 KiB |
BIN
third_party/vllm/docs/assets/deployment/hf-inference-endpoints-catalog.png
vendored
Normal file
|
After Width: | Height: | Size: 627 KiB |
BIN
third_party/vllm/docs/assets/deployment/hf-inference-endpoints-choose-infra.png
vendored
Normal file
|
After Width: | Height: | Size: 350 KiB |
BIN
third_party/vllm/docs/assets/deployment/hf-inference-endpoints-click-deploy-button.png
vendored
Normal file
|
After Width: | Height: | Size: 814 KiB |
BIN
third_party/vllm/docs/assets/deployment/hf-inference-endpoints-configure-container.png
vendored
Normal file
|
After Width: | Height: | Size: 267 KiB |
BIN
third_party/vllm/docs/assets/deployment/hf-inference-endpoints-create-endpoint.png
vendored
Normal file
|
After Width: | Height: | Size: 354 KiB |
BIN
third_party/vllm/docs/assets/deployment/hf-inference-endpoints-locate-deploy-button.png
vendored
Normal file
|
After Width: | Height: | Size: 781 KiB |
BIN
third_party/vllm/docs/assets/deployment/hf-inference-endpoints-new-endpoint.png
vendored
Normal file
|
After Width: | Height: | Size: 51 KiB |
BIN
third_party/vllm/docs/assets/deployment/hf-inference-endpoints-select-hardware.png
vendored
Normal file
|
After Width: | Height: | Size: 359 KiB |
BIN
third_party/vllm/docs/assets/deployment/hf-inference-endpoints-select-model.png
vendored
Normal file
|
After Width: | Height: | Size: 82 KiB |
BIN
third_party/vllm/docs/assets/deployment/open_webui.png
vendored
Normal file
|
After Width: | Height: | Size: 57 KiB |
BIN
third_party/vllm/docs/assets/deployment/streamlit-chat.png
vendored
Normal file
|
After Width: | Height: | Size: 106 KiB |
BIN
third_party/vllm/docs/assets/design/arch_overview/entrypoints.excalidraw.png
vendored
Normal file
|
After Width: | Height: | Size: 120 KiB |
BIN
third_party/vllm/docs/assets/design/arch_overview/llm_engine.excalidraw.png
vendored
Normal file
|
After Width: | Height: | Size: 174 KiB |
BIN
third_party/vllm/docs/assets/design/arch_overview/v1_process_architecture_tp2_dp4.png
vendored
Normal file
|
After Width: | Height: | Size: 4.7 MiB |
BIN
third_party/vllm/docs/assets/design/arch_overview/v1_process_architecture_tp4.png
vendored
Normal file
|
After Width: | Height: | Size: 3.8 MiB |
BIN
third_party/vllm/docs/assets/design/cuda_graphs/current_design.png
vendored
Normal file
|
After Width: | Height: | Size: 70 KiB |
BIN
third_party/vllm/docs/assets/design/cuda_graphs/executor_runtime.png
vendored
Normal file
|
After Width: | Height: | Size: 60 KiB |
BIN
third_party/vllm/docs/assets/design/cuda_graphs/previous_design.png
vendored
Normal file
|
After Width: | Height: | Size: 44 KiB |
BIN
third_party/vllm/docs/assets/design/cuda_graphs/wrapper_flow.png
vendored
Normal file
|
After Width: | Height: | Size: 87 KiB |
BIN
third_party/vllm/docs/assets/design/debug_vllm_compile/design_diagram.png
vendored
Normal file
|
After Width: | Height: | Size: 314 KiB |
BIN
third_party/vllm/docs/assets/design/debug_vllm_compile/dynamic_shapes.png
vendored
Normal file
|
After Width: | Height: | Size: 359 KiB |
BIN
third_party/vllm/docs/assets/design/debug_vllm_compile/tlparse_inductor.png
vendored
Normal file
|
After Width: | Height: | Size: 257 KiB |
BIN
third_party/vllm/docs/assets/design/fused_moe_modular_kernel/fused_experts_blocks.png
vendored
Normal file
|
After Width: | Height: | Size: 187 KiB |
BIN
third_party/vllm/docs/assets/design/fused_moe_modular_kernel/fused_moe_batched.png
vendored
Normal file
|
After Width: | Height: | Size: 189 KiB |
BIN
third_party/vllm/docs/assets/design/fused_moe_modular_kernel/fused_moe_non_batched.png
vendored
Normal file
|
After Width: | Height: | Size: 227 KiB |
BIN
third_party/vllm/docs/assets/design/fused_moe_modular_kernel/prepare_and_finalize_blocks.png
vendored
Normal file
|
After Width: | Height: | Size: 128 KiB |
BIN
third_party/vllm/docs/assets/design/hierarchy.png
vendored
Normal file
|
After Width: | Height: | Size: 170 KiB |
BIN
third_party/vllm/docs/assets/design/hybrid_kv_cache_manager/basic_grouping_example.png
vendored
Normal file
|
After Width: | Height: | Size: 24 KiB |
BIN
third_party/vllm/docs/assets/design/hybrid_kv_cache_manager/full_attn.png
vendored
Normal file
|
After Width: | Height: | Size: 4.0 KiB |
BIN
third_party/vllm/docs/assets/design/hybrid_kv_cache_manager/memory_layout.png
vendored
Normal file
|
After Width: | Height: | Size: 62 KiB |
BIN
third_party/vllm/docs/assets/design/hybrid_kv_cache_manager/overview.png
vendored
Normal file
|
After Width: | Height: | Size: 39 KiB |
BIN
third_party/vllm/docs/assets/design/hybrid_kv_cache_manager/sw_attn.png
vendored
Normal file
|
After Width: | Height: | Size: 4.5 KiB |
BIN
third_party/vllm/docs/assets/design/metrics/intervals-1.png
vendored
Normal file
|
After Width: | Height: | Size: 185 KiB |
BIN
third_party/vllm/docs/assets/design/metrics/intervals-2.png
vendored
Normal file
|
After Width: | Height: | Size: 162 KiB |
BIN
third_party/vllm/docs/assets/design/metrics/intervals-3.png
vendored
Normal file
|
After Width: | Height: | Size: 161 KiB |
BIN
third_party/vllm/docs/assets/design/model_runner_v2/async_no_race_condition.png
vendored
Normal file
|
After Width: | Height: | Size: 130 KiB |
BIN
third_party/vllm/docs/assets/design/model_runner_v2/async_race_condition.png
vendored
Normal file
|
After Width: | Height: | Size: 128 KiB |
BIN
third_party/vllm/docs/assets/design/model_runner_v2/async_sched.png
vendored
Normal file
|
After Width: | Height: | Size: 254 KiB |
BIN
third_party/vllm/docs/assets/design/model_runner_v2/persistent_batch_mrv2.png
vendored
Normal file
|
After Width: | Height: | Size: 73 KiB |
BIN
third_party/vllm/docs/assets/design/model_runner_v2/persistent_batch_v1.png
vendored
Normal file
|
After Width: | Height: | Size: 65 KiB |
BIN
third_party/vllm/docs/assets/design/paged_attention/k_vecs.png
vendored
Normal file
|
After Width: | Height: | Size: 27 KiB |
BIN
third_party/vllm/docs/assets/design/paged_attention/key.png
vendored
Normal file
|
After Width: | Height: | Size: 109 KiB |
BIN
third_party/vllm/docs/assets/design/paged_attention/logits_vec.png
vendored
Normal file
|
After Width: | Height: | Size: 17 KiB |
BIN
third_party/vllm/docs/assets/design/paged_attention/q_vecs.png
vendored
Normal file
|
After Width: | Height: | Size: 41 KiB |
BIN
third_party/vllm/docs/assets/design/paged_attention/query.png
vendored
Normal file
|
After Width: | Height: | Size: 32 KiB |
BIN
third_party/vllm/docs/assets/design/paged_attention/v_vec.png
vendored
Normal file
|
After Width: | Height: | Size: 42 KiB |
BIN
third_party/vllm/docs/assets/design/paged_attention/value.png
vendored
Normal file
|
After Width: | Height: | Size: 167 KiB |
BIN
third_party/vllm/docs/assets/design/prefix_caching/example-time-1.png
vendored
Normal file
|
After Width: | Height: | Size: 47 KiB |
BIN
third_party/vllm/docs/assets/design/prefix_caching/example-time-3.png
vendored
Normal file
|
After Width: | Height: | Size: 50 KiB |
BIN
third_party/vllm/docs/assets/design/prefix_caching/example-time-4.png
vendored
Normal file
|
After Width: | Height: | Size: 59 KiB |
BIN
third_party/vllm/docs/assets/design/prefix_caching/example-time-5.png
vendored
Normal file
|
After Width: | Height: | Size: 54 KiB |
BIN
third_party/vllm/docs/assets/design/prefix_caching/example-time-6.png
vendored
Normal file
|
After Width: | Height: | Size: 54 KiB |
BIN
third_party/vllm/docs/assets/design/prefix_caching/example-time-7.png
vendored
Normal file
|
After Width: | Height: | Size: 55 KiB |
BIN
third_party/vllm/docs/assets/design/prefix_caching/free.png
vendored
Normal file
|
After Width: | Height: | Size: 18 KiB |
BIN
third_party/vllm/docs/assets/design/prefix_caching/overview.png
vendored
Normal file
|
After Width: | Height: | Size: 32 KiB |
BIN
third_party/vllm/docs/assets/design/tpu/most_model_len.png
vendored
Normal file
|
After Width: | Height: | Size: 12 KiB |
BIN
third_party/vllm/docs/assets/features/disagg_encoder/disagg_encoder_flow.png
vendored
Normal file
|
After Width: | Height: | Size: 84 KiB |
BIN
third_party/vllm/docs/assets/features/disagg_prefill/abstraction.jpg
vendored
Normal file
|
After Width: | Height: | Size: 102 KiB |
BIN
third_party/vllm/docs/assets/features/disagg_prefill/high_level_design.png
vendored
Normal file
|
After Width: | Height: | Size: 91 KiB |
BIN
third_party/vllm/docs/assets/features/disagg_prefill/overview.jpg
vendored
Normal file
|
After Width: | Height: | Size: 173 KiB |
BIN
third_party/vllm/docs/assets/features/disagg_prefill/workflow.png
vendored
Normal file
|
After Width: | Height: | Size: 88 KiB |
321
third_party/vllm/docs/assets/features/speculative_decoding/speculators-user-flow-dark.svg
vendored
Normal file
|
After Width: | Height: | Size: 339 KiB |
275
third_party/vllm/docs/assets/features/speculative_decoding/speculators-user-flow-light.svg
vendored
Normal file
|
After Width: | Height: | Size: 374 KiB |
BIN
third_party/vllm/docs/assets/logos/vllm-logo-only-light.ico
vendored
Normal file
|
After Width: | Height: | Size: 17 KiB |
BIN
third_party/vllm/docs/assets/logos/vllm-logo-only-light.png
vendored
Normal file
|
After Width: | Height: | Size: 53 KiB |
BIN
third_party/vllm/docs/assets/logos/vllm-logo-text-dark.png
vendored
Normal file
|
After Width: | Height: | Size: 86 KiB |
BIN
third_party/vllm/docs/assets/logos/vllm-logo-text-light.png
vendored
Normal file
|
After Width: | Height: | Size: 88 KiB |
7
third_party/vllm/docs/benchmarking/README.md
vendored
Normal file
@@ -0,0 +1,7 @@
|
||||
# Benchmark Suites
|
||||
|
||||
vLLM provides comprehensive benchmarking tools for performance testing and evaluation:
|
||||
|
||||
- **[Benchmark CLI](./cli.md)**: `vllm bench` CLI tools and specialized benchmark scripts for interactive performance testing.
|
||||
- **[Parameter Sweeps](./sweeps.md)**: Automate `vllm bench` runs for multiple configurations, useful for [optimization and tuning](../configuration/optimization.md).
|
||||
- **[Performance Dashboard](./dashboard.md)**: Automated CI that publishes benchmarks on each commit.
|
||||
1143
third_party/vllm/docs/benchmarking/cli.md
vendored
Normal file
122
third_party/vllm/docs/benchmarking/dashboard.md
vendored
Normal file
@@ -0,0 +1,122 @@
|
||||
# Performance Dashboard
|
||||
|
||||
The performance dashboard is used to confirm whether new changes improve/degrade performance under various workloads.
|
||||
It is updated by triggering benchmark runs on every commit with both the `perf-benchmarks` and `ready` labels, and when a PR is merged into vLLM.
|
||||
|
||||
The results are automatically published to the public [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm).
|
||||
|
||||
## Manually Trigger the benchmark
|
||||
|
||||
Use [vllm-ci-test-repo images](https://gallery.ecr.aws/q9t5s3a7/vllm-ci-test-repo) with vLLM benchmark suite.
|
||||
For x86 CPU environment, please use the image with "-cpu" postfix. For AArch64 CPU environment, please use the image with "-arm64-cpu" postfix.
|
||||
|
||||
Here is an example for docker run command for CPU. For GPUs skip setting the `ON_CPU` env var.
|
||||
|
||||
```bash
|
||||
export VLLM_COMMIT=7f42dc20bb2800d09faa72b26f25d54e26f1b694 # use full commit hash from the main branch
|
||||
export HF_TOKEN=<valid Hugging Face token>
|
||||
if [[ "$(uname -m)" == aarch64 || "$(uname -m)" == arm64 ]]; then
|
||||
IMG_SUFFIX="arm64-cpu"
|
||||
else
|
||||
IMG_SUFFIX="cpu"
|
||||
fi
|
||||
docker run -it --entrypoint /bin/bash -v /data/huggingface:/root/.cache/huggingface -e HF_TOKEN=$HF_TOKEN -e ON_CPU=1 --shm-size=16g --name vllm-cpu-ci public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:${VLLM_COMMIT}-${IMG_SUFFIX}
|
||||
```
|
||||
|
||||
Then, run below command inside the docker instance.
|
||||
|
||||
```bash
|
||||
bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh
|
||||
```
|
||||
|
||||
When run, benchmark script generates results under **benchmark/results** folder, along with the benchmark_results.md and benchmark_results.json.
|
||||
|
||||
### Runtime environment variables
|
||||
|
||||
- `ON_CPU`: set the value to '1' on Intel® Xeon® and Arm® Neoverse™ Processors. Default value is 0.
|
||||
- `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file).
|
||||
- `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file).
|
||||
- `THROUGHPUT_JSON`: JSON file to use for the throughout tests. Default value is empty string (use default file).
|
||||
- `REMOTE_HOST`: IP for the remote vLLM service to benchmark. Default value is empty string.
|
||||
- `REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string.
|
||||
- `PROMPTS_PER_CONCURRENCY`: Multiplier to compute `num_prompts` for serving tests (`num_prompts = max_concurrency × value`). Overrides JSON `num_prompts`. Default is NULL.
|
||||
- `ENABLE_ADAPTIVE_CONCURRENCY`: set the value to '1' to enable adaptive SLA-based concurrency search after the static serving max_concurrency sweep. Default value is 0.
|
||||
- `SLA_TTFT_MS`: default TTFT SLA threshold in milliseconds for adaptive concurrency search. Default value is 3000.
|
||||
- `SLA_TPOT_MS`: default TPOT SLA threshold in milliseconds for adaptive concurrency search. Default value is 100.
|
||||
- `ADAPTIVE_MAX_PROBES`: maximum number of extra adaptive search probes. Default value is 8.
|
||||
- `ADAPTIVE_MAX_CONCURRENCY`: maximum allowed concurrency during adaptive search. Default value is 1024.
|
||||
|
||||
### Visualization
|
||||
|
||||
The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table with real benchmarking results.
|
||||
You can find the result presented as a table inside the `buildkite/performance-benchmark` job page.
|
||||
If you do not see the table, please wait till the benchmark finish running.
|
||||
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
|
||||
The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.
|
||||
|
||||
#### Performance Results Comparison
|
||||
|
||||
The `compare-json-results.py` helps to compare benchmark results JSON files converted using `convert-results-json-to-markdown.py`.
|
||||
When run, benchmark script generates results under `benchmark/results` folder, along with the `benchmark_results.md` and `benchmark_results.json`.
|
||||
`compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT.
|
||||
If only one benchmark_results.json is passed, `compare-json-results.py` compares different TP and PP configurations in the benchmark_results.json instead.
|
||||
|
||||
Here is an example using the script to compare result_a and result_b with max concurrency and qps for same Model, Dataset name, input/output length.
|
||||
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json`
|
||||
|
||||
***Output Tput (tok/s) — Model : [ meta-llama/Llama-3.1-8B-Instruct ] , Dataset Name : [ random ] , Input Len : [ 2048.0 ] , Output Len : [ 2048.0 ]***
|
||||
|
||||
| | # of max concurrency | qps | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio |
|
||||
| | -------------------- | --- | -------------------------------- | -------------------------------- | ---------- |
|
||||
| 0 | 12 | inf | 24.98 | 186.03 | 7.45 |
|
||||
| 1 | 16 | inf | 25.49 | 246.92 | 9.69 |
|
||||
| 2 | 24 | inf | 27.74 | 293.34 | 10.57 |
|
||||
| 3 | 32 | inf | 28.61 |306.69 | 10.72 |
|
||||
|
||||
***compare-json-results.py – Command-Line Parameters***
|
||||
|
||||
compare-json-results.py provides configurable parameters to compare one or more benchmark_results.json files and generate summary tables and plots.
|
||||
In most cases, users only need to specify --file to parse the desired benchmark results.
|
||||
|
||||
| Parameter | Type | Default Value | Description |
|
||||
| ---------------------- | ------------------ | ----------------------- | ----------------------------------------------------------------------------------------------------- |
|
||||
| `--file` | `str` (appendable) | *None* | Input JSON result file(s). Can be specified multiple times to compare multiple benchmark outputs. |
|
||||
| `--debug` | `bool` | `False` | Enables debug mode. When set, prints all available information to aid troubleshooting and validation. |
|
||||
| `--plot` / `--no-plot` | `bool` | `True` | Controls whether performance plots are generated. Use `--no-plot` to disable graph generation. |
|
||||
| `--xaxis` | `str` | `# of max concurrency.` | Column name used as the X-axis in comparison plots (for example, concurrency or batch size). |
|
||||
| `--latency` | `str` | `p99` | Latency aggregation method used for TTFT/TPOT. Supported values: `median` or `p99`. |
|
||||
| `--ttft-max-ms` | `float` | `3000.0` | Reference upper bound (milliseconds) for TTFT plots, typically used to visualize SLA thresholds. |
|
||||
| `--tpot-max-ms` | `float` | `100.0` | Reference upper bound (milliseconds) for TPOT plots, typically used to visualize SLA thresholds. |
|
||||
|
||||
***Valid Max Concurrency Summary***
|
||||
|
||||
Based on the configured TTFT and TPOT SLA thresholds, compare-json-results.py computes the maximum valid concurrency for each benchmark result.
|
||||
The “Max # of max concurrency. (Both)” column represents the highest concurrency level that satisfies both TTFT and TPOT constraints simultaneously.
|
||||
This value is typically used in capacity planning and sizing guides.
|
||||
|
||||
| # | Configuration | Max # of max concurrency. (TTFT ≤ 10000 ms) | Max # of max concurrency. (TPOT ≤ 100 ms) | Max # of max concurrency. (Both) | Output Tput @ Both (tok/s) | TTFT @ Both (ms) | TPOT @ Both (ms) |
|
||||
| - | -------------- | ------------------------------------------- | ----------------------------------------- | -------------------------------- | -------------------------- | ---------------- | ---------------- |
|
||||
| 0 | results-a | 128.00 | 12.00 | 12.00 | 127.76 | 3000.82 | 93.24 |
|
||||
| 1 | results-b | 128.00 | 32.00 | 32.00 | 371.42 | 2261.53 | 81.74 |
|
||||
|
||||
More information on the performance benchmarks and their parameters can be found in [Benchmark README](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md) and [performance benchmark description](../../.buildkite/performance-benchmarks/performance-benchmarks-descriptions.md).
|
||||
|
||||
## Continuous Benchmarking
|
||||
|
||||
The continuous benchmarking provides automated performance monitoring for vLLM across different models and GPU devices. This helps track vLLM's performance characteristics over time and identify any performance regressions or improvements.
|
||||
|
||||
### How It Works
|
||||
|
||||
The continuous benchmarking is triggered via a [GitHub workflow CI](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-benchmark.yml) in the PyTorch infrastructure repository, which runs automatically every 4 hours. The workflow executes three types of performance tests:
|
||||
|
||||
- **Serving tests**: Measure request handling and API performance
|
||||
- **Throughput tests**: Evaluate token generation rates
|
||||
- **Latency tests**: Assess response time characteristics
|
||||
|
||||
### Benchmark Configuration
|
||||
|
||||
The benchmarking currently runs on a predefined set of models configured in the [vllm-benchmarks directory](https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks). To add new models for benchmarking:
|
||||
|
||||
1. Navigate to the appropriate GPU directory in the benchmarks configuration
|
||||
2. Add your model specifications to the corresponding configuration files
|
||||
3. The new models will be included in the next scheduled benchmark run
|
||||
261
third_party/vllm/docs/benchmarking/sweeps.md
vendored
Normal file
@@ -0,0 +1,261 @@
|
||||
# Parameter Sweeps
|
||||
|
||||
`vllm bench sweep` is a suite of commands designed to run benchmarks across multiple configurations and compare them by visualizing the results.
|
||||
|
||||
## Online Benchmark
|
||||
|
||||
### Basic
|
||||
|
||||
`vllm bench sweep serve` starts `vllm serve` and iteratively runs `vllm bench serve` for each server configuration.
|
||||
|
||||
!!! tip
|
||||
If you only need to run benchmarks for a single server configuration, consider using [GuideLLM](https://github.com/vllm-project/guidellm), an established performance benchmarking framework with live progress updates and automatic report generation. It is also more flexible than `vllm bench serve` in terms of dataset loading, request formatting, and workload patterns.
|
||||
|
||||
Follow these steps to run the script:
|
||||
|
||||
1. Construct the base command to `vllm serve`, and pass it to the `--serve-cmd` option.
|
||||
2. Construct the base command to `vllm bench serve`, and pass it to the `--bench-cmd` option.
|
||||
3. (Optional) If you would like to vary the settings of `vllm serve`, create a new JSON file and populate it with the parameter combinations you want to test. Pass the file path to `--serve-params`.
|
||||
|
||||
- Example: Tuning `--max-num-seqs` and `--max-num-batched-tokens`:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"max_num_seqs": 32,
|
||||
"max_num_batched_tokens": 1024
|
||||
},
|
||||
{
|
||||
"max_num_seqs": 64,
|
||||
"max_num_batched_tokens": 1024
|
||||
},
|
||||
{
|
||||
"max_num_seqs": 64,
|
||||
"max_num_batched_tokens": 2048
|
||||
},
|
||||
{
|
||||
"max_num_seqs": 128,
|
||||
"max_num_batched_tokens": 2048
|
||||
},
|
||||
{
|
||||
"max_num_seqs": 128,
|
||||
"max_num_batched_tokens": 4096
|
||||
},
|
||||
{
|
||||
"max_num_seqs": 256,
|
||||
"max_num_batched_tokens": 4096
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
4. (Optional) If you would like to vary the settings of `vllm bench serve`, create a new JSON file and populate it with the parameter combinations you want to test. Pass the file path to `--bench-params`.
|
||||
|
||||
- Example: Using different input/output lengths for random dataset:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"_benchmark_name": "scenario_A",
|
||||
"random_input_len": 128,
|
||||
"random_output_len": 32
|
||||
},
|
||||
{
|
||||
"_benchmark_name": "scenario_B",
|
||||
"random_input_len": 256,
|
||||
"random_output_len": 64
|
||||
},
|
||||
{
|
||||
"_benchmark_name": "scenario_C",
|
||||
"random_input_len": 512,
|
||||
"random_output_len": 128
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
5. Set `--output-dir` and optionally `--experiment-name` to control where to save the results.
|
||||
|
||||
Example command:
|
||||
|
||||
```bash
|
||||
vllm bench sweep serve \
|
||||
--serve-cmd 'vllm serve meta-llama/Llama-2-7b-chat-hf' \
|
||||
--bench-cmd 'vllm bench serve --model meta-llama/Llama-2-7b-chat-hf --backend vllm --endpoint /v1/completions --dataset-name sharegpt --dataset-path benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json' \
|
||||
--serve-params benchmarks/serve_hparams.json \
|
||||
--bench-params benchmarks/bench_hparams.json \
|
||||
--output-dir benchmarks/results \
|
||||
--experiment-name demo
|
||||
```
|
||||
|
||||
By default, each parameter combination is benchmarked 3 times to make the results more reliable. You can adjust the number of runs by setting `--num-runs`.
|
||||
|
||||
!!! important
|
||||
If both `--serve-params` and `--bench-params` are passed, the script will iterate over the Cartesian product between them.
|
||||
You can use `--dry-run` to preview the commands to be run.
|
||||
|
||||
We only start the server once for each `--serve-params`, and keep it running for multiple `--bench-params`.
|
||||
Between each benchmark run, we call all `/reset_*_cache` endpoints to get a clean slate for the next run.
|
||||
In case you are using a custom `--serve-cmd`, you can override the commands used for resetting the state by setting `--after-bench-cmd`.
|
||||
|
||||
!!! note
|
||||
You should set `_benchmark_name` to provide a human-readable name for parameter combinations involving many variables.
|
||||
This becomes mandatory if the file name would otherwise exceed the maximum path length allowed by the filesystem.
|
||||
|
||||
!!! tip
|
||||
You can use the `--resume` option to continue the parameter sweep if an unexpected error occurs, e.g., timeout when connecting to HF Hub.
|
||||
|
||||
### Workload Explorer
|
||||
|
||||
`vllm bench sweep serve_workload` is a variant of `vllm bench sweep serve` that explores different workload levels in order to find the tradeoff between latency and throughput. The results can also be [visualized](#visualization) to determine the feasible SLAs.
|
||||
|
||||
The workload can be expressed in terms of request rate or concurrency (choose using `--workload-var`).
|
||||
|
||||
Example command:
|
||||
|
||||
```bash
|
||||
vllm bench sweep serve_workload \
|
||||
--serve-cmd 'vllm serve meta-llama/Llama-2-7b-chat-hf' \
|
||||
--bench-cmd 'vllm bench serve --model meta-llama/Llama-2-7b-chat-hf --backend vllm --endpoint /v1/completions --dataset-name sharegpt --dataset-path benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 100' \
|
||||
--workload-var max_concurrency \
|
||||
--serve-params benchmarks/serve_hparams.json \
|
||||
--bench-params benchmarks/bench_hparams.json \
|
||||
--num-runs 1 \
|
||||
--output-dir benchmarks/results \
|
||||
--experiment-name demo
|
||||
```
|
||||
|
||||
The algorithm for exploring different workload levels can be summarized as follows:
|
||||
|
||||
1. Run the benchmark by sending requests one at a time (serial inference, lowest workload). This results in the lowest possible latency and throughput.
|
||||
2. Run the benchmark by sending all requests at once (batch inference, highest workload). This results in the highest possible latency and throughput.
|
||||
3. Estimate the value of `workload_var` corresponding to Step 2.
|
||||
4. Run the benchmark over intermediate values of `workload_var` uniformly using the remaining iterations.
|
||||
|
||||
You can override the number of iterations in the algorithm by setting `--workload-iters`.
|
||||
|
||||
!!! tip
|
||||
This is our equivalent of [GuideLLM's `--profile sweep`](https://github.com/vllm-project/guidellm/blob/v0.5.3/src/guidellm/benchmark/profiles.py#L575).
|
||||
|
||||
In general, `--workload-var max_concurrency` produces more reliable results because it directly controls the workload imposed on the vLLM engine.
|
||||
Nevertheless, we default to `--workload-var request_rate` to maintain similar behavior as GuideLLM.
|
||||
|
||||
## Startup Benchmark
|
||||
|
||||
`vllm bench sweep startup` runs `vllm bench startup` across parameter combinations to compare cold/warm startup time for different engine settings.
|
||||
|
||||
Follow these steps to run the script:
|
||||
|
||||
1. (Optional) Construct the base command to `vllm bench startup`, and pass it to `--startup-cmd` (default: `vllm bench startup`).
|
||||
2. (Optional) Reuse a `--serve-params` JSON from `vllm bench sweep serve` to vary engine settings. Only parameters supported by `vllm bench startup` are applied.
|
||||
3. (Optional) Create a `--startup-params` JSON to vary startup-specific options like iteration counts.
|
||||
4. Determine where you want to save the results, and pass that to `--output-dir`.
|
||||
|
||||
Example `--serve-params`:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"_benchmark_name": "tp1",
|
||||
"model": "Qwen/Qwen3-0.6B",
|
||||
"tensor_parallel_size": 1,
|
||||
"gpu_memory_utilization": 0.9
|
||||
},
|
||||
{
|
||||
"_benchmark_name": "tp2",
|
||||
"model": "Qwen/Qwen3-0.6B",
|
||||
"tensor_parallel_size": 2,
|
||||
"gpu_memory_utilization": 0.9
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
Example `--startup-params`:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"_benchmark_name": "qwen3-0.6",
|
||||
"num_iters_cold": 2,
|
||||
"num_iters_warmup": 1,
|
||||
"num_iters_warm": 2
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
Example command:
|
||||
|
||||
```bash
|
||||
vllm bench sweep startup \
|
||||
--startup-cmd 'vllm bench startup --model Qwen/Qwen3-0.6B' \
|
||||
--serve-params benchmarks/serve_hparams.json \
|
||||
--startup-params benchmarks/startup_hparams.json \
|
||||
--output-dir benchmarks/results \
|
||||
--experiment-name demo
|
||||
```
|
||||
|
||||
!!! important
|
||||
By default, unsupported parameters in `--serve-params` or `--startup-params` are ignored with a warning.
|
||||
Use `--strict-params` to fail fast on unknown keys.
|
||||
|
||||
## Visualization
|
||||
|
||||
### Basic
|
||||
|
||||
`vllm bench sweep plot` can be used to plot performance curves from parameter sweep results.
|
||||
|
||||
Control the variables to plot via `--var-x` and `--var-y`, optionally applying `--filter-by` and `--bin-by` to the values. The plot is organized according to `--fig-by`, `--row-by`, `--col-by`, and `--curve-by`.
|
||||
|
||||
Example commands for visualizing [Workload Explorer](#workload-explorer) results:
|
||||
|
||||
```bash
|
||||
EXPERIMENT_DIR=${1:-"benchmarks/results/demo"}
|
||||
|
||||
# Latency increases as the workload increases
|
||||
vllm bench sweep plot $EXPERIMENT_DIR \
|
||||
--var-x max_concurrency \
|
||||
--var-y median_ttft_ms \
|
||||
--col-by _benchmark_name \
|
||||
--curve-by max_num_seqs,max_num_batched_tokens \
|
||||
--fig-name latency_curve
|
||||
|
||||
# Throughput saturates as workload increases
|
||||
vllm bench sweep plot $EXPERIMENT_DIR \
|
||||
--var-x max_concurrency \
|
||||
--var-y total_token_throughput \
|
||||
--col-by _benchmark_name \
|
||||
--curve-by max_num_seqs,max_num_batched_tokens \
|
||||
--fig-name throughput_curve
|
||||
|
||||
# Tradeoff between latency and throughput
|
||||
vllm bench sweep plot $EXPERIMENT_DIR \
|
||||
--var-x total_token_throughput \
|
||||
--var-y median_ttft_ms \
|
||||
--col-by _benchmark_name \
|
||||
--curve-by max_num_seqs,max_num_batched_tokens \
|
||||
--fig-name latency_throughput
|
||||
```
|
||||
|
||||
!!! tip
|
||||
You can use `--dry-run` to preview the figures to be plotted.
|
||||
|
||||
### Pareto chart
|
||||
|
||||
`vllm bench sweep plot_pareto` helps pick configurations that balance per-user and per-GPU throughput.
|
||||
|
||||
Higher concurrency or batch size can raise GPU efficiency (per-GPU), but can add per user latency; lower concurrency improves per-user rate but underutilizes GPUs; The Pareto frontier shows the best achievable pairs across your runs.
|
||||
|
||||
- x-axis: tokens/s/user = `output_throughput` ÷ concurrency (`--user-count-var`, default `max_concurrency`, fallback `max_concurrent_requests`).
|
||||
- y-axis: tokens/s/GPU = `output_throughput` ÷ GPU count (`--gpu-count-var` if set; else gpu_count is TP×PP*DP).
|
||||
- Output: a single figure at `OUTPUT_DIR/pareto/PARETO.png`.
|
||||
- Show the configuration used in each data point `--label-by` (default: `max_concurrency,gpu_count`).
|
||||
|
||||
Example:
|
||||
|
||||
```bash
|
||||
EXPERIMENT_DIR=${1:-"benchmarks/results/demo"}
|
||||
|
||||
vllm bench sweep plot_pareto $EXPERIMENT_DIR \
|
||||
--label-by max_concurrency,tensor_parallel_size,pipeline_parallel_size
|
||||
```
|
||||
|
||||
!!! tip
|
||||
You can use `--dry-run` to preview the figures to be plotted.
|
||||
1
third_party/vllm/docs/cli/.meta.yml
vendored
Normal file
@@ -0,0 +1 @@
|
||||
toc_depth: 3
|
||||
8
third_party/vllm/docs/cli/.nav.yml
vendored
Normal file
@@ -0,0 +1,8 @@
|
||||
nav:
|
||||
- README.md
|
||||
- serve.md
|
||||
- chat.md
|
||||
- complete.md
|
||||
- run-batch.md
|
||||
- vllm bench:
|
||||
- bench/**/*.md
|
||||
188
third_party/vllm/docs/cli/README.md
vendored
Normal file
@@ -0,0 +1,188 @@
|
||||
# vLLM CLI Guide
|
||||
|
||||
The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with:
|
||||
|
||||
```bash
|
||||
vllm --help
|
||||
```
|
||||
|
||||
Available Commands:
|
||||
|
||||
```bash
|
||||
vllm {chat,complete,serve,bench,collect-env,run-batch}
|
||||
```
|
||||
|
||||
## serve
|
||||
|
||||
Starts the vLLM OpenAI Compatible API server.
|
||||
|
||||
Start with a model:
|
||||
|
||||
```bash
|
||||
vllm serve meta-llama/Llama-2-7b-hf
|
||||
```
|
||||
|
||||
Specify the port:
|
||||
|
||||
```bash
|
||||
vllm serve meta-llama/Llama-2-7b-hf --port 8100
|
||||
```
|
||||
|
||||
Serve over a Unix domain socket:
|
||||
|
||||
```bash
|
||||
vllm serve meta-llama/Llama-2-7b-hf --uds /tmp/vllm.sock
|
||||
```
|
||||
|
||||
Check with --help for more options:
|
||||
|
||||
```bash
|
||||
# To list all groups
|
||||
vllm serve --help=listgroup
|
||||
|
||||
# To view a argument group
|
||||
vllm serve --help=ModelConfig
|
||||
|
||||
# To view a single argument
|
||||
vllm serve --help=max-num-seqs
|
||||
|
||||
# To search by keyword
|
||||
vllm serve --help=max
|
||||
|
||||
# To view full help with pager (less/more)
|
||||
vllm serve --help=page
|
||||
```
|
||||
|
||||
See [vllm serve](./serve.md) for the full reference of all available arguments.
|
||||
|
||||
## chat
|
||||
|
||||
Generate chat completions via the running API server.
|
||||
|
||||
```bash
|
||||
# Directly connect to localhost API without arguments
|
||||
vllm chat
|
||||
|
||||
# Specify API url
|
||||
vllm chat --url http://{vllm-serve-host}:{vllm-serve-port}/v1
|
||||
|
||||
# Quick chat with a single prompt
|
||||
vllm chat --quick "hi"
|
||||
```
|
||||
|
||||
See [vllm chat](./chat.md) for the full reference of all available arguments.
|
||||
|
||||
## complete
|
||||
|
||||
Generate text completions based on the given prompt via the running API server.
|
||||
|
||||
```bash
|
||||
# Directly connect to localhost API without arguments
|
||||
vllm complete
|
||||
|
||||
# Specify API url
|
||||
vllm complete --url http://{vllm-serve-host}:{vllm-serve-port}/v1
|
||||
|
||||
# Quick complete with a single prompt
|
||||
vllm complete --quick "The future of AI is"
|
||||
```
|
||||
|
||||
See [vllm complete](./complete.md) for the full reference of all available arguments.
|
||||
|
||||
## bench
|
||||
|
||||
Run benchmark tests for latency online serving throughput and offline inference throughput.
|
||||
|
||||
To use benchmark commands, please install with extra dependencies using `pip install vllm[bench]`.
|
||||
|
||||
Available Commands:
|
||||
|
||||
```bash
|
||||
vllm bench {latency, serve, throughput}
|
||||
```
|
||||
|
||||
### latency
|
||||
|
||||
Benchmark the latency of a single batch of requests.
|
||||
|
||||
```bash
|
||||
vllm bench latency \
|
||||
--model meta-llama/Llama-3.2-1B-Instruct \
|
||||
--input-len 32 \
|
||||
--output-len 1 \
|
||||
--enforce-eager \
|
||||
--load-format dummy
|
||||
```
|
||||
|
||||
See [vllm bench latency](./bench/latency.md) for the full reference of all available arguments.
|
||||
|
||||
### serve
|
||||
|
||||
Benchmark the online serving throughput.
|
||||
|
||||
```bash
|
||||
vllm bench serve \
|
||||
--model meta-llama/Llama-3.2-1B-Instruct \
|
||||
--host server-host \
|
||||
--port server-port \
|
||||
--random-input-len 32 \
|
||||
--random-output-len 4 \
|
||||
--num-prompts 5
|
||||
```
|
||||
|
||||
See [vllm bench serve](./bench/serve.md) for the full reference of all available arguments.
|
||||
|
||||
### throughput
|
||||
|
||||
Benchmark offline inference throughput.
|
||||
|
||||
```bash
|
||||
vllm bench throughput \
|
||||
--model meta-llama/Llama-3.2-1B-Instruct \
|
||||
--input-len 32 \
|
||||
--output-len 1 \
|
||||
--enforce-eager \
|
||||
--load-format dummy
|
||||
```
|
||||
|
||||
See [vllm bench throughput](./bench/throughput.md) for the full reference of all available arguments.
|
||||
|
||||
## collect-env
|
||||
|
||||
Start collecting environment information.
|
||||
|
||||
```bash
|
||||
vllm collect-env
|
||||
```
|
||||
|
||||
## run-batch
|
||||
|
||||
Run batch prompts and write results to file.
|
||||
|
||||
Running with a local file:
|
||||
|
||||
```bash
|
||||
vllm run-batch \
|
||||
-i offline_inference/openai_batch/openai_example_batch.jsonl \
|
||||
-o results.jsonl \
|
||||
--model meta-llama/Meta-Llama-3-8B-Instruct
|
||||
```
|
||||
|
||||
Using remote file:
|
||||
|
||||
```bash
|
||||
vllm run-batch \
|
||||
-i https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl \
|
||||
-o results.jsonl \
|
||||
--model meta-llama/Meta-Llama-3-8B-Instruct
|
||||
```
|
||||
|
||||
See [vllm run-batch](./run-batch.md) for the full reference of all available arguments.
|
||||
|
||||
## More Help
|
||||
|
||||
For detailed options of any subcommand, use:
|
||||
|
||||
```bash
|
||||
vllm <subcommand> --help
|
||||
```
|
||||
9
third_party/vllm/docs/cli/bench/latency.md
vendored
Normal file
@@ -0,0 +1,9 @@
|
||||
# vllm bench latency
|
||||
|
||||
## JSON CLI Arguments
|
||||
|
||||
--8<-- "docs/cli/json_tip.inc.md"
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/generated/argparse/bench_latency.inc.md"
|
||||
55
third_party/vllm/docs/cli/bench/mm_processor.md
vendored
Normal file
@@ -0,0 +1,55 @@
|
||||
# vllm bench mm-processor
|
||||
|
||||
## Overview
|
||||
|
||||
`vllm bench mm-processor` profiles the multimodal input processor pipeline of
|
||||
vision-language models. It measures per-stage latency from the HuggingFace
|
||||
processor through to the encoder forward pass, helping you identify
|
||||
preprocessing bottlenecks and understand how different image resolutions or
|
||||
item counts affect end-to-end request time.
|
||||
|
||||
The benchmark supports two data sources: synthetic random multimodal inputs
|
||||
(`random-mm`) and HuggingFace datasets (`hf`). Warmup requests are run before
|
||||
measurement to ensure stable results.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
vllm bench mm-processor \
|
||||
--model Qwen/Qwen2-VL-7B-Instruct \
|
||||
--dataset-name random-mm \
|
||||
--num-prompts 50 \
|
||||
--random-input-len 300 \
|
||||
--random-output-len 40 \
|
||||
--random-mm-base-items-per-request 2 \
|
||||
--random-mm-limit-mm-per-prompt '{"image": 3, "video": 0}' \
|
||||
--random-mm-bucket-config '{(256, 256, 1): 0.7, (720, 1280, 1): 0.3}'
|
||||
```
|
||||
|
||||
## Measured Stages
|
||||
|
||||
| Stage | Description |
|
||||
| ----- | ----------- |
|
||||
| `get_mm_hashes_secs` | Time spent hashing multimodal inputs |
|
||||
| `get_cache_missing_items_secs` | Time spent looking up the processor cache |
|
||||
| `apply_hf_processor_secs` | Time spent in the HuggingFace processor |
|
||||
| `merge_mm_kwargs_secs` | Time spent merging multimodal kwargs |
|
||||
| `apply_prompt_updates_secs` | Time spent updating prompt tokens |
|
||||
| `preprocessor_total_secs` | Total preprocessing time |
|
||||
| `encoder_forward_secs` | Time spent in the encoder model forward pass |
|
||||
| `num_encoder_calls` | Number of encoder invocations per request |
|
||||
|
||||
The benchmark also reports end-to-end latency (TTFT + decode time) per
|
||||
request. Use `--metric-percentiles` to select which percentiles to report
|
||||
(default: p99) and `--output-json` to save results.
|
||||
|
||||
For more examples (HF datasets, warmup, JSON output), see
|
||||
[Benchmarking CLI — Multimodal Processor Benchmark](../../benchmarking/cli.md#multimodal-processor-benchmark).
|
||||
|
||||
## JSON CLI Arguments
|
||||
|
||||
--8<-- "docs/cli/json_tip.inc.md"
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/generated/argparse/bench_mm_processor.inc.md"
|
||||
9
third_party/vllm/docs/cli/bench/serve.md
vendored
Normal file
@@ -0,0 +1,9 @@
|
||||
# vllm bench serve
|
||||
|
||||
## JSON CLI Arguments
|
||||
|
||||
--8<-- "docs/cli/json_tip.inc.md"
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/generated/argparse/bench_serve.inc.md"
|
||||
9
third_party/vllm/docs/cli/bench/sweep/plot.md
vendored
Normal file
@@ -0,0 +1,9 @@
|
||||
# vllm bench sweep plot
|
||||
|
||||
## JSON CLI Arguments
|
||||
|
||||
--8<-- "docs/cli/json_tip.inc.md"
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/generated/argparse/bench_sweep_plot.inc.md"
|
||||
9
third_party/vllm/docs/cli/bench/sweep/plot_pareto.md
vendored
Normal file
@@ -0,0 +1,9 @@
|
||||
# vllm bench sweep plot_pareto
|
||||
|
||||
## JSON CLI Arguments
|
||||
|
||||
--8<-- "docs/cli/json_tip.inc.md"
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/generated/argparse/bench_sweep_plot_pareto.inc.md"
|
||||
9
third_party/vllm/docs/cli/bench/sweep/serve.md
vendored
Normal file
@@ -0,0 +1,9 @@
|
||||
# vllm bench sweep serve
|
||||
|
||||
## JSON CLI Arguments
|
||||
|
||||
--8<-- "docs/cli/json_tip.inc.md"
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/generated/argparse/bench_sweep_serve.inc.md"
|
||||
9
third_party/vllm/docs/cli/bench/sweep/serve_workload.md
vendored
Normal file
@@ -0,0 +1,9 @@
|
||||
# vllm bench sweep serve_workload
|
||||
|
||||
## JSON CLI Arguments
|
||||
|
||||
--8<-- "docs/cli/json_tip.inc.md"
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/generated/argparse/bench_sweep_serve_workload.inc.md"
|
||||