Files

Gahow Wang 445e491123 Add vLLM v0.18.1 source tree with KV transfer abort fix

third_party/vllm/ now tracked in git for direct patch management.
Based on vLLM v0.18.1 release with one patch applied:

  vllm/v1/core/sched/scheduler.py:
    Replace fatal assert with graceful skip when KV transfer callback
    arrives for an already-aborted request during PD disaggregated serving.

Future vLLM modifications should be made directly in third_party/vllm/
and committed normally. The patches/ directory is kept as documentation
of what changed from upstream.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 00:30:38 +08:00

3.0 KiB

Raw Permalink Blame History

Optimization Levels

Overview

vLLM provides 4 optimization levels (-O0, -O1, -O2, -O3) that allow users to trade off startup time for performance:

-O0: No optimization. Fastest startup time, but lowest performance.
-O1: Fast optimization. Simple compilation and fast fusions, and PIECEWISE cudagraphs.
-O2: Default optimization. Additional compilation ranges, additional fusions, FULL_AND_PIECEWISE cudagraphs.
-O3: Aggressive optimization. Currently equal to -O2, but may include additional time-consuming or experimental optimizations in the future.

All optimization level defaults can be achieved by manually setting the underlying flags. User-set flags take precedence over optimization level defaults.

Level Summaries and Usage Examples

# CLI usage
python -m vllm.entrypoints.api_server --model RedHatAI/Llama-3.2-1B-FP8 -O1

# Python API usage
from vllm.entrypoints.llm import LLM

llm = LLM(
    model="RedHatAI/Llama-3.2-1B-FP8",
    optimization_level=2 # equivalent to -O2
)

`-O0`: No Optimization

Startup as fast as possible - no autotuning, no compilation, and no cudagraphs. This level is good for initial phases of development and debugging.

Settings:

-cc.cudagraph_mode=NONE
-cc.mode=NONE (also resulting in -cc.custom_ops=["none"])
-cc.pass_config.fuse_...=False (all fusions disabled)
--kernel-config.enable_flashinfer_autotune=False

`-O1`: Fast Optimization

Prioritize fast startup, but still enable basic optimizations like compilation and cudagraphs. This level is a good balance for most development scenarios where you want faster startup but still make sure your code does not break cudagraphs or compilation.

Settings:

-cc.cudagraph_mode=PIECEWISE
-cc.mode=VLLM_COMPILE
--kernel-config.enable_flashinfer_autotune=True

Fusions:

-cc.pass_config.fuse_norm_quant=True*
-cc.pass_config.fuse_act_quant=True*
-cc.pass_config.fuse_act_padding=True†
-cc.pass_config.fuse_rope_kvcache=True† (will be moved to O2)

* These fusions are only enabled when either op is using a custom kernel, otherwise Inductor fusion is better.
† These fusions are ROCm-only and require AITER.

`-O2`: Full Optimization (Default)

Prioritize performance at the expense of additional startup time. This level is recommended for production workloads and is hence the default. Fusions in this level may take longer due to additional compile ranges.

Settings (on top of -O1):

-cc.cudagraph_mode=FULL_AND_PIECEWISE
-cc.pass_config.fuse_allreduce_rms=True

`-O3`: Aggressive Optimization

This level is currently the same as -O2, but may include additional optimizations in the future that are more time-consuming or experimental.

Troubleshooting

Common Issues

Startup Time Too Long: Use -O0 or -O1 for faster startup
Compilation Errors: Use debug_dump_path for additional debugging information
Performance Issues: Ensure using -O2 for production

3.0 KiB Raw Permalink Blame History