chore: vendor sglang v0.5.10 snapshot

This commit is contained in:
2026-04-24 12:29:36 +00:00
parent 78f0d15221
commit bded08301f
4308 changed files with 1200894 additions and 2 deletions

5
.gitignore vendored
View File

@@ -12,6 +12,7 @@ src/*.egg-info
.deps/ .deps/
outputs/ outputs/
# Local heavyweight checkouts and generated experiment artifacts # Vendored dependencies. Track only the maintained SGLang fork/snapshot.
third_party/ third_party/*
!third_party/sglang/
*.log *.log

View File

@@ -0,0 +1,607 @@
---
name: add-jit-kernel
description: Step-by-step tutorial for adding a new lightweight JIT CUDA kernel to sglang's jit_kernel module
---
# Tutorial: Adding a New JIT Kernel to SGLang
This tutorial walks through adding a simple element-wise scale operation as a JIT kernel. We'll implement `scale(x, factor) = x * factor` to demonstrate the complete workflow.
## Goal
Add a new operation that scales each element of a tensor by a scalar factor:
- Input: tensor `x` (CUDA) and scalar `factor` (float, passed at runtime)
- Output: `x * factor` (element-wise), allocated internally
- Supported dtypes: **FP16 (`torch.float16`), BF16 (`torch.bfloat16`), FP32 (`torch.float32`)**
## When to use JIT vs AOT (`sgl-kernel`)
- **JIT (`jit_kernel`)**: prefer this first for kernels that do **not** depend on CUTLASS or another large C++ project. It is the default choice for lightweight kernels that benefit from rapid iteration and first-use compilation.
- **AOT (`sgl-kernel`)**: prefer this when the kernel **does** depend on CUTLASS or another large C++ project, or when it should live in `sgl-kernel/` and participate in the wheel build / torch op registration flow.
- **Exception**: kernels that depend on `flashinfer`, or on CUTLASS that is already provided through `flashinfer`, can still be implemented as `jit_kernel`.
---
## Common Abstractions in `python/sglang/jit_kernel/include/sgl_kernel/`
**Always prefer these abstractions over raw CUDA primitives.** They provide safety, readability, and consistency with the rest of the codebase.
**Important include rule:** for every `#include <sgl_kernel/...>` line, add a short trailing comment explaining why that header is included (for example `// For TensorMatcher, SymbolicSize, SymbolicDevice`). This matches the current JIT kernel style and keeps include usage self-documenting.
### `utils.h` — Host-side utilities
```cpp
#include <sgl_kernel/utils.h>
```
- **`host::RuntimeCheck(cond, args...)`** — Assert a condition at runtime; throws `PanicError` with file/line info on failure. Prefer this over bare `assert`.
- **`host::Panic(args...)`** — Unconditionally throw a `PanicError` with a descriptive message.
- **`host::div_ceil(a, b)`** — Integer ceiling division `(a + b - 1) / b`.
- **`host::irange(n)`** / **`host::irange(start, end)`** — Range views for cleaner loops.
- **`host::pointer::offset(ptr, offsets...)`** — Byte-safe pointer arithmetic on `void*`. Use this instead of raw casts.
### `utils.cuh` — Device-side utilities + `LaunchKernel`
```cpp
#include <sgl_kernel/utils.cuh>
```
- **Type aliases**: `fp16_t`, `bf16_t`, `fp32_t`, `fp8_e4m3_t`, `fp8_e5m2_t` and their packed variants `fp16x2_t`, `bf16x2_t`, `fp32x2_t`, etc.
- **`SGL_DEVICE`** — Expands to `__forceinline__ __device__`. Use on all device functions.
- **`device::kWarpThreads`** — Constant `32`.
- **`device::load_as<T>(ptr, offset)`** / **`device::store_as<T>(ptr, val, offset)`** — Type-safe loads/stores from `void*`.
- **`device::pointer::offset(ptr, offsets...)`** — Pointer arithmetic on device.
- **`host::LaunchKernel(grid, block, device_or_stream [, smem])`** — RAII kernel launcher that:
- Resolves the CUDA stream from a `DLDevice` via TVM-FFI automatically.
- Checks the CUDA error with file/line info after launch via `operator()(kernel, args...)`.
- Supports `.enable_pdl(bool)` for PDL (Programmatic Dependent Launch, SM90+).
- **`host::RuntimeDeviceCheck(cudaError_t)`** — Check a CUDA error; throw on failure.
### `tensor.h` — Tensor validation (`TensorMatcher`, Symbolic types)
```cpp
#include <sgl_kernel/tensor.h>
```
This is the **primary validation API** for all kernel launchers. Use it to validate every `tvm::ffi::TensorView` argument.
- **`host::SymbolicSize{"name"}`** — A named symbolic dimension. Call `.set_value(n)` to pin it, `.unwrap()` to extract after verification.
- **`host::SymbolicDType`** — Symbolic dtype. Use `.set_options<Ts...>()` to restrict allowed types.
- **`host::SymbolicDevice`** — Symbolic device. Use `.set_options<kDLCUDA>()` to restrict to CUDA.
- **`host::TensorMatcher({dims...})`** — Fluent builder for tensor validation:
- `.with_dtype<T>()` — require a specific C++ type (e.g. `fp16_t`)
- `.with_dtype<T1, T2, ...>()` — allow a set of types
- `.with_device<kDLCUDA>(device_sym)` — require CUDA and bind the checked device to a `SymbolicDevice`
- `.with_strides({strides...})` — validate strides (omit to require contiguous)
- `.verify(tensor_view)` — execute the check; throws `PanicError` with full context on failure; **chainable** (`verify(a).verify(b)` to check multiple tensors with the same shape)
**Typical pattern:**
```cpp
auto N = SymbolicSize{"num_elements"};
auto device = SymbolicDevice{};
device.set_options<kDLCUDA>();
TensorMatcher({N}) //
.with_dtype<fp16_t>()
.with_device<kDLCUDA>(device)
.verify(dst)
.verify(src); // same shape, dtype, device as dst
const size_t n = N.unwrap();
const DLDevice dev = device.unwrap();
```
### `type.cuh` — `dtype_trait<T>` and `packed_t<T>`
```cpp
#include <sgl_kernel/type.cuh>
```
- **`dtype_trait<T>`** — Static trait struct for each scalar type. Provides:
- `dtype_trait<T>::from(value)` — convert from another type (e.g. `fp32_t``fp16_t`)
- `dtype_trait<T>::abs/sqrt/rsqrt/exp/sin/cos(x)` — type-dispatched unary math (primarily for `fp32_t`)
- `dtype_trait<T>::max/min(x, y)` — type-dispatched binary math (primarily for `fp32_t`)
- **`packed_t<T>`** — Two-element packed alias: `packed_t<fp16_t>` = `fp16x2_t`, `packed_t<bf16_t>` = `bf16x2_t`, `packed_t<fp32_t>` = `fp32x2_t`. Use for vectorized loads/stores.
- **`device::cast<To, From>(value)`** — Type-safe cast using `dtype_trait`, e.g. `cast<fp32x2_t, fp16x2_t>(v)`.
### `vec.cuh` — Vectorized memory access (`AlignedVector`)
```cpp
#include <sgl_kernel/vec.cuh>
```
- **`device::AlignedVector<T, N>`** — Aligned storage for N elements of type T. N must be a power of two, `sizeof(T)*N <= 32`. Enables vectorized loads/stores for bandwidth efficiency. In terms of API/codegen constraints, the upper bound is 256-bit; in practice, 128-bit is the portable default, while 256-bit vectorization is typically only viable on `SM100+` and should be gated by an architecture check when needed.
- `.load(ptr, offset)` — vectorized load from `ptr[offset]`
- `.store(ptr, offset)` — vectorized store to `ptr[offset]`
- `.fill(value)` — fill all lanes
- `operator[](i)` — element access
### `tile.cuh` — `tile::Memory` (strided memory access pattern)
```cpp
#include <sgl_kernel/tile.cuh>
```
- `tile::Memory<T>` is fundamentally a **1D cooperative accessor** over a contiguous region.
- **`device::tile::Memory<T>::cta(blockDim.x)`** — Creates a tile accessor where each thread handles `tid = threadIdx.x` with stride `tsize` (for `cta(blockDim.x)`, this is `blockDim.x`). Common for loops over a 1D array.
- **`.load(ptr, offset)`** — loads `ptr[tid + offset * tsize]`
- **`.store(ptr, val, offset)`** — stores to `ptr[tid + offset * tsize]`
- **`.in_bound(n, offset)`** — boundary check
For a **2D tile**, either flatten `(row, col)` into a linear tile index first, or compute the address manually with `ptr[row * stride + col]` using your thread/block coordinates.
### `math.cuh` — Device math (`device::math::`)
```cpp
#include <sgl_kernel/math.cuh>
```
- `device::math::max/min<T>(a, b)` — type-dispatched binary math via `dtype_trait`
- `device::math::abs/sqrt/rsqrt/exp/sin/cos<T>(x)` — type-dispatched unary math via `dtype_trait`
### `warp.cuh` — Warp-level primitives
```cpp
#include <sgl_kernel/warp.cuh>
```
- `device::warp::reduce_sum<T>(value)` — warp-level sum reduction via `__shfl_xor_sync`
- `device::warp::reduce_max<T>(value)` — warp-level max reduction
### `cta.cuh` — CTA-level primitives
```cpp
#include <sgl_kernel/cta.cuh>
```
- `device::cta::reduce_max<T>(value, smem, min_value)` — CTA-wide max using shared memory + warp reduction. Caller is responsible for a `__syncthreads()` after if the result in `smem[0]` is needed.
### `atomic.cuh` — Atomic operations
```cpp
#include <sgl_kernel/atomic.cuh>
```
- `device::atomic::max(float* addr, float value)` — float atomic max (handles negative values correctly via bit tricks).
### `runtime.cuh` — Occupancy and device info
```cpp
#include <sgl_kernel/runtime.cuh>
```
- `host::runtime::get_blocks_per_sm(kernel, block_dim)` — max active blocks per SM (occupancy)
- `host::runtime::get_sm_count(device_id)` — number of SMs on the device
- `host::runtime::get_cc_major(device_id)` — compute capability major version
**Persistent kernel pattern** (cap blocks to SM count × occupancy):
```cpp
static const uint32_t max_occ = runtime::get_blocks_per_sm(kernel, kBlockSize);
static const uint32_t num_sm = runtime::get_sm_count(device.unwrap().device_id);
const auto num_blocks = std::min(num_sm * max_occ, div_ceil(n, kBlockSize));
LaunchKernel(num_blocks, kBlockSize, device.unwrap())(kernel, params);
```
---
## Step 0 (optional): Generate a `.clangd` config for better IDE support
```bash
python -m sglang.jit_kernel
```
---
## Step 1: Implement the CUDA kernel in `jit_kernel/csrc/`
Create `python/sglang/jit_kernel/csrc/elementwise/scale.cuh`.
The implementation fully uses the project abstractions described above:
```cpp
#include <sgl_kernel/tensor.h> // For TensorMatcher, SymbolicSize, SymbolicDevice
#include <sgl_kernel/type.cuh> // For dtype_trait, fp16_t, bf16_t, fp32_t
#include <sgl_kernel/utils.h> // For RuntimeCheck, div_ceil
#include <sgl_kernel/utils.cuh> // For LaunchKernel, SGL_DEVICE
#include <sgl_kernel/vec.cuh> // For AlignedVector
#include <dlpack/dlpack.h>
#include <tvm/ffi/container/tensor.h>
namespace {
// ----------------------------------------------------------------
// Kernel: element-wise scale using vectorized 128-bit loads/stores
// T = fp16_t | bf16_t | fp32_t
// kVecN = number of elements per vector load (e.g. 8 for fp16)
// factor = runtime scale factor
// ----------------------------------------------------------------
template <typename T, int kVecN>
__global__ void scale_kernel(T* __restrict__ dst,
const T* __restrict__ src,
float factor,
uint32_t n_total) {
using vec_t = device::AlignedVector<T, kVecN>;
const uint32_t n_vecs = n_total / kVecN;
// --- vectorised body ---
const uint32_t vec_stride = blockDim.x * gridDim.x;
for (uint32_t vi = blockIdx.x * blockDim.x + threadIdx.x;
vi < n_vecs;
vi += vec_stride) {
vec_t v;
v.load(src, vi);
#pragma unroll
for (int i = 0; i < kVecN; ++i) {
v[i] = static_cast<T>(static_cast<float>(v[i]) * factor);
}
v.store(dst, vi);
}
// --- scalar tail ---
const uint32_t base = n_vecs * kVecN;
const uint32_t scalar_stride = blockDim.x * gridDim.x;
for (uint32_t i = blockIdx.x * blockDim.x + threadIdx.x;
base + i < n_total;
i += scalar_stride) {
dst[base + i] = static_cast<T>(static_cast<float>(src[base + i]) * factor);
}
}
// ----------------------------------------------------------------
// Launcher: validates tensors, selects vector width, launches kernel
// ----------------------------------------------------------------
template <typename T>
void scale(tvm::ffi::TensorView dst, tvm::ffi::TensorView src, float factor) {
using namespace host;
// 1. Validate input tensors with TensorMatcher
SymbolicSize N = {"num_elements"};
SymbolicDevice device_;
device_.set_options<kDLCUDA>();
TensorMatcher({N}) //
.with_dtype<T>()
.with_device<kDLCUDA>(device_)
.verify(dst)
.verify(src); // same shape / dtype / device as dst
const uint32_t n = static_cast<uint32_t>(N.unwrap());
const DLDevice device = device_.unwrap();
RuntimeCheck(n > 0, "scale: num_elements must be > 0, got ", n);
// 2. Choose vector width for 128-bit loads (16 bytes)
// fp16/bf16: 8 elements × 2 bytes = 16 bytes
// fp32: 4 elements × 4 bytes = 16 bytes
constexpr int kVecN = 16 / sizeof(T);
const uint32_t n_work_items = div_ceil(n, static_cast<uint32_t>(kVecN));
// 3. Launch
constexpr uint32_t kBlockSize = 256;
const uint32_t grid = div_ceil(n_work_items, kBlockSize);
LaunchKernel(grid, kBlockSize, device)(
scale_kernel<T, kVecN>,
static_cast<T*>(dst.data_ptr()),
static_cast<const T*>(src.data_ptr()),
factor,
n);
}
} // namespace
```
**Key points:**
- Include headers from `sgl_kernel/`**not** raw CUDA headers for anything already covered
- Add a short trailing `// For ...` explanation to every `#include <sgl_kernel/...>` line
- Use `TensorMatcher` for all tensor validation; never manually check shape/dtype/device
- Use `AlignedVector` for vectorised 128-bit loads/stores — significant bandwidth win
- Use `LaunchKernel` — it resolves the stream and checks errors automatically
- Use `RuntimeCheck` for runtime assertions with useful error messages
- Prefer passing runtime scalars like `factor` directly unless compile-time specialisation is genuinely required
- `fp16_t` / `bf16_t` / `fp32_t` are the project's type aliases (from `utils.cuh`)
- `device::cast<To, From>` or `dtype_trait<T>::from(val)` for cross-type conversions
- `device::math::` functions for device math instead of bare `__` intrinsics
---
## Step 2: Add the Python wrapper in `jit_kernel/`
Create `python/sglang/jit_kernel/scale.py`:
```python
from __future__ import annotations
from typing import TYPE_CHECKING
import torch
from sglang.jit_kernel.utils import cache_once, load_jit, make_cpp_args
if TYPE_CHECKING:
from tvm_ffi.module import Module
@cache_once
def _jit_scale_module(dtype: torch.dtype) -> Module:
"""Compile and cache the JIT scale module for a given dtype."""
args = make_cpp_args(dtype)
return load_jit(
"scale",
*args,
cuda_files=["elementwise/scale.cuh"],
cuda_wrappers=[("scale", f"scale<{args}>")],
)
def scale(src: torch.Tensor, factor: float, out: torch.Tensor | None = None) -> torch.Tensor:
"""
Element-wise scale: dst = src * factor.
Supported dtypes: torch.float16, torch.bfloat16, torch.float32.
Parameters
----------
src : CUDA tensor (FP16 / BF16 / FP32)
factor : scale factor
out : optional pre-allocated output tensor (same shape/dtype as src)
Returns
-------
Scaled tensor (dst = src * factor).
"""
if not src.is_cuda:
raise RuntimeError("src must be a CUDA tensor")
if src.dtype not in (torch.float16, torch.bfloat16, torch.float32):
raise RuntimeError(
f"Unsupported dtype {src.dtype}. Supported: float16, bfloat16, float32"
)
if out is None:
out = torch.empty_like(src)
else:
if out.shape != src.shape:
raise RuntimeError("out shape must match src")
if out.dtype != src.dtype:
raise RuntimeError("out dtype must match src")
if out.device != src.device:
raise RuntimeError("out device must match src")
# Keep the Python wrapper thin, but still enforce the basic preconditions
# that the current JIT/FFI path does not reject safely on its own.
module = _jit_scale_module(src.dtype)
module.scale(out, src, factor)
return out
```
**Key points:**
- Use `cache_once`**not** `functools.lru_cache` (incompatible with `torch.compile`)
- `load_jit` first arg(s) form the unique build marker; same marker = same cached binary
- Only include compile-time specialisation knobs in the build marker; runtime values like `factor` should stay runtime unless the kernel truly needs templating
- `cuda_wrappers`: `(export_name, kernel_symbol)``export_name` is called from Python
- `make_cpp_args(dtype, ...)` converts `torch.dtype` to C++ type alias:
- Keep Python launchers thin, but still validate the basic invariants (`is_cuda`, supported dtype, `out` metadata). In the current JIT/FFI path, invalid tensors are not always rejected safely before launch
| `torch.dtype` | C++ type |
|--------------------|------------|
| `torch.float16` | `fp16_t` |
| `torch.bfloat16` | `bf16_t` |
| `torch.float32` | `fp32_t` |
---
## Step 3 (optional): Tune JIT build flags
```python
return load_jit(
"scale",
*args,
cuda_files=["elementwise/scale.cuh"],
cuda_wrappers=[("scale", f"scale<{args}>")],
extra_cuda_cflags=["-O3", "--use_fast_math"],
)
```
If your kernel requires SM90+, raise a clear Python error before calling `load_jit`:
```python
if torch.cuda.get_device_capability()[0] < 9:
raise RuntimeError("This kernel requires SM90 (Hopper) or later")
```
---
## Step 4: Write tests (required)
JIT kernel tests live under `python/sglang/jit_kernel/tests/`. **CI does not run `pytest` in that directory directly.** The unified runner `test/run_suite.py` discovers every `test_*.py` there (and every `bench_*.py` under `benchmark/`), collects `register_*_ci(...)` calls by **statically parsing each files AST**, and executes the selected suite. Every test file must register at least one CUDA entry or the collector fails its sanity check.
- **PR / per-commit CUDA suites** (see `test/run_suite.py``PER_COMMIT_SUITES`): JIT unit tests use `stage-b-kernel-unit-1-gpu-large` (see `.github/workflows/pr-test-jit-kernel.yml`: `python3 run_suite.py --hw cuda --suite stage-b-kernel-unit-1-gpu-large`).
- **Nightly kernel suite**: `nightly-kernel-1-gpu` with `--nightly` — typically used with `SGLANG_JIT_KERNEL_RUN_FULL_TESTS=1` in CI for expanded parameter grids (see `python/sglang/jit_kernel/utils.py``should_run_full_tests` / `get_ci_test_range`). Wired in `.github/workflows/nightly-test-nvidia.yml` (e.g. `python3 run_suite.py --hw cuda --suite nightly-kernel-1-gpu --nightly --continue-on-error`).
Registration pattern (module level, **literal** `est_time` and `suite` strings — required for AST parsing):
```python
from sglang.test.ci.ci_register import register_cuda_ci
register_cuda_ci(est_time=30, suite="stage-b-kernel-unit-1-gpu-large")
# Optional second registration: same file also listed under the nightly kernel suite
# register_cuda_ci(est_time=120, suite="nightly-kernel-1-gpu", nightly=True)
```
Keep `est_time` and `suite` as literal values. `run_suite.py` collects them from the file AST, so computed values and helper wrappers can break CI discovery.
Use `register_cuda_ci(..., disabled="reason")` if the file must stay in-tree but should be skipped in CI (e.g. multi-GPU only).
**Run like CI** (from repo root):
```bash
cd test && python3 run_suite.py --hw cuda --suite stage-b-kernel-unit-1-gpu-large
```
For fast iteration you can still run `pytest` on a single file locally; CI coverage is via `run_suite.py`.
Create `python/sglang/jit_kernel/tests/test_scale.py`:
```python
import pytest
import torch
from sglang.jit_kernel.scale import scale
from sglang.test.ci.ci_register import register_cuda_ci
register_cuda_ci(est_time=30, suite="stage-b-kernel-unit-1-gpu-large")
@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32])
@pytest.mark.parametrize("size", [1, 127, 128, 1024, 4097]) # cover tail remainder
@pytest.mark.parametrize("factor", [0.5, 1.0, 2.0, 3.0])
def test_scale_correctness(dtype, size, factor):
src = torch.randn(size, dtype=dtype, device="cuda")
out = scale(src, factor)
expected = src * factor
rtol, atol = (1e-5, 1e-6) if dtype == torch.float32 else (1e-2, 1e-2)
torch.testing.assert_close(out, expected, rtol=rtol, atol=atol)
@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32])
def test_scale_out_param(dtype):
src = torch.randn(1024, dtype=dtype, device="cuda")
out = torch.empty_like(src)
result = scale(src, 2.0, out=out)
assert result is out
torch.testing.assert_close(out, src * 2.0, rtol=1e-2, atol=1e-2)
def test_scale_cpu_error():
src = torch.randn(128, dtype=torch.float16) # CPU tensor
with pytest.raises(RuntimeError, match="CUDA"):
scale(src, 2.0)
def test_scale_unsupported_dtype():
src = torch.randint(0, 10, (128,), dtype=torch.int32, device="cuda")
with pytest.raises(RuntimeError, match="dtype"):
scale(src, 2.0)
if __name__ == "__main__":
import sys
sys.exit(pytest.main([__file__, "-v", "-s"]))
```
---
## Step 5: Add a benchmark (required)
Benchmarks are `bench_*.py` files under `python/sglang/jit_kernel/benchmark/`. They are picked up by the same `run_suite.py` machinery as unit tests. Register them for **`stage-b-kernel-benchmark-1-gpu-large`** (PR JIT benchmark job: `python3 run_suite.py --hw cuda --suite stage-b-kernel-benchmark-1-gpu-large`).
Create `python/sglang/jit_kernel/benchmark/bench_scale.py`:
```python
import itertools
import torch
import triton
import triton.testing
from sglang.jit_kernel.benchmark.utils import (
DEFAULT_DEVICE,
DEFAULT_DTYPE,
get_benchmark_range,
run_benchmark,
)
from sglang.jit_kernel.scale import scale as jit_scale
from sglang.test.ci.ci_register import register_cuda_ci
register_cuda_ci(est_time=6, suite="stage-b-kernel-benchmark-1-gpu-large")
SIZE_LIST = get_benchmark_range(
full_range=[2**n for n in range(10, 20)], # 1K … 512K elements
ci_range=[4096, 65536],
)
configs = list(itertools.product(SIZE_LIST))
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=["size"],
x_vals=configs,
line_arg="provider",
line_vals=["jit", "torch"],
line_names=["SGL JIT Kernel", "PyTorch"],
styles=[("blue", "-"), ("red", "--")],
ylabel="us",
plot_name="scale-performance",
args={},
)
)
def benchmark(size: int, provider: str):
src = torch.randn(size, dtype=DEFAULT_DTYPE, device=DEFAULT_DEVICE)
factor = 2.0
if provider == "jit":
fn = lambda: jit_scale(src, factor)
else:
fn = lambda: src * factor
return run_benchmark(fn)
if __name__ == "__main__":
benchmark.run(print_data=True)
```
Run locally:
```bash
python python/sglang/jit_kernel/benchmark/bench_scale.py
```
Run the benchmark suite the way CI does:
```bash
cd test && python3 run_suite.py --hw cuda --suite stage-b-kernel-benchmark-1-gpu-large
```
---
## Troubleshooting
- **`No CI registry found in ...` from `run_suite.py`**: add a module-level `register_cuda_ci(...)` with literal `est_time` and `suite` (and optional `nightly=True`); starred args and non-literal values break AST collection
- **JIT compilation fails**: ensure the `.cuh` file is under `python/sglang/jit_kernel/csrc/`; reduce template argument combinations
- **CUDA crash / illegal memory access**: `CUDA_LAUNCH_BLOCKING=1`; `compute-sanitizer --tool memcheck python ...`
- **Unstable benchmark results**: `run_benchmark` uses CUDA-graph-based timing by default
---
## References
- `docs/developer_guide/development_jit_kernel_guide.md`
- `test/run_suite.py` — suite names, discovery of `jit_kernel/tests/` and `jit_kernel/benchmark/`, execution entrypoint for CI
- `python/sglang/test/ci/ci_register.py``register_cuda_ci` and AST registration rules
- `python/sglang/jit_kernel/utils.py``cache_once`, `load_jit`, `make_cpp_args`, `should_run_full_tests`, `get_ci_test_range`
- `python/sglang/jit_kernel/include/sgl_kernel/tensor.h``TensorMatcher`, `SymbolicSize/DType/Device`
- `python/sglang/jit_kernel/include/sgl_kernel/utils.cuh` — type aliases, `LaunchKernel`, `SGL_DEVICE`
- `python/sglang/jit_kernel/include/sgl_kernel/vec.cuh``AlignedVector`
- `python/sglang/jit_kernel/include/sgl_kernel/tile.cuh``tile::Memory`
- `python/sglang/jit_kernel/include/sgl_kernel/type.cuh``dtype_trait`, `packed_t`, `device::cast`
- `python/sglang/jit_kernel/include/sgl_kernel/math.cuh``device::math::`
- `python/sglang/jit_kernel/include/sgl_kernel/warp.cuh``warp::reduce_sum/max`
- `python/sglang/jit_kernel/include/sgl_kernel/cta.cuh``cta::reduce_max`
- `python/sglang/jit_kernel/include/sgl_kernel/atomic.cuh``atomic::max`
- `python/sglang/jit_kernel/include/sgl_kernel/runtime.cuh` — occupancy / SM count helpers
- `python/sglang/jit_kernel/csrc/add_constant.cuh` — minimal runnable reference
- `python/sglang/jit_kernel/csrc/elementwise/rmsnorm.cuh` — real example using `TensorMatcher` + `LaunchKernel` + `tile::Memory`
- `python/sglang/jit_kernel/csrc/elementwise/qknorm.cuh` — real example using `runtime::get_blocks_per_sm` + persistent kernel pattern
- `python/sglang/jit_kernel/benchmark/utils.py` — benchmark helpers
## Summary of Files Created
```
python/sglang/jit_kernel/csrc/elementwise/scale.cuh # NEW: CUDA kernel
python/sglang/jit_kernel/scale.py # NEW: Python wrapper
python/sglang/jit_kernel/tests/test_scale.py # NEW: Tests
python/sglang/jit_kernel/benchmark/bench_scale.py # NEW: Benchmark
```

View File

@@ -0,0 +1,363 @@
---
name: add-sgl-kernel
description: Step-by-step tutorial for adding a heavyweight AOT CUDA/C++ kernel to sgl-kernel (including tests & benchmarks)
---
# Tutorial: Adding a New Kernel to `sgl-kernel` (AOT / Heavyweight)
This tutorial walks through adding a simple element-wise scale operation as an AOT kernel. We'll implement `scale(x, factor) = x * factor` to demonstrate the complete workflow.
## Goal
Add a new operation that scales each element of a tensor by a scalar factor:
- Input: tensor `x` (CUDA) and scalar `factor` (float)
- Output: `x * factor` (element-wise, in-place or into pre-allocated `out`)
- Supported dtypes: **FP16 (`torch.float16`), BF16 (`torch.bfloat16`), FP32 (`torch.float32`)**
- Dispatched via `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` macro (defined in `sgl-kernel/include/utils.h`)
## Two rules of thumb (must follow)
1. **Prefer `python/sglang/jit_kernel` first** when the kernel does **not** depend on CUTLASS or another large C++ project. This is the default path for lightweight kernels that benefit from rapid iteration.
2. **Prefer `sgl-kernel`** when the kernel **does** depend on CUTLASS or another large C++ project, or when it should be part of the AOT wheel / torch op registration flow.
3. **Exception**: if the dependency is `flashinfer`, or CUTLASS that is already provided through `flashinfer`, the kernel can still be implemented as `jit_kernel`.
In addition, every new kernel must ship with:
- **Tests** (pytest)
- **A benchmark script** (triton.testing)
---
## Repository integration map
You will typically touch these files/areas:
- Implementation: `sgl-kernel/csrc/elementwise/scale.cu` (pick the right subdirectory)
- Public declarations: `sgl-kernel/include/sgl_kernel_ops.h`
- Torch extension registration: `sgl-kernel/csrc/common_extension.cc`
- Build: `sgl-kernel/CMakeLists.txt` (`set(SOURCES ...)`)
- Python API: `sgl-kernel/python/sgl_kernel/` and `sgl-kernel/python/sgl_kernel/__init__.py`
- Tests: `sgl-kernel/tests/test_scale.py`
- Benchmarks: `sgl-kernel/benchmark/bench_scale.py`
---
## Step 1: Implement the kernel in `csrc/`
Pick the right subdirectory:
- `csrc/elementwise/` — for element-wise ops (our example)
- `csrc/gemm/`, `csrc/attention/`, `csrc/moe/` — for other categories
Create `sgl-kernel/csrc/elementwise/scale.cu`:
```cpp
#include <ATen/cuda/CUDAContext.h>
#include <c10/cuda/CUDAGuard.h>
#include <torch/all.h>
#include "utils.h" // DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16
// scale_kernel: out[i] = input[i] * factor
// Supports float, half (__half), __nv_bfloat16 via template T
template <typename T>
__global__ void scale_kernel(T* __restrict__ out,
const T* __restrict__ input,
float factor,
int64_t n) {
int64_t idx = static_cast<int64_t>(blockIdx.x) * blockDim.x + threadIdx.x;
if (idx < n) {
out[idx] = static_cast<T>(static_cast<float>(input[idx]) * factor);
}
}
void scale(at::Tensor& out, const at::Tensor& input, double factor) {
TORCH_CHECK(input.is_cuda(), "input must be a CUDA tensor");
TORCH_CHECK(input.is_contiguous(), "input must be contiguous");
TORCH_CHECK(out.is_cuda(), "out must be a CUDA tensor");
TORCH_CHECK(out.is_contiguous(), "out must be contiguous");
TORCH_CHECK(out.sizes() == input.sizes(), "out and input must have the same shape");
TORCH_CHECK(out.scalar_type() == input.scalar_type(),
"out and input must have the same dtype");
const int64_t n = input.numel();
const int threads = 256;
const int blocks = (n + threads - 1) / threads;
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
// Dispatches over float, float16, bfloat16
DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16(input.scalar_type(), c_type, [&] {
scale_kernel<c_type><<<blocks, threads, 0, stream>>>(
static_cast<c_type*>(out.data_ptr()),
static_cast<const c_type*>(input.data_ptr()),
static_cast<float>(factor),
n);
cudaError_t status = cudaGetLastError();
TORCH_CHECK(status == cudaSuccess,
"scale_kernel launch failed: ", cudaGetErrorString(status));
return true;
});
}
```
**Key points:**
- Use `at::Tensor` (PyTorch tensors), `TORCH_CHECK` for validation, `at::cuda::getCurrentCUDAStream()` for stream
- Keep Python wrappers thin; do shape/dtype/device validation in C++ right around the launch path
- `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` covers `float`, `half` (FP16), `__nv_bfloat16` (BF16)
- Add device error checking after every kernel launch
- If a kernel only works on certain architectures, enforce that with `TORCH_CHECK` and skip logic in tests
---
## Step 2: Add a C++ declaration in `include/sgl_kernel_ops.h`
Edit `sgl-kernel/include/sgl_kernel_ops.h`, add to the elementwise section:
```cpp
void scale(at::Tensor& out, const at::Tensor& input, double factor);
```
---
## Step 3: Register the op in `csrc/common_extension.cc`
Edit `sgl-kernel/csrc/common_extension.cc`, inside `TORCH_LIBRARY_FRAGMENT(sgl_kernel, m)`:
```cpp
// From csrc/elementwise
m.def("scale(Tensor! out, Tensor input, float factor) -> ()");
m.impl("scale", torch::kCUDA, &scale);
```
**Key points:**
- `Tensor!` means in-place / mutable output argument
- The schema is important for `torch.compile` and for consistent call signatures
- Keep the torch schema in PyTorch scalar types (`float` here), but note that the C++ launcher signature still needs `double` for scalar arguments accepted by `torch::Library`
---
## Step 4: Add the new source file to `CMakeLists.txt`
Edit `sgl-kernel/CMakeLists.txt`, add to `set(SOURCES ...)`:
```cmake
csrc/elementwise/scale.cu
```
**Key points:**
- Keep the list **alphabetically sorted** (the file explicitly requires this)
- If the kernel has arch constraints, reflect that in tests/benchmarks via skip logic
---
## Step 5: Expose a Python API under `sgl-kernel/python/sgl_kernel/`
Prefer following the existing module organization first. For elementwise kernels, the usual pattern is:
- implement the Python wrapper in `sgl-kernel/python/sgl_kernel/elementwise.py`
- then re-export it from `sgl-kernel/python/sgl_kernel/__init__.py`
For example, in `sgl-kernel/python/sgl_kernel/elementwise.py`, add:
```python
import torch
def scale(
input: torch.Tensor,
factor: float,
out: torch.Tensor | None = None,
) -> torch.Tensor:
"""
Element-wise scale: out = input * factor.
Supported dtypes: torch.float16, torch.bfloat16, torch.float32.
Parameters
----------
input : CUDA input tensor
factor : scale factor (float)
out : optional pre-allocated CUDA output tensor (same shape/dtype as input)
"""
if out is None:
out = torch.empty_like(input)
torch.ops.sgl_kernel.scale.default(out, input, factor)
return out
```
Then re-export it from `sgl-kernel/python/sgl_kernel/__init__.py` following the existing import style used by other kernels.
---
## Step 6: Write tests (required)
Create `sgl-kernel/tests/test_scale.py`:
```python
import pytest
import torch
import sgl_kernel
@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32])
@pytest.mark.parametrize("size", [128, 1024, 4096, 65536])
@pytest.mark.parametrize("factor", [0.5, 1.0, 2.0])
def test_scale_correctness(dtype, size, factor):
input = torch.randn(size, dtype=dtype, device="cuda")
out = torch.empty_like(input)
result = sgl_kernel.scale(input, factor, out=out)
assert result is out
expected = input * factor
rtol, atol = (1e-5, 1e-6) if dtype == torch.float32 else (1e-2, 1e-2)
torch.testing.assert_close(out, expected, rtol=rtol, atol=atol)
def test_scale_shape_mismatch():
input = torch.randn(128, dtype=torch.float16, device="cuda")
out = torch.empty(256, dtype=torch.float16, device="cuda")
with pytest.raises(RuntimeError, match="same shape"):
sgl_kernel.scale(input, 2.0, out=out)
def test_scale_cpu_input():
input = torch.randn(128, dtype=torch.float16) # CPU
out = torch.empty_like(input)
with pytest.raises(RuntimeError, match="CUDA"):
sgl_kernel.scale(input, 2.0, out=out)
if __name__ == "__main__":
import sys
sys.exit(pytest.main([__file__, "-q"]))
```
---
## Step 7: Add a benchmark (required)
Create `sgl-kernel/benchmark/bench_scale.py`:
```python
import itertools
import torch
import triton
import triton.testing
import sgl_kernel
from sglang.utils import is_in_ci
IS_CI = is_in_ci()
dtypes = [torch.float16] if IS_CI else [torch.float16, torch.bfloat16, torch.float32]
sizes = [4096] if IS_CI else [2**n for n in range(10, 20)] # 1K … 512K
factors = [2.0]
configs = list(itertools.product(dtypes, sizes))
def torch_scale(input: torch.Tensor, factor: float) -> torch.Tensor:
return input * factor
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=["dtype", "size"],
x_vals=configs,
line_arg="provider",
line_vals=["sglang", "torch"],
line_names=["SGL Kernel", "PyTorch"],
styles=[("green", "-"), ("red", "--")],
ylabel="µs (median)",
plot_name="scale-performance",
args={},
)
)
def benchmark(dtype, size, provider):
input = torch.randn(size, dtype=dtype, device="cuda")
out = torch.empty_like(input)
factor = 2.0
if provider == "sglang":
fn = lambda: sgl_kernel.scale(input, factor, out=out)
else:
fn = lambda: torch_scale(input, factor)
ms, min_ms, max_ms = triton.testing.do_bench_cudagraph(
fn, quantiles=[0.5, 0.2, 0.8]
)
return 1000 * ms, 1000 * max_ms, 1000 * min_ms
if __name__ == "__main__":
benchmark.run(print_data=True)
```
---
## Step 8: Build
Build:
```bash
cd sgl-kernel
make build -j16
```
If you need to limit host resource usage:
```bash
cd sgl-kernel
make build -j1 MAX_JOBS=2 CMAKE_ARGS="-DSGL_KERNEL_COMPILE_THREADS=1"
```
---
## Step 9: Validate
After building successfully, run the test and benchmark:
```bash
pytest sgl-kernel/tests/test_scale.py -q
python sgl-kernel/benchmark/bench_scale.py
```
---
## Troubleshooting
- **Async CUDA errors**: `CUDA_LAUNCH_BLOCKING=1`
- **Memory errors**: `compute-sanitizer --tool memcheck python ...`
- **Build is too slow / OOM**: reduce `MAX_JOBS` and `SGL_KERNEL_COMPILE_THREADS`
- **Binary bloat**: use `sgl-kernel/analyze_whl_kernel_sizes.py`
- **CMake sources list**: if your `.cu` file is missing from `SOURCES`, the symbol will be undefined at link time
---
## References
- `sgl-kernel/README.md`
- `sgl-kernel/include/sgl_kernel_ops.h`
- `sgl-kernel/csrc/common_extension.cc`
- `sgl-kernel/CMakeLists.txt`
- `sgl-kernel/include/utils.h``DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` macro and friends
- `sgl-kernel/csrc/elementwise/activation.cu` — reference for the FP16/BF16/FP32 dispatch pattern
## Summary of Files Created/Modified
```
sgl-kernel/csrc/elementwise/scale.cu # NEW: CUDA kernel + launcher
sgl-kernel/include/sgl_kernel_ops.h # MODIFIED: C++ declaration
sgl-kernel/csrc/common_extension.cc # MODIFIED: schema + dispatch registration
sgl-kernel/CMakeLists.txt # MODIFIED: add source file (alphabetical)
sgl-kernel/python/sgl_kernel/elementwise.py # MODIFIED: Python wrapper
sgl-kernel/python/sgl_kernel/__init__.py # MODIFIED: re-export Python API
sgl-kernel/tests/test_scale.py # NEW: tests
sgl-kernel/benchmark/bench_scale.py # NEW: benchmark
```

View File

@@ -0,0 +1,386 @@
---
name: ci-workflow-guide
description: Guide to SGLang CI workflow orchestration — stage ordering, fast-fail, gating, partitioning, execution modes, and debugging CI failures. Use when modifying CI workflows, adding stages, debugging CI pipeline issues, or understanding how tests are dispatched and gated across stages.
---
# SGLang CI Workflow Orchestration Guide
This skill covers the CI **infrastructure** layer — how tests are dispatched, gated, and fast-failed across stages. For test authoring (templates, fixtures, registration, model selection), see the [write-sglang-test skill](../write-sglang-test/SKILL.md).
---
## Naming Conventions
- **Suite**: `stage-{a,b,c}-test-{gpu_count}-gpu-{hardware}` (e.g., `stage-b-test-1-gpu-small`)
- **CI runner**: `{gpu_count}-gpu-{hardware}` (e.g., `1-gpu-5090`, `4-gpu-h100`, `8-gpu-h200`)
---
## Key Files
| File | Role |
|------|------|
| `.github/workflows/pr-test.yml` | Main workflow — all stages, jobs, conditions, matrix definitions |
| `.github/workflows/pr-gate.yml` | PR gating: draft check, `run-ci` label, per-user rate limiting |
| `.github/actions/check-stage-health/action.yml` | Cross-job fast-fail: queries API for any failed job |
| `.github/actions/wait-for-jobs/action.yml` | Stage gating: polls API until stage jobs complete |
| `.github/actions/check-maintenance/action.yml` | Maintenance mode check |
| `test/run_suite.py` | Suite runner: collects, filters, partitions, executes tests |
| `python/sglang/test/ci/ci_register.py` | Test registration (AST-parsed markers), LPT auto-partition |
| `python/sglang/test/ci/ci_utils.py` | `run_unittest_files()`: execution, retry, continue-on-error |
| `scripts/ci/utils/slash_command_handler.py` | Handles slash commands from PR comments |
---
## Architecture Overview
```
┌──────────────┐
│ build kernel │
└──────┬───────┘
├─ check-changes ──── detects which packages changed
│ (main_package, sgl_kernel, jit_kernel, multimodal_gen)
├─ call-gate ──────── pr-gate.yml (draft? label? rate limit?)
├─────────────────────────────────────────────────────┐
│ │
▼ │
┌─────────────────────────────────────┐ │
│ Stage A (~3 min) │ │
│ pre-flight check │ │
│ │ │
│ ┌─────────────────────────────┐ │ │
│ │ stage-a-test-1-gpu-small │ │ │
│ │ (small GPUs) │ │ │
│ └─────────────────────────────┘ │ │
│ ┌─────────────────────────────┐ │ │
│ │ stage-a-test-cpu │ │ │
│ │ (CPU) │ │ │
│ └─────────────────────────────┘ │ │
└──────┬──────────────────────────────┘ │
│ │
▼ ▼
┌─────────────────────────────────────┐ ┌──────────────────────────┐
│ Stage B (~30 min) │ │ kernel test │
│ basic tests │ └──────────────────────────┘
│ │ ┌──────────────────────────┐
│ ┌─────────────────────────────┐ │ │ multimodal gen test │
│ │ stage-b-test-1-gpu-small │ │ └──────────────────────────┘
│ │ (small GPUs, e.g. 5090) │ │
│ └─────────────────────────────┘ │
│ ┌─────────────────────────────┐ │
│ │ stage-b-test-1-gpu-large │ │
│ │ (large GPUs, e.g. H100) │ │
│ └─────────────────────────────┘ │
│ ┌─────────────────────────────┐ │
│ │ stage-b-test-2-gpu-large │ │
│ │ (large GPUs, e.g. H100) │ │
│ └─────────────────────────────┘ │
└──────┬──────────────────────────────┘
┌─────────────────────────────────────┐
│ Stage C (~30 min) │
│ advanced tests │
│ │
│ ┌─────────────────────────────┐ │
│ │ stage-c-test-4-gpu-h100 │ │
│ │ (H100 GPUs) │ │
│ └─────────────────────────────┘ │
│ ┌─────────────────────────────┐ │
│ │ stage-c-test-8-gpu-h200 │ │
│ │ (8 x H200 GPUs) │ │
│ └─────────────────────────────┘ │
│ ┌─────────────────────────────┐ │
│ │ stage-c-test-4-gpu-b200 │ │
│ │ (4 x B200 GPUs) │ │
│ └─────────────────────────────┘ │
│ ┌─────────────────────────────┐ │
│ │ Other advanced tests │ │
│ │ (DeepEP, PD Disagg, GB300) │ │
│ └─────────────────────────────┘ │
└──────┬──────────────────────────────┘
┌─────────────────────────────────────┐
│ pr-test-finish │
│ aggregates all results, fails if │
│ any job failed/cancelled │
└─────────────────────────────────────┘
```
**Every stage test job** includes a `check-stage-health` step after checkout — if any job in the run has already failed, the job fast-fails (red X) with a root cause annotation.
**Scheduled runs** skip `wait-for-stage-*` jobs, running all stages in parallel. Fast-fail is also disabled.
---
## Fast-Fail Layers
4 layers of fast-fail, from fine to coarse:
| Layer | Mechanism | Granularity | Disabled on schedule? |
|-------|-----------|-------------|----------------------|
| **1. Test method → file** | `unittest -f` (failfast) | One test method fails → entire test file stops immediately | Yes |
| **2. File → suite** | `run_unittest_files()` default | One test file fails → entire suite stops (`--continue-on-error` off) | Yes |
| **3. Job → job (same stage)** | `check-stage-health` action | One job fails → other waiting jobs in same stage fast-fail (red X) | Yes |
| **4. Stage → stage (cross-stage)** | `wait-for-stage` + `needs` | Stage A fails → stage B/C jobs skip entirely (never get a runner) | Yes (wait jobs skipped) |
- **Layer 1**: `-f` flag appended to all `python3 -m pytest` / `unittest` invocations in `ci_utils.py`
- **Layer 2**: `--continue-on-error` flag in `run_suite.py` — off for PRs, on for scheduled runs
- **Layer 3**: `check-stage-health` auto-detects `schedule` event and skips; filters out cascade failures to show only root cause jobs
- **Layer 4**: `wait-for-stage-*` jobs are conditioned on `github.event_name == 'pull_request'` — skipped for scheduled runs
---
## Execution Modes
| Aspect | PR (`pull_request`) | Scheduled (`cron`, every 6h) | `/rerun-stage` (`workflow_dispatch`) |
|--------|---------------------|------------------------------|--------------------------------------|
| **Stage ordering** | Sequential: A → B → C via `wait-for-stage-*` | Parallel (all at once) | Single target stage only |
| **Cross-job fast-fail** | Yes (`check-stage-health`) | Yes | Yes |
| **continue-on-error** | No (stop at first failure within suite) | Yes (run all tests) | No |
| **Retry** | Enabled | Enabled | Enabled |
| **max_parallel** | 3 (default), 14 if `high priority` label | 14 | 3 (default), 14 if `high priority` |
| **PR gate** | Yes (draft, label, rate limit) | Skipped | Skipped |
| **Concurrency** | `cancel-in-progress: true` per branch | Queue (no cancel) | Isolated per stage+SHA |
---
## Stage Gating (`wait-for-jobs` action)
`wait-for-stage-a` and `wait-for-stage-b` are lightweight `ubuntu-latest` jobs that poll the GitHub Actions API.
**How it works:**
1. Calls `listJobsForWorkflowRun` to list all jobs in the current run
2. Matches jobs by exact name or prefix (for matrix jobs, e.g., `stage-b-test-1-gpu-small (3)`)
3. If any matched job has `conclusion === 'failure'` → fail immediately (fast-fail)
4. If all matched jobs are completed and count matches `expected_count` → success
5. Otherwise → sleep `poll-interval-seconds` (default: 60s) and retry
6. Timeout after `max-wait-minutes` (240 min for stage-a, 480 min for stage-b)
**Job specs example** (stage-b):
```json
[
{"prefix": "stage-b-test-1-gpu-small", "expected_count": 8},
{"prefix": "stage-b-test-1-gpu-large", "expected_count": 14},
{"prefix": "stage-b-test-2-gpu-large", "expected_count": 4},
{"prefix": "stage-b-test-4-gpu-b200", "expected_count": 1}
]
```
> **Critical**: `expected_count` must match the matrix size. If you add/remove matrix entries, update the wait job's spec accordingly.
**PR only**: Condition `github.event_name == 'pull_request' && !inputs.target_stage` — scheduled runs and `/rerun-stage` skip these entirely, allowing parallel execution.
---
## Cross-Job Fast-Fail (`check-stage-health` action)
Composite action called after checkout in every stage test job (21 jobs total across `pr-test.yml`, `pr-test-multimodal-gen.yml`, `pr-test-sgl-kernel.yml`, `pr-test-jit-kernel.yml`).
**How it works:**
1. Queries `listJobsForWorkflowRun` for the current workflow run
2. Filters for **root cause failures only** — jobs with `conclusion === 'failure'` whose failing step is NOT `check-stage-health` (excludes cascade failures)
3. If root cause failures found → calls `core.setFailed()` with the list of root cause job names
4. If none → does nothing (step succeeds)
**Cascade filtering**: When job A fast-fails due to health check, it also has `conclusion: failure`. Without filtering, job B would list both the original failure AND job A's fast-fail. The filter checks each failed job's `steps` array — if the failing step name contains `check-stage-health` or `Check stage health`, it's excluded from the root cause list.
**Usage pattern:**
```yaml
steps:
- name: Checkout code
uses: actions/checkout@v4
...
- uses: ./.github/actions/check-stage-health
id: stage-health
- name: Install dependencies # skipped automatically if health check failed
... # (default if: success() is false)
- name: Run test # also skipped
...
```
**Visual effect**: Job shows **red X** (failure) with error annotation showing root cause job names. Subsequent steps are naturally skipped (default `if: success()` is false after a failed step). No per-step `if` guards needed.
**No stage filtering**: Checks ALL jobs in the run, not just the current stage. Any failure anywhere triggers fast-fail.
**Error message example:**
```
Fast-fail: skipping — root cause job(s): stage-b-test-1-gpu-small (0), stage-b-test-1-gpu-small (1)
```
---
## Within-Suite Failure Handling
Controlled by `run_unittest_files()` in `python/sglang/test/ci/ci_utils.py`.
### Flags
| Flag | PR default | Scheduled default | Effect |
|------|------------|-------------------|--------|
| `--continue-on-error` | Off | On | Off: stop at first failure. On: run all files, report all failures at end |
| `--enable-retry` | On | On | Retry retriable failures (accuracy/perf assertions) |
| `--max-attempts` | 2 | 2 | Max attempts per file including initial run |
### Retry Classification
When a test fails and retry is enabled, the output is classified:
**Non-retriable** (checked first — real code errors):
`SyntaxError`, `ImportError`, `ModuleNotFoundError`, `NameError`, `TypeError`, `AttributeError`, `RuntimeError`, `CUDA out of memory`, `OOM`, `Segmentation fault`, `core dumped`, `ConnectionRefusedError`, `FileNotFoundError`
**Retriable** (accuracy/performance):
`AssertionError` with comparison patterns (`not greater than`, `not less than`, `not equal to`), `accuracy`, `score`, `latency`, `throughput`, `timeout`
**Default**: Unknown `AssertionError` → retriable. Other unknown failures → not retriable.
### How `continue_on_error` is set
In `pr-test.yml`'s `check-changes` job:
- `schedule` runs or `run_all_tests` flag → `continue_on_error = 'true'`
- PR runs → `continue_on_error = 'false'`
Each test job propagates via:
```yaml
env:
CONTINUE_ON_ERROR_FLAG: ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }}
run: |
python3 run_suite.py --hw cuda --suite <name> $CONTINUE_ON_ERROR_FLAG
```
---
## Test Partitioning
Large suites are split across matrix jobs using the **LPT (Longest Processing Time) heuristic** in `ci_register.py:auto_partition()`:
1. Sort tests by `est_time` descending, filename as tie-breaker (deterministic)
2. Greedily assign each test to the partition with smallest cumulative time
3. Result: roughly equal total time per partition
**Partition table** (CUDA per-commit suites):
| Suite | Partitions | Runner | max_parallel |
|-------|-----------|--------|-------------|
| `stage-a-test-1-gpu-small` | 1 (no matrix) | `1-gpu-5090` | — |
| `stage-a-test-cpu` | 1 (no matrix) | `ubuntu-latest` | — |
| `stage-b-test-1-gpu-small` | 8 | `1-gpu-5090` | 8 |
| `stage-b-test-1-gpu-large` | 14 | `1-gpu-h100` | dynamic (3 or 14) |
| `stage-b-test-2-gpu-large` | 4 | `2-gpu-h100` | — |
| `stage-b-test-4-gpu-b200` | 1 (no matrix) | `4-gpu-b200` | — |
| `stage-b-kernel-unit-1-gpu-large` | 1 (no matrix) | `1-gpu-h100` | — |
| `stage-b-kernel-unit-8-gpu-h200` | 1 (no matrix) | `8-gpu-h200` | — |
| `stage-b-kernel-benchmark-1-gpu-large` | 1 (no matrix) | `1-gpu-h100` | — |
| `stage-c-test-4-gpu-h100` | 3 | `4-gpu-h100` | — |
| `stage-c-test-8-gpu-h200` | 4 | `8-gpu-h200` | — |
| `stage-c-test-8-gpu-h20` | 2 | `8-gpu-h20` | — |
| `stage-c-test-deepep-4-gpu-h100` | 1 (no matrix) | `4-gpu-h100` | — |
| `stage-c-test-deepep-8-gpu-h200` | 1 (no matrix) | `8-gpu-h200` | — |
| `stage-c-test-4-gpu-b200` | 4 | `4-gpu-b200` | — |
| `stage-c-test-4-gpu-gb200` | 1 (no matrix) | `4-gpu-gb200` | — |
> **Note**: Kernel suites (`stage-b-kernel-*`) run via `pr-test-jit-kernel.yml` and `pr-test-sgl-kernel.yml`, not the main `pr-test.yml`. Multimodal diffusion uses `python/sglang/multimodal_gen/test/run_suite.py`, not `test/run_suite.py`.
**Workflow usage:**
```yaml
strategy:
matrix:
partition: [0, 1, 2, 3, 4, 5, 6, 7]
steps:
- run: python3 run_suite.py --hw cuda --suite stage-b-test-1-gpu-small \
--auto-partition-id ${{ matrix.partition }} --auto-partition-size 8
```
---
## check-changes Job
Determines which test suites to run based on file changes.
### Detection Methods
| Trigger | Method | Details |
|---------|--------|---------|
| `pull_request` | `dorny/paths-filter` | Detects changes via GitHub diff |
| `workflow_dispatch` (with `pr_head_sha`) | GitHub API | `repos/{repo}/compare/main...{sha}` |
| `schedule` / `run_all_tests` | Force all true | Runs everything |
### Output Flags
| Output | Triggers |
|--------|----------|
| `main_package` | Stage A/B/C test suites |
| `sgl_kernel` | Kernel wheel builds + kernel test suites |
| `jit_kernel` | JIT kernel test workflow |
| `multimodal_gen` | Multimodal-gen test workflow |
> **Note**: `sgl_kernel` is forced to `false` when `target_stage` is set, because `sgl-kernel-build-wheels` won't run and wheel artifacts won't be available.
---
## Concurrency Control
```
group: pr-test-{event_name}-{branch}-{pr_sha}-{stage}
```
| Segment | Source | Purpose |
|---------|--------|---------|
| `event_name` | `github.event_name` | Prevents scheduled runs colliding with fork PRs named `main` |
| `branch` | `github.head_ref \|\| github.ref_name` | Per-branch isolation |
| `pr_sha` | `inputs.pr_head_sha \|\| 'current'` | Isolates `/rerun-stage` from main runs |
| `stage` | `inputs.target_stage \|\| 'all'` | Allows parallel stage dispatches |
`cancel-in-progress: true` for `pull_request` events (new push cancels old run), `false` for `workflow_call`.
---
## How To: Add a New Stage Job
1. Define the job in `pr-test.yml` with `needs: [check-changes, call-gate, wait-for-stage-X, ...]`
2. Copy the `if:` condition pattern from an existing same-stage job (handles `target_stage`, `schedule`, `main_package`)
3. Add `checkout` step
4. Add `check-stage-health` step (after checkout) — if any prior job failed, `core.setFailed()` fires and all subsequent steps auto-skip via default `if: success()`
5. Add `check-maintenance` step
6. Add `download-artifact` step if `sgl_kernel` changed
7. Add `install dependencies` step
8. Add `run test` step with `$CONTINUE_ON_ERROR_FLAG`
9. Add `upload-cuda-coredumps` step with `if: always()`
10. Register the suite name in `PER_COMMIT_SUITES` in `test/run_suite.py`
11. If using matrix, add `--auto-partition-id` and `--auto-partition-size` to the run command
12. **Update `wait-for-stage-X`** job spec with the new job name and `expected_count` (if matrix)
13. **Add the job to `pr-test-finish.needs`** list
---
## How To: Debug CI Failures
| Symptom | Likely cause | What to check |
|---------|-------------|---------------|
| All stage-B/C jobs green but steps skipped | Earlier job failed, `check-stage-health` triggered | Find the actual failed job (red X) |
| `wait-for-stage-b` timeout | `expected_count` doesn't match matrix size | Verify job spec counts match `matrix:` array length |
| `pr-test-finish` fails but all jobs green | A job was `cancelled` (counts as failure in finish) | Check concurrency cancellation |
| Tests pass locally but fail in CI | Partition assignment, runner GPU type, or `est_time` inaccuracy | Check which partition the test lands in; verify runner label |
| Flaky test retried and passed | Retriable failure (accuracy/perf) | Check `[CI Retry]` markers in job logs |
| Flaky test NOT retried | Matched non-retriable pattern | Check if error matches `NON_RETRIABLE_PATTERNS` in `ci_utils.py` |
---
## Slash Commands
| Command | Effect |
|---------|--------|
| `/tag-run-ci-label` | Adds `run-ci` label to PR |
| `/rerun-failed-ci` | Reruns failed jobs in the latest workflow run |
| `/tag-and-rerun-ci` | Adds label + reruns |
| `/rerun-stage <stage>` | Dispatches `pr-test.yml` with `target_stage=<stage>` |
| `/rerun-test <test-file>` | Reruns a specific test file via `rerun-test.yml` |
Handled by `scripts/ci/utils/slash_command_handler.py``.github/workflows/slash-command-handler.yml`.

View File

@@ -0,0 +1,657 @@
---
name: debug-cuda-crash
description: Call this skill when you need to debug CUDA crashes in SGLang using kernel API logging
---
# Tutorial: Debugging CUDA Crashes with Kernel API Logging
This tutorial shows you how to debug CUDA crashes and errors in SGLang using the `@debug_kernel_api` logging decorator.
## Goal
When your code crashes with CUDA errors such as illegal memory access, device-side assert, out-of-bounds, or NaN/Inf, use kernel API logging to:
- Capture input tensors BEFORE the crash occurs
- Understand what data caused the problem
- Track tensor shapes, dtypes, and values through the call boundary that triggered the crash
- Detect numerical issues such as NaN, Inf, or obviously wrong shapes
## Why Use Kernel API Logging?
**Problem**: CUDA errors often crash the program before normal debugging output is flushed.
**Solution**: SGLang's `@debug_kernel_api` decorator logs inputs before execution, so you can still see what caused the crash even after the program aborts.
## What Is Covered?
The current logging coverage focuses on the highest-value kernel boundaries in SGLang:
- Custom ops registered through `register_custom_op(...)`
- External custom ops registered through `register_custom_op_from_extern(...)`
- LLM attention, linear, quantization, and multi-platform wrapper entry points
- Diffusion attention impl, linear, rotary, and custom-op wrapper entry points
- Selected direct `torch.ops.sglang.*` hotspots and model-specific bypasses
This means the logging is useful for both LLM and diffusion kernel debugging, but it does not automatically cover every pure PyTorch call in the repository.
## Step 1: Enable Kernel API Logging
### Basic Logging (Function Names Only)
```bash
export SGLANG_KERNEL_API_LOGLEVEL=1
export SGLANG_KERNEL_API_LOGDEST=stdout
python my_script.py
```
Output:
```
================================================================================
[2026-03-19 00:47:06] SGLang Kernel API Call: RMSNorm.forward
================================================================================
[2026-03-19 00:47:06] SGLang Kernel API Call: sglang.quant_method.UnquantizedLinearMethod.apply
================================================================================
[2026-03-19 00:47:06] SGLang Kernel API Call: sglang.custom_op.fused_inplace_qknorm
```
This is a real level-1 excerpt captured from `Qwen/Qwen3-0.6B`.
### Detailed Logging (Inputs with Metadata)
```bash
export SGLANG_KERNEL_API_LOGLEVEL=3
export SGLANG_KERNEL_API_LOGDEST=debug.log
python my_script.py
```
Output in `debug.log`:
```
================================================================================
[2026-03-19 00:47:30] SGLang Kernel API Call: sglang.quant_method.UnquantizedLinearMethod.apply
Positional input arguments:
arg[0]=QKVParallelLinear(
repr=QKVParallelLinear(in_features=1024, output_features=4096, bias=False, tp_size=1, gather_output=False)
)
arg[1]=Tensor(
shape=(1, 1024)
dtype=torch.bfloat16
device=cuda:0
requires_grad=False
is_contiguous=True
)
arg[2]=None
Output:
return=Tensor(
shape=(1, 4096)
dtype=torch.bfloat16
device=cuda:0
requires_grad=False
is_contiguous=True
)
```
This is a real level-3 excerpt captured from `Qwen/Qwen3-0.6B`.
### Full Logging (With Tensor Statistics)
```bash
export SGLANG_KERNEL_API_LOGLEVEL=5
export SGLANG_KERNEL_API_LOGDEST=debug.log
python my_script.py
```
Additional output:
```
================================================================================
[2026-03-19 01:00:42] SGLang Kernel API Call: diffusion.quant_method.UnquantizedLinearMethod.apply
Positional input arguments:
arg[1]=Tensor(
shape=(1, 77, 768)
dtype=torch.bfloat16
device=cuda:0
requires_grad=False
is_contiguous=True
min=-27.250000
max=28.500000
mean=0.011723
nan_count=0
inf_count=0
)
Output:
return=Tensor(
shape=(1, 77, 2304)
dtype=torch.bfloat16
device=cuda:0
requires_grad=False
is_contiguous=True
min=-8.937500
max=9.375000
mean=0.009460
nan_count=0
inf_count=0
)
```
This is a real level-5 excerpt captured from `black-forest-labs/FLUX.1-dev`.
### Crash-Safe Dumps (Inputs Saved Before Execution)
```bash
export SGLANG_KERNEL_API_LOGLEVEL=10
export SGLANG_KERNEL_API_LOGDEST=debug.log
export SGLANG_KERNEL_API_DUMP_DIR=/tmp/sglang_kernel_api_dumps
python my_script.py
```
At level 10, SGLang saves the inputs before execution. If the kernel crashes, the dump directory still contains the inputs and exception metadata.
If CUDA graph capture is active, tensor dumps are skipped automatically to avoid capture-time CUDA errors. In that case, you still get the kernel API call log, but not `inputs.pt` / `outputs.pt`.
Level-10 dumps are best understood as crash-safe call snapshots. They always preserve the observed call boundary. They do not guarantee one-click replay for every method, because some methods depend on module state that is not serialized into the dump.
Real level-10 dump layout from `Qwen/Qwen3-0.6B`:
```text
/tmp/sglang_kernel_api_validation/qwen_qwen3_0_6b_level10_dumps
/tmp/sglang_kernel_api_validation/qwen_qwen3_0_6b_level10_dumps/20260319_004821_182_pid919286_RotaryEmbedding.forward_call0001
/tmp/sglang_kernel_api_validation/qwen_qwen3_0_6b_level10_dumps/20260319_004821_182_pid919286_RotaryEmbedding.forward_call0001/inputs.pt
/tmp/sglang_kernel_api_validation/qwen_qwen3_0_6b_level10_dumps/20260319_004821_182_pid919286_RotaryEmbedding.forward_call0001/metadata.json
/tmp/sglang_kernel_api_validation/qwen_qwen3_0_6b_level10_dumps/20260319_004821_182_pid919286_RotaryEmbedding.forward_call0001/outputs.pt
```
Real `metadata.json` excerpt:
```json
{
"function_name": "RotaryEmbedding.forward",
"timestamp": "20260319_004821_182",
"process_id": 919286,
"execution_status": "completed",
"input_tensor_keys": ["arg_0", "arg_1", "arg_2"],
"output_tensor_keys": ["result_0", "result_1"]
}
```
## Step 2: Reproduce an LLM CUDA Crash
Create a temporary reproducer:
```bash
python3 - <<'PY'
from pathlib import Path
Path("/tmp/sglang_llm_crash.py").write_text(
"import torch\\n"
"import torch.nn.functional as F\\n"
"from sglang.srt.utils.custom_op import register_custom_op\\n\\n"
"def _fake_embedding(indices, table):\\n"
" return torch.empty((*indices.shape, table.shape[-1]), device=table.device, dtype=table.dtype)\\n\\n"
"@register_custom_op(op_name='mock_llm_cuda_crash', fake_impl=_fake_embedding)\\n"
"def mock_llm_cuda_crash(indices, table):\\n"
" out = F.embedding(indices, table)\\n"
" torch.cuda.synchronize()\\n"
" return out\\n\\n"
"table = torch.randn(4, 8, device='cuda', dtype=torch.float16)\\n"
"indices = torch.tensor([0, 7], device='cuda', dtype=torch.long)\\n"
"mock_llm_cuda_crash(indices, table)\\n"
)
PY
SGLANG_KERNEL_API_LOGLEVEL=1 \
SGLANG_KERNEL_API_LOGDEST=/tmp/sglang_llm_level1.log \
python3 /tmp/sglang_llm_crash.py
```
What to expect:
- The script exits with a CUDA `device-side assert`
- The log still contains the last API boundary before the crash
Try the same example at level 3:
```bash
SGLANG_KERNEL_API_LOGLEVEL=3 \
SGLANG_KERNEL_API_LOGDEST=/tmp/sglang_llm_level3.log \
python3 /tmp/sglang_llm_crash.py
```
Now the log shows tensor metadata before the crash.
Try level 10:
```bash
SGLANG_KERNEL_API_LOGLEVEL=10 \
SGLANG_KERNEL_API_LOGDEST=/tmp/sglang_llm_level10.log \
SGLANG_KERNEL_API_DUMP_DIR=/tmp/sglang_llm_level10_dumps \
python3 /tmp/sglang_llm_crash.py
```
Now you should see:
- A log entry for `sglang.custom_op.mock_llm_cuda_crash`
- A dump directory with `inputs.pt`
- `metadata.json` showing `execution_status: "exception"`
- No `outputs.pt`, because the kernel crashed before producing output
For real-model success-path level-10 dumps, it is often easier to temporarily disable CUDA graph and piecewise CUDA graph for the debug run.
## Step 3: Reproduce a Diffusion CUDA Crash
Create a temporary diffusion-side reproducer:
```bash
python3 - <<'PY'
from pathlib import Path
Path("/tmp/sglang_diffusion_crash.py").write_text(
"import torch\\n"
"import torch.nn.functional as F\\n"
"from sglang.multimodal_gen.runtime.layers.utils import register_custom_op\\n\\n"
"def _fake_embedding(positions, cache):\\n"
" return torch.empty((*positions.shape, cache.shape[-1]), device=cache.device, dtype=cache.dtype)\\n\\n"
"@register_custom_op(op_name='mock_diffusion_cuda_crash', fake_impl=_fake_embedding)\\n"
"def mock_diffusion_cuda_crash(positions, cache):\\n"
" out = F.embedding(positions, cache)\\n"
" torch.cuda.synchronize()\\n"
" return out\\n\\n"
"cache = torch.randn(4, 64, device='cuda', dtype=torch.float16)\\n"
"positions = torch.tensor([0, 9], device='cuda', dtype=torch.long)\\n"
"mock_diffusion_cuda_crash(positions, cache)\\n"
)
PY
SGLANG_KERNEL_API_LOGLEVEL=1 \
SGLANG_KERNEL_API_LOGDEST=/tmp/sglang_diffusion_level1.log \
python3 /tmp/sglang_diffusion_crash.py
```
Try level 3:
```bash
SGLANG_KERNEL_API_LOGLEVEL=3 \
SGLANG_KERNEL_API_LOGDEST=/tmp/sglang_diffusion_level3.log \
python3 /tmp/sglang_diffusion_crash.py
```
Try level 10:
```bash
SGLANG_KERNEL_API_LOGLEVEL=10 \
SGLANG_KERNEL_API_LOGDEST=/tmp/sglang_diffusion_level10.log \
SGLANG_KERNEL_API_DUMP_DIR=/tmp/sglang_diffusion_level10_dumps \
python3 /tmp/sglang_diffusion_crash.py
```
If your local environment has unrelated FlashInfer import issues, resolve them in the shell before running the example. The example itself does not set any `FLASHINFER_*` environment variable.
## Step 4: Multi-Process Debugging
When running with multiple GPUs or worker processes, use `%i` in the log path:
```bash
export SGLANG_KERNEL_API_LOGLEVEL=3
export SGLANG_KERNEL_API_LOGDEST=debug_rank_%i.log
torchrun --nproc_per_node=4 my_script.py
```
This creates separate logs such as:
- `debug_rank_12345.log`
- `debug_rank_12346.log`
- `debug_rank_12347.log`
- `debug_rank_12348.log`
Real multi-process example from a 2-GPU `Qwen/Qwen2.5-0.5B-Instruct` run:
```text
/tmp/sglang_kernel_api_validation_multi/qwen_qwen2_5_0_5b_instruct_level3_950201.log
/tmp/sglang_kernel_api_validation_multi/qwen_qwen2_5_0_5b_instruct_level3_950349.log
/tmp/sglang_kernel_api_validation_multi/qwen_qwen2_5_0_5b_instruct_level3_950350.log
/tmp/sglang_kernel_api_validation_multi/qwen_qwen2_5_0_5b_instruct_level3_950351.log
```
You should usually do the same for level-10 dump directories:
```bash
export SGLANG_KERNEL_API_LOGLEVEL=10
export SGLANG_KERNEL_API_LOGDEST=debug_rank_%i.log
export SGLANG_KERNEL_API_DUMP_DIR=/tmp/sglang_kernel_api_dumps_%i
```
This avoids multiple ranks writing into the same dump directory tree.
## Step 5: Filter Level-10 Dumps
If level 10 is too noisy, restrict dumps to specific APIs:
```bash
export SGLANG_KERNEL_API_LOGLEVEL=10
export SGLANG_KERNEL_API_LOGDEST=debug.log
export SGLANG_KERNEL_API_DUMP_DIR=/tmp/sglang_kernel_api_dumps
export SGLANG_KERNEL_API_DUMP_INCLUDE='sglang.custom_op.*'
export SGLANG_KERNEL_API_DUMP_EXCLUDE='*.fake_impl'
```
`SGLANG_KERNEL_API_DUMP_INCLUDE` and `SGLANG_KERNEL_API_DUMP_EXCLUDE` use shell-style wildcard matching.
## Step 6: Common CUDA Errors and What to Check
### Illegal Memory Access or Device-Side Assert
**Typical errors**:
```
RuntimeError: CUDA error: an illegal memory access was encountered
torch.AcceleratorError: CUDA error: device-side assert triggered
```
Use:
```bash
export SGLANG_KERNEL_API_LOGLEVEL=3
```
Check in the logs:
- ✅ Tensor shapes
- ✅ Tensor dtypes
- ✅ CUDA vs CPU device placement
- ✅ Tensor stride / contiguity
- ✅ Whether the failing call has inputs logged but no outputs logged
Typical shape-mismatch pattern:
```text
SGLang Kernel API Call: ...
arg[0]=Tensor(shape=(..., 128), ...) # ✅ expected dimension
arg[1]=Tensor(shape=(..., 64), ...) # ❌ mismatch
```
This often points to head-dim, hidden-dim, or cache-layout mismatch rather than a random CUDA failure.
### NaN or Inf
Use:
```bash
export SGLANG_KERNEL_API_LOGLEVEL=5
```
Check:
- `min`
- `max`
- `mean`
- `nan_count`
- `inf_count`
Typical bad pattern:
```text
Tensor(
...
min=-1234567.000000 # ❌ suspiciously large
max=9876543.000000 # ❌ suspiciously large
mean=nan # ❌ bad
nan_count=128 # ❌ found NaNs
inf_count=0 # ✅ no Infs here
)
```
This usually means the bad values were already present before the crashing kernel.
### Out of Memory
Use:
```bash
export SGLANG_KERNEL_API_LOGLEVEL=3
```
Check:
- Unexpectedly large tensor shapes
- Batch size
- Sequence length
- Frame count or image resolution in diffusion workloads
Also check whether a supposedly per-token or per-frame tensor accidentally became full-sequence or full-image sized.
Typical bad pattern:
```text
Tensor(
shape=(1024, 8192, 128, 128) # ❌ way too large
...
)
```
### Example: Spot a Shape Bug from the Log
Suppose the failing API log looks like this:
```text
[2026-03-19 00:47:30] SGLang Kernel API Call: RotaryEmbedding.forward
Positional input arguments:
arg[0]=Tensor(shape=(1, 8), dtype=torch.int64, ...)
arg[1]=Tensor(shape=(1, 8, 8, 256), dtype=torch.bfloat16, ...) # ✅ query
arg[2]=Tensor(shape=(1, 8, 4, 64), dtype=torch.bfloat16, ...) # ❌ key head_dim mismatch
```
What this tells you:
- ✅ positions look reasonable
- ✅ query looks plausible
- ❌ key last dimension is inconsistent with the expected rotary/head dimension
That usually means the bug is in projection layout, head packing, or cache format rather than in the rotary kernel itself.
## Step 7: Combine with compute-sanitizer
For harder bugs, combine kernel API logging with CUDA memory checking:
```bash
export SGLANG_KERNEL_API_LOGLEVEL=3
export SGLANG_KERNEL_API_LOGDEST=debug.log
compute-sanitizer --tool memcheck python3 /tmp/sglang_llm_crash.py
```
Use `debug.log` to see the exact inputs that reached the crashing API boundary.
Typical `compute-sanitizer` output:
```text
========= COMPUTE-SANITIZER
========= Invalid __global__ write of size 4 bytes
========= at 0x1234 in SomeKernel
========= by thread (256,0,0) in block (10,0,0)
========= Address 0x... is out of bounds
```
Use the sanitizer output to identify the failing kernel and use `debug.log` to identify the exact tensors that reached the API boundary right before it.
If you need more synchronous host-side error reporting, you can try `CUDA_LAUNCH_BLOCKING=1` as a separate follow-up experiment. It is not part of the default workflow because it changes execution timing and can hide concurrency-related behavior.
## Step 8: Combine with cuda-gdb
For crashes that need a stack trace instead of only memory diagnostics:
```bash
export SGLANG_KERNEL_API_LOGLEVEL=3
export SGLANG_KERNEL_API_LOGDEST=debug.log
cuda-gdb --args python3 /tmp/sglang_llm_crash.py
```
Inside `cuda-gdb`:
```text
(cuda-gdb) run
(cuda-gdb) where
```
Then correlate the backtrace with `debug.log`.
## Step 9: Kernel-Level Debugging with printf()
When you own the CUDA kernel, `printf()` is still useful for narrowing down bad indices, bad launch geometry, or broken state propagation.
Basic pattern:
```cpp
__global__ void MyKernel(const float* input, float* output, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (threadIdx.x == 0 && blockIdx.x == 0) {
printf("n=%d input0=%f\n", n, input[0]);
}
if (idx < n) {
output[idx] = input[idx] * 2.0f;
}
}
```
After launch, force the output to flush:
```python
my_kernel(...)
torch.cuda.synchronize()
```
For warp-specialized kernels, do not blindly print only on `threadIdx.x == 0`. Pick one representative thread per warp or per specialization group instead.
### Warp-Specialized Kernels: Choosing the Right Print Thread
Problem:
- `threadIdx.x == 0` only prints from the first warp in the block
- for warp-specialized kernels, that often misses the warp or group that is actually wrong
Better pattern:
```cpp
__global__ void WarpSpecializedKernel(...) {
// Example: first lane of each warp
if ((threadIdx.x % 32) == 0) {
printf("warp=%d\n", threadIdx.x / 32);
}
}
```
Or, if the kernel is organized in larger specialization groups, print once per group instead of once per block.
Common mistake:
```cpp
// Only warp 0 prints
if (threadIdx.x == 0) {
printf("warp=%d\n", threadIdx.x / 32);
}
```
### Quick Reference
| Kernel Type | Print Condition | Notes |
|----------|----------|-------------|
| Simple kernel | `threadIdx.x == 0` | One thread per block is usually enough |
| Warp-specialized kernel | one representative lane per warp | e.g. `threadIdx.x % 32 == 0` |
| Group-specialized kernel | one representative lane per group | choose based on the kernel's scheduling layout |
### Other Kernel Debugging Tools
```cpp
assert(value >= 0.0f && "value must be non-negative");
static_assert(BLOCK_SIZE % 32 == 0, "BLOCK_SIZE must be warp aligned");
```
## Environment Variables Reference
| Variable | Values | Description |
|----------|--------|-------------|
| `SGLANG_KERNEL_API_LOGLEVEL` | `0` | No logging (default) |
| | `1` | Function names only |
| | `3` | Inputs and outputs with metadata |
| | `5` | Level 3 plus tensor statistics |
| | `10` | Level 5 plus crash-safe tensor dumps |
| `SGLANG_KERNEL_API_LOGDEST` | `stdout` | Log to stdout |
| | `stderr` | Log to stderr |
| | `<path>` | Log to file |
| | `log_%i.txt` | `%i` expands to process ID |
| `SGLANG_KERNEL_API_DUMP_DIR` | `<path>` | Directory for level-10 dumps |
| `SGLANG_KERNEL_API_DUMP_INCLUDE` | wildcard list | Only dump matching API names |
| `SGLANG_KERNEL_API_DUMP_EXCLUDE` | wildcard list | Skip matching API names |
## Best Practices
### 1. Start with Level 3
```bash
export SGLANG_KERNEL_API_LOGLEVEL=3
```
Level 3 is usually enough to catch wrong shapes, wrong dtypes, and wrong devices.
### 2. Use Level 5 for Numerical Issues
```bash
export SGLANG_KERNEL_API_LOGLEVEL=5
```
Use it when you suspect NaN or Inf values.
### 3. Use Level 10 for Crash Reproduction
```bash
export SGLANG_KERNEL_API_LOGLEVEL=10
```
This is the most useful mode when the process crashes before you can inspect live tensors.
If you need successful input/output dumps from a real model run, temporarily disable CUDA graph for that debug session.
When level 10 is too noisy, pair it with `SGLANG_KERNEL_API_DUMP_INCLUDE` / `SGLANG_KERNEL_API_DUMP_EXCLUDE` instead of dumping every covered API.
### 4. Log to File for Crashes
```bash
export SGLANG_KERNEL_API_LOGDEST=crash.log
```
File logs are safer than stdout when the process aborts.
### 5. Disable Logging in Production
```bash
unset SGLANG_KERNEL_API_LOGLEVEL
```
When disabled, the decorator returns the original callable and adds no runtime logging overhead.
## Troubleshooting
### No Logs Appear
Check:
1. `echo $SGLANG_KERNEL_API_LOGLEVEL`
2. `echo $SGLANG_KERNEL_API_LOGDEST`
3. Whether the failing path goes through a covered API boundary
### Too Much Output
Reduce the level:
```bash
export SGLANG_KERNEL_API_LOGLEVEL=3
```
### Statistics Are Skipped During CUDA Graph Capture
If you see:
```text
statistics=[skipped: CUDA graph capture in progress]
```
That is expected. Level-5 statistics are intentionally skipped during CUDA graph capture to avoid synchronization side effects.
### Tensor Dumps Are Skipped During CUDA Graph Capture
If you see:
```text
Tensor dump skipped: CUDA graph capture in progress
```
That is also expected. Level-10 dumps require copying tensors to CPU, which is not allowed during CUDA graph capture.

View File

@@ -0,0 +1,141 @@
---
name: generate-profile
description: Generate an e2e profiling trace of an SGLang server run. Launches a server, validates accuracy, captures a Chrome-compatible trace, and returns the profile path.
---
# Generate an E2E Profile of an SGLang Server Run
This skill launches an SGLang server, validates it with a quick accuracy test, generates a profiling trace, and returns the profile file path.
## Prerequisites
- A working SGLang installation (`pip install -e .` or equivalent)
- At least one available CUDA GPU
## Step-by-step Workflow
### Step 1: Launch the server
```bash
CUDA_VISIBLE_DEVICES=<gpu_id> sglang serve --model-path <model> --port <port> &
```
- Default model: `Qwen/Qwen3-8B` (good balance of speed and quality)
- Default port: `30000`
- The server runs in the background. Save the PID for cleanup.
- Use the GPU specified by the user's preferences (check memory files for GPU preferences).
### Step 2: Wait for server readiness
Poll the health endpoint until the server is ready:
```bash
for i in $(seq 1 120); do
if curl -s http://127.0.0.1:<port>/health 2>/dev/null | grep -q "ok\|healthy"; then
echo "Server ready"
break
fi
sleep 5
done
```
The server prints **"The server is fired up and ready to roll!"** to stdout when ready. The health endpoint returns 200 once the server can accept requests.
Typical startup time: 30-90 seconds depending on model size and whether CUDA graphs are being compiled.
### Step 3: Validate accuracy (sanity check)
```bash
python3 -m sglang.test.few_shot_gsm8k --num-q 20
```
- Expected accuracy: **> 0.8** for capable models (Qwen3-8B, Llama-3.1-8B-Instruct, etc.)
- This is a quick sanity check, not a rigorous benchmark.
- If accuracy is unexpectedly low, something is wrong — do not proceed to profiling.
### Step 4: Generate the profile
```bash
python3 -m sglang.test.send_one --profile
```
This command:
1. Sends a request to the server
2. Triggers the profiler for 5 steps (default)
3. Generates a trace file under `/tmp/<timestamp>/`
4. The trace directory contains:
- `<timestamp>-TP-0.trace.json.gz` — Chrome trace format (open in `chrome://tracing` or Perfetto)
- `server_args.json` — the server configuration used
**Output format:**
```
Dump profiling traces to /tmp/<timestamp>
```
The profile path is printed to stdout. Parse it from the output.
**Optional flags:**
- `--profile-steps N` — number of profiling steps (default: 5)
- `--profile-by-stage` — profile by stage (prefill/decode separately)
- `--profile-prefix <path>` — custom output prefix
### Step 5: Kill the server
```bash
pkill -9 -f "sglang.launch_server\|sglang serve\|sglang.srt"
```
Wait a moment and verify no sglang processes remain:
```bash
sleep 2 && pgrep -af "sglang serve" || echo "Server killed"
```
### Step 6: Report the profile path
Return the profile directory path (e.g., `/tmp/1773999986.4769795`) and list its contents so the user knows what files were generated.
## Example Full Run
```bash
# 1. Launch server
source cleanup/bin/activate
CUDA_VISIBLE_DEVICES=1 sglang serve --model-path Qwen/Qwen3-8B --port 30000 &
# 2. Wait for ready
for i in $(seq 1 120); do
curl -s http://127.0.0.1:30000/health | grep -q "ok" && break
sleep 5
done
# 3. Accuracy check
python3 -m sglang.test.few_shot_gsm8k --num-q 20
# Expected: Accuracy > 0.8
# 4. Profile
python3 -m sglang.test.send_one --profile
# Output: "Dump profiling traces to /tmp/1773999986.4769795"
# 5. Cleanup
pkill -9 -f "sglang.launch_server\|sglang serve\|sglang.srt"
sleep 2
# 6. Check output
ls -la /tmp/1773999986.4769795/
# 1773999986.4851577-TP-0.trace.json.gz (Chrome trace)
# server_args.json (server config)
```
## Customization
- **Different port**: Pass `--port <port>` and use `--host 127.0.0.1 --port <port>` for test commands
- **Multi-GPU**: Use `--tp <N>` for tensor parallelism; trace files will be generated per TP rank
- **Longer profile**: Use `--profile-steps 10` for more steps in the trace
- **Stage profiling**: Use `--profile-by-stage` to separate prefill and decode phases
## Viewing the Profile
Open the `.trace.json.gz` file in:
- **Perfetto UI**: https://ui.perfetto.dev/ (drag and drop the file)
- **Chrome tracing**: `chrome://tracing` (load the file)
Both support the gzipped Chrome trace format natively.

View File

@@ -0,0 +1,219 @@
# SGLang Bisect CI Regression
Investigate a consistently failing CI test to find the root cause - whether it's a code regression from a specific PR, a hardware/runner-specific issue, or an environment change. Optionally reproduce the failure on a remote GPU server.
## Slash Command
`/sglang-bisect-ci-regression <test_name_or_ci_url> [ssh_target] [docker_container]`
## When to Use This Skill
- A CI test is failing consistently on main (scheduled runs)
- You need to find which PR introduced a regression
- You suspect a runner-specific or GPU-specific issue
- You want to reproduce a CI failure on a remote server
## Arguments
- **First argument (required)**: Test file name (e.g. `test_lora_tp.py`) or a GitHub Actions job URL
- **Second argument (optional)**: SSH target for remote reproduction (e.g. `user@host`)
- **Third argument (optional)**: Docker container name on the SSH target (e.g. `sglang_dev`)
If SSH target and docker container are not provided, the skill will only perform the CI log analysis and bisection, without remote reproduction. **Ask the user** for these if reproduction is needed and they weren't provided.
## Background: Scheduled CI Runs
SGLang uses the `pr-test.yml` workflow with **scheduled runs** (cron-triggered) to periodically test the `main` branch. These runs are the primary data source for detecting regressions:
- **Workflow**: `pr-test.yml` with `event: schedule`
- **Branch**: `main`
- **Dashboard**: https://github.com/sgl-project/sglang/actions/workflows/pr-test.yml?query=event%3Aschedule
- **Frequency**: Runs multiple times daily, each pinned to the HEAD of `main` at trigger time
- **Purpose**: Catches regressions that slip through PR-level CI (e.g., interaction bugs between merged PRs, hardware-specific issues)
Always use these scheduled runs (not PR-triggered runs) when bisecting regressions on `main`. The `--event schedule` filter in `gh run list` ensures you only see these periodic main-branch runs.
## Workflow
### Phase 1: Extract the Failure Signature
1. **Get the failing test details from CI logs.** If given a URL, fetch logs directly. If given a test name, find recent scheduled runs of `pr-test.yml` on `main` that failed:
```bash
# List recent scheduled runs targeting main (the primary source of truth for regressions)
# These are cron-triggered runs visible at:
# https://github.com/sgl-project/sglang/actions/workflows/pr-test.yml?query=event%3Aschedule
gh run list --repo sgl-project/sglang --workflow="pr-test.yml" --event schedule --branch main --limit 20 --json databaseId,conclusion,createdAt,headSha
# Find the job containing the test
gh run view {RUN_ID} --repo sgl-project/sglang --json jobs --jq '.jobs[] | select(.conclusion == "failure") | {name, conclusion, databaseId}'
# Get the failure details
gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -E -B 5 -A 30 "AssertionError|FAIL|Error|{TEST_NAME}"
```
2. **Record the failure signature:**
- Exact error message and assertion
- Affected test method name
- Model/config involved
- Numeric values (e.g., tolerance diffs, scores)
- Whether the failure is deterministic (same values across runs)
### Phase 2: Temporal Bisection
3. **Find the boundary between passing and failing runs.** Walk through the scheduled run history (from the `pr-test.yml` schedule runs on `main`) to identify:
- Last known PASSING run (sha + date)
- First known FAILING run (sha + date)
```bash
# For each scheduled run, check the specific partition/job status
gh run view {RUN_ID} --repo sgl-project/sglang --json jobs --jq '.jobs[] | select(.name == "{JOB_NAME}") | {conclusion, databaseId}'
# Verify a specific test passed or failed in a run
gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -E "{TEST_NAME}|PASSED|FAILED|logprobs mismatch" | head -10
```
4. **List commits between the boundary:**
```bash
git log --oneline {LAST_PASS_SHA}..{FIRST_FAIL_SHA}
```
5. **Filter for relevant commits** that touch files related to the failing test (model layers, kernels, test utilities, etc.):
```bash
git log --oneline {LAST_PASS_SHA}..{FIRST_FAIL_SHA} -- {relevant_paths}
```
### Phase 3: Runner/Hardware Analysis
6. **Check if the failure is runner-specific.** Extract the runner identity from each failing and passing run:
```bash
# Get runner name and machine
gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -E "Runner name|Machine name" | head -5
# Get GPU/driver info
gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -i -E "NVIDIA-SMI|Driver Version|CUDA Version" | head -5
# Get package versions
gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -E "sgl.kernel.*==|flashinfer.*==" | head -5
```
7. **Correlate runners with pass/fail outcomes.** Build a table:
| Run ID | Date | Runner | GPU Type | Driver | Result |
|--------|------|--------|----------|--------|--------|
If all failures map to a specific runner type/GPU and all passes map to another, the issue is **hardware-specific**, not a code regression.
### Phase 4: Code Analysis
8. **If a code regression is suspected** (failures not runner-specific), examine the candidate commits:
- Read the changed files
- Understand how the changes could affect the failing test
- Look for prefill-vs-decode differences, TP-specific paths, kernel changes
9. **If a hardware issue is suspected**, analyze:
- Kernel compatibility (CUDA compute capability)
- Driver version differences
- All-reduce / NCCL behavior differences
- CUDA graph capture differences across GPU architectures
### Phase 5: Remote Reproduction (Optional)
Only if SSH target and docker container were provided.
10. **Verify the remote environment:**
```bash
ssh {SSH_TARGET} "docker exec {CONTAINER} nvidia-smi --query-gpu=name,driver_version --format=csv"
ssh {SSH_TARGET} "docker exec {CONTAINER} pip show sgl-kernel sglang flashinfer-python 2>&1 | grep -E 'Name:|Version:'"
```
11. **Ensure latest code is installed.** If the container is stale, update:
```bash
# Try fetching latest main
ssh {SSH_TARGET} "docker exec {CONTAINER} bash -c 'cd /path/to/sglang && git fetch origin main && git checkout origin/main'"
# Or download and install from tarball if git auth fails
ssh {SSH_TARGET} "docker exec {CONTAINER} bash -c 'cd /tmp && curl -L https://github.com/sgl-project/sglang/archive/refs/heads/main.tar.gz | tar xz && cd sglang-main && pip install -e \"python[all]\"'"
# Reinstall (after git fetch)
ssh {SSH_TARGET} "docker exec {CONTAINER} bash -c 'cd /path/to/sglang && pip install -e \"python[all]\"'"
# Install test dependencies if needed
ssh {SSH_TARGET} "docker exec {CONTAINER} pip install peft rouge-score"
```
12. **Create a minimal reproduction script** that:
- Uses `if __name__ == '__main__'` with `mp.set_start_method("spawn")`
- Runs the specific failing test configuration
- Prints key metrics (diffs, scores, outputs)
- Exits with code 1 on failure
13. **Copy and run the reproduction script:**
```bash
scp /tmp/repro_script.py {SSH_TARGET}:/tmp/
ssh {SSH_TARGET} "docker cp /tmp/repro_script.py {CONTAINER}:/tmp/"
ssh {SSH_TARGET} "docker exec -e CUDA_VISIBLE_DEVICES=0,1 {CONTAINER} python3 /tmp/repro_script.py"
```
14. **Run control experiments** to isolate the variable:
- If suspecting TP issue: run with TP=1 as control
- If suspecting GPU issue: compare same code on different GPU
- If suspecting a specific commit: test before/after that commit
### Phase 6: Report
15. **Produce a structured report:**
```markdown
## CI Regression Bisection Report
### Failure Signature
- **Test**: {test_file}::{test_method}
- **Error**: {exact error message}
- **Key metrics**: {numeric values}
- **Deterministic**: Yes/No
### Root Cause Classification
One of:
- **Code Regression**: PR #{number} introduced the bug
- **Hardware-Specific**: Fails on {GPU_TYPE}, passes on others
- **Environment Change**: New runner/driver/package version
- **Pre-existing Flakiness**: Intermittent, not a new regression
### Evidence
| Condition | Result |
|-----------|--------|
| {condition1} | PASS/FAIL |
| {condition2} | PASS/FAIL |
### Timeline
- {date}: Last known pass ({sha}, {runner})
- {date}: First known fail ({sha}, {runner})
- {date}: Confirmed reproduction on {server}
### Recommended Fix
- **Short-term**: {workaround}
- **Long-term**: {proper fix}
```
## Key Patterns to Recognize
| Pattern | Diagnosis |
|---------|-----------|
| Same SHA passes on runner A, fails on runner B | Hardware/runner-specific |
| All runners fail after commit X | Code regression from commit X |
| Intermittent - same runner sometimes passes/fails | Flaky test or race condition |
| Prefill OK but decode fails | TP/all-reduce issue in decode path |
| Works with TP=1, fails with TP>1 | Tensor parallelism bug |
| Exact same numeric diff every time | Deterministic bug, not flakiness |
## Important Notes
- **Always check runner identity** before concluding it's a code regression. Many "consistent" failures are actually runner-specific.
- **Test partition assignments change over time** as tests are added/removed. A test may move between partitions, landing on different runner types.
- **H200 runners** use `/root/actions-runner/` path and machine names like `gpu-h200-worker-*`. Non-H200 runners use `/public_sglang_ci/runner-*` paths.
- When running remote reproduction, use `run_in_background` for long-running tests and check output with `TaskOutput`.
- Container environments may be stale - always verify package versions match CI before drawing conclusions.

View File

@@ -0,0 +1,444 @@
---
name: write-sglang-test
description: Guide for writing SGLang CI/UT tests. Covers CustomTestCase, CI registration, server fixtures, model selection, mock testing, and test placement. Always read test/README.md for the full CI layout, how to run tests, and extra tips. Use when creating new tests, adding CI test cases, writing unit tests, or when the user asks to add tests for SGLang features.
---
# Writing SGLang CI / UT Tests
This skill covers **how to write and register tests**. For CI pipeline internals (stage ordering, fast-fail, gating, partitioning, debugging CI failures), see the [CI workflow guide](../ci-workflow-guide/SKILL.md).
## Core Rules
1. **Always use `CustomTestCase`** — never raw `unittest.TestCase`. It ensures `tearDownClass` runs even when `setUpClass` fails, preventing resource leaks in CI.
2. **`tearDownClass` must be defensive** — use `hasattr`/null checks before accessing resources (e.g. `cls.process`) that `setUpClass` may not have finished allocating.
3. **Place tests in `test/registered/<category>/`** — except JIT kernel tests and benchmarks, which live in `python/sglang/jit_kernel/tests/` and `python/sglang/jit_kernel/benchmark/` (nested subfolders are allowed)
4. **Reuse server fixtures** — inherit from `DefaultServerBase` or write `setUpClass`/`tearDownClass` with `popen_launch_server`
5. **Prefer mock over real server** — when testing logic that doesn't need a server / engine launch (middleware, request routing, config validation, argument parsing), use `unittest.mock.patch` / `MagicMock` and place tests in `test/registered/unit/`. Only launch a real server when the test genuinely needs inference results or server lifecycle behavior.
JIT kernel exception:
- If the task is adding or updating code under `python/sglang/jit_kernel/`, prefer the `add-jit-kernel` skill first.
- JIT kernel correctness tests use `python/sglang/jit_kernel/tests/**/test_*.py`.
- JIT kernel benchmarks use `python/sglang/jit_kernel/benchmark/**/bench_*.py`.
- Those files are still executed by `test/run_suite.py`, but through dedicated kernel suites rather than `test/registered/`.
---
## Model & Backend Selection
| Scenario | Model | CI Registration | Suite |
|----------|-------|-----------------|-------|
| **Unit tests** (no server / engine launch) | None | `register_cpu_ci` (prefer) or `register_cuda_ci` | `stage-a-test-cpu` or `stage-b-test-1-gpu-small` |
| **Common / backend-independent** (middleware, abort, routing, config, arg parsing) | `DEFAULT_SMALL_MODEL_NAME_FOR_TEST` (1B) | `register_cuda_ci` only | `stage-b-test-1-gpu-small` |
| **Model-agnostic functionality** (sampling, session, OpenAI API features) | `DEFAULT_SMALL_MODEL_NAME_FOR_TEST` (1B) | `register_cuda_ci` (+ AMD if relevant) | `stage-b-test-1-gpu-small` |
| **General performance** (single node, no spec/DP/parallelism) | `DEFAULT_MODEL_NAME_FOR_TEST` (8B) | `register_cuda_ci` | `stage-b-test-1-gpu-large` |
| **Bigger features** (spec, DP, TP, disaggregation) | Case by case | Case by case | See suite table below |
**Key principle for E2E tests**: Do NOT add `register_amd_ci` unless the test specifically exercises AMD/ROCm code paths. Common E2E tests just need any GPU to run — duplicating across backends wastes CI time with no extra coverage.
### All model constants
Defined in `python/sglang/test/test_utils.py`:
| Constant | Model | When to use |
|----------|-------|-------------|
| `DEFAULT_SMALL_MODEL_NAME_FOR_TEST` | Llama-3.2-1B-Instruct | Common features, model-agnostic tests |
| `DEFAULT_SMALL_MODEL_NAME_FOR_TEST_BASE` | Llama-3.2-1B | Base (non-instruct) model tests |
| `DEFAULT_MODEL_NAME_FOR_TEST` | Llama-3.1-8B-Instruct | General performance (single node) |
| `DEFAULT_MOE_MODEL_NAME_FOR_TEST` | Mixtral-8x7B-Instruct | MoE-specific tests |
| `DEFAULT_SMALL_EMBEDDING_MODEL_NAME_FOR_TEST` | — | Embedding tests |
| `DEFAULT_SMALL_VLM_MODEL_NAME_FOR_TEST` | — | Vision-language tests |
### Naming Conventions
- **Suite**: `stage-{a,b,c}-test-{gpu_count}-gpu-{hardware}` (e.g., `stage-b-test-1-gpu-small`)
- **CI runner**: `{gpu_count}-gpu-{hardware}` (e.g., `1-gpu-5090`, `4-gpu-h100`, `8-gpu-h200`)
### All CI Suites
#### Per-commit (CUDA)
| Suite | Runner (label) | Description |
|-------|----------------|-------------|
| `stage-a-test-1-gpu-small` | `1-gpu-5090` | Quick checks on a small NVIDIA GPU before heavier stages |
| `stage-a-test-cpu` | `ubuntu-latest` | CPU-only unit tests |
| `stage-b-test-1-gpu-small` | `1-gpu-5090` | Core engine tests that fit a 5090-class card |
| `stage-b-test-1-gpu-large` | `1-gpu-h100` | Tests that need H100-class memory or kernels (e.g. FA3) |
| `stage-b-test-2-gpu-large` | `2-gpu-h100` | Two-GPU correctness and parallelism (TP/PP) on H100 |
| `stage-b-test-4-gpu-b200` | `4-gpu-b200` | Early Blackwell coverage (SM100+ paths) on four GPUs |
| `stage-b-kernel-unit-1-gpu-large` | `1-gpu-h100` | JIT kernel correctness tests under `python/sglang/jit_kernel/tests/` |
| `stage-b-kernel-unit-8-gpu-h200` | `8-gpu-h200` | Multi-GPU JIT kernel correctness tests under `python/sglang/jit_kernel/tests/` |
| `stage-b-kernel-benchmark-1-gpu-large` | `1-gpu-h100` | JIT kernel benchmark files under `python/sglang/jit_kernel/benchmark/` |
| `stage-c-test-4-gpu-h100` | `4-gpu-h100` | Large 4-GPU H100 integration and scaling tests |
| `stage-c-test-8-gpu-h200` | `8-gpu-h200` | Large 8-GPU H200 runs for big models and parallelism |
| `stage-c-test-8-gpu-h20` | `8-gpu-h20` | Large 8-GPU H20 runs for big models |
| `stage-c-test-deepep-4-gpu-h100` | `4-gpu-h100` | DeepEP expert-parallel and networking on four H100s |
| `stage-c-test-deepep-8-gpu-h200` | `8-gpu-h200` | DeepEP at 8-GPU H200 scale |
| `stage-c-test-8-gpu-b200` | `8-gpu-b200` | 8-GPU B200 suite (registered but not yet wired to a workflow) |
| `stage-c-test-4-gpu-b200` | `4-gpu-b200` | 4-GPU B200 suite for large models on Blackwell |
| `stage-c-test-4-gpu-gb200` | `4-gpu-gb200` | 4-GPU GB200 suite for large models on Grace Blackwell |
#### Per-commit (AMD)
| Suite | Runner (label) | Description |
|-------|----------------|-------------|
| `stage-a-test-1-gpu-small-amd` | `linux-mi325-1gpu-sglang` | Quick checks on one MI325-class GPU |
| `stage-b-test-1-gpu-small-amd` | `linux-mi325-1gpu-sglang` | Core 1-GPU AMD tests (14 partitions) |
| `stage-b-test-1-gpu-small-amd-nondeterministic` | `linux-mi325-1gpu-sglang` | Non-deterministic 1-GPU AMD tests |
| `stage-b-test-1-gpu-small-amd-mi35x` | `linux-mi35x-gpu-1` | 1-GPU tests on MI35x hardware |
| `stage-b-test-1-gpu-large-amd` | `linux-mi325-1gpu-sglang` | Large 1-GPU AMD tests (2 partitions) |
| `stage-b-test-2-gpu-large-amd` | `linux-mi325-2gpu-sglang` | 2-GPU ROCm correctness and parallel setups |
| `stage-b-test-large-8-gpu-35x-disaggregation-amd` | `linux-mi35x-gpu-8.fabric` | PD disaggregation and RDMA on 8×MI35x fabric |
| `stage-c-test-4-gpu-amd` | `linux-mi325-4gpu-sglang` | 4-GPU AMD integration (2 partitions) |
| `stage-c-test-large-8-gpu-amd` | `linux-mi325-8gpu-sglang` | 8-GPU MI325 scaling and integration |
| `stage-c-test-large-8-gpu-amd-mi35x` | `linux-mi35x-gpu-8` | 8-GPU MI35x scaling (2 partitions) |
### Per-commit (Ascend NPU)
| Suite | Runner (label) | Description |
| --- | --- | --- |
| `per-commit-1-npu-a2` | `linux-aarch64-a2-1` | 1-NPU LLM CI machine |
| `per-commit-2-npu-a2` | `linux-aarch64-a2-2` | 2-NPU LLM CI machine |
| `per-commit-4-npu-a3` | `linux-aarch64-a3-4` | 4-NPU LLM CI machine |
| `per-commit-16-npu-a3` | `linux-aarch64-a3-16` | 16-NPU LLM CI machine |
| `multimodal-gen-test-1-npu-a3` | `linux-aarch64-a3-2` | 1-NPU multimodal CI machine |
| `multimodal-gen-test-2-npu-a3` | `linux-aarch64-a3-16` | 2-NPU multimodal CI machine |
| `multimodal-gen-test-8-npu-a3` | `linux-aarch64-a3-16` | 8-NPU multimodal CI machine |
#### Nightly
Nightly suites are listed in `NIGHTLY_SUITES` in [`test/run_suite.py`](../../../test/run_suite.py). They run via `nightly-test-nvidia.yml`, `nightly-test-amd.yml` amd `nightly-test-npu.yml`, not `pr-test.yml`. Examples:
- `nightly-1-gpu` (CUDA)
- `nightly-kernel-1-gpu` (CUDA, JIT kernel full grids)
- `nightly-kernel-8-gpu-h200` (CUDA, multi-GPU JIT kernel nightly)
- `nightly-8-gpu-h200` (CUDA)
- `nightly-eval-vlm-2-gpu` (CUDA)
- `nightly-amd` (AMD)
- `nightly-amd-8-gpu-mi35x` (AMD)
- `nightly-1-npu-a3` (NPU)
- `nightly-2-npu-a3` (NPU)
- `nightly-4-npu-a3` (NPU)
- `nightly-8-npu-a3` (NPU)
- `nightly-16-npu-a3` (NPU)
> **Note**: Multimodal diffusion uses `python/sglang/multimodal_gen/test/run_suite.py`, not `test/run_suite.py`.
### Choosing a Suite
Use the lightest suite that meets your test's needs:
- **No GPU required** → `stage-a-test-cpu`
- **Most small GPU tests** → `stage-b-test-1-gpu-small` (default choice)
- **Need H100 memory or Hopper features** → `stage-b-test-1-gpu-large`
- **JIT kernel correctness** → `stage-b-kernel-unit-1-gpu-large`
- **JIT kernel benchmarks** → `stage-b-kernel-benchmark-1-gpu-large`
- **Multi-GPU** → only when the test actually needs multiple GPUs
---
## Test File Templates
### Unit Tests (no server / engine launch)
See `test/registered/unit/README.md` for quick-start and rules. Unit tests live in `test/registered/unit/`, mirroring `python/sglang/srt/`:
```python
"""Unit tests for srt/<module>"""
import unittest
from unittest.mock import MagicMock, patch
from sglang.srt.<module> import TargetClass
from sglang.test.ci.ci_register import register_cpu_ci
from sglang.test.test_utils import CustomTestCase
register_cpu_ci(est_time=5, suite="stage-a-test-cpu")
# Prefer CPU. Only use register_cuda_ci when the test truly needs a GPU.
class TestTargetClass(CustomTestCase):
def test_basic_behavior(self):
obj = TargetClass(...)
self.assertEqual(obj.method(), expected)
@patch("sglang.srt.<module>.some_dependency")
def test_with_mock(self, mock_dep):
mock_dep.return_value = MagicMock()
# test logic with dependency mocked
...
if __name__ == "__main__":
unittest.main()
```
Use `unittest.mock.patch` / `MagicMock` to mock dependencies and isolate the logic under test. If the module transitively imports GPU-only packages (e.g. `sgl_kernel`), they can be stubbed so the test runs on CPU CI. See `test/registered/unit/README.md` for details and examples.
**Quality bar** — test real logic (validation boundaries, state transitions, error paths, branching, etc.). Skip tests that just verify Python itself works (e.g., "does calling an abstract method raise `NotImplementedError`?", "does a dataclass store the field I assigned?"). Consolidate repetitive patterns into parameterized tests. No production code changes in test PRs.
### E2E test (small model, server needed)
```python
import unittest
import requests
from sglang.srt.utils import kill_process_tree
from sglang.test.ci.ci_register import register_cuda_ci
from sglang.test.test_utils import (
DEFAULT_SMALL_MODEL_NAME_FOR_TEST,
DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
DEFAULT_URL_FOR_TEST,
CustomTestCase,
popen_launch_server,
)
register_cuda_ci(est_time=60, suite="stage-b-test-1-gpu-small")
class TestMyFeature(CustomTestCase):
@classmethod
def setUpClass(cls):
cls.model = DEFAULT_SMALL_MODEL_NAME_FOR_TEST
cls.base_url = DEFAULT_URL_FOR_TEST
cls.process = popen_launch_server(
cls.model,
cls.base_url,
timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
other_args=["--arg1", "value1"], # feature-specific args
)
@classmethod
def tearDownClass(cls):
if hasattr(cls, "process") and cls.process:
kill_process_tree(cls.process.pid)
def test_basic_functionality(self):
response = requests.post(
self.base_url + "/generate",
json={"text": "Hello", "sampling_params": {"max_new_tokens": 32}},
)
self.assertEqual(response.status_code, 200)
if __name__ == "__main__":
unittest.main(verbosity=3)
```
### E2E test (8B model, server needed, performance)
```python
import time
import unittest
import requests
from sglang.srt.utils import kill_process_tree
from sglang.test.ci.ci_register import register_cuda_ci
from sglang.test.test_utils import (
DEFAULT_MODEL_NAME_FOR_TEST,
DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
DEFAULT_URL_FOR_TEST,
CustomTestCase,
popen_launch_server,
)
register_cuda_ci(est_time=300, suite="stage-b-test-1-gpu-large")
class TestMyFeaturePerf(CustomTestCase):
@classmethod
def setUpClass(cls):
cls.model = DEFAULT_MODEL_NAME_FOR_TEST
cls.base_url = DEFAULT_URL_FOR_TEST
cls.process = popen_launch_server(
cls.model,
cls.base_url,
timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
)
@classmethod
def tearDownClass(cls):
if hasattr(cls, "process") and cls.process:
kill_process_tree(cls.process.pid)
def test_latency(self):
start = time.perf_counter()
response = requests.post(
self.base_url + "/generate",
json={"text": "Hello", "sampling_params": {"max_new_tokens": 128}},
)
elapsed = time.perf_counter() - start
self.assertEqual(response.status_code, 200)
self.assertLess(elapsed, 5.0, "Latency exceeded threshold")
if __name__ == "__main__":
unittest.main(verbosity=3)
```
---
## Server Fixture Reuse
For tests that only need a standard server, inherit from `DefaultServerBase` and override class attributes:
```python
from sglang.test.server_fixtures.default_fixture import DefaultServerBase
class TestMyFeature(DefaultServerBase):
model = DEFAULT_SMALL_MODEL_NAME_FOR_TEST
other_args = ["--enable-my-feature"]
def test_something(self):
...
```
Available fixtures in `python/sglang/test/server_fixtures/`:
| Fixture | Use case |
|---------|----------|
| `DefaultServerBase` | Standard single-server tests |
| `EagleServerBase` | EAGLE speculative decoding |
| `PDDisaggregationServerBase` | Disaggregated prefill/decode |
| `MMMUServerBase` | Multimodal VLM tests |
---
## CI Registration
Every CI-discovered test file must call a registration function at module level:
```python
from sglang.test.ci.ci_register import (
register_cuda_ci,
register_amd_ci,
register_cpu_ci,
register_npu_ci,
)
# Per-commit test (small 1-gpu, runs on 5090)
register_cuda_ci(est_time=80, suite="stage-b-test-1-gpu-small")
# Per-commit test (large 1-gpu, runs on H100)
register_cuda_ci(est_time=120, suite="stage-b-test-1-gpu-large")
# Nightly-only test
register_cuda_ci(est_time=200, suite="nightly-1-gpu", nightly=True)
# Multi-backend test (only when testing backend-specific code paths)
register_cuda_ci(est_time=80, suite="stage-a-test-1-gpu-small")
register_amd_ci(est_time=120, suite="stage-a-test-1-gpu-small-amd")
register_npu_ci(est_time=400, suite="nightly-8-npu-a3", nightly=True)
# Temporarily disabled test
register_cuda_ci(est_time=80, suite="stage-b-test-1-gpu-small", disabled="flaky - see #12345")
```
Parameters:
- `est_time`: estimated runtime in seconds (used for CI partitioning)
- `suite`: which CI suite to run in (see suite tables above)
- `nightly=True`: for nightly-only tests (default `False` = per-commit)
- `disabled="reason"`: temporarily disable with explanation
**Key principle**: Only add `register_amd_ci` / `register_npu_ci` when the test exercises backend-specific code paths. Common E2E tests just need `register_cuda_ci` — duplicating across backends wastes CI time.
### JIT Kernel Registration
JIT kernel files live outside `test/registered/` but still use registration:
```python
from sglang.test.ci.ci_register import register_cuda_ci
# Correctness tests in python/sglang/jit_kernel/tests/
register_cuda_ci(est_time=30, suite="stage-b-kernel-unit-1-gpu-large")
register_cuda_ci(est_time=120, suite="stage-b-kernel-unit-8-gpu-h200")
# Benchmarks in python/sglang/jit_kernel/benchmark/
register_cuda_ci(est_time=6, suite="stage-b-kernel-benchmark-1-gpu-large")
# Optional nightly registration
register_cuda_ci(est_time=120, suite="nightly-kernel-1-gpu", nightly=True)
register_cuda_ci(est_time=120, suite="nightly-kernel-8-gpu-h200", nightly=True)
```
Keep `est_time` and `suite` as **literal values**`run_suite.py` collects them by AST parsing
---
## Test Placement
```
test/
├── registered/ # CI tests (auto-discovered by run_suite.py)
│ ├── unit/ # No server / engine launch (see test/registered/unit/README.md)
│ ├── kernels/ # CUDA kernel correctness (no server, GPU required)
│ ├── sampling/ # test_penalty.py, test_sampling_params.py ...
│ ├── sessions/ # test_session_control.py ...
│ ├── openai_server/ # basic/, features/, validation/ ...
│ ├── spec/ # eagle/, utils/ ...
│ ├── models/ # model-specific accuracy tests
│ ├── perf/ # performance benchmarks
│ └── <category>/ # create new category if needed
├── manual/ # Non-CI: debugging, one-off, manual verification
└── run_suite.py # CI runner (scans registered/ plus jit_kernel test/benchmark files)
python/sglang/jit_kernel/
├── tests/ # JIT kernel correctness tests (CI-discovered by test/run_suite.py)
└── benchmark/ # JIT kernel benchmarks (CI-discovered by test/run_suite.py)
```
**Decision rule** (see also `test/registered/README.md`):
- Component logic, no server → `registered/unit/`
- JIT kernel correctness / benchmarks → `python/sglang/jit_kernel/tests/` or `python/sglang/jit_kernel/benchmark/`
- Other kernel correctness → `registered/kernels/`
- Server needed → `registered/<category>/`
- Local debugging → `manual/`
---
## Eval Accuracy Mixins
**Design philosophy**: Most test files don't care about eval logic — they only need a "does this feature break model output quality?" sanity check. The mixin pattern separates **what to test** (threshold) from **how to test** (run_eval, assertions, CI summary). Test classes declare thresholds as class attributes; the mixin provides the `test_*` method. Override when you need extra assertions (e.g. EAGLE accept length).
Available mixins in `python/sglang/test/kits/eval_accuracy_kit.py`: `MMLUMixin`, `HumanEvalMixin`, `MGSMEnMixin`, `GSM8KMixin`. Can be combined freely. Read the source for attrs and defaults.
```python
class TestMyFeature(CustomTestCase, MMLUMixin):
mmlu_score_threshold = 0.65
mmlu_num_examples = 64
mmlu_num_threads = 32
# test_mmlu is inherited — no code needed
```
---
## Key Utilities
```python
from sglang.test.test_utils import (
CustomTestCase, # base class with retry logic
popen_launch_server, # launch server subprocess
DEFAULT_URL_FOR_TEST, # auto-configured base URL
DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH, # 600s default
run_bench_serving, # benchmark helper (launch + bench)
)
from sglang.srt.utils import kill_process_tree # cleanup server
```
---
## Checklist
Before submitting a test:
- [ ] Inherits from `CustomTestCase` (not `unittest.TestCase`)
- [ ] Has `register_*_ci(...)` call at module level
- [ ] Placed in `test/registered/<category>/`, unless this is a JIT kernel test/benchmark
- [ ] JIT kernel work: files live in `python/sglang/jit_kernel/tests/` or `python/sglang/jit_kernel/benchmark/`
- [ ] Backend-independent tests: `register_cuda_ci` only + smallest model
- [ ] Logic that doesn't need a server / engine launch → unit test in `registered/unit/` (see Unit Tests section)
- [ ] `setUpClass` launches server, `tearDownClass` kills it (if server-based)
- [ ] `tearDownClass` is defensive — uses `hasattr`/null checks before accessing resources that may not have been allocated
- [ ] Has `if __name__ == "__main__": unittest.main()`
- [ ] `est_time` is reasonable (measure locally)

3
third_party/sglang/.codespellrc vendored Normal file
View File

@@ -0,0 +1,3 @@
[codespell]
ignore-words-list = ans, als, hel, boostrap, childs, te, vas, hsa, ment, cann, thi, makro, wil, rouge, PRIS
skip = *.json,*.jsonl,*.patch,*.txt

16
third_party/sglang/.coveragerc vendored Normal file
View File

@@ -0,0 +1,16 @@
[run]
source = python/sglang/srt
omit =
*/test/*
*/__pycache__/*
[report]
show_missing = true
exclude_lines =
pragma: no cover
if __name__ == .__main__.:
raise NotImplementedError
if TYPE_CHECKING
[html]
directory = htmlcov

View File

@@ -0,0 +1,35 @@
FROM lmsysorg/sglang:dev
# Create non-root user with specified UID and GID
# NOTE: Replace with your own UID and GID. This is a workaround from https://github.com/microsoft/vscode-remote-release/issues/49#issuecomment-489060908.
ARG HOST_UID=1003
ARG HOST_GID=1003
RUN groupadd -g $HOST_GID devuser && \
useradd -m -u $HOST_UID -g $HOST_GID -s /bin/zsh devuser
# Give devuser sudo access
RUN apt-get update && apt-get install -y sudo && \
echo "devuser ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/devuser && \
rm -rf /var/lib/apt/lists/* && \
apt-get clean
# Set up oh-my-zsh for devuser
RUN cp -r /root/.oh-my-zsh /home/devuser/.oh-my-zsh && \
cp /root/.zshrc /home/devuser/.zshrc && \
cp /root/.vimrc /home/devuser/.vimrc && \
cp /root/.tmux.conf /home/devuser/.tmux.conf && \
sed -i 's|/root/.oh-my-zsh|/home/devuser/.oh-my-zsh|g' /home/devuser/.zshrc && \
chown -R devuser:devuser /home/devuser/
# Set workspace directory and ownership
WORKDIR /sgl-workspace/sglang
RUN chown -R devuser:devuser /sgl-workspace
# Switch to devuser
USER devuser
# Install uv
RUN curl -LsSf https://astral.sh/uv/install.sh | sh
# Install rust
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y

View File

@@ -0,0 +1,30 @@
{
"name": "sglang",
"build": {
"dockerfile": "Dockerfile"
},
"remoteUser": "devuser",
"customizations": {
"vscode": {
"extensions": [
// Python development
"ms-python.python",
"charliermarsh.ruff",
// Rust development
"rust-lang.rust-analyzer",
"tamasfe.even-better-toml"
]
}
},
"forwardPorts": [],
"runArgs": [
"--gpus",
"all"
],
// The two lines below ensures that your local changes in the sglang
// repo is automatically synced to the sglang pip package installed
// in the dev docker container. You can remove / comment out these
// two lines if you prefer to sync code changes manually.
"workspaceMount": "source=${localWorkspaceFolder},target=/sgl-workspace/sglang,type=bind",
"workspaceFolder": "/sgl-workspace/sglang"
}

1
third_party/sglang/.dockerignore vendored Symbolic link
View File

@@ -0,0 +1 @@
.gitignore

File diff suppressed because it is too large Load Diff

74
third_party/sglang/.github/CODEOWNERS vendored Normal file
View File

@@ -0,0 +1,74 @@
.github @merrymercy @Fridge003 @ispobock @Kangyan-Zhou @bingxche
/docker @Fridge003 @ispobock @HaiShaw @ishandhanani @yctseng0211
/docker/npu.Dockerfile @ping1jing2 @iforgetmyname
/python/pyproject.toml @merrymercy @Fridge003 @ispobock
/python/sglang/jit_kernel @DarkSharpness @BBuf @celve @HydraQYH @yuan-luo
/python/sglang/jit_kernel/diffusion @yingluosanqian @BBuf @mickqian
/python/sglang/multimodal_gen @mickqian @yhyang201 @ping1jing2
/python/sglang/multimodal_gen/runtime/cache @DefTruth
/python/sglang/multimodal_gen/runtime/layers @mickqian @yhyang201 @BBuf @yingluosanqian @ping1jing2
/python/sglang/multimodal_gen/runtime/models/dits @mickqian @yhyang201 @BBuf @yingluosanqian @ping1jing2
/python/sglang/srt/batch_invariant_ops @Fridge003 @hebiao064
/python/sglang/srt/compilation @hebiao064 @Oasis-Git
/python/sglang/srt/constrained @hnyls2002 @DarkSharpness
/python/sglang/srt/disaggregation @ByronHsu @hnyls2002 @ShangmingCai
/python/sglang/srt/disaggregation/ascend @ping1jing2 @iforgetmyname
/python/sglang/srt/distributed @yizhang2077 @merrymercy @ch-wan
/python/sglang/srt/distributed/device_communicators/mooncake_transfer_engine.py @ShangmingCai @stmatengss
/python/sglang/srt/dllm @ClawSeven @btw616
/python/sglang/srt/entrypoints @ispobock @CatherineSue @slin1237 @merrymercy @JustinTong0323
/python/sglang/srt/entrypoints/engine_score_mixin.py @sundar24295s @chanh @fortunecookiee
/python/sglang/srt/entrypoints/grpc_server.py @CatherineSue @slin1237
/python/sglang/srt/entrypoints/openai/serving_score.py @sundar24295s @chanh @fortunecookiee
/python/sglang/srt/eplb @fzyzcjy @ch-wan
/python/sglang/srt/function_call @CatherineSue @JustinTong0323
/python/sglang/srt/grpc @CatherineSue @slin1237
/python/sglang/srt/hardware_backend/npu @ping1jing2 @iforgetmyname
/python/sglang/srt/hardware_backend/npu/quantization @OrangeRedeng @TamirBaydasov @iforgetmyname
/python/sglang/srt/layers @merrymercy @Ying1123 @Fridge003 @ispobock @HaiShaw @ch-wan @BBuf @Edwardf0t1
/python/sglang/srt/layers/attention @merrymercy @Fridge003 @ispobock @Qiaolin-Yu @hebiao064 @HaiShaw
/python/sglang/srt/layers/attention/fla @yizhang2077 @hebiao064 @yuan-luo
/python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py @yizhang2077 @hebiao064 @hanming-lu @yuan-luo
/python/sglang/srt/layers/attention/mamba @yizhang2077 @hebiao064
/python/sglang/srt/layers/attention/nsa @1am9trash @hubertlu-tw @kkHuang-amd @HaiShaw @Fridge003 @hlu1 @rainj-me
/python/sglang/srt/layers/attention/vision.py @mickqian @yuan-luo @yhyang201
/python/sglang/srt/layers/quantization @ch-wan @BBuf @Edwardf0t1 @FlamingoPg @AniZpZ @HaiShaw @b8zhong
/python/sglang/srt/layers/quantization/quark @kkHuang-amd @yichiche @hubertlu-tw @1am9trash @BowenBao
/python/sglang/srt/lora @Ying1123 @Fridge003 @lifuhuang @yushengsu-thu
/python/sglang/srt/managers @merrymercy @Ying1123 @hnyls2002 @xiezhq-hermann
/python/sglang/srt/managers/scheduler_pp_mixin.py @ShangmingCai @XucSh
/python/sglang/srt/managers/tokenizer_manager_score_mixin.py @sundar24295s @chanh @fortunecookiee
/python/sglang/srt/mem_cache @merrymercy @Ying1123 @hnyls2002 @xiezhq-hermann @hanming-lu @yizhang2077 @hzh0425 @ispobock
/python/sglang/srt/model_executor @merrymercy @Ying1123 @hnyls2002 @Fridge003 @ispobock
/python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py @hebiao064
/python/sglang/srt/models/deepseek_common @Fridge003 @ispobock @fzyzcjy @ch-wan
/python/sglang/srt/models/deepseek_v2.py @fzyzcjy @zhyncs @ispobock @ch-wan @merrymercy @Fridge003
/python/sglang/srt/models/transformers.py @adarshxs
/python/sglang/srt/multimodal @mickqian @JustinTong0323 @yhyang201 @yuan-luo
/python/sglang/srt/observability @merrymercy @fzyzcjy @sufeng-buaa
/python/sglang/srt/ray @Qiaolin-Yu @xyuzh
/python/sglang/srt/speculative @Ying1123 @merrymercy @hnyls2002
/sgl-kernel @ispobock @BBuf @yizhang2077 @merrymercy @FlamingoPg @HaiShaw
/sgl-model-gateway @slin1237 @CatherineSue
/sgl-model-gateway/benches @slin1237
/sgl-model-gateway/bindings/python @CatherineSue @key4ng @slin1237
/sgl-model-gateway/e2e_test @CatherineSue @key4ng
/sgl-model-gateway/examples/wasm @slin1237
/sgl-model-gateway/src/config @slin1237
/sgl-model-gateway/src/core @slin1237
/sgl-model-gateway/src/data_connector @key4ng
/sgl-model-gateway/src/grpc_client @CatherineSue @slin1237
/sgl-model-gateway/src/mcp @key4ng @slin1237
/sgl-model-gateway/src/policies @slin1237 @ByronHsu
/sgl-model-gateway/src/proto @CatherineSue @slin1237
/sgl-model-gateway/src/protocols @CatherineSue @key4ng
/sgl-model-gateway/src/reasoning_parser @CatherineSue
/sgl-model-gateway/src/routers @CatherineSue @key4ng @slin1237
/sgl-model-gateway/src/tokenizer @slin1237 @CatherineSue
/sgl-model-gateway/src/tool_parser @slin1237 @CatherineSue
/sgl-model-gateway/src/wasm @slin1237
/sgl-model-gateway/examples/wasm @slin1237
/test/registered/core/test_score_api.py @sundar24295s @chanh @fortunecookiee
/benchmark/prefill_only/bench_score.py @sundar24295s @chanh @fortunecookiee
/test/srt/ascend @ping1jing2 @iforgetmyname
/test/srt/test_modelopt* @Edwardf0t1

View File

@@ -0,0 +1,12 @@
# Maintenance Tools
This folder contains tools and workflows for automating maintenance tasks.
## CI Permissions
`CI_PERMISSIONS.json` defines the CI permissions granted to each user.
Maintainers can directly edit the file to add entries with `"reason": "custom override"`.
Maintainers can also run `update_ci_permission.py` to update it with some auto rules (e.g., top contributors in the last 90 days get full permissions).
## Others
- `MAINTAINER.md` defines the code maintenance model.

View File

@@ -0,0 +1,35 @@
name: 🐞 Bug report
description: Report a bug to help us reproduce and fix it.
title: "[Bug] "
labels: ['Bug']
body:
- type: checkboxes
attributes:
label: Checklist
options:
- label: I searched related issues but found no solution.
- label: The bug persists in the latest version.
- label: Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
- label: If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- label: Please use English. Otherwise, it will be closed.
- type: textarea
attributes:
label: Describe the bug
description: A clear, concise description of the bug.
validations:
required: true
- type: textarea
attributes:
label: Reproduction
description: Command/script run and model used.
placeholder: Paste the command here.
validations:
required: true
- type: textarea
attributes:
label: Environment
description: Run `python3 -m sglang.check_env` and paste output here. Issues without this will be closed.
placeholder: Paste environment output here.
validations:
required: true

View File

@@ -0,0 +1,23 @@
name: 🚀 Feature request
description: Suggest an idea for this project
title: "[Feature] "
body:
- type: checkboxes
attributes:
label: Checklist
options:
- label: If this is not a feature request but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- label: Please use English. Otherwise, it will be closed.
- type: textarea
attributes:
label: Motivation
description: |
Clearly and concisely describe the feature's motivation.
validations:
required: true
- type: textarea
attributes:
label: Related resources
description: |
Provide official releases or third-party implementations if available.

154
third_party/sglang/.github/MAINTAINER.md vendored Normal file
View File

@@ -0,0 +1,154 @@
# SGLang Code Maintenance Model
This document describes the code maintenance model for the SGLang project.
Since SGLang is a large project involving multiple organizations and hardware platforms, we designed this model with the following goals:
- Ensure a responsive and smooth review process.
- Allow for fast iteration, so maintainers can sometimes bypass flaky CI tests for important PRs.
## Role Descriptions
There are four roles in this maintenance model. Some are custom roles, while others are predefined by GitHub.
- **Merge Oncall**: The person who drives the PR merge process. They have strong area-specific expertise and uphold a high bar for code quality.
- Permission: Merge PRs. Bypass branch protection rules if needed.
- Responsibility: Shepherd the merge of PRs assigned to their area. Revert or hotfix any issues related to their merge (especially if they bypass).
- **Codeowner**: The person who protects critical code. Without a bypass, each PR needs at least one Codeowner approval for each modified file protected by [CODEOWNERS](./CODEOWNERS). Please note that this role is not an honor but a significant responsibility because PRs cannot be merged without your approval (except when bypassed by a Merge Oncall).
- Permission: Approve PRs, allowing them to be merged without a bypass.
- Responsibility: Review PRs in a timely manner.
- **Write**: A person with write permission to the SGLang repo.
- Permission: Merge PRs if they have passed required tests and been approved by Codeowners. This role cannot bypass branch protection rules.
- Responsibility: Review and merge PRs in a timely manner.
- **CI Oncall**: A person who manages CI runners for specific hardware platforms.
- Permission: Add CI runners.
- Responsibility: Keep the CI runners up and running.
__Note__: Difference between Merge Oncall and Codeowner
- The Merge Oncall is an active role held by someone who actively tries to help merge PRs and can bypass CI if needed.
- The Codeowner is a passive protection role provided by GitHub; it prevents accidental changes to critical code.
- The list of Merge Oncalls is attached below. The list of Codeowners is in the [CODEOWNERS](./CODEOWNERS) file.
__Note__: The permissions to trigger CI tests are defined separately according to these [rules](https://docs.sglang.io/developer_guide/contribution_guide.html#how-to-trigger-ci-tests).
## Pull Request Merge Process
1. The author submits a pull request (PR) and fills out the PR checklist.
2. A bot assigns this PR to a Merge Oncall and @-mentions them. At the same time, GitHub will automatically request reviews from Codeowners.
3. Someone tags the PR with a `run-ci` label ([help](https://docs.sglang.io/developer_guide/contribution_guide.html#how-to-trigger-ci-tests)). Then the author can trigger CI by pushing new commits.
4. The Merge Oncall coordinates the review (e.g., asking people to review) and approves the PR; the Codeowners also approve the PR. If the assigned Merge Oncall is not responsive, the author can ping other related Merge Oncalls and Reviewers in the list below.
5. The code can now be merged:
- **Ideal case:** For each modified file, one Codeowner has approved the PR. The PR has also passed the required CI tests. Then, anyone with write permission can merge the PR.
- **Exception:** In cases where it is difficult to meet all requirements (due to flaky CI or slow responses), a Merge Oncall can bypass branch protection to merge the PR.
If you meet any issues during the merge, you can discuss in [slack channels](https://slack.sglang.io/): #pull-request, #ci-cd-build-release, #dev.
## The List of Merge Oncalls and Reviewers
This section lists the oncalls for each module or feature.
The format is @github-username (Slack username).
### Scheduler
[@merrymercy](https://github.com/merrymercy) (Lianmin Zheng), [@hnyls2002](https://github.com/hnyls2002) (Liangsheng Yin), [@cctry](https://github.com/cctry) (Shiyang Chen)
related files
- python/sglang/srt/managers
- python/sglang/srt/model_executor
### Diffusion
[@mickqian](https://github.com/mickqian) (Mick), [@BBuf](https://github.com/BBuf) (BBuf)
related files
- python/sglang/multimodal_gen
### PD disaggregation
[@ByronHsu](https://github.com/ByronHsu) (Byron Hsu), [@cctry](https://github.com/cctry) (Shiyang Chen), [@ShangmingCai](https://github.com/ShangmingCai) (Shangming Cai)
related files
- python/sglang/srt/disaggregation
### KV Cache
[@ispobock](https://github.com/ispobock) (Ke Bao), [@xiezhq-hermann](https://github.com/xiezhq-hermann) (Zhiqiang Xie)
related files
- python/sglang/srt/mem_cache
### Parallelism
[@ch-wan](https://github.com/ch-wan) (Cheng Wan), [@fzyzcjy](https://github.com/fzyzcjy) (Tom)
related files
- python/sglang/srt/eplb
- python/sglang/srt/distributed
- python/sglang/srt/layers/dp_attention.py
### Kernel
[@BBuf](https://github.com/BBuf) (BBuf)
related files
- python/sglang/jit_kernel
- sgl-kernel
### Speculative decoding
[@hnyls2002](https://github.com/hnyls2002) (Liangsheng Yin), [@Qiaolin-Yu](https://github.com/Qiaolin-Yu) (Qiaolin Yu)
related files
- python/sglang/srt/speculative
### NV and model-specific optimizations
[@Fridge003](https://github.com/Fridge003) (Baizhou Zhang), [@ishandhanani](https://github.com/ishandhanani) (Ishan Dhanani), [@Qiaolin-Yu](https://github.com/Qiaolin-Yu) (Qiaolin Yu)
related files
- python/sglang/srt/models
- python/sglang/srt/layers/attention
### AMD optimizations
[@HaiShaw](https://github.com/HaiShaw) (Henry HAI)
### NPU optimizations
[@iforgetmyname](https://github.com/iforgetmyname) (Even Zhou)
related files
- python/sglang/srt/hardware_backend/npu
### CI, Release, Package
[@Kangyan-Zhou](https://github.com/Kangyan-Zhou) (Kangyan Zhou), [@Fridge003](https://github.com/Fridge003) (Baizhou Zhang)
related files
- .github/workflows
### Router, API
[@slin1237](https://github.com/slin1237) (Simo Lin)
related files
- sgl-model-gateway
- python/sglang/srt/grpc
- python/sglang/srt/entrypoints
### Other Notes
Now we have many Merge Oncalls mainly because the CI is flaky and the CODEOWNERS is too coarse-grained.
In the future, we hope the CI can be improved and we only need bypass rarely. After that, most Merge Oncalls can be converted back to Write and CODEOWNERS.
This list is based on the current situation. If you or someone you know would like to take on more responsibility and are qualified, please ping [Lianmin Zheng](https://github.com/merrymercy) and [Ying Sheng](https://github.com/Ying1123) in the Slack channel. They will start a nomination and internal review process.
## The List of CI Oncalls
This section lists the oncalls for each hardware platform. The format is @github-username (Slack username).
### NVIDIA GPUs
[@Kangyan-Zhou](https://github.com/Kangyan-Zhou) (Kangyan Zhou), [@ch-wan](https://github.com/ch-wan) (Cheng Wan), [@HanHan009527](https://github.com/HanHan009527) (hanhan), [@ishandhanani](https://github.com/ishandhanani) (Ishan Dhanani), [@ShangmingCai](https://github.com/ShangmingCai) (Shangming Cai), [@alisonshao](https://github.com/alisonshao) (Alison Shao).
### AMD GPUs
[@saienduri](https://github.com/saienduri) (Sai Enduri), [@HaiShaw](https://github.com/HaiShaw) (Henry HAI)
### Intel CPU and XPU
[@mingfeima](https://github.com/mingfeima) (Mingfei Ma), [@DiweiSun](https://github.com/DiweiSun) (Diwei Sun)
### Ascend NPUs
[@iforgetmyname](https://github.com/iforgetmyname) (Even Zhou)
This list is based on the current situation. If you or someone you know would like to donate machines for CI, they can serve as the CI oncalls for their machines. Please ping [Lianmin Zheng](https://github.com/merrymercy) and [Ying Sheng](https://github.com/Ying1123) in the Slack channel. They will start a nomination and internal review process.
## CI Maintenance Mode
When the CI is unhealthy (e.g., the scheduled pr-test on `main` is broken for consecutive runs), the project enters **CI Maintenance Mode** by opening [issue #21065](https://github.com/sgl-project/sglang/issues/21065). While active:
- All PR CI runs are paused. Resources are allocated to PRs that fix the CI.
- **Merging non-CI-fix PRs is prohibited.** Only PRs that fix the CI may be merged. In severe cases, merge permissions may be revoked.
Maintenance mode ends when `pr-test.yml` is all green on `main` and the issue is closed.
## Suspending Permissions
If a Merge Oncall bypasses checks to merge a PR that breaks the `main` branch, merges a non-CI-fix PR during CI Maintenance Mode, or repeatedly breaks the CI due to various reasons, their privileges will be suspended for at least two days, depending on the severity of the incident.

View File

@@ -0,0 +1,63 @@
name: Check Maintenance Mode
description: Blocks CI when maintenance mode is active (issue #21065 is open), unless the PR has the bypass-maintenance label, or env SGLANG_PR_TEST_BYPASS_MAINTENANCE_ON_MAIN=true (PR Test workflow on main only). Merging non-CI-fix PRs is prohibited during maintenance mode; in severe cases, merge permissions may be revoked.
inputs:
github-token:
description: GitHub token for API access
required: false
default: ${{ github.token }}
runs:
using: composite
steps:
- name: Check maintenance mode
shell: bash
env:
GH_TOKEN: ${{ inputs.github-token }}
run: |
MAINTENANCE_ISSUE=21065
REPO="${{ github.repository }}"
PR_NUMBER="${{ github.event.pull_request.number }}"
# PR Test workflow only: scheduled runs and runs on main (dispatch / workflow_call) set this env
if [[ "${SGLANG_PR_TEST_BYPASS_MAINTENANCE_ON_MAIN:-}" == "true" ]]; then
echo "✅ PR Test on main branch; bypassing maintenance gate."
exit 0
fi
# Check if maintenance issue is open (fail-open: if API errors, allow CI to proceed)
ISSUE_STATE=$(gh issue view "$MAINTENANCE_ISSUE" --repo "$REPO" --json state --jq '.state' 2>/dev/null || echo "UNKNOWN")
if [[ "$ISSUE_STATE" != "OPEN" ]]; then
echo "✅ Maintenance mode is OFF. Proceeding with CI."
exit 0
fi
# For PRs, check if bypass-maintenance label is present
if [[ -n "$PR_NUMBER" ]]; then
HAS_BYPASS=$(gh pr view "$PR_NUMBER" --repo "$REPO" --json labels --jq '[.labels[].name] | map(select(. == "bypass-maintenance")) | length' 2>/dev/null || echo "0")
if [[ "$HAS_BYPASS" -gt 0 ]]; then
echo "✅ PR #$PR_NUMBER has 'bypass-maintenance' label. Bypassing maintenance mode."
exit 0
fi
fi
MSG=$(printf "%s\n" \
"## ⚠️ CI Maintenance Mode is Active" \
"The CI infrastructure is currently under maintenance." \
"All PR CI runs are paused until maintenance is complete." \
"**Merging non-CI-fix PRs is prohibited during maintenance mode.** In severe cases, merge permissions may be revoked." \
"You might also experience unexpected failures during this period." \
"The team is working on the issue and will update the status as soon as possible." \
"" \
"What should you do?" \
"- **Do NOT merge non-CI-fix PRs** until maintenance mode is lifted" \
"- Check back later (~12 hours)" \
"- Follow CI Maintenance Mode issue: https://github.com/$REPO/issues/$MAINTENANCE_ISSUE for status updates")
echo "$MSG" >> "$GITHUB_STEP_SUMMARY"
while IFS= read -r line; do
echo "::error::$line"
done <<< "$MSG"
exit 1

View File

@@ -0,0 +1,50 @@
name: Check Stage Health
description: Fail fast if any job in the current workflow run has already failed. Auto-skips for scheduled runs.
inputs:
github-token:
description: 'GitHub token for API calls'
required: false
default: ${{ github.token }}
runs:
using: composite
steps:
- name: Check stage health
uses: actions/github-script@v7
env:
SKIP_STAGE_HEALTH_CHECK: ${{ env.SKIP_STAGE_HEALTH_CHECK }}
with:
github-token: ${{ inputs.github-token }}
script: |
// Skip when explicitly requested via env var (e.g. release branch cut)
if (process.env.SKIP_STAGE_HEALTH_CHECK === 'true') {
core.info('Skipping health check (SKIP_STAGE_HEALTH_CHECK=true)');
return;
}
// Skip for scheduled runs — they should collect all failures, not fast-fail
if (context.eventName === 'schedule') {
core.info('Skipping health check for scheduled run');
return;
}
const jobs = await github.paginate(github.rest.actions.listJobsForWorkflowRun, {
owner: context.repo.owner,
repo: context.repo.repo,
run_id: context.runId,
per_page: 100,
});
// Find jobs that failed from a real error, not from fast-fail cascade
const rootCauseFailures = jobs.filter(j => {
if (j.status !== 'completed' || j.conclusion !== 'failure') return false;
// If the failing step is the health check, it's a cascade — skip it
const failedStep = (j.steps || []).find(s => s.conclusion === 'failure');
if (failedStep && (failedStep.name.includes('check-stage-health') || failedStep.name.includes('Check stage health'))) {
return false;
}
return true;
});
if (rootCauseFailures.length > 0) {
core.setFailed(`Fast-fail: skipping — root cause job(s): ${rootCauseFailures.map(j => j.name).join(', ')}`);
}

View File

@@ -0,0 +1,27 @@
name: Upload CUDA Coredumps
description: Upload CUDA coredump files as artifacts and clean up the directory.
inputs:
artifact-suffix:
description: Suffix appended to the artifact name (e.g. matrix partition id)
required: false
default: ""
retention-days:
description: Number of days to retain the artifact
required: false
default: "7"
runs:
using: composite
steps:
- name: Upload CUDA coredumps
uses: actions/upload-artifact@v4
with:
name: cuda-coredumps-${{ github.job }}${{ inputs.artifact-suffix && format('-{0}', inputs.artifact-suffix) }}
path: ${{ env.SGLANG_CUDA_COREDUMP_DIR || '/tmp/sglang_cuda_coredumps' }}/
retention-days: ${{ inputs.retention-days }}
if-no-files-found: ignore
- name: Cleanup CUDA coredumps
shell: bash
run: rm -rf "${{ env.SGLANG_CUDA_COREDUMP_DIR || '/tmp/sglang_cuda_coredumps' }}"

View File

@@ -0,0 +1,177 @@
name: Wait for Jobs
description: Poll and wait for specified jobs in the current workflow run to complete
inputs:
stage-name:
description: 'Human-readable stage name for log messages (e.g. "stage-a")'
required: true
jobs:
description: |
JSON array of job specs to wait for. Each element is either:
- a string: exact job name (e.g. "stage-a-test-1-gpu-small")
- an object { "prefix": "...", "expected_count": N }: for matrix jobs
required: true
max-wait-minutes:
description: 'Maximum time to wait before timing out'
required: false
default: '240'
poll-interval-seconds:
description: 'Seconds between polling attempts'
required: false
default: '60'
github-token:
description: 'GitHub token for API calls'
required: false
default: ${{ github.token }}
outputs:
result:
description: 'Overall result: success, failure, or timeout'
value: ${{ steps.wait.outputs.result }}
runs:
using: composite
steps:
- name: Wait for jobs to complete
id: wait
uses: actions/github-script@v7
env:
INPUT_STAGE_NAME: ${{ inputs.stage-name }}
INPUT_JOBS: ${{ inputs.jobs }}
INPUT_MAX_WAIT_MINUTES: ${{ inputs.max-wait-minutes }}
INPUT_POLL_INTERVAL_SECONDS: ${{ inputs.poll-interval-seconds }}
with:
github-token: ${{ inputs.github-token }}
script: |
const stageName = process.env.INPUT_STAGE_NAME;
const jobSpecs = JSON.parse(process.env.INPUT_JOBS);
const maxWaitMinutes = parseInt(process.env.INPUT_MAX_WAIT_MINUTES);
const pollIntervalSeconds = parseInt(process.env.INPUT_POLL_INTERVAL_SECONDS);
const maxAttempts = (maxWaitMinutes * 60) / pollIntervalSeconds;
// Normalize job specs into a uniform format
const normalizedSpecs = jobSpecs.map(spec => {
if (typeof spec === 'string') {
return { prefix: spec, expected_count: 1, exact: true };
}
return { ...spec, exact: false };
});
const totalExpectedJobs = normalizedSpecs.reduce((sum, s) => sum + s.expected_count, 0);
const matchesSpec = (jobName, spec) => {
if (spec.exact) {
return jobName === spec.prefix;
}
return jobName === spec.prefix || jobName.startsWith(spec.prefix + ' (');
};
// Use ETag conditional requests to avoid consuming rate limit when nothing changed.
// GitHub returns 304 Not Modified for unchanged data, which is FREE (no rate limit cost).
let lastEtag = '';
let lastJobs = null;
let apiCalls = 0;
let cachedCalls = 0;
async function fetchJobs() {
const url = `GET /repos/{owner}/{repo}/actions/runs/{run_id}/jobs`;
const params = {
owner: context.repo.owner,
repo: context.repo.repo,
run_id: context.runId,
per_page: 100,
headers: {},
};
if (lastEtag) {
params.headers['if-none-match'] = lastEtag;
}
try {
const response = await github.request(url, params);
apiCalls++;
const rateRemaining = response.headers['x-ratelimit-remaining'] || '?';
const rateLimit = response.headers['x-ratelimit-limit'] || '?';
console.log(`[rate-limit] ${rateRemaining}/${rateLimit} remaining (ETag: ${lastEtag ? 'sent' : 'none'}) | this session: ${apiCalls} paid, ${cachedCalls} free`);
lastEtag = response.headers.etag || '';
const jobs = response.data.jobs;
// Handle pagination if >100 jobs
// ETag only covers page 1, so invalidate it to avoid stale cache
// when later pages change but page 1 doesn't.
if (response.data.total_count > 100) {
lastEtag = '';
for (let page = 2; page <= Math.ceil(response.data.total_count / 100); page++) {
const { data: pageData } = await github.request(url, {
...params,
page,
headers: {},
});
jobs.push(...pageData.jobs);
}
}
lastJobs = jobs;
return { jobs, cached: false };
} catch (err) {
if (err.status === 304 && lastJobs) {
cachedCalls++;
console.log(`[rate-limit] 304 Not Modified | this session: ${apiCalls} paid, ${cachedCalls} free`);
return { jobs: lastJobs, cached: true };
}
throw err;
}
}
for (let attempt = 0; attempt < maxAttempts; attempt++) {
const { jobs, cached } = await fetchJobs();
let allCompleted = true;
let failedJobs = [];
let completedCount = 0;
let totalCount = 0;
for (const spec of normalizedSpecs) {
const matchingJobs = jobs.filter(job => matchesSpec(job.name, spec));
for (const job of matchingJobs) {
totalCount++;
if (!cached) {
console.log(`${job.name}: status=${job.status}, conclusion=${job.conclusion}`);
}
if (job.status === 'completed') {
completedCount++;
if (job.conclusion !== 'success' && job.conclusion !== 'skipped') {
failedJobs.push(job.name);
}
} else {
allCompleted = false;
}
}
if (matchingJobs.length < spec.expected_count) {
console.log(`${spec.prefix}: found ${matchingJobs.length}/${spec.expected_count} jobs (waiting for more)`);
allCompleted = false;
}
}
console.log(`[${stageName}] Progress: ${completedCount}/${totalCount} jobs completed (expected ${totalExpectedJobs})${cached ? ' (cached, no rate limit cost)' : ''}`);
// Fail fast if any jobs failed
if (failedJobs.length > 0) {
core.setOutput('result', 'failure');
core.setFailed(`${stageName} jobs failed: ${failedJobs.join(', ')}`);
return;
}
if (allCompleted && totalCount >= totalExpectedJobs) {
core.setOutput('result', 'success');
return;
}
console.log(`Waiting ${pollIntervalSeconds}s... (attempt ${attempt + 1}/${maxAttempts})`);
await new Promise(resolve => setTimeout(resolve, pollIntervalSeconds * 1000));
}
core.setFailed(`Timeout waiting for ${stageName} jobs`);
core.setOutput('result', 'timeout');

View File

@@ -0,0 +1,411 @@
"""
Audit GitHub repository collaborators with elevated access.
This script will:
1. Fetch all collaborators with write permission to this repo.
2. Show their github username, Nickname and the role (e.g., admin, maintain,
custom org role, write, triage).
3. Show their last activity related to this repo (last commit, last issue,
last pull request). Put the data in YYYY-MM-DD format. Add a column "last activity date" to the CSV, before the above three breakdown columns.
4. Show activity on other repos: repos touched via public events in the last 90 days (Push, PR, Issues, etc.). Sort the repos by the number of activities.
5. Write results to a CSV sorted by the roles (admin, maintain, custom org role, write, triage) and the last activity date (most recent first).
Usage:
export GH_TOKEN="your_github_token"
python3 audit_permission.py [--output path] [--repo owner/name]
Requires: requests, and a token with permission to list collaborators (push+
access to the repo).
"""
from __future__ import annotations
import argparse
import csv
import os
import sys
import time
from collections import Counter
from datetime import datetime, timedelta, timezone
from typing import Any
try:
import requests
except ImportError:
requests = None # type: ignore
DEFAULT_OWNER = "sgl-project"
DEFAULT_NAME = "sglang"
HEADERS: dict[str, str] = {}
def _request(
method: str,
url: str,
*,
params: dict[str, Any] | None = None,
max_retries: int = 3,
) -> requests.Response:
if requests is None:
raise RuntimeError("Install the requests package: pip install requests")
for attempt in range(max_retries):
r = requests.request(method, url, headers=HEADERS, params=params, timeout=60)
if r.status_code == 403 and "rate limit" in (r.text or "").lower():
reset = r.headers.get("X-RateLimit-Reset")
wait = 60
if reset:
try:
wait = max(1, int(reset) - int(time.time()) + 2)
except ValueError:
pass
print(f"Rate limited; sleeping {wait}s...", file=sys.stderr)
time.sleep(min(wait, 3600))
continue
return r
return r
def paginate_list(url: str, params: dict[str, Any] | None = None) -> list[Any]:
out: list[Any] = []
next_url: str | None = url
next_params = params
while next_url:
r = _request("GET", next_url, params=next_params)
next_params = None
if r.status_code != 200:
print(
f"Error {r.status_code} GET {next_url}: {r.text[:500]}",
file=sys.stderr,
)
break
data = r.json()
if isinstance(data, list):
out.extend(data)
else:
break
next_url = None
link = r.headers.get("Link", "")
for part in link.split(", "):
if 'rel="next"' in part:
start = part.find("<") + 1
end = part.find(">")
if start > 0 and end > start:
next_url = part[start:end]
break
return out
def collaborator_role(collab: dict[str, Any]) -> str:
role_name = collab.get("role_name")
if isinstance(role_name, str) and role_name.strip():
return role_name.strip()
perms = collab.get("permissions") or {}
if perms.get("admin"):
return "admin"
if perms.get("maintain"):
return "maintain"
if perms.get("push"):
return "write"
if perms.get("triage"):
return "triage"
return "read"
def has_write_plus(collab: dict[str, Any]) -> bool:
perms = collab.get("permissions") or {}
return bool(
perms.get("admin")
or perms.get("maintain")
or perms.get("push")
or perms.get("triage")
)
def role_sort_tier(collab: dict[str, Any]) -> int:
"""Sort order: admin (0), maintain (1), custom org role (2), write (3), triage (4)."""
rn = collab.get("role_name")
if isinstance(rn, str) and rn.strip():
k = rn.strip().lower()
if k == "admin":
return 0
if k == "maintain":
return 1
if k == "write":
return 3
if k == "triage":
return 4
if k == "read":
return 5
return 2
perms = collab.get("permissions") or {}
if perms.get("admin"):
return 0
if perms.get("maintain"):
return 1
if perms.get("push"):
return 3
if perms.get("triage"):
return 4
return 5
def fetch_display_name(login: str) -> str:
url = f"https://api.github.com/users/{login}"
r = _request("GET", url)
if r.status_code != 200:
return ""
data = r.json()
if not isinstance(data, dict):
return ""
n = data.get("name")
return n.strip() if isinstance(n, str) else ""
def parse_github_ts(s: str) -> datetime | None:
if not s:
return None
s = s.replace("Z", "+00:00")
try:
return datetime.fromisoformat(s)
except ValueError:
return None
def iso_timestamp_to_ymd(iso: str | None) -> str:
if not iso:
return ""
p = parse_github_ts(iso)
if not p:
return ""
return p.date().isoformat()
def max_date_ymd(*iso_dates: str | None) -> str:
best: datetime | None = None
for d in iso_dates:
p = parse_github_ts(d or "")
if p and (best is None or p > best):
best = p
return best.date().isoformat() if best else ""
def parse_ymd(s: str) -> datetime | None:
if not s:
return None
try:
return datetime.strptime(s, "%Y-%m-%d").replace(tzinfo=timezone.utc)
except ValueError:
return None
def last_commit_date(owner: str, repo: str, login: str) -> str | None:
url = f"https://api.github.com/repos/{owner}/{repo}/commits"
r = _request("GET", url, params={"author": login, "per_page": 1})
if r.status_code != 200:
return None
data = r.json()
if not isinstance(data, list) or not data:
return None
commit = data[0].get("commit") or {}
c = commit.get("committer") or commit.get("author") or {}
d = c.get("date")
return d if isinstance(d, str) else None
def search_repo_item(
owner: str, repo: str, login: str, kind: str
) -> dict[str, Any] | None:
q = f"repo:{owner}/{repo} is:{kind} author:{login}"
url = "https://api.github.com/search/issues"
r = _request(
"GET",
url,
params={"q": q, "sort": "updated", "order": "desc", "per_page": 1},
)
if r.status_code != 200:
return None
payload = r.json()
items = payload.get("items")
if not items:
return None
return items[0] if isinstance(items[0], dict) else None
def last_issue_pr_dates(
owner: str, repo: str, login: str
) -> tuple[str | None, str | None]:
issue = search_repo_item(owner, repo, login, "issue")
pr = search_repo_item(owner, repo, login, "pr")
issue_dt = None
pr_dt = None
if issue:
issue_dt = issue.get("updated_at") or issue.get("created_at")
if not isinstance(issue_dt, str):
issue_dt = None
if pr:
pr_dt = pr.get("updated_at") or pr.get("created_at")
if not isinstance(pr_dt, str):
pr_dt = None
return issue_dt, pr_dt
def other_repos_activity_column(
login: str, owner: str, repo: str, days: int = 90
) -> str:
"""Repos other than this one touched in the window, sorted by event count (desc)."""
cutoff = datetime.now(timezone.utc) - timedelta(days=days)
full = f"{owner}/{repo}"
counts: Counter[str] = Counter()
url: str | None = f"https://api.github.com/users/{login}/events/public"
params: dict[str, Any] = {"per_page": 100}
while url:
r = _request("GET", url, params=params)
params = {}
if r.status_code != 200:
break
events = r.json()
if not isinstance(events, list):
break
oldest_in_page: datetime | None = None
for ev in events:
if not isinstance(ev, dict):
continue
created = parse_github_ts(ev.get("created_at") or "")
if created:
if oldest_in_page is None or created < oldest_in_page:
oldest_in_page = created
if created and created < cutoff:
continue
rinfo = ev.get("repo")
name = None
if isinstance(rinfo, dict):
name = rinfo.get("name")
if isinstance(name, str) and name and name != full:
counts[name] += 1
next_url = None
link = r.headers.get("Link", "")
for part in link.split(", "):
if 'rel="next"' in part:
s, e = part.find("<") + 1, part.find(">")
if s > 0 and e > s:
next_url = part[s:e]
break
if oldest_in_page and oldest_in_page < cutoff:
break
url = next_url
if not events:
break
ordered = sorted(counts.items(), key=lambda x: (-x[1], x[0]))
return ";".join(f"{n}:{c}" for n, c in ordered)
def main() -> None:
parser = argparse.ArgumentParser(description="Audit repo collaborator permissions.")
parser.add_argument(
"--repo",
default=f"{DEFAULT_OWNER}/{DEFAULT_NAME}",
help=f"owner/name (default: {DEFAULT_OWNER}/{DEFAULT_NAME})",
)
parser.add_argument(
"--output",
"-o",
default=os.path.join(os.path.dirname(__file__), "permission_audit.csv"),
help="Output CSV path",
)
parser.add_argument(
"--events-days",
type=int,
default=90,
help="Window for other-repo activity via public events",
)
args = parser.parse_args()
if "/" not in args.repo:
print("Error: --repo must be owner/name", file=sys.stderr)
sys.exit(1)
owner, name = args.repo.split("/", 1)
gh_token = os.getenv("GH_TOKEN")
if not gh_token:
print("Error: GH_TOKEN environment variable is not set.", file=sys.stderr)
sys.exit(1)
global HEADERS
HEADERS = {
"Authorization": f"Bearer {gh_token}",
"Accept": "application/vnd.github+json",
"X-GitHub-Api-Version": "2022-11-28",
}
collab_url = f"https://api.github.com/repos/{owner}/{name}/collaborators"
print(f"Fetching collaborators for {owner}/{name}...", file=sys.stderr)
collaborators = paginate_list(
collab_url, params={"per_page": 100, "affiliation": "all"}
)
rows: list[dict[str, Any]] = []
elevated = [c for c in collaborators if isinstance(c, dict) and has_write_plus(c)]
print(
f"Found {len(elevated)} collaborators with admin/maintain/write/triage.",
file=sys.stderr,
)
for i, col in enumerate(elevated, start=1):
login = col.get("login")
if not isinstance(login, str):
continue
print(f" [{i}/{len(elevated)}] {login}", file=sys.stderr)
role = collaborator_role(col)
nickname = fetch_display_name(login)
cd = last_commit_date(owner, name, login)
issue_dt, pr_dt = last_issue_pr_dates(owner, name, login)
last_act_ymd = max_date_ymd(cd, issue_dt, pr_dt)
others = other_repos_activity_column(login, owner, name, days=args.events_days)
rows.append(
{
"_role_tier": role_sort_tier(col),
"github_username": login,
"nickname": nickname,
"role": role,
"last_activity_date": last_act_ymd,
"last_commit_date": iso_timestamp_to_ymd(cd),
"last_issue_date": iso_timestamp_to_ymd(issue_dt),
"last_pr_date": iso_timestamp_to_ymd(pr_dt),
"other_repos_90d": others,
}
)
def sort_key(r: dict[str, Any]) -> tuple[int, float]:
tier = r["_role_tier"]
act = parse_ymd(r.get("last_activity_date") or "")
ts = act.timestamp() if act else 0.0
return (tier, -ts)
rows.sort(key=sort_key)
fieldnames = [
"github_username",
"nickname",
"role",
"last_activity_date",
"last_commit_date",
"last_issue_date",
"last_pr_date",
"other_repos_90d",
]
for r in rows:
del r["_role_tier"]
with open(args.output, "w", newline="", encoding="utf-8") as f:
w = csv.DictWriter(f, fieldnames=fieldnames)
w.writeheader()
w.writerows(rows)
print(f"Wrote {len(rows)} rows to {args.output}", file=sys.stderr)
if __name__ == "__main__":
main()

122
third_party/sglang/.github/labeler.yml vendored Normal file
View File

@@ -0,0 +1,122 @@
# Configuration for the GitHub Labeler action
# Automatically adds labels to PRs based on the files changed
# Router specific (Rust code in sgl-model-gateway)
model-gateway:
- changed-files:
- any-glob-to-any-file: 'sgl-model-gateway/**/*'
# Kernel specific
sgl-kernel:
- changed-files:
- any-glob-to-any-file: 'sgl-kernel/**/*'
# JIT kernel specific
jit-kernel:
- changed-files:
- any-glob-to-any-file: 'python/sglang/jit_kernel/**/*'
# Documentation
documentation:
- changed-files:
- any-glob-to-any-file:
- '**/*.md'
- 'docs/**/*'
- 'README*'
# Dependencies
dependencies:
- changed-files:
- any-glob-to-any-file:
- '**/requirements*.txt'
- '**/Cargo.toml'
- '**/Cargo.lock'
- '**/pyproject*.toml'
- '**/setup.py'
- '**/poetry.lock'
- '**/package.json'
- '**/package-lock.json'
# Multi-modal
Multi-modal:
- changed-files:
- any-glob-to-any-file:
- '**/*multimodal*'
- '**/*vision*'
- '**/*vlm*'
# Diffusion
diffusion:
- changed-files:
- any-glob-to-any-file: 'python/sglang/multimodal_gen/**/*'
# LoRA
lora:
- changed-files:
- any-glob-to-any-file:
- '**/*lora*'
# Quantization
quant:
- changed-files:
- any-glob-to-any-file:
- '**/*quant*'
- '**/*quantization*'
# Speculative decoding
speculative-decoding:
- changed-files:
- any-glob-to-any-file:
- '**/*speculative*'
# AMD specific
amd:
- changed-files:
- any-glob-to-any-file:
- '**/*amd*'
- '**/*rocm*'
# NPU specific
npu:
- changed-files:
- any-glob-to-any-file:
- '**/*npu*'
- '**/*ascend*'
# Blackwell
blackwell:
- changed-files:
- any-glob-to-any-file:
- '**/*nvfp4*'
- 'sgl-kernel/csrc/attention/cutlass_sm100_mla/**/*'
- 'python/sglang/srt/layers/attention/trtllm_mla_backend.py'
- 'python/sglang/srt/layers/attention/trtllm_mha_backend.py'
# DeepSeek specific
deepseek:
- changed-files:
- any-glob-to-any-file:
- '**/*deepseek*'
# HiCache
hicache:
- changed-files:
- any-glob-to-any-file:
- '**/*hicache*'
# Deterministic
deterministic:
- changed-files:
- any-glob-to-any-file: 'python/sglang/srt/batch_invariant_ops/**/*'
# Piecewise CUDA Graph
piecewise-cuda-graph:
- changed-files:
- any-glob-to-any-file: 'python/sglang/srt/compilation/**/*'
# Moore Threads specific
mthreads:
- changed-files:
- any-glob-to-any-file:
- '**/*mthreads*'
- '**/*musa*'

View File

@@ -0,0 +1,42 @@
no_progress = true
verbose = "warn"
timeout = 20
max_concurrency = 8
retry_wait_time = 2
max_retries = 2
# CI should validate external links over the network.
offline = false
scheme = ["http", "https"]
exclude_path = [
# Exclude generated Sphinx build artifacts.
# - "(\\./)?" allows both "docs/..." and "./docs/..."
# - "[/\\\\]" supports both slash styles in CI environments
"^(\\./)?docs[/\\\\]_build[/\\\\]",
]
exclude = [
# Local-only endpoints referenced in docs/examples.
# These are expected to be unreachable in GitHub-hosted CI.
"^https?://localhost(:[0-9]+)?(/|$)",
"^http://127\\.0\\.0\\.1(:[0-9]+)?(/|$)",
# Vendor pages that frequently block/deny CI user-agents (transient 403/anti-bot).
"^https://www\\.intel\\.com/content/www/us/en/ark/products/series/240391/intel-arc-b-series-graphics\\.html$",
"^https://www\\.intel\\.com/content/www/us/en/ark/products/series/242616/intel-arc-pro-b-series-graphics\\.html$",
"^https://www\\.intel\\.com/content/www/us/en/products/sku/241598/intel-arc-b580-graphics/specifications\\.html$",
# Non-routable bind address used in examples, never externally reachable.
"^http://0\\.0\\.0\\.0(/|$)",
# Large doc portals with anti-bot/rate-limit behavior in CI.
# We keep API docs references in content but do not fail CI on access policy.
"^https://platform\\.openai\\.com/docs/",
"^https://gamma\\.app/docs/Optimizing-RL-with-SGLang-y0kqgj877k34779$",
"^https://aflah02\\.substack\\.com/p/multi-node-llm-inference-with-sglang/?$",
# Known noisy image URLs used in notebook-rendered examples.
"^https://github\\.com/sgl-project/sglang/blob/main/examples/assets/example_image\\.png\\?raw=true$",
"^https://raw\\.githubusercontent\\.com/sgl-project/sglang/main/examples/assets/example_image\\.png/?$",
"^https://raw\\.githubusercontent\\.com/sgl-project/sglang/main/assets/logo\\.png/?$",
]

View File

@@ -0,0 +1,18 @@
# .github/linters/lychee.toml
no_progress = true
verbose = "warn"
timeout = 20
max_concurrency = 8
offline = true
# Ignore generated docs output; check source docs only.
exclude_path = [
"^(\\./)?docs[/\\\\]_build[/\\\\]",
]
exclude = [
"^https?://localhost(:[0-9]+)?(/|$)",
"^http://127\\.0\\.0\\.1(:[0-9]+)?(/|$)",
"^http://0\\.0\\.0\\.0(/|$)",
]

View File

@@ -0,0 +1,33 @@
<!-- Thank you for your contribution! Please follow these guidelines to enhance your pull request. If anything is unclear, submit your PR and reach out to maintainers for assistance. Join our Slack community at https://slack.sglang.io to discuss further. -->
## Motivation
<!-- Describe the purpose and goals of this pull request. -->
## Modifications
<!-- Detail the changes made in this pull request. -->
## Accuracy Tests
<!-- If this pull request affects model outputs (e.g., changes to the kernel or model forward code), provide accuracy test results. -->
## Speed Tests and Profiling
<!-- If this pull request impacts inference speed, provide benchmarking and profiling results. -->
## Checklist
- [ ] Format your code according to the [Format code with pre-commit](https://docs.sglang.io/developer_guide/contribution_guide.html#format-code-with-pre-commit).
- [ ] Add unit tests according to the [Run and add unit tests](https://docs.sglang.io/developer_guide/contribution_guide.html#run-and-add-unit-tests).
- [ ] Update documentation according to [Write documentations](https://docs.sglang.io/developer_guide/contribution_guide.html#write-documentations).
- [ ] Provide accuracy and speed benchmark results according to [Test the accuracy](https://docs.sglang.io/developer_guide/contribution_guide.html#test-the-accuracy) and [Benchmark the speed](https://docs.sglang.io/developer_guide/contribution_guide.html#benchmark-the-speed).
- [ ] Follow the SGLang code style [guidance](https://docs.sglang.io/developer_guide/contribution_guide.html#code-style-guidance).
## Review and Merge Process
1. Ping Merge Oncalls to start the process. See the [PR Merge Process](https://github.com/sgl-project/sglang/blob/main/.github/MAINTAINER.md#pull-request-merge-process).
2. Get approvals from [CODEOWNERS](https://github.com/sgl-project/sglang/blob/main/.github/CODEOWNERS) and other reviewers.
3. Trigger CI tests with [comments](https://docs.sglang.io/developer_guide/contribution_guide.html#how-to-trigger-ci-tests) or contact authorized users to do so.
- Common commands include `/tag-and-rerun-ci`, `/tag-run-ci-label`, `/rerun-failed-ci`
4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

View File

@@ -0,0 +1,244 @@
"""
Update the CI permissions configuration file.
This script updates the `CI_PERMISSIONS.json` file, which defines the CI permissions granted to each user.
The format of `CI_PERMISSIONS.json` is as follows:
{
"username1": {
"can_tag_run_ci_label": true,
"can_rerun_failed_ci": true,
"cooldown_interval_minutes": 0,
"reason": "top contributor"
},
"username2": {
"can_tag_run_ci_label": true,
"can_rerun_failed_ci": true,
"cooldown_interval_minutes": 60,
"reason": "custom override"
}
}
Permissions are assigned according to the following rules:
1. Add the top 50 contributors from the last 120 days with full permissions, no cooldown, and the reason "top contributor".
2. Load all users from the existing `CI_PERMISSIONS.json` file and update their entries as follows:
- If a user is already covered by rule 1, skip that user.
- If the old reason of a user is "top contributor" but they are not in the current top contributors list, change their configuration to:
{
"can_tag_run_ci_label": true,
"can_rerun_failed_ci": true,
"cooldown_interval_minutes": 60,
"reason": "custom override"
}
- For all other cases, preserve the original configuration unchanged.
3. All other users receive no permissions and a 120-minute cooldown (they are omitted from the file).
Usage:
export GH_TOKEN="your_github_token"
python3 update_ci_permission.py
# Sort-only mode (no network calls, no GH_TOKEN required)
python3 update_ci_permission.py --sort-only
"""
import argparse
import json
import os
from collections import Counter
from datetime import datetime, timedelta, timezone
try:
import requests
except ImportError:
requests = None # Only needed for non-sort-only runs
# Configuration
REPO_OWNER = "sgl-project"
REPO_NAME = "sglang"
FILE_NAME = os.path.join(os.path.dirname(__file__), "CI_PERMISSIONS.json")
HEADERS = {}
def github_api_get(endpoint, params=None):
"""Helper to make paginated GitHub API requests."""
if requests is None:
raise RuntimeError(
"The requests package is required. Install it or use --sort-only."
)
if not HEADERS:
raise RuntimeError(
"GitHub headers not initialized. Set GH_TOKEN or use --sort-only."
)
results = []
url = f"https://api.github.com/repos/{REPO_OWNER}/{REPO_NAME}/{endpoint}"
while url:
response = requests.get(url, headers=HEADERS, params=params)
if response.status_code != 200:
print(f"Error fetching {url}: {response.status_code} {response.text}")
# If we fail to fetch, strictly return what we have or empty to avoid crashing logic
break
data = response.json()
if isinstance(data, list):
results.extend(data)
else:
return data # Non-list response (not paginated usually)
# Handle pagination
url = None
if "link" in response.headers:
links = response.headers["link"].split(", ")
for link in links:
if 'rel="next"' in link:
url = link[link.find("<") + 1 : link.find(">")]
params = None # Params are included in the next link
break
return results
def get_write_access_users():
"""Fetches users with push (write) or admin access."""
print("Fetching collaborators with write access...")
# Note: This endpoint usually requires admin rights on the token.
collaborators = github_api_get("collaborators", params={"per_page": 100})
writers = set()
for col in collaborators:
perms = col.get("permissions", {})
# Check for admin, maintain, or push rights
if perms.get("admin") or perms.get("maintain") or perms.get("push"):
writers.add(col["login"])
print(f"Found {len(writers)} users with write access.")
return writers
def get_top_contributors(days, limit):
"""Fetches top contributors based on commit count in the last N days."""
print(f"Fetching commits from the last {days} days...")
since_date = (datetime.now(timezone.utc) - timedelta(days=days)).isoformat()
# Fetch commits
commits = github_api_get("commits", params={"since": since_date, "per_page": 100})
author_counts = Counter()
for commit in commits:
# commit['author'] contains the GitHub user object (can be None if not linked)
if commit.get("author") and "login" in commit["author"]:
author_counts[commit["author"]["login"]] += 1
top_users = [user for user, _ in author_counts.most_common(limit)]
print(f"Found {len(top_users)} top contributors in the last {days} days.")
return set(top_users)
def load_existing_permissions():
if os.path.exists(FILE_NAME):
try:
with open(FILE_NAME, "r") as f:
return json.load(f)
except json.JSONDecodeError:
print(f"Warning: {FILE_NAME} is invalid JSON. Starting fresh.")
return {}
def sort_permissions_file():
"""Sort the existing CI permissions file alphabetically and exit."""
if not os.path.exists(FILE_NAME):
print(f"{FILE_NAME} not found. Nothing to sort.")
return
old_permissions = load_existing_permissions()
sorted_permissions = dict(sorted(old_permissions.items()))
with open(FILE_NAME, "w") as f:
json.dump(sorted_permissions, f, indent=4)
f.write("\n")
print(f"Sorted {FILE_NAME}. Total users: {len(sorted_permissions)}")
def main():
parser = argparse.ArgumentParser(description="Update or sort CI permissions.")
parser.add_argument(
"--sort-only",
action="store_true",
help="Only sort CI_PERMISSIONS.json alphabetically without fetching data.",
)
args = parser.parse_args()
if args.sort_only:
sort_permissions_file()
return
gh_token = os.getenv("GH_TOKEN")
if not gh_token:
raise ValueError("Error: GH_TOKEN environment variable is not set.")
global HEADERS
HEADERS = {
"Authorization": f"Bearer {gh_token}",
"Accept": "application/vnd.github+json",
"X-GitHub-Api-Version": "2022-11-28",
}
# Gather Data
try:
write_access_users = get_write_access_users()
except Exception as e:
print(f"Warning: Could not fetch collaborators (check token scope). Error: {e}")
write_access_users = set()
top_contributors = get_top_contributors(days=120, limit=50)
old_permissions = load_existing_permissions()
new_permissions = {}
# Rule 1: Add Top 50 Contributors
for user in top_contributors:
new_permissions[user] = {
"can_tag_run_ci_label": True,
"can_rerun_failed_ci": True,
"can_rerun_stage": True,
"cooldown_interval_minutes": 0,
"reason": "top contributor",
}
# Rule 2: Process Existing Users (Merge Logic)
for user, config in old_permissions.items():
if user in new_permissions:
# Already handled by Rule 1 or 2
continue
old_reason = config.get("reason", "")
# If they fell off the top contributor list
if old_reason in ["top contributor"]:
new_permissions[user] = {
"can_tag_run_ci_label": True,
"can_rerun_failed_ci": True,
"can_rerun_stage": True,
"cooldown_interval_minutes": 60,
"reason": "custom override",
}
else:
# Preserve custom overrides
new_permissions[user] = config
# Save and Sort
# Sorting keys for cleaner diffs
sorted_permissions = dict(sorted(new_permissions.items()))
with open(FILE_NAME, "w") as f:
json.dump(sorted_permissions, f, indent=4)
f.write("\n") # Add trailing newline
print(f"Successfully updated {FILE_NAME}. Total users: {len(sorted_permissions)}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,161 @@
name: AMD AITER Scout
on:
schedule:
- cron: '0 20 * * 1' # Monday 20:00 UTC
- cron: '0 20 * * 4' # Thursday 20:00 UTC
workflow_dispatch:
inputs:
aiter_ref:
description: 'AITER git ref (branch, tag, or SHA). Default: main (latest commit)'
required: false
type: string
default: 'main'
job_filter:
description: 'Comma-separated workflows to run: nightly-amd, nightly-amd-rocm720, pr-test-amd, pr-test-amd-rocm720. Default: all'
required: false
type: string
default: 'all'
continue_on_error:
description: 'Continue running other workflows even if one fails'
required: false
type: boolean
default: true
concurrency:
group: amd-aiter-scout-${{ github.run_id }}
cancel-in-progress: true
jobs:
resolve-aiter:
runs-on: ubuntu-latest
outputs:
aiter_sha: ${{ steps.resolve.outputs.sha }}
run_nightly_amd: ${{ steps.parse.outputs.run_nightly_amd }}
run_nightly_amd_rocm720: ${{ steps.parse.outputs.run_nightly_amd_rocm720 }}
run_pr_test_amd: ${{ steps.parse.outputs.run_pr_test_amd }}
run_pr_test_amd_rocm720: ${{ steps.parse.outputs.run_pr_test_amd_rocm720 }}
steps:
- name: Resolve AITER commit
id: resolve
run: |
REF="${{ inputs.aiter_ref || 'main' }}"
echo "Resolving AITER ref: ${REF}"
SHA=$(git ls-remote https://github.com/ROCm/aiter.git "refs/heads/${REF}" | head -1 | cut -f1)
if [ -z "$SHA" ]; then
SHA=$(git ls-remote https://github.com/ROCm/aiter.git "refs/tags/${REF}" | head -1 | cut -f1)
fi
if [ -z "$SHA" ]; then
SHA=$(git ls-remote https://github.com/ROCm/aiter.git "${REF}" | head -1 | cut -f1)
fi
if [ -z "$SHA" ]; then
SHA="${REF}"
fi
echo "sha=${SHA}" >> $GITHUB_OUTPUT
echo "### AITER Ref Resolution" >> $GITHUB_STEP_SUMMARY
echo "- **Requested ref:** \`${REF}\`" >> $GITHUB_STEP_SUMMARY
echo "- **Resolved SHA:** \`${SHA}\`" >> $GITHUB_STEP_SUMMARY
echo "- **AITER commit:** https://github.com/ROCm/aiter/commit/${SHA}" >> $GITHUB_STEP_SUMMARY
- name: Parse job filter
id: parse
run: |
FILTER="${{ inputs.job_filter || 'all' }}"
echo "Job filter: ${FILTER}"
if [[ "$FILTER" == "all" ]]; then
echo "run_nightly_amd=true" >> $GITHUB_OUTPUT
echo "run_nightly_amd_rocm720=true" >> $GITHUB_OUTPUT
echo "run_pr_test_amd=true" >> $GITHUB_OUTPUT
echo "run_pr_test_amd_rocm720=true" >> $GITHUB_OUTPUT
else
# Wrap with commas for exact substring matching (avoids "nightly-amd" matching "nightly-amd-rocm720")
PADDED=",${FILTER// /},"
echo "run_nightly_amd=$(echo "$PADDED" | grep -q ',nightly-amd,' && echo true || echo false)" >> $GITHUB_OUTPUT
echo "run_nightly_amd_rocm720=$(echo "$PADDED" | grep -q ',nightly-amd-rocm720,' && echo true || echo false)" >> $GITHUB_OUTPUT
echo "run_pr_test_amd=$(echo "$PADDED" | grep -q ',pr-test-amd,' && echo true || echo false)" >> $GITHUB_OUTPUT
echo "run_pr_test_amd_rocm720=$(echo "$PADDED" | grep -q ',pr-test-amd-rocm720,' && echo true || echo false)" >> $GITHUB_OUTPUT
fi
echo "### Job Filter" >> $GITHUB_STEP_SUMMARY
echo "- **Filter:** \`${FILTER}\`" >> $GITHUB_STEP_SUMMARY
call-nightly-amd:
if: needs.resolve-aiter.outputs.run_nightly_amd == 'true'
needs: resolve-aiter
uses: ./.github/workflows/nightly-test-amd.yml
secrets: inherit
with:
ref: ${{ github.sha }}
aiter_ref: ${{ needs.resolve-aiter.outputs.aiter_sha }}
job_filter: 'all'
continue_on_error: ${{ inputs.continue_on_error == '' && true || inputs.continue_on_error }}
call-nightly-amd-rocm720:
if: needs.resolve-aiter.outputs.run_nightly_amd_rocm720 == 'true'
needs: resolve-aiter
uses: ./.github/workflows/nightly-test-amd-rocm720.yml
secrets: inherit
with:
ref: ${{ github.sha }}
aiter_ref: ${{ needs.resolve-aiter.outputs.aiter_sha }}
job_filter: 'all'
continue_on_error: ${{ inputs.continue_on_error == '' && true || inputs.continue_on_error }}
call-pr-test-amd:
if: needs.resolve-aiter.outputs.run_pr_test_amd == 'true'
needs: resolve-aiter
uses: ./.github/workflows/pr-test-amd.yml
secrets: inherit
with:
run_all_tests: true
aiter_ref: ${{ needs.resolve-aiter.outputs.aiter_sha }}
continue_on_error: ${{ inputs.continue_on_error == '' && true || inputs.continue_on_error }}
call-pr-test-amd-rocm720:
if: needs.resolve-aiter.outputs.run_pr_test_amd_rocm720 == 'true'
needs: resolve-aiter
uses: ./.github/workflows/pr-test-amd-rocm720.yml
secrets: inherit
with:
run_all_tests: true
aiter_ref: ${{ needs.resolve-aiter.outputs.aiter_sha }}
continue_on_error: ${{ inputs.continue_on_error == '' && true || inputs.continue_on_error }}
check-all-jobs:
if: always()
needs:
- resolve-aiter
- call-nightly-amd
- call-nightly-amd-rocm720
- call-pr-test-amd
- call-pr-test-amd-rocm720
runs-on: ubuntu-latest
steps:
- name: Summary
run: |
echo "## AMD AITER Scout Results" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "- **AITER SHA:** \`${{ needs.resolve-aiter.outputs.aiter_sha }}\`" >> $GITHUB_STEP_SUMMARY
echo "- **AITER commit:** https://github.com/ROCm/aiter/commit/${{ needs.resolve-aiter.outputs.aiter_sha }}" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "| Workflow | Result |" >> $GITHUB_STEP_SUMMARY
echo "|----------|--------|" >> $GITHUB_STEP_SUMMARY
echo "| Nightly AMD (AITER Latest) | \`${{ needs.call-nightly-amd.result }}\` |" >> $GITHUB_STEP_SUMMARY
echo "| Nightly AMD ROCm 7.2 | \`${{ needs.call-nightly-amd-rocm720.result }}\` |" >> $GITHUB_STEP_SUMMARY
echo "| PR Test AMD (AITER Latest) | \`${{ needs.call-pr-test-amd.result }}\` |" >> $GITHUB_STEP_SUMMARY
echo "| PR Test AMD ROCm 7.2 | \`${{ needs.call-pr-test-amd-rocm720.result }}\` |" >> $GITHUB_STEP_SUMMARY
- name: Check if any job failed
run: |
if [[ "${{ contains(needs.*.result, 'failure') }}" == "true" ]]; then
echo "One or more workflows failed"
exit 1
fi
if [[ "${{ contains(needs.*.result, 'cancelled') }}" == "true" ]]; then
echo "One or more workflows were cancelled"
exit 1
fi
echo "All workflows passed"

View File

@@ -0,0 +1,338 @@
name: AMD CI Job Monitor
on:
schedule:
- cron: '0 0 * * *' # Daily at midnight UTC
pull_request:
paths:
- '.github/workflows/amd-ci-job-monitor.yml'
- 'scripts/ci/utils/query_job_status.py'
workflow_dispatch:
inputs:
hours:
description: 'Time window in hours'
required: false
default: '24'
type: string
job_filter:
description: 'Job name filter (leave empty for all AMD jobs)'
required: false
type: string
jobs:
fetch-actions-data:
name: Fetch Actions Snapshot
runs-on: ubuntu-latest
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Install dependencies
run: pip install tabulate
- name: Select workflows for snapshot
id: select-workflows
run: |
if [[ -n "${{ inputs.job_filter }}" ]]; then
echo "workflows=pr-test-amd.yml" >> "$GITHUB_OUTPUT"
else
echo "workflows=pr-test-amd.yml,nightly-test-amd.yml,pr-test-amd-rocm720.yml,nightly-test-amd-rocm720.yml" >> "$GITHUB_OUTPUT"
fi
- name: Fetch Actions data snapshot
timeout-minutes: 30
run: |
python scripts/ci/utils/query_job_status.py \
--repo ${{ github.repository }} \
--workflow "${{ steps.select-workflows.outputs.workflows }}" \
--hours ${{ inputs.hours || '24' }} \
--dump-data-file actions-job-snapshot.json
- name: Upload Actions data snapshot
uses: actions/upload-artifact@v4
with:
name: actions-job-snapshot
path: actions-job-snapshot.json
if-no-files-found: error
# Single job filter mode
custom-report:
name: Custom Job Report
if: ${{ inputs.job_filter }}
needs: fetch-actions-data
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Install dependencies
run: pip install tabulate
- name: Download Actions data snapshot
uses: actions/download-artifact@v4
with:
name: actions-job-snapshot
path: ci-data
- name: Generate Custom Job Report
timeout-minutes: 30
run: |
python scripts/ci/utils/query_job_status.py \
--repo ${{ github.repository }} \
--job "${{ inputs.job_filter }}" \
--workflow "pr-test-amd.yml" \
--hours ${{ inputs.hours || '24' }} \
--input-data-file ci-data/actions-job-snapshot.json \
--summary
# Parse workflow files to get job names dynamically
parse-workflows:
name: Parse Workflow Jobs
if: ${{ !inputs.job_filter }}
runs-on: ubuntu-latest
outputs:
pr_jobs: ${{ steps.parse.outputs.pr_jobs }}
nightly_jobs: ${{ steps.parse.outputs.nightly_jobs }}
pr_rocm720_jobs: ${{ steps.parse.outputs.pr_rocm720_jobs }}
nightly_rocm720_jobs: ${{ steps.parse.outputs.nightly_rocm720_jobs }}
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Parse workflow files
id: parse
run: |
# Parse pr-test-amd.yml and extract job names (exclude utility jobs)
# Excluded: call-gate, check-changes, pr-test-amd-finish, cancel, check-all-jobs
pr_jobs=$(yq -r '.jobs | keys | .[]' .github/workflows/pr-test-amd.yml | \
grep -v -E '^(call-gate|check-changes|pr-test-amd-finish|cancel|check-all-jobs)$' | \
jq -R -s -c 'split("\n") | map(select(length > 0))')
echo "pr_jobs=$pr_jobs" >> $GITHUB_OUTPUT
echo "PR jobs: $pr_jobs"
# Parse nightly-test-amd.yml and extract job names (exclude utility jobs)
# Excluded: check-all-jobs
nightly_jobs=$(yq -r '.jobs | keys | .[]' .github/workflows/nightly-test-amd.yml | \
grep -v -E '^(check-all-jobs)$' | \
jq -R -s -c 'split("\n") | map(select(length > 0))')
echo "nightly_jobs=$nightly_jobs" >> $GITHUB_OUTPUT
echo "Nightly jobs: $nightly_jobs"
# Parse pr-test-amd-rocm720.yml (exclude utility jobs)
# Excluded: call-gate, check-changes, pr-test-amd-finish, cancel, check-all-jobs
pr_rocm720_jobs=$(yq -r '.jobs | keys | .[]' .github/workflows/pr-test-amd-rocm720.yml | \
grep -v -E '^(call-gate|check-changes|pr-test-amd-finish|cancel|check-all-jobs)$' | \
jq -R -s -c 'split("\n") | map(select(length > 0))')
echo "pr_rocm720_jobs=$pr_rocm720_jobs" >> $GITHUB_OUTPUT
echo "PR ROCm 7.2 jobs: $pr_rocm720_jobs"
# Parse nightly-test-amd-rocm720.yml (exclude utility jobs)
# Excluded: check-all-jobs
nightly_rocm720_jobs=$(yq -r '.jobs | keys | .[]' .github/workflows/nightly-test-amd-rocm720.yml | \
grep -v -E '^(check-all-jobs)$' | \
jq -R -s -c 'split("\n") | map(select(length > 0))')
echo "nightly_rocm720_jobs=$nightly_rocm720_jobs" >> $GITHUB_OUTPUT
echo "Nightly ROCm 7.2 jobs: $nightly_rocm720_jobs"
# PR CI reports using dynamic matrix
pr-ci-reports:
name: PR - ${{ matrix.job_name }}
needs: [parse-workflows, fetch-actions-data]
if: ${{ !inputs.job_filter }}
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
job_name: ${{ fromJson(needs.parse-workflows.outputs.pr_jobs) }}
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Install dependencies
run: pip install tabulate
- name: Download Actions data snapshot
uses: actions/download-artifact@v4
with:
name: actions-job-snapshot
path: ci-data
- name: Generate Report
timeout-minutes: 15
run: |
python scripts/ci/utils/query_job_status.py \
--repo ${{ github.repository }} \
--job "${{ matrix.job_name }}" \
--workflow "pr-test-amd.yml" \
--hours ${{ inputs.hours || '24' }} \
--input-data-file ci-data/actions-job-snapshot.json \
--summary
# Nightly AMD test reports using dynamic matrix
nightly-reports:
name: Nightly - ${{ matrix.job_name }}
needs: [parse-workflows, fetch-actions-data]
if: ${{ !inputs.job_filter }}
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
job_name: ${{ fromJson(needs.parse-workflows.outputs.nightly_jobs) }}
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Install dependencies
run: pip install tabulate
- name: Download Actions data snapshot
uses: actions/download-artifact@v4
with:
name: actions-job-snapshot
path: ci-data
- name: Generate Nightly Report
timeout-minutes: 15
run: |
python scripts/ci/utils/query_job_status.py \
--repo ${{ github.repository }} \
--job "${{ matrix.job_name }}" \
--workflow "nightly-test-amd.yml" \
--hours ${{ inputs.hours || '24' }} \
--input-data-file ci-data/actions-job-snapshot.json \
--summary
# PR ROCm 7.2 CI reports using dynamic matrix
pr-rocm720-ci-reports:
name: PR ROCm720 - ${{ matrix.job_name }}
needs: [parse-workflows, fetch-actions-data]
if: ${{ !inputs.job_filter }}
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
job_name: ${{ fromJson(needs.parse-workflows.outputs.pr_rocm720_jobs) }}
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Install dependencies
run: pip install tabulate
- name: Download Actions data snapshot
uses: actions/download-artifact@v4
with:
name: actions-job-snapshot
path: ci-data
- name: Generate PR ROCm 7.2 Report
timeout-minutes: 15
run: |
python scripts/ci/utils/query_job_status.py \
--repo ${{ github.repository }} \
--job "${{ matrix.job_name }}" \
--workflow "pr-test-amd-rocm720.yml" \
--hours ${{ inputs.hours || '24' }} \
--input-data-file ci-data/actions-job-snapshot.json \
--summary
# Nightly ROCm 7.2 reports using dynamic matrix
nightly-rocm720-reports:
name: Nightly ROCm720 - ${{ matrix.job_name }}
needs: [parse-workflows, fetch-actions-data]
if: ${{ !inputs.job_filter }}
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
job_name: ${{ fromJson(needs.parse-workflows.outputs.nightly_rocm720_jobs) }}
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Install dependencies
run: pip install tabulate
- name: Download Actions data snapshot
uses: actions/download-artifact@v4
with:
name: actions-job-snapshot
path: ci-data
- name: Generate Nightly ROCm 7.2 Report
timeout-minutes: 15
run: |
python scripts/ci/utils/query_job_status.py \
--repo ${{ github.repository }} \
--job "${{ matrix.job_name }}" \
--workflow "nightly-test-amd-rocm720.yml" \
--hours ${{ inputs.hours || '24' }} \
--input-data-file ci-data/actions-job-snapshot.json \
--summary
# Runner fleet report - cross-workflow runner analytics in a single pass
runner-fleet-report:
name: Runner Fleet Report
if: ${{ !inputs.job_filter }}
needs: fetch-actions-data
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Install dependencies
run: pip install tabulate
- name: Download Actions data snapshot
uses: actions/download-artifact@v4
with:
name: actions-job-snapshot
path: ci-data
- name: Generate Runner Fleet Report
timeout-minutes: 30
run: |
python scripts/ci/utils/query_job_status.py \
--repo ${{ github.repository }} \
--runner-report \
--workflow "pr-test-amd.yml,nightly-test-amd.yml,pr-test-amd-rocm720.yml,nightly-test-amd-rocm720.yml" \
--hours ${{ inputs.hours || '24' }} \
--input-data-file ci-data/actions-job-snapshot.json \
--summary

View File

@@ -0,0 +1,10 @@
name: Auto tune
on:
workflow_dispatch:
jobs:
auto-tune-lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

View File

@@ -0,0 +1,50 @@
name: Bot Bump Flashinfer Version
on:
workflow_dispatch:
inputs:
new_version:
description: 'New flashinfer version (e.g., 0.6.4)'
required: true
type: string
permissions:
contents: write
pull-requests: write
jobs:
bump-flashinfer-version:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
token: ${{ secrets.GITHUB_TOKEN }}
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Install Python dependencies
run: |
pip install tomli
- name: Configure Git and branch
run: |
git config user.name "sglang-bot"
git config user.email "sglang-bot@users.noreply.github.com"
RANDOM_SUFFIX=$(echo $RANDOM | md5sum | head -c 4)
BRANCH_NAME="bot/bump-flashinfer-version-${{ github.event.inputs.new_version }}-${RANDOM_SUFFIX}"
git checkout -b "$BRANCH_NAME"
echo "BRANCH_NAME=$BRANCH_NAME" >> $GITHUB_ENV
- name: Run flashinfer version bump script
run: |
python scripts/release/bump_flashinfer_version.py "${{ github.event.inputs.new_version }}"
- name: Commit and create PR
env:
GH_TOKEN: ${{ secrets.GH_PAT_FOR_PULL_REQUEST }}
run: |
bash scripts/release/commit_and_pr.sh "flashinfer" "${{ github.event.inputs.new_version }}" "$BRANCH_NAME"

View File

@@ -0,0 +1,60 @@
name: Bot Bump Kernel Version to SGLang
on:
workflow_dispatch:
permissions:
contents: write
pull-requests: write
jobs:
bump-kernel-version-to-sglang:
runs-on: ubuntu-latest
outputs:
branch_name: ${{ steps.set_output.outputs.branch_name }}
needs_sync: ${{ steps.check_sync.outputs.needs_sync }}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
token: ${{ secrets.GITHUB_TOKEN }}
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Install Python dependencies
run: |
pip install tomli
- name: Check if sync is needed
id: check_sync
run: |
python scripts/release/check_kernel_version_to_sglang.py
- name: Configure Git and branch
if: steps.check_sync.outputs.needs_sync == 'true'
id: set_output
run: |
git config user.name "sglang-bot"
git config user.email "sglang-bot@users.noreply.github.com"
RANDOM_SUFFIX=$(echo $RANDOM | md5sum | head -c 4)
KERNEL_VERSION="${{ steps.check_sync.outputs.kernel_version }}"
BRANCH_NAME="bot/bump-kernel-version-to-sglang-${KERNEL_VERSION}-${RANDOM_SUFFIX}"
git checkout -b "$BRANCH_NAME"
echo "BRANCH_NAME=$BRANCH_NAME" >> $GITHUB_ENV
echo "KERNEL_VERSION=$KERNEL_VERSION" >> $GITHUB_ENV
echo "branch_name=$BRANCH_NAME" >> $GITHUB_OUTPUT
- name: Run kernel version bump script
if: steps.check_sync.outputs.needs_sync == 'true'
run: |
python scripts/release/bump_kernel_version_to_sglang.py
- name: Commit and create PR
if: steps.check_sync.outputs.needs_sync == 'true'
env:
GH_TOKEN: ${{ secrets.GH_PAT_FOR_PULL_REQUEST }}
run: |
bash scripts/release/commit_and_pr_kernel_to_sglang.sh "$KERNEL_VERSION" "$BRANCH_NAME"

View File

@@ -0,0 +1,50 @@
name: Bot Bump Kernel Version
on:
workflow_dispatch:
inputs:
new_version:
description: 'New sgl-kernel version (e.g., 0.3.12)'
required: true
type: string
permissions:
contents: write
pull-requests: write
jobs:
bump-kernel-version:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
token: ${{ secrets.GITHUB_TOKEN }}
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Install Python dependencies
run: |
pip install tomli
- name: Configure Git and branch
run: |
git config user.name "sglang-bot"
git config user.email "sglang-bot@users.noreply.github.com"
RANDOM_SUFFIX=$(echo $RANDOM | md5sum | head -c 4)
BRANCH_NAME="bot/bump-kernel-version-${{ github.event.inputs.new_version }}-${RANDOM_SUFFIX}"
git checkout -b "$BRANCH_NAME"
echo "BRANCH_NAME=$BRANCH_NAME" >> $GITHUB_ENV
- name: Run kernel version bump script
run: |
python scripts/release/bump_kernel_version.py "${{ github.event.inputs.new_version }}"
- name: Commit and create PR
env:
GH_TOKEN: ${{ secrets.GH_PAT_FOR_PULL_REQUEST }}
run: |
bash scripts/release/commit_and_pr.sh "sgl-kernel" "${{ github.event.inputs.new_version }}" "$BRANCH_NAME"

View File

@@ -0,0 +1,89 @@
name: Bot Bump SGLang Version
on:
workflow_dispatch:
inputs:
new_version:
description: 'New SGLang version (e.g., 0.5.3 or 0.5.3rc0)'
required: true
type: string
permissions:
contents: write
pull-requests: write
jobs:
bump-sglang-version:
runs-on: ubuntu-latest
outputs:
branch_name: ${{ steps.set_output.outputs.branch_name }}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
token: ${{ secrets.GITHUB_TOKEN }}
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Install Python dependencies
run: |
pip install tomli
- name: Configure Git and branch
id: set_output
run: |
git config user.name "sglang-bot"
git config user.email "sglang-bot@users.noreply.github.com"
RANDOM_SUFFIX=$(echo $RANDOM | md5sum | head -c 4)
BRANCH_NAME="bot/bump-sglang-version-${{ github.event.inputs.new_version }}-${RANDOM_SUFFIX}"
git checkout -b "$BRANCH_NAME"
echo "BRANCH_NAME=$BRANCH_NAME" >> $GITHUB_ENV
echo "branch_name=$BRANCH_NAME" >> $GITHUB_OUTPUT
- name: Run SGLang version bump script
run: |
python scripts/release/bump_sglang_version.py "${{ github.event.inputs.new_version }}"
- name: Commit and create PR
env:
GH_TOKEN: ${{ secrets.GH_PAT_FOR_PULL_REQUEST }}
run: |
bash scripts/release/commit_and_pr.sh "SGLang" "${{ github.event.inputs.new_version }}" "$BRANCH_NAME"
run-nightly-tests-nvidia:
needs: bump-sglang-version
uses: ./.github/workflows/nightly-test-nvidia.yml
with:
ref: ${{ needs.bump-sglang-version.outputs.branch_name }}
secrets: inherit
run-nightly-tests-amd:
needs: bump-sglang-version
uses: ./.github/workflows/nightly-test-amd.yml
with:
ref: ${{ needs.bump-sglang-version.outputs.branch_name }}
secrets: inherit
run-nightly-tests-npu:
needs: bump-sglang-version
uses: ./.github/workflows/nightly-test-npu.yml
with:
ref: ${{ needs.bump-sglang-version.outputs.branch_name }}
secrets: inherit
run-pr-tests-xeon:
needs: bump-sglang-version
uses: ./.github/workflows/pr-test-xeon.yml
with:
ref: ${{ needs.bump-sglang-version.outputs.branch_name }}
secrets: inherit
run-pr-tests-xpu:
needs: bump-sglang-version
uses: ./.github/workflows/pr-test-xpu.yml
with:
ref: ${{ needs.bump-sglang-version.outputs.branch_name }}
secrets: inherit

View File

@@ -0,0 +1,182 @@
name: Bot Cherry Pick to Release Branch
on:
workflow_dispatch:
inputs:
commit_sha:
description: 'Commit SHA to cherry-pick (full or short hash)'
required: true
type: string
target_branch:
description: 'Target release branch (e.g., release/v0.5.7)'
required: true
type: string
create_pr:
description: 'Create a PR instead of pushing directly'
required: false
type: boolean
default: true
permissions:
contents: write
pull-requests: write
concurrency:
group: cherry-pick-${{ github.event.inputs.target_branch }}
cancel-in-progress: false
jobs:
cherry-pick:
if: github.repository == 'sgl-project/sglang'
runs-on: ubuntu-latest
environment: 'prod'
steps:
- name: Validate inputs
env:
TARGET_BRANCH: ${{ github.event.inputs.target_branch }}
run: |
if [[ ! "$TARGET_BRANCH" =~ ^release/v[0-9]+\.[0-9]+(\.[0-9]+)?$ ]]; then
echo "::error::Target branch must match pattern 'release/vX.Y' or 'release/vX.Y.Z' (e.g., release/v0.5.7)"
exit 1
fi
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0
token: ${{ secrets.GH_PAT_FOR_PULL_REQUEST }}
- name: Configure Git
run: |
git config user.name "sglang-bot"
git config user.email "sglang-bot@users.noreply.github.com"
- name: Validate target branch exists
env:
TARGET_BRANCH: ${{ github.event.inputs.target_branch }}
run: |
git fetch origin
if ! git ls-remote --exit-code --heads origin "$TARGET_BRANCH" > /dev/null 2>&1; then
echo "::error::Target branch '$TARGET_BRANCH' does not exist on remote"
exit 1
fi
- name: Get commit info
id: commit_info
env:
COMMIT_SHA_INPUT: ${{ github.event.inputs.commit_sha }}
run: |
# Verify commit exists
if ! git cat-file -t "$COMMIT_SHA_INPUT" > /dev/null 2>&1; then
echo "::error::Commit SHA '$COMMIT_SHA_INPUT' does not exist"
exit 1
fi
# Get full SHA if short hash provided
FULL_SHA=$(git rev-parse "$COMMIT_SHA_INPUT")
COMMIT_TITLE=$(git log -1 --format="%s" "$FULL_SHA")
SHORT_SHA=$(git rev-parse --short "$FULL_SHA")
echo "full_sha=$FULL_SHA" >> $GITHUB_OUTPUT
echo "short_sha=$SHORT_SHA" >> $GITHUB_OUTPUT
# Use delimiter for multiline-safe output
{
echo "commit_title<<EOF"
echo "$COMMIT_TITLE"
echo "EOF"
} >> $GITHUB_OUTPUT
echo "Cherry-picking commit: $SHORT_SHA - $COMMIT_TITLE"
- name: Cherry-pick commit
id: cherry_pick
env:
TARGET_BRANCH: ${{ github.event.inputs.target_branch }}
FULL_SHA: ${{ steps.commit_info.outputs.full_sha }}
SHORT_SHA: ${{ steps.commit_info.outputs.short_sha }}
CREATE_PR: ${{ github.event.inputs.create_pr }}
run: |
if [[ "$CREATE_PR" == "true" ]]; then
# Create a new branch for the PR
RANDOM_SUFFIX=$(head -c 4 /dev/urandom | xxd -p)
NEW_BRANCH="cherry-pick/${SHORT_SHA}-to-${TARGET_BRANCH#release/}-${RANDOM_SUFFIX}"
git checkout -b "$NEW_BRANCH" "origin/$TARGET_BRANCH"
echo "new_branch=$NEW_BRANCH" >> $GITHUB_OUTPUT
else
# Checkout target branch directly
git checkout "$TARGET_BRANCH"
fi
# Attempt cherry-pick
if git cherry-pick "$FULL_SHA"; then
echo "cherry_pick_success=true" >> $GITHUB_OUTPUT
else
echo "::error::Cherry-pick failed due to conflicts. Please resolve manually."
git cherry-pick --abort || true
echo "cherry_pick_success=false" >> $GITHUB_OUTPUT
exit 1
fi
- name: Push changes
if: steps.cherry_pick.outputs.cherry_pick_success == 'true'
env:
CREATE_PR: ${{ github.event.inputs.create_pr }}
TARGET_BRANCH: ${{ github.event.inputs.target_branch }}
NEW_BRANCH: ${{ steps.cherry_pick.outputs.new_branch }}
run: |
if [[ "$CREATE_PR" == "true" ]]; then
git push origin "$NEW_BRANCH"
else
git push origin "$TARGET_BRANCH"
fi
- name: Create Pull Request
if: steps.cherry_pick.outputs.cherry_pick_success == 'true' && github.event.inputs.create_pr == 'true'
env:
GH_TOKEN: ${{ secrets.GH_PAT_FOR_PULL_REQUEST }}
TARGET_BRANCH: ${{ github.event.inputs.target_branch }}
SHORT_SHA: ${{ steps.commit_info.outputs.short_sha }}
COMMIT_TITLE: ${{ steps.commit_info.outputs.commit_title }}
FULL_SHA: ${{ steps.commit_info.outputs.full_sha }}
NEW_BRANCH: ${{ steps.cherry_pick.outputs.new_branch }}
run: |
PR_TITLE="[Cherry-pick] ${COMMIT_TITLE} to ${TARGET_BRANCH}"
gh pr create \
--title "$PR_TITLE" \
--base "$TARGET_BRANCH" \
--head "$NEW_BRANCH" \
--label "cherry-pick" \
--body-file - <<EOF
Cherry-pick of commit ${FULL_SHA} to \`${TARGET_BRANCH}\`
**Original commit:** ${FULL_SHA}
**Original title:** ${COMMIT_TITLE}
---
*This PR was automatically created by the cherry-pick workflow.*
EOF
- name: Summary
if: always()
env:
FULL_SHA: ${{ steps.commit_info.outputs.full_sha }}
COMMIT_TITLE: ${{ steps.commit_info.outputs.commit_title }}
TARGET_BRANCH: ${{ github.event.inputs.target_branch }}
CHERRY_PICK_SUCCESS: ${{ steps.cherry_pick.outputs.cherry_pick_success }}
CREATE_PR: ${{ github.event.inputs.create_pr }}
NEW_BRANCH: ${{ steps.cherry_pick.outputs.new_branch }}
ACTOR: ${{ github.actor }}
run: |
echo "## Cherry-Pick Summary" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "- **Triggered by:** @${ACTOR}" >> $GITHUB_STEP_SUMMARY
echo "- **Commit:** ${FULL_SHA}" >> $GITHUB_STEP_SUMMARY
echo "- **Title:** ${COMMIT_TITLE}" >> $GITHUB_STEP_SUMMARY
echo "- **Target Branch:** ${TARGET_BRANCH}" >> $GITHUB_STEP_SUMMARY
if [[ "$CHERRY_PICK_SUCCESS" == "true" ]]; then
echo "- **Status:** ✅ Success" >> $GITHUB_STEP_SUMMARY
else
echo "- **Status:** ❌ Failed" >> $GITHUB_STEP_SUMMARY
fi
if [[ "$CREATE_PR" == "true" && "$CHERRY_PICK_SUCCESS" == "true" ]]; then
echo "- **PR Branch:** ${NEW_BRANCH}" >> $GITHUB_STEP_SUMMARY
fi

View File

@@ -0,0 +1,22 @@
name: Cancel PR Workflows on Merge
on:
pull_request_target:
types:
- closed
permissions:
actions: write
jobs:
cancel:
if: github.event.pull_request.merged == true
runs-on: ubuntu-latest
steps:
- name: Cancel Previous Runs
uses: styfle/cancel-workflow-action@0.12.1
with:
workflow_id: all
access_token: ${{ secrets.GITHUB_TOKEN }}
ignore_sha: true
pr_number: ${{ github.event.pull_request.number }}

View File

@@ -0,0 +1,155 @@
name: Cancel Unfinished PR Runs
on:
workflow_dispatch:
inputs:
workflows:
description: 'Space-separated list of workflow filenames to cancel'
required: true
type: string
default: 'pr-test.yml'
include_high_priority:
description: 'Also cancel runs from high-priority PRs'
required: false
type: boolean
default: false
permissions:
actions: write # Needed to cancel runs
contents: read # Needed to read repo info
pull-requests: read # needed for gh pr view (labels)
jobs:
cancel-unfinished-pr-runs:
runs-on: ubuntu-latest
steps:
- name: Install GitHub CLI
run: sudo apt-get install -y gh jq
- name: Cancel unfinished PR-associated runs (skip high-priority PRs)
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
REPO: ${{ github.repository }}
WORKFLOWS: ${{ github.event.inputs.workflows || 'pr-test.yml' }}
INCLUDE_HIGH_PRIORITY: ${{ github.event.inputs.include_high_priority || 'false' }}
shell: bash
run: |
set -euo pipefail
# Read the space-separated string from the input into a bash array
read -r -a WORKFLOW_FILES <<< "${WORKFLOWS}"
echo "Targeting ${#WORKFLOW_FILES[@]} workflow(s): ${WORKFLOWS}"
echo ""
for workflow_file in "${WORKFLOW_FILES[@]}"; do
echo "========================================="
echo "Workflow: $workflow_file"
echo "========================================="
# Get all unfinished runs
all_runs=$(gh run list \
--repo "$REPO" \
--workflow "$workflow_file" \
--json databaseId,status,event,url,createdAt \
--limit 1000 \
| jq -c '.[] | select(.status=="queued" or .status=="waiting" or .status=="in_progress")')
if [ -z "$all_runs" ]; then
echo "✅ No unfinished runs found"
echo ""
continue
fi
# Count runs by event type
total_runs=$(echo "$all_runs" | wc -l)
pr_runs=$(echo "$all_runs" | jq -s '[.[] | select(.event=="pull_request")] | length')
other_runs=$(echo "$all_runs" | jq -s '[.[] | select(.event!="pull_request")] | length')
echo "📊 Summary: $total_runs unfinished runs ($pr_runs PR-related, $other_runs other)"
echo ""
# Process non-PR runs first
if [ "$other_runs" -gt 0 ]; then
echo "--- Non-PR Runs ---"
echo "$all_runs" | jq -c 'select(.event!="pull_request")' | while read -r run; do
run_url=$(echo "$run" | jq -r '.url')
run_event=$(echo "$run" | jq -r '.event')
run_status=$(echo "$run" | jq -r '.status')
echo " • $run_event ($run_status): $run_url"
done
echo ""
fi
# Process PR runs
if [ "$pr_runs" -gt 0 ]; then
echo "--- PR Runs (checking for cancellation) ---"
echo "$all_runs" | jq -c 'select(.event=="pull_request")' | while read -r run; do
run_id=$(echo "$run" | jq -r '.databaseId')
run_url=$(echo "$run" | jq -r '.url')
run_status=$(echo "$run" | jq -r '.status')
echo ""
echo "Run ($run_status): $run_url"
# Fetch full run details to get head repository and branch info
run_details=$(gh api -H "Accept: application/vnd.github+json" \
"repos/$REPO/actions/runs/$run_id" 2>/dev/null || true)
if [ -z "$run_details" ]; then
echo " ⚠️ Could not fetch run details, skipping"
continue
fi
# Get head owner and branch (works for both fork and non-fork PRs)
head_owner=$(echo "$run_details" | jq -r '.head_repository.owner.login // empty')
head_branch=$(echo "$run_details" | jq -r '.head_branch // empty')
if [ -z "$head_owner" ] || [ -z "$head_branch" ]; then
echo " ⚠️ Missing head info, skipping"
continue
fi
echo " Branch: ${head_owner}:${head_branch}"
# Find PR by searching with head=owner:branch
pr_number=$(gh api -H "Accept: application/vnd.github+json" \
"repos/$REPO/pulls?state=open&head=${head_owner}:${head_branch}" \
--jq '.[0].number // empty' 2>/dev/null || true)
if [ -z "$pr_number" ]; then
echo " ⚠️ No open PR found, skipping"
continue
fi
pr_url="https://github.com/$REPO/pull/$pr_number"
echo " PR: $pr_url"
# Check for high priority label
labels=$(gh pr view "$pr_number" --repo "$REPO" --json labels \
| jq -r '.labels[].name' 2>/dev/null || true)
if echo "$labels" | grep -Fxq "bypass-maintenance"; then
echo " 🛑 Skipping (bypass-maintenance label, never cancelled)"
continue
fi
if echo "$labels" | grep -Fxq "high priority"; then
if [ "$INCLUDE_HIGH_PRIORITY" != "true" ]; then
echo " 🛑 Skipping (high priority label)"
continue
fi
echo " ⚠️ High priority PR, but include_high_priority is enabled"
fi
echo " 🚫 Cancelling..."
gh run cancel "$run_id" --repo "$REPO" || echo " ⚠️ Cancellation failed"
done
fi
echo ""
done
echo "========================================="
echo "✅ Processing complete"
echo "========================================="

View File

@@ -0,0 +1,154 @@
name: CI Coverage Overview
on:
schedule:
- cron: '0 6 * * *' # Daily at 6 AM UTC
pull_request:
paths:
- '.github/workflows/ci-coverage-overview.yml'
- 'scripts/ci/utils/ci_coverage_report.py'
- 'test/registered/**'
workflow_dispatch:
inputs:
output_format:
description: 'Output format'
required: false
default: 'markdown'
type: choice
options:
- markdown
- json
jobs:
summary:
name: Summary
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Generate Summary Report
run: |
python scripts/ci/utils/ci_coverage_report.py --section summary
by-folder:
name: Tests by Folder
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Generate Tests by Folder Report
run: |
python scripts/ci/utils/ci_coverage_report.py --section by-folder
by-suite:
name: Tests by Suite
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Generate Tests by Suite Report
run: |
python scripts/ci/utils/ci_coverage_report.py --section by-suite
unit-test-coverage:
name: Unit Test Code Coverage
if: github.event_name != 'pull_request'
runs-on: 1-gpu-h100
timeout-minutes: 30
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Install dependencies
timeout-minutes: 10
run: |
pip install -e "python/[test]"
- name: Run unit tests with coverage
timeout-minutes: 10
run: |
pytest test/registered/unit/ \
--cov --cov-config=.coveragerc \
--cov-report=term-missing:skip-covered \
--continue-on-collection-errors \
-v | tee coverage_output.txt
- name: Write coverage to summary
if: always()
run: |
echo "## Unit Test Code Coverage" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "**Commit:** \`${GITHUB_SHA::8}\` | **Branch:** \`${GITHUB_REF_NAME}\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
# Test result line (e.g., "== 42 passed, 1 failed in 23.5s ==")
echo '```' >> $GITHUB_STEP_SUMMARY
grep -E '^=+.*passed' coverage_output.txt >> $GITHUB_STEP_SUMMARY || true
echo "" >> $GITHUB_STEP_SUMMARY
# Coverage total
grep -E '^TOTAL ' coverage_output.txt >> $GITHUB_STEP_SUMMARY || true
echo '```' >> $GITHUB_STEP_SUMMARY
# Partially covered core modules (1-49%) — most actionable for contributors
# Only show modules with testable logic; skip configs, models, layers, etc.
LOW_COV=$(awk '/^python\/.*%/ {
for (i=1; i<=NF; i++) {
if ($i ~ /^[0-9]+%$/) {
pct = $i + 0
if (pct >= 1 && pct < 50) printf "%-80s %5s %s\n", $1, $(i-2), $i
break
}
}
}' coverage_output.txt \
| grep -E '/(mem_cache|managers|sampling|parser|observability|function_call|entrypoints|speculative|multimodal|utils)/' \
| head -40 || true)
if [ -n "$LOW_COV" ]; then
echo "" >> $GITHUB_STEP_SUMMARY
echo "<details><summary>Core modules with coverage below 50% — good candidates for more unit tests</summary>" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo '```' >> $GITHUB_STEP_SUMMARY
echo "$LOW_COV" >> $GITHUB_STEP_SUMMARY
echo '```' >> $GITHUB_STEP_SUMMARY
echo "</details>" >> $GITHUB_STEP_SUMMARY
fi
json-export:
name: JSON Export
runs-on: ubuntu-latest
if: inputs.output_format == 'json'
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Generate JSON Report
run: |
python scripts/ci/utils/ci_coverage_report.py --output-format json > ci_coverage.json
- name: Upload JSON artifact
uses: actions/upload-artifact@v4
with:
name: ci-coverage-report
path: ci_coverage.json

View File

@@ -0,0 +1,72 @@
name: CI Failure Monitor
on:
schedule:
- cron: '0 */12 * * *' # Every 12 hour
workflow_dispatch:
concurrency:
group: ci-failure-monitor-${{ github.ref }}
cancel-in-progress: true
permissions:
contents: read
actions: read
jobs:
failure-analysis:
if: github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request'
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.14'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install requests slack_sdk
- name: Run Failure Analysis
env:
GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }}
GH_PAT_FOR_RUNNER_ADMIN: ${{ secrets.GH_PAT_FOR_RUNNER_ADMIN }}
PYTHONUNBUFFERED: 1
PYTHONIOENCODING: utf-8
run: |
cd scripts/ci_monitor
python ci_failures_analysis.py \
--token $GITHUB_TOKEN \
--limit 100 \
--output ci_failure_analysis_$(date +%Y%m%d_%H%M%S).json
- name: Upload Analysis Results
uses: actions/upload-artifact@v4
with:
name: ci-failure-analysis-${{ github.run_number }}
path: |
scripts/ci_monitor/ci_failure_analysis_*.json
retention-days: 7
- name: Send Slack Notification
if: always()
env:
SGLANG_DIFFUSION_SLACK_TOKEN: ${{ secrets.SGLANG_DIFFUSION_SLACK_TOKEN }}
run: |
cd scripts/ci_monitor
LATEST_REPORT=$(ls -t ci_failure_analysis_*.json | head -1)
if [ ! -f "$LATEST_REPORT" ]; then
echo "No report found, so skipping Slack notification"
exit 0
fi
if [ -n "$SGLANG_DIFFUSION_SLACK_TOKEN" ]; then
python3 post_ci_failures_to_slack.py --report-file "$LATEST_REPORT"
else
echo "SGLANG_DIFFUSION_SLACK_TOKEN not configured, skipping notification"
fi

View File

@@ -0,0 +1,96 @@
name: Close Inactive Issues
on:
schedule:
- cron: '0 0 * * *'
workflow_dispatch:
permissions:
issues: write
contents: read
jobs:
close-inactive-issues:
if: github.repository == 'sgl-project/sglang'
runs-on: ubuntu-latest
steps:
- name: Check and close inactive issues
uses: actions/github-script@v6
with:
github-token: ${{secrets.GITHUB_TOKEN}}
script: |
const sixtyDaysAgo = new Date(Date.now() - 60 * 24 * 60 * 60 * 1000);
const [owner, repo] = process.env.GITHUB_REPOSITORY.split('/');
console.log(`Owner: ${owner}, Repo: ${repo}`);
async function fetchIssues(page = 1) {
console.log(`Fetching issues for ${owner}/${repo}, page ${page}`);
return await github.rest.issues.listForRepo({
owner,
repo,
state: 'open',
sort: 'updated',
direction: 'asc',
per_page: 100,
page: page
});
}
async function processIssues() {
console.log('Starting to process issues');
console.log(`Repository: ${owner}/${repo}`);
let page = 1;
let hasMoreIssues = true;
while (hasMoreIssues) {
try {
const issues = await fetchIssues(page);
console.log(`Fetched ${issues.data.length} issues on page ${page}`);
if (issues.data.length === 0) {
hasMoreIssues = false;
break;
}
for (const issue of issues.data) {
// Skip if the issue has 'good first issue' label
if (issue.labels.some(label => label.name === 'good first issue')) {
console.log(`Skipping issue #${issue.number} as it's marked as 'good first issue'`);
continue;
}
if (new Date(issue.updated_at) < sixtyDaysAgo) {
try {
await github.rest.issues.update({
owner,
repo,
issue_number: issue.number,
state: 'closed',
labels: [...issue.labels.map(l => l.name), 'inactive']
});
await github.rest.issues.createComment({
owner,
repo,
issue_number: issue.number,
body: 'This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.'
});
console.log(`Closed issue #${issue.number} due to inactivity.`);
} catch (error) {
console.error(`Failed to close issue #${issue.number}: ${error.message}`);
}
} else {
console.log(`Issue #${issue.number} is still active. Stopping processing.`);
hasMoreIssues = false;
break;
}
}
page += 1;
} catch (error) {
console.error(`Error fetching issues on page ${page}: ${error.message}`);
hasMoreIssues = false;
}
}
console.log('Finished processing issues');
}
await processIssues();

View File

@@ -0,0 +1,115 @@
name: Diffusion CI Ground Truth Generation
on:
workflow_dispatch:
inputs:
ref:
description: 'Git ref to checkout'
required: false
default: ''
type: string
case_ids:
description: 'Specific case IDs to run (space-separated, optional)'
required: false
default: ''
type: string
concurrency:
group: diffusion-ci-gt-gen-${{ github.ref }}
cancel-in-progress: true
permissions:
contents: write
actions: read
jobs:
multimodal-diffusion-gen-1gpu:
if: github.repository == 'sgl-project/sglang'
runs-on: 1-gpu-h100
strategy:
matrix:
part: [0, 1]
timeout-minutes: 150
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- name: Install dependencies
run: bash scripts/ci/cuda/ci_install_dependency.sh diffusion
- name: Generate outputs
run: |
cd python
python -m sglang.multimodal_gen.test.scripts.gen_diffusion_ci_outputs \
--suite 1-gpu \
--partition-id ${{ matrix.part }} \
--total-partitions 2 \
--out-dir ./diffusion-ci-outputs \
${{ inputs.case_ids != '' && format('--case-ids {0}', inputs.case_ids) || '' }}
- name: Upload artifact
uses: actions/upload-artifact@v4
with:
name: diffusion-gen-1gpu-part${{ matrix.part }}
path: python/diffusion-ci-outputs
retention-days: 7
multimodal-diffusion-gen-2gpu:
if: github.repository == 'sgl-project/sglang'
runs-on: 2-gpu-h100
strategy:
matrix:
part: [0, 1]
timeout-minutes: 150
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- name: Install dependencies
run: bash scripts/ci/cuda/ci_install_dependency.sh diffusion
- name: Generate outputs
run: |
cd python
python -m sglang.multimodal_gen.test.scripts.gen_diffusion_ci_outputs \
--suite 2-gpu \
--partition-id ${{ matrix.part }} \
--total-partitions 2 \
--out-dir ./diffusion-ci-outputs \
${{ inputs.case_ids != '' && format('--case-ids {0}', inputs.case_ids) || '' }}
- name: Upload artifact
uses: actions/upload-artifact@v4
with:
name: diffusion-gen-2gpu-part${{ matrix.part }}
path: python/diffusion-ci-outputs
retention-days: 7
diffusion-ci-push:
needs: [multimodal-diffusion-gen-1gpu, multimodal-diffusion-gen-2gpu]
if: github.repository == 'sgl-project/sglang'
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Download artifacts
uses: actions/download-artifact@v4
with:
pattern: diffusion-gen-*
path: combined
merge-multiple: true
- name: Collect image files
run: |
mkdir -p gt_images
find combined \( -name "*.png" -o -name "*.jpg" -o -name "*.jpeg" -o -name "*.webp" \) -type f -exec cp -f {} gt_images/ \;
- name: Publish GT images to sglang-bot/sglang-ci-data
env:
GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }}
run: python scripts/ci/utils/diffusion/publish_diffusion_gt.py --source-dir gt_images

View File

@@ -0,0 +1,74 @@
name: Execute Notebooks
on:
pull_request:
branches: [ main ]
types: [opened, synchronize, reopened, labeled]
paths:
- "python/sglang/**"
- "docs/**"
- "!python/sglang/**/*.md"
- "!docs/**/*.md"
workflow_dispatch:
concurrency:
group: execute-notebook-${{ github.ref }}
cancel-in-progress: true
env:
SGLANG_IS_IN_CI: true
jobs:
call-gate:
# Align with PR Test: fail fast if PR doesn't have run-ci label.
# This makes /tag-and-rerun-ci work by rerunning this failed workflow.
uses: ./.github/workflows/pr-gate.yml
secrets: inherit
run-all-notebooks:
needs: [call-gate]
runs-on: 1-gpu-h100
if: github.event_name != 'pull_request' || needs.call-gate.result == 'success'
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Install dependencies
run: |
bash scripts/ci/cuda/ci_install_dependency.sh
pip install -r docs/requirements.txt
apt-get update && apt-get install -y pandoc parallel retry
ln -sf "$(which python3)" /usr/bin/python
- name: Setup Jupyter Kernel
run: |
python -m ipykernel install --user --name python3 --display-name "Python 3"
- name: Execute notebooks
timeout-minutes: 40
run: |
cd docs
make clean
make compile
notebook-finish:
needs: [
call-gate,
run-all-notebooks
]
runs-on: ubuntu-latest
if: always() && needs.run-all-notebooks.result != 'skipped'
steps:
- name: Check all dependent job statuses
run: |
results=(${{ join(needs.*.result, ' ') }})
for result in "${results[@]}"; do
if [ "$result" = "failure" ] || [ "$result" = "cancelled" ]; then
echo "Job failed with result: $result"
exit 1
fi
done
echo "All jobs completed successfully"
exit 0

View File

@@ -0,0 +1,355 @@
name: Full Test (NPU)
on:
# pull_request:
# branches:
# - main
# paths:
# - ".github/workflows/full-test-npu.yml"
workflow_dispatch:
inputs:
ref:
description: 'Git ref (branch, tag, or SHA) to test. If not provided, uses the default branch.'
required: false
type: string
default: ''
job_filter:
description: 'Select which job to run (leave empty or "all" to run all jobs)'
required: false
type: string
default: 'all'
image_a3:
description: 'The a3 running docker image of the test task.'
required: false
type: string
default: 'swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-a3-ubuntu22.04-py3.11'
skip_install_flag:
description: 'Indicates whether to skip the installation of sglang, defaulting to false.'
required: false
type: string
default: 'false'
concurrency:
group: full-test-npu-${{ inputs.ref || github.ref }}
cancel-in-progress: ${{ github.event_name != 'workflow_call' }}
jobs:
set-image-config:
runs-on: ubuntu-latest
outputs:
ref: ${{ steps.set-vars.outputs.ref }}
job_filter: ${{ steps.set-vars.outputs.job_filter }}
image_a3: ${{ steps.set-vars.outputs.image_a3 }}
skip_install_flag: ${{ steps.set-vars.outputs.skip_install_flag }}
steps:
# When triggered by PR, no inputs parameters are used. The latest community code is tested by default.
- name: Set image config
id: set-vars
run: |
if [ -z "${{ inputs.ref }}" ]; then
echo "ref=" >> $GITHUB_OUTPUT
else
echo "ref=${{ inputs.ref }}" >> $GITHUB_OUTPUT
fi
if [ -z "${{ inputs.job_filter }}" ]; then
echo "job_filter=all" >> $GITHUB_OUTPUT
else
echo "job_filter=${{ inputs.job_filter }}" >> $GITHUB_OUTPUT
fi
if [ -z "${{ inputs.image_a3 }}" ]; then
echo "image_a3=swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-a3-ubuntu22.04-py3.11" >> $GITHUB_OUTPUT
else
echo "image_a3=${{ inputs.image_a3 }}" >> $GITHUB_OUTPUT
fi
if [ -z "${{ inputs.skip_install_flag }}" ]; then
echo "skip_install_flag=false" >> $GITHUB_OUTPUT
else
echo "skip_install_flag=${{ inputs.skip_install_flag }}" >> $GITHUB_OUTPUT
fi
nighly-test-npu:
needs: [set-image-config]
name: nightly-test-npu
if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }}
uses: ./.github/workflows/nightly-test-npu.yml
with:
ref: ${{ needs.set-image-config.outputs.ref }}
job_filter: ${{ needs.set-image-config.outputs.job_filter }}
image_a3: ${{ needs.set-image-config.outputs.image_a3 }}
skip_install_flag: ${{ needs.set-image-config.outputs.skip_install_flag }}
secrets: inherit
full-1-npu-a3:
needs: [set-image-config]
if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }}
runs-on: linux-aarch64-a3-2
container:
image: ${{ needs.set-image-config.outputs.image_a3 }}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ needs.set-image-config.outputs.ref || github.ref }}
- name: Install dependencies
env:
TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
run: |
# speed up by using infra cache services
CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
pip config set global.index-url http://${CACHING_URL}/pypi/simple
pip config set global.trusted-host "${CACHING_URL}"
if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then
bash scripts/ci/npu/npu_ci_install_dependency.sh a3
fi
# copy required file from our daily cache
cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
# copy gsm8k dataset
cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
- name: Print Log Information
run: |
bash scripts/ci/npu/npu_log_print.sh
- name: Run test
timeout-minutes: 240
env:
SGLANG_USE_MODELSCOPE: true
SGLANG_IS_IN_CI: true
HF_ENDPOINT: https://hf-mirror.com
TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
STREAMS_PER_DEVICE: 32
run: |
pip install sglang_router
hf download lmms-lab/MMMU --repo-type dataset
pip install sentence_transformers torchaudio==2.8.0
pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap
pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1
pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.1.25 numpy==1.26.4 dotenv
git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git
cd ./lmms-eval
nohup pip install . > lmmslog.txt 2>&1 &
sleep 120
export PYTHONPATH=$PYTHONPATH:$(pwd)
cd ../
cd test
python3 run_suite.py --hw npu --suite full-1-npu-a3 --nightly --continue-on-error --timeout-per-file 3600
full-2-npu-a3:
needs: [set-image-config]
if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }}
runs-on: linux-aarch64-a3-2
container:
image: ${{ needs.set-image-config.outputs.image_a3 }}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ needs.set-image-config.outputs.ref || github.ref }}
- name: Install dependencies
env:
TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
run: |
# speed up by using infra cache services
CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
pip config set global.index-url http://${CACHING_URL}/pypi/simple
pip config set global.trusted-host "${CACHING_URL}"
if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then
bash scripts/ci/npu/npu_ci_install_dependency.sh a3
fi
# copy required file from our daily cache
cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
# copy gsm8k dataset
cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
- name: Print Log Information
run: |
bash scripts/ci/npu/npu_log_print.sh
- name: Run test
timeout-minutes: 240
env:
SGLANG_USE_MODELSCOPE: true
SGLANG_IS_IN_CI: true
HF_ENDPOINT: https://hf-mirror.com
TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
STREAMS_PER_DEVICE: 32
run: |
pip install sglang_router
hf download lmms-lab/MMMU --repo-type dataset
pip install sentence_transformers torchaudio==2.8.0
pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap
pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1
pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.1.25 numpy==1.26.4 dotenv
git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git
cd ./lmms-eval
nohup pip install . > lmmslog.txt 2>&1 &
sleep 120
export PYTHONPATH=$PYTHONPATH:$(pwd)
cd ../
cd test
python3 run_suite.py --hw npu --suite full-2-npu-a3 --nightly --continue-on-error --timeout-per-file 3600
full-4-npu-a3:
needs: [set-image-config]
if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }}
runs-on: linux-aarch64-a3-4
container:
image: ${{ needs.set-image-config.outputs.image_a3 }}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ needs.set-image-config.outputs.ref || github.ref }}
- name: Install dependencies
env:
TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
run: |
# speed up by using infra cache services
CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
pip config set global.index-url http://${CACHING_URL}/pypi/simple
pip config set global.trusted-host "${CACHING_URL}"
if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then
bash scripts/ci/npu/npu_ci_install_dependency.sh a3
fi
# copy required file from our daily cache
cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
# copy gsm8k dataset
cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
- name: Print Log Information
run: |
bash scripts/ci/npu/npu_log_print.sh
- name: Run test
timeout-minutes: 240
env:
SGLANG_USE_MODELSCOPE: true
SGLANG_IS_IN_CI: true
HF_ENDPOINT: https://hf-mirror.com
TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
STREAMS_PER_DEVICE: 32
run: |
pip install sglang_router
hf download lmms-lab/MMMU --repo-type dataset
pip install sentence_transformers torchaudio==2.8.0
pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap
pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1
pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.1.25 numpy==1.26.4 dotenv
git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git
cd ./lmms-eval
nohup pip install . > lmmslog.txt 2>&1 &
sleep 120
export PYTHONPATH=$PYTHONPATH:$(pwd)
cd ../
cd test
python3 run_suite.py --hw npu --suite full-4-npu-a3 --nightly --continue-on-error --timeout-per-file 3600
full-16-npu-a3:
needs: [set-image-config]
if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }}
runs-on: linux-aarch64-a3-16
container:
image: ${{ needs.set-image-config.outputs.image_a3 }}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ needs.set-image-config.outputs.ref || github.ref }}
- name: Install dependencies
env:
TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
run: |
# speed up by using infra cache services
CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
pip config set global.index-url http://${CACHING_URL}/pypi/simple
pip config set global.trusted-host "${CACHING_URL}"
if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then
bash scripts/ci/npu/npu_ci_install_dependency.sh a3
fi
# copy required file from our daily cache
cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
# copy gsm8k dataset
cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
- name: Print Log Information
run: |
bash scripts/ci/npu/npu_log_print.sh
- name: Run test
timeout-minutes: 240
env:
SGLANG_USE_MODELSCOPE: true
SGLANG_IS_IN_CI: true
HF_ENDPOINT: https://hf-mirror.com
TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
STREAMS_PER_DEVICE: 32
run: |
pip install sglang_router
hf download lmms-lab/MMMU --repo-type dataset
pip install sentence_transformers torchaudio==2.8.0
pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap
pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1
pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.1.25 numpy==1.26.4 dotenv
git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git
cd ./lmms-eval
nohup pip install . > lmmslog.txt 2>&1 &
sleep 120
export PYTHONPATH=$PYTHONPATH:$(pwd)
cd ../
cd test
python3 run_suite.py --hw npu --suite full-16-npu-a3 --nightly --continue-on-error --timeout-per-file 3600
check-all-jobs:
if: github.repository == 'sgl-project/sglang' && always()
needs:
- nighly-test-npu
- full-1-npu-a3
- full-2-npu-a3
- full-4-npu-a3
- full-16-npu-a3
runs-on: ubuntu-latest
container:
image: docker.m.daocloud.io/ubuntu:22.04
steps:
- name: Check if any job failed
run: |
if [[ "${{ contains(needs.*.result, 'failure') }}" == "true" ]]; then
echo "One or more nightly test jobs failed"
exit 1
fi
if [[ "${{ contains(needs.*.result, 'cancelled') }}" == "true" ]]; then
echo "One or more nightly test jobs were cancelled"
exit 1
fi
echo "All nightly test jobs passed"

View File

@@ -0,0 +1,20 @@
name: Auto Label PRs
on:
pull_request_target:
types: [opened, synchronize, reopened]
permissions:
contents: read
pull-requests: write
jobs:
label:
runs-on: ubuntu-latest
steps:
- name: Auto-label by file changes
uses: actions/labeler@v5
with:
repo-token: "${{ secrets.GITHUB_TOKEN }}"
configuration-path: .github/labeler.yml
sync-labels: false

View File

@@ -0,0 +1,39 @@
name: Lint
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.12"
- name: Install pre-commit hook
run: |
python -m pip install pre-commit
pre-commit install
- name: Run pre-commit checks
run: SKIP=no-commit-to-branch pre-commit run --all-files --show-diff-on-failure
- name: Run lychee docs checks (offline references)
uses: lycheeverse/lychee-action@8646ba30535128ac92d33dfc9133794bfdd9b411 # v2
with:
args: --config .github/linters/lychee.toml README.md "docs/**/*.md" "docs/**/*.rst" "docs/**/*.ipynb"
- name: Run sgl-kernel clang-format checks
uses: DoozyX/clang-format-lint-action@v0.20
with:
source: sgl-kernel
extensions: h,c,cpp,hpp,cu,cuh,cc
clangFormatVersion: 20
style: file

View File

@@ -0,0 +1,317 @@
name: List Active Runs
on:
workflow_dispatch:
inputs:
workflows:
description: 'Space-separated list of workflow filenames to check'
required: false
type: string
default: 'pr-test.yml'
permissions:
actions: read
contents: read
pull-requests: read
jobs:
list-active-runs:
runs-on: ubuntu-latest
steps:
- name: Install GitHub CLI
run: sudo apt-get install -y gh jq
- name: List active runs grouped by PR
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
REPO: ${{ github.repository }}
WORKFLOWS: ${{ github.event.inputs.workflows || 'pr-test.yml' }}
shell: bash
run: |
set -euo pipefail
echo "========================================="
echo "🔍 Active Workflow Runs Report"
echo "========================================="
echo ""
# Get all workflows or specific ones
read -r -a workflow_files <<< "${WORKFLOWS}"
echo "📋 Checking specified workflows: ${WORKFLOWS}"
echo ""
# Create a temporary file to store PR data
pr_data_file=$(mktemp)
# Process each workflow
for workflow_file in ${workflow_files[@]}; do
echo "Scanning workflow: $workflow_file"
# Get all active runs (queued, waiting, in_progress)
active_runs=$(gh run list \
--repo "$REPO" \
--workflow "$workflow_file" \
--json databaseId,status,event,headBranch,createdAt,updatedAt,headSha,number,attempt \
--limit 500 \
| jq -c '.[] | select(.status=="queued" or .status=="waiting" or .status=="in_progress")')
if [ -z "$active_runs" ]; then
continue
fi
# Process each run
echo "$active_runs" | while read -r run; do
run_id=$(echo "$run" | jq -r '.databaseId')
run_status=$(echo "$run" | jq -r '.status')
run_event=$(echo "$run" | jq -r '.event')
created_at=$(echo "$run" | jq -r '.createdAt')
head_sha=$(echo "$run" | jq -r '.headSha')
run_number=$(echo "$run" | jq -r '.number')
run_attempt=$(echo "$run" | jq -r '.attempt // 1')
# Get detailed run information including jobs
run_details=$(gh api "repos/$REPO/actions/runs/$run_id" 2>/dev/null || true)
if [ -z "$run_details" ]; then
continue
fi
head_owner=$(echo "$run_details" | jq -r '.head_repository.owner.login // empty')
head_branch=$(echo "$run_details" | jq -r '.head_branch // empty')
if [ -z "$head_owner" ] || [ -z "$head_branch" ]; then
continue
fi
# Find PR number (may be empty for non-PR runs)
pr_number=$(gh api "repos/$REPO/pulls?state=open&head=${head_owner}:${head_branch}" \
--jq '.[0].number // empty' 2>/dev/null || true)
if [ -z "$pr_number" ]; then
pr_number="NO_PR"
fi
# Get jobs for this run (with pagination to avoid missing jobs)
jobs=$(gh api "repos/$REPO/actions/runs/$run_id/jobs" --paginate --jq '.jobs[]' | jq -s '.')
running_jobs=$(echo "$jobs" | jq '[.[] | select(.status=="in_progress")] | length')
queued_jobs=$(echo "$jobs" | jq '[.[] | select(.status=="queued" or .status=="waiting")] | length')
# Get runner info for running jobs
runners=$(echo "$jobs" | jq -r '.[] | select(.status=="in_progress") | .runner_name // "N/A"' | paste -sd "," -)
# Calculate queue time
current_time=$(date -u +%s)
created_time=$(date -u -d "$created_at" +%s 2>/dev/null || echo "$current_time")
queue_time=$((current_time - created_time))
queue_minutes=$((queue_time / 60))
# Store data in temporary file (unified format with event and branch)
echo "$pr_number|$workflow_file|$run_id|$run_status|$running_jobs|$queued_jobs|$runners|$queue_minutes|$created_at|$head_sha|$run_attempt|$run_event|$head_branch" >> "$pr_data_file"
done
done
echo ""
echo "========================================="
echo "📊 Active Runs Summary"
echo "========================================="
echo ""
if [ ! -s "$pr_data_file" ]; then
echo "✅ No active runs found"
rm -f "$pr_data_file"
exit 0
fi
# Get unique PR numbers (exclude NO_PR entries)
pr_numbers=$(cut -d'|' -f1 < "$pr_data_file" | grep -v '^NO_PR$' | sort -u || true)
# Separate high priority and normal PRs
high_priority_prs=()
normal_prs=()
for pr_num in $pr_numbers; do
labels=$(gh pr view "$pr_num" --repo "$REPO" --json labels \
| jq -r '.labels[].name' 2>/dev/null || true)
if echo "$labels" | grep -Fxq "high priority"; then
high_priority_prs+=($pr_num)
else
normal_prs+=($pr_num)
fi
done
# Combine: high priority first, then normal
sorted_pr_numbers=("${high_priority_prs[@]}" "${normal_prs[@]}")
pr_count=0
total_running=0
total_queued=0
for pr_num in "${sorted_pr_numbers[@]}"; do
pr_count=$((pr_count + 1))
# Get PR details
pr_info=$(gh pr view "$pr_num" --repo "$REPO" --json title,author,labels,url 2>/dev/null || true)
if [ -z "$pr_info" ]; then
continue
fi
pr_title=$(echo "$pr_info" | jq -r '.title')
pr_author=$(echo "$pr_info" | jq -r '.author.login')
pr_url=$(echo "$pr_info" | jq -r '.url')
pr_labels=$(echo "$pr_info" | jq -r '.labels[].name' | paste -sd ", " -)
if [ -z "$pr_labels" ]; then
pr_labels="(no labels)"
fi
# Add priority indicator
priority_indicator=""
if echo "$pr_labels" | grep -q "high priority"; then
priority_indicator="🔴 [HIGH PRIORITY] "
fi
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
echo "🔗 ${priority_indicator}PR #$pr_num: $pr_title"
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
echo "👤 Author: $pr_author"
echo "🏷️ Labels: $pr_labels"
echo "🔗 URL: $pr_url"
echo ""
# Get all runs for this PR
pr_runs=$(grep "^$pr_num|" "$pr_data_file")
pr_running_total=0
pr_queued_total=0
echo "$pr_runs" | while read -r line; do
workflow=$(echo "$line" | cut -d'|' -f2)
run_id=$(echo "$line" | cut -d'|' -f3)
status=$(echo "$line" | cut -d'|' -f4)
running=$(echo "$line" | cut -d'|' -f5)
queued=$(echo "$line" | cut -d'|' -f6)
runners=$(echo "$line" | cut -d'|' -f7)
queue_min=$(echo "$line" | cut -d'|' -f8)
created=$(echo "$line" | cut -d'|' -f9)
attempt=$(echo "$line" | cut -d'|' -f11)
pr_running_total=$((pr_running_total + running))
pr_queued_total=$((pr_queued_total + queued))
run_url="https://github.com/$REPO/actions/runs/$run_id"
# Calculate retry count for this specific run
retry_count=$((attempt - 1))
# Show retry indicator
retry_indicator=""
if [ "$retry_count" -gt 0 ]; then
retry_indicator=" 🔄 Retry #$retry_count"
fi
echo " 📦 Workflow: $workflow (Run #$run_id)$retry_indicator"
echo " Status: $status"
echo " 🟢 Running jobs: $running"
echo " 🟡 Queued jobs: $queued"
if [ "$running" -gt 0 ] && [ "$runners" != "" ]; then
echo " 🖥️ Runners: $runners"
fi
if [ "$queue_min" -gt 0 ]; then
echo " ⏱️ Queue time: ${queue_min} minutes"
fi
echo " 🔗 Run URL: $run_url"
echo ""
done
# Summary for this PR
pr_running_total=$(grep "^$pr_num|" "$pr_data_file" | cut -d'|' -f5 | awk '{sum+=$1} END {print sum+0}')
pr_queued_total=$(grep "^$pr_num|" "$pr_data_file" | cut -d'|' -f6 | awk '{sum+=$1} END {print sum+0}')
total_running=$((total_running + pr_running_total))
total_queued=$((total_queued + pr_queued_total))
echo " 📊 PR Total: $pr_running_total running, $pr_queued_total queued"
echo ""
done
# --- Non-PR Runs Section ---
non_pr_runs=$(grep '^NO_PR|' "$pr_data_file" 2>/dev/null || true)
non_pr_running=0
non_pr_queued=0
if [ -n "$non_pr_runs" ]; then
echo "========================================="
echo "📦 Non-PR Runs (manual / scheduled / other)"
echo "========================================="
echo ""
echo "$non_pr_runs" | while read -r line; do
workflow=$(echo "$line" | cut -d'|' -f2)
run_id=$(echo "$line" | cut -d'|' -f3)
status=$(echo "$line" | cut -d'|' -f4)
running=$(echo "$line" | cut -d'|' -f5)
queued=$(echo "$line" | cut -d'|' -f6)
runners=$(echo "$line" | cut -d'|' -f7)
queue_min=$(echo "$line" | cut -d'|' -f8)
created=$(echo "$line" | cut -d'|' -f9)
attempt=$(echo "$line" | cut -d'|' -f11)
event=$(echo "$line" | cut -d'|' -f12)
branch=$(echo "$line" | cut -d'|' -f13)
run_url="https://github.com/$REPO/actions/runs/$run_id"
retry_count=$((attempt - 1))
retry_indicator=""
if [ "$retry_count" -gt 0 ]; then
retry_indicator=" 🔄 Retry #$retry_count"
fi
echo " 📦 Workflow: $workflow (Run #$run_id)$retry_indicator"
echo " Event: $event"
echo " Branch: $branch"
echo " Status: $status"
echo " 🟢 Running jobs: $running"
echo " 🟡 Queued jobs: $queued"
if [ "$running" -gt 0 ] && [ "$runners" != "" ]; then
echo " 🖥️ Runners: $runners"
fi
if [ "$queue_min" -gt 0 ]; then
echo " ⏱️ Queue time: ${queue_min} minutes"
fi
echo " 🔗 Run URL: $run_url"
echo ""
done
non_pr_running=$(echo "$non_pr_runs" | cut -d'|' -f5 | awk '{sum+=$1} END {print sum+0}')
non_pr_queued=$(echo "$non_pr_runs" | cut -d'|' -f6 | awk '{sum+=$1} END {print sum+0}')
non_pr_count=$(echo "$non_pr_runs" | wc -l | tr -d ' ')
total_running=$((total_running + non_pr_running))
total_queued=$((total_queued + non_pr_queued))
echo " 📊 Non-PR Total: $non_pr_running running, $non_pr_queued queued"
echo ""
fi
# Overall summary
echo "========================================="
echo "📈 Overall Summary"
echo "========================================="
echo "Total PRs with active runs: $pr_count"
echo "Total non-PR active runs: ${non_pr_count:-0}"
echo "Total running jobs: $total_running"
echo "Total queued jobs: $total_queued"
echo "========================================="
# Cleanup
rm -f "$pr_data_file"

View File

@@ -0,0 +1,32 @@
name: Nightly Link Check
on:
schedule:
- cron: "0 2 * * *"
workflow_dispatch:
concurrency:
group: nightly-link-check-${{ github.ref }}
cancel-in-progress: true
jobs:
lychee-online:
if: github.repository == 'sgl-project/sglang'
runs-on: ubuntu-latest
timeout-minutes: 20
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Run lychee online link checks
uses: lycheeverse/lychee-action@8646ba30535128ac92d33dfc9133794bfdd9b411 # v2
with:
fail: true
args: >-
--config .github/linters/lychee-ci.toml
README.md
docs/**/*.md
docs/**/*.rst
docs/**/*.ipynb
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

View File

@@ -0,0 +1,196 @@
# Nightly release workflow for SGLang Model Gateway
name: Nightly Release SGLang Model Gateway to PyPI
on:
schedule:
# Run at 2 AM UTC every day
- cron: '0 2 * * *'
workflow_dispatch: # Allow manual trigger
jobs:
build:
name: build on ${{ matrix.platform || matrix.os }} (${{ matrix.target }} - ${{ matrix.manylinux || 'auto' }})
runs-on: ${{ matrix.os }}-latest
strategy:
fail-fast: false
matrix:
os: [ubuntu, macos, windows]
target: [x86_64, aarch64]
manylinux: [auto]
include:
- os: ubuntu
platform: linux
- os: windows
ls: dir
target: x86_64
python-architecture: x64
interpreter: 3.9 3.10 3.11 3.12 3.13
- os: macos
target: aarch64
interpreter: 3.9 3.10 3.11 3.12 3.13
- os: ubuntu
platform: linux
target: aarch64
# musllinux
- os: ubuntu
platform: linux
target: x86_64
manylinux: musllinux_1_1
- os: ubuntu
platform: linux
target: aarch64
manylinux: musllinux_1_1
exclude:
- os: windows
target: aarch64
steps:
- uses: actions/checkout@v4
with:
path: sglang-repo
- name: Move sgl-model-gateway folder to root and delete sglang-repo
run: |
mv sglang-repo/sgl-model-gateway/* .
rm -rf sglang-repo
ls -alt
shell: bash
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.13"
architecture: ${{ matrix.python-architecture || 'x64' }}
- name: Modify version for nightly release
run: |
# Get current version from pyproject.toml
CURRENT_VERSION=$(python -c "import tomllib; print(tomllib.load(open('bindings/python/pyproject.toml', 'rb'))['project']['version'])" 2>/dev/null || python -c "import tomli; print(tomli.load(open('bindings/python/pyproject.toml', 'rb'))['project']['version'])")
# Create nightly version with date: e.g., 0.2.1.dev20250128
NIGHTLY_VERSION="${CURRENT_VERSION}.dev$(date +%Y%m%d)"
echo "Nightly version: $NIGHTLY_VERSION"
# Update pyproject.toml with nightly version (temporary, not committed)
sed -i.bak "s/version = \"${CURRENT_VERSION}\"/version = \"${NIGHTLY_VERSION}\"/" bindings/python/pyproject.toml
# Verify the change
cat bindings/python/pyproject.toml | grep "^version"
shell: bash
- name: Install twine and tomli
run: pip install -U twine tomli
- name: Install protoc (macOS)
if: matrix.os == 'macos'
run: brew install protobuf
- name: Install protoc (Windows)
if: matrix.os == 'windows'
run: choco install protoc -y
- name: Build wheels
uses: PyO3/maturin-action@v1
with:
working-directory: bindings/python
target: ${{ matrix.target }}
manylinux: ${{ matrix.manylinux || 'auto' }}
args: --release --out dist --features vendored-openssl --interpreter ${{ matrix.interpreter || '3.9 3.10 3.11 3.12 3.13 3.14' }}
rust-toolchain: stable
docker-options: -e CI -e CC_aarch64_unknown_linux_gnu=aarch64-linux-gnu-gcc -e CXX_aarch64_unknown_linux_gnu=aarch64-linux-gnu-g++
before-script-linux: |
# Install build dependencies (perl/make for vendored OpenSSL, protoc for gRPC)
if command -v yum &> /dev/null; then
yum update -y && yum install -y wget unzip gcc gcc-c++ perl-core make
# Install cross-compilation toolchain for aarch64 if needed
if [ "${{ matrix.target }}" = "aarch64" ]; then
yum install -y gcc-aarch64-linux-gnu gcc-c++-aarch64-linux-gnu || true
fi
elif command -v apt-get &> /dev/null; then
apt-get update && apt-get install -y wget unzip gcc g++ perl make
# Install cross-compilation toolchain for aarch64 if needed
if [ "${{ matrix.target }}" = "aarch64" ]; then
apt-get install -y gcc-aarch64-linux-gnu g++-aarch64-linux-gnu || true
fi
fi
(cd /tmp && \
wget https://github.com/protocolbuffers/protobuf/releases/download/v32.0/protoc-32.0-linux-x86_64.zip && \
unzip protoc-32.0-linux-x86_64.zip -d /usr/local && \
rm protoc-32.0-linux-x86_64.zip)
protoc --version
- name: List built packages
run: ${{ matrix.ls || 'ls -lh' }} bindings/python/dist/
- name: Check packages
run: twine check --strict bindings/python/dist/*
- uses: actions/upload-artifact@v4
with:
name: packages-${{ matrix.os }}-${{ matrix.target }}-${{ matrix.manylinux || 'auto' }}
path: bindings/python/dist/
build-sdist:
name: Build SDist
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
path: sglang-repo
- name: Move sgl-model-gateway folder to root and delete sglang-repo
run: |
mv sglang-repo/sgl-model-gateway/* .
rm -rf sglang-repo
ls -alt
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.13"
- name: Modify version for nightly release
run: |
# Get current version from pyproject.toml
CURRENT_VERSION=$(python -c "import tomllib; print(tomllib.load(open('bindings/python/pyproject.toml', 'rb'))['project']['version'])" 2>/dev/null || python -c "import tomli; print(tomli.load(open('bindings/python/pyproject.toml', 'rb'))['project']['version'])")
# Create nightly version with date: e.g., 0.2.1.dev20250128
NIGHTLY_VERSION="${CURRENT_VERSION}.dev$(date +%Y%m%d)"
echo "Nightly version: $NIGHTLY_VERSION"
# Update pyproject.toml with nightly version (temporary, not committed)
sed -i "s/version = \"${CURRENT_VERSION}\"/version = \"${NIGHTLY_VERSION}\"/" bindings/python/pyproject.toml
# Verify the change
cat bindings/python/pyproject.toml | grep "^version"
- name: Build SDist
uses: PyO3/maturin-action@v1
with:
working-directory: bindings/python
command: sdist
args: --out dist
rust-toolchain: stable
- uses: actions/upload-artifact@v4
with:
name: sdist
path: bindings/python/dist/*.tar.gz
upload:
name: Upload to TestPyPI
if: github.repository == 'sgl-project/sglang' # Ensure this job only runs for the sgl-project/sglang repository
needs: [build, build-sdist]
runs-on: ubuntu-latest
steps:
- uses: actions/download-artifact@v4
with:
path: dist
merge-multiple: true
- name: Upload to TestPyPI
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.TEST_PYPI_TOKEN_ROUTER }}
run: |
pip install twine
twine upload --repository testpypi dist/* --verbose

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,33 @@
name: Nightly Test (Intel)
on:
schedule:
- cron: '0 0 * * *'
push:
branches:
- main
paths:
- "python/sglang/version.py"
workflow_dispatch:
workflow_call:
inputs:
ref:
description: "Branch, tag or SHA to checkout"
required: false
type: string
default: ""
concurrency:
group: nightly-test-intel-${{ inputs.ref || github.ref }}
cancel-in-progress: ${{ github.event_name != 'workflow_call' }}
jobs:
# Placeholder for Intel GPU tests
# Add Intel-specific nightly test workflows here when available
placeholder:
if: github.repository == 'sgl-project/sglang'
runs-on: ubuntu-latest
steps:
- name: Placeholder
run: echo "Intel nightly tests will be added here"

View File

@@ -0,0 +1,428 @@
name: Nightly Test (NPU)
on:
schedule:
- cron: '0 18 * * *' # Execute at 2:00 a.m. Beijing Time every day
pull_request:
branches:
- main
paths:
- ".github/workflows/nightly-test-npu.yml"
workflow_dispatch:
workflow_call:
inputs:
ref:
description: 'Git ref (branch, tag, or SHA) to test. If not provided, uses the default branch.'
required: false
type: string
default: ''
job_filter:
description: 'Select which job to run (leave empty or "all" to run all jobs)'
required: false
type: string
default: 'all'
image_a3:
description: 'The a3 running docker image of the test task.'
required: false
type: string
default: 'swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-a3-ubuntu22.04-py3.11'
skip_install_flag:
description: 'Indicates whether to skip the installation of sglang, defaulting to false.'
required: false
type: string
default: 'false'
concurrency:
group: nightly-test-npu-${{ inputs.ref || github.ref }}
cancel-in-progress: ${{ github.event_name != 'workflow_call' }}
jobs:
set-image-config:
runs-on: ubuntu-latest
outputs:
ref: ${{ steps.set-vars.outputs.ref }}
job_filter: ${{ steps.set-vars.outputs.job_filter }}
image_a3: ${{ steps.set-vars.outputs.image_a3 }}
skip_install_flag: ${{ steps.set-vars.outputs.skip_install_flag }}
steps:
# When triggered by PR, no inputs parameters are used. The latest community code is tested by default.
- name: Set image config
id: set-vars
run: |
if [ -z "${{ inputs.ref }}" ]; then
echo "ref=" >> $GITHUB_OUTPUT
else
echo "ref=${{ inputs.ref }}" >> $GITHUB_OUTPUT
fi
if [ -z "${{ inputs.job_filter }}" ]; then
echo "job_filter=all" >> $GITHUB_OUTPUT
else
echo "job_filter=${{ inputs.job_filter }}" >> $GITHUB_OUTPUT
fi
if [ -z "${{ inputs.image_a3 }}" ]; then
echo "image_a3=swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-a3-ubuntu22.04-py3.11" >> $GITHUB_OUTPUT
else
echo "image_a3=${{ inputs.image_a3 }}" >> $GITHUB_OUTPUT
fi
if [ -z "${{ inputs.skip_install_flag }}" ]; then
echo "skip_install_flag=false" >> $GITHUB_OUTPUT
else
echo "skip_install_flag=${{ inputs.skip_install_flag }}" >> $GITHUB_OUTPUT
fi
nightly-1-npu-a3:
needs: [set-image-config]
if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }}
runs-on: linux-aarch64-a3-2
strategy:
fail-fast: false
matrix:
part: [0, 1]
container:
image: ${{ needs.set-image-config.outputs.image_a3 }}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ needs.set-image-config.outputs.ref || github.ref }}
- name: Install dependencies
env:
TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
run: |
# speed up by using infra cache services
CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
pip config set global.index-url http://${CACHING_URL}/pypi/simple
pip config set global.trusted-host "${CACHING_URL}"
if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then
bash scripts/ci/npu/npu_ci_install_dependency.sh a3
fi
# copy required file from our daily cache
cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
# copy gsm8k dataset
cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
- name: Print Log Information
run: |
bash scripts/ci/npu/npu_log_print.sh
- name: Run test
timeout-minutes: 240
env:
SGLANG_USE_MODELSCOPE: true
SGLANG_IS_IN_CI: true
HF_ENDPOINT: https://hf-mirror.com
TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
STREAMS_PER_DEVICE: 32
run: |
pip install sglang_router
hf download lmms-lab/MMMU --repo-type dataset
pip install sentence_transformers torchaudio==2.8.0
pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap
pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1
pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.1.32 numpy==1.26.4 dotenv
git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git
cd ./lmms-eval
nohup pip install . > lmmslog.txt 2>&1 &
sleep 120
export PYTHONPATH=$PYTHONPATH:$(pwd)
cd ../
cd test
python3 run_suite.py --hw npu --suite nightly-1-npu-a3 --nightly --continue-on-error --timeout-per-file 3600 --auto-partition-id ${{ matrix.part }} --auto-partition-size 2
nightly-2-npu-a3:
needs: [set-image-config]
if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }}
runs-on: linux-aarch64-a3-2
strategy:
fail-fast: false
matrix:
part: [0]
container:
image: ${{ needs.set-image-config.outputs.image_a3 }}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ needs.set-image-config.outputs.ref || github.ref }}
- name: Install dependencies
env:
TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
run: |
# speed up by using infra cache services
CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
pip config set global.index-url http://${CACHING_URL}/pypi/simple
pip config set global.trusted-host "${CACHING_URL}"
if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then
bash scripts/ci/npu/npu_ci_install_dependency.sh a3
fi
# copy required file from our daily cache
cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
# copy gsm8k dataset
cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
- name: Print Log Information
run: |
bash scripts/ci/npu/npu_log_print.sh
- name: Run test
timeout-minutes: 240
env:
SGLANG_USE_MODELSCOPE: true
SGLANG_IS_IN_CI: true
HF_ENDPOINT: https://hf-mirror.com
TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
STREAMS_PER_DEVICE: 32
run: |
pip install sglang_router
hf download lmms-lab/MMMU --repo-type dataset
pip install sentence_transformers torchaudio==2.8.0
pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap
pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1
pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.1.32 numpy==1.26.4 dotenv
git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git
cd ./lmms-eval
nohup pip install . > lmmslog.txt 2>&1 &
sleep 120
export PYTHONPATH=$PYTHONPATH:$(pwd)
cd ../
cd test
python3 run_suite.py --hw npu --suite nightly-2-npu-a3 --nightly --continue-on-error --timeout-per-file 3600 --auto-partition-id ${{ matrix.part }} --auto-partition-size 1
nightly-4-npu-a3:
needs: [set-image-config]
if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }}
runs-on: linux-aarch64-a3-4
strategy:
fail-fast: false
matrix:
part: [0]
container:
image: ${{ needs.set-image-config.outputs.image_a3 }}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ needs.set-image-config.outputs.ref|| github.ref }}
- name: Install dependencies
env:
TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
run: |
# speed up by using infra cache services
CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
pip config set global.index-url http://${CACHING_URL}/pypi/simple
pip config set global.trusted-host "${CACHING_URL}"
if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then
bash scripts/ci/npu/npu_ci_install_dependency.sh a3
fi
# copy required file from our daily cache
cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
# copy gsm8k dataset
cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
- name: Print Log Information
run: |
bash scripts/ci/npu/npu_log_print.sh
- name: Run test
timeout-minutes: 240
env:
SGLANG_USE_MODELSCOPE: true
SGLANG_IS_IN_CI: true
HF_ENDPOINT: https://hf-mirror.com
TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
STREAMS_PER_DEVICE: 32
run: |
pip install sglang_router
hf download lmms-lab/MMMU --repo-type dataset
pip install sentence_transformers torchaudio==2.8.0
pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap
pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1
pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.1.32 numpy==1.26.4 dotenv
git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git
cd ./lmms-eval
nohup pip install . > lmmslog.txt 2>&1 &
sleep 120
export PYTHONPATH=$PYTHONPATH:$(pwd)
cd ../
cd test
python3 run_suite.py --hw npu --suite nightly-4-npu-a3 --nightly --continue-on-error --timeout-per-file 3600 --auto-partition-id ${{ matrix.part }} --auto-partition-size 1
nightly-8-npu-a3:
needs: [set-image-config]
if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }}
runs-on: linux-aarch64-a3-8
strategy:
fail-fast: false
matrix:
part: [0]
container:
image: ${{ needs.set-image-config.outputs.image_a3 }}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ needs.set-image-config.outputs.ref || github.ref }}
- name: Install dependencies
env:
TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
run: |
# speed up by using infra cache services
CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
pip config set global.index-url http://${CACHING_URL}/pypi/simple
pip config set global.trusted-host "${CACHING_URL}"
if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then
bash scripts/ci/npu/npu_ci_install_dependency.sh a3
fi
# copy required file from our daily cache
cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
# copy gsm8k dataset
cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
- name: Print Log Information
run: |
bash scripts/ci/npu/npu_log_print.sh
- name: Run test
timeout-minutes: 240
env:
SGLANG_USE_MODELSCOPE: true
SGLANG_IS_IN_CI: true
HF_ENDPOINT: https://hf-mirror.com
TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
STREAMS_PER_DEVICE: 32
run: |
pip install sglang_router
hf download lmms-lab/MMMU --repo-type dataset
pip install sentence_transformers torchaudio==2.8.0
pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap
pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1
pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.1.32 numpy==1.26.4 dotenv
git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git
cd ./lmms-eval
nohup pip install . > lmmslog.txt 2>&1 &
sleep 120
export PYTHONPATH=$PYTHONPATH:$(pwd)
cd ../
cd test
python3 run_suite.py --hw npu --suite nightly-8-npu-a3 --nightly --continue-on-error --timeout-per-file 3600 --auto-partition-id ${{ matrix.part }} --auto-partition-size 1
nightly-16-npu-a3:
needs: [set-image-config]
if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }}
runs-on: linux-aarch64-a3-16
strategy:
fail-fast: false
matrix:
part: [0, 1]
container:
image: ${{ needs.set-image-config.outputs.image_a3 }}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ needs.set-image-config.outputs.ref || github.ref }}
- name: Install dependencies
env:
TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
run: |
# speed up by using infra cache services
CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
pip config set global.index-url http://${CACHING_URL}/pypi/simple
pip config set global.trusted-host "${CACHING_URL}"
if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then
bash scripts/ci/npu/npu_ci_install_dependency.sh a3
fi
# copy required file from our daily cache
cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
# copy gsm8k dataset
cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
- name: Print Log Information
run: |
bash scripts/ci/npu/npu_log_print.sh
- name: Run test
timeout-minutes: 240
env:
SGLANG_USE_MODELSCOPE: true
SGLANG_IS_IN_CI: true
HF_ENDPOINT: https://hf-mirror.com
TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
STREAMS_PER_DEVICE: 32
run: |
pip install sglang_router
hf download lmms-lab/MMMU --repo-type dataset
pip install sentence_transformers torchaudio==2.8.0
pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap
pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1
pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.1.32 numpy==1.26.4 dotenv
git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git
cd ./lmms-eval
nohup pip install . > lmmslog.txt 2>&1 &
sleep 120
export PYTHONPATH=$PYTHONPATH:$(pwd)
cd ../
cd test
python3 run_suite.py --hw npu --suite nightly-16-npu-a3 --nightly --continue-on-error --timeout-per-file 3600 --auto-partition-id ${{ matrix.part }} --auto-partition-size 2
check-all-jobs:
if: github.repository == 'sgl-project/sglang' && always()
needs:
- nightly-1-npu-a3
- nightly-2-npu-a3
- nightly-4-npu-a3
- nightly-8-npu-a3
- nightly-16-npu-a3
runs-on: ubuntu-latest
container:
image: docker.m.daocloud.io/ubuntu:22.04
steps:
- name: Check if any job failed
run: |
if [[ "${{ contains(needs.*.result, 'failure') }}" == "true" ]]; then
echo "One or more nightly test jobs failed"
exit 1
fi
if [[ "${{ contains(needs.*.result, 'cancelled') }}" == "true" ]]; then
echo "One or more nightly test jobs were cancelled"
exit 1
fi
echo "All nightly test jobs passed"

View File

@@ -0,0 +1,796 @@
name: Nightly Test (Nvidia)
on:
schedule:
- cron: '0 0 * * *'
workflow_dispatch:
inputs:
job_filter:
description: 'Select which job to run (leave empty or "all" to run all jobs)'
required: false
type: choice
default: 'all'
options:
- 'all'
- 'nightly-test-general-1-gpu-h100'
- 'nightly-test-general-4-gpu-h100'
- 'nightly-test-general-8-gpu-h200'
- 'nightly-test-general-8-gpu-h20'
- 'nightly-test-general-8-gpu-b200'
- 'nightly-test-text-accuracy-2-gpu-h100'
- 'nightly-test-text-perf-2-gpu-h100'
- 'nightly-test-vlm-accuracy-2-gpu-h100'
- 'nightly-test-vlm-perf-2-gpu-h100'
- 'nightly-test-multimodal-server-1-gpu'
- 'nightly-test-multimodal-server-2-gpu'
- 'nightly-test-perf-4-gpu-b200'
- 'nightly-test-perf-8-gpu-b200'
- 'nightly-test-specialized-8-gpu-b200'
- 'nightly-test-kernel-1-gpu-h100'
- 'nightly-test-diffusion-comparison'
- 'nightly-test-kernel-8-gpu-h200'
workflow_call:
inputs:
ref:
description: 'Git ref (branch, tag, or SHA) to test. If not provided, uses the default branch.'
required: false
type: string
default: ''
job_filter:
description: 'Select which job to run (leave empty or "all" to run all jobs)'
required: false
type: string
default: 'all'
concurrency:
group: nightly-test-nvidia-${{ inputs.ref || github.ref }}
cancel-in-progress: ${{ github.event_name != 'workflow_call' }}
env:
SGLANG_IS_IN_CI: true
SGLANG_CUDA_COREDUMP: "1"
HF_HUB_DOWNLOAD_TIMEOUT: 300
HF_HUB_ETAG_TIMEOUT: 300
jobs:
# General tests - 1 GPU
nightly-test-general-1-gpu-h100:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-general-1-gpu-h100')
runs-on: 1-gpu-h100
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- uses: ./.github/actions/check-maintenance
- name: Install dependencies
run: |
bash scripts/ci/cuda/ci_install_dependency.sh
- name: Run test
timeout-minutes: 60
run: |
cd test
python3 run_suite.py --hw cuda --suite nightly-1-gpu --nightly --continue-on-error
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
# JIT kernel full unit tests (expanded parameter ranges via SGLANG_JIT_KERNEL_RUN_FULL_TESTS)
nightly-test-kernel-1-gpu-h100:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-kernel-1-gpu-h100')
runs-on: 1-gpu-h100
timeout-minutes: 240
env:
# Full jit_kernel test grids (see sglang.jit_kernel.utils.should_run_full_tests)
SGLANG_JIT_KERNEL_RUN_FULL_TESTS: "1"
# Match pr-test-jit-kernel workflow for consistent JIT warmup behavior
SGLANG_JIT_DEEPGEMM_FAST_WARMUP: true
# Allow maintenance bypass on default branch (same semantics as PR JIT workflow)
SGLANG_PR_TEST_BYPASS_MAINTENANCE_ON_MAIN: ${{ github.ref == 'refs/heads/main' && 'true' || 'false' }}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- uses: ./.github/actions/check-maintenance
- name: Install dependencies
timeout-minutes: 20
run: |
bash scripts/ci/cuda/ci_install_dependency.sh
- name: Run jit kernel nightly suite
timeout-minutes: 60
run: |
cd test
python3 run_suite.py --hw cuda --suite nightly-kernel-1-gpu --nightly --continue-on-error
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
nightly-test-kernel-8-gpu-h200:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-kernel-8-gpu-h200')
runs-on: 8-gpu-h200
timeout-minutes: 240
env:
SGLANG_JIT_KERNEL_RUN_FULL_TESTS: "1"
SGLANG_JIT_DEEPGEMM_FAST_WARMUP: true
SGLANG_PR_TEST_BYPASS_MAINTENANCE_ON_MAIN: ${{ github.ref == 'refs/heads/main' && 'true' || 'false' }}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- uses: ./.github/actions/check-maintenance
- name: Install dependencies
timeout-minutes: 20
run: |
bash scripts/ci/cuda/ci_install_dependency.sh
- name: Run multi-GPU jit kernel nightly suite
timeout-minutes: 90
run: |
cd test
python3 run_suite.py --hw cuda --suite nightly-kernel-8-gpu-h200 --nightly --continue-on-error
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
# General tests - 4 GPU H100
nightly-test-general-4-gpu-h100:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-general-4-gpu-h100')
runs-on: 4-gpu-h100
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- uses: ./.github/actions/check-maintenance
- name: Install dependencies
run: |
bash scripts/ci/cuda/ci_install_dependency.sh
- name: Run test
timeout-minutes: 30
run: |
cd test
python3 run_suite.py --hw cuda --suite nightly-4-gpu --nightly --continue-on-error
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
# General tests - 8 GPU H200
nightly-test-general-8-gpu-h200:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-general-8-gpu-h200')
runs-on: 8-gpu-h200
strategy:
fail-fast: false
matrix:
partition: [0, 1, 2, 3]
env:
RUNNER_LABELS: 8-gpu-h200
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- uses: ./.github/actions/check-maintenance
- name: Install dependencies
run: |
bash scripts/ci/cuda/ci_install_dependency.sh
- name: Run common 8-GPU model tests
if: always()
timeout-minutes: 300
env:
TRACE_BASE_URL: https://raw.githubusercontent.com/sglang-bot/sglang-ci-data/main/traces/${{ github.run_id }}
PERFETTO_RELAY_URL: ${{ vars.PERFETTO_RELAY_URL }}
GPU_CONFIG: "8-gpu-h200"
IS_H200: "1"
run: |
cd test
python3 run_suite.py --hw cuda --suite nightly-8-gpu-common --nightly --timeout-per-file=18000 --continue-on-error --auto-partition-id=${{ matrix.partition }} --auto-partition-size=4
- name: Publish traces to storage repo
if: always()
continue-on-error: true
env:
GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }}
GITHUB_RUN_ID: ${{ github.run_id }}
GITHUB_RUN_NUMBER: ${{ github.run_number }}
run: |
TRACE_ARGS=""
for dir in test/performance_profiles_*/; do
[ -d "$dir" ] && TRACE_ARGS="$TRACE_ARGS --traces-dir $dir"
done
if [ -n "$TRACE_ARGS" ]; then
python3 scripts/ci/utils/publish_traces.py $TRACE_ARGS
find test/performance_profiles_*/ -name '*.json.gz' -delete
else
echo "No trace directories found, skipping publish"
fi
- name: Run test
timeout-minutes: 30
env:
GPU_CONFIG: "8-gpu-h200"
run: |
cd test
python3 run_suite.py --hw cuda --suite nightly-8-gpu-h200 --nightly --continue-on-error
- name: Collect performance metrics
if: always()
run: |
python3 scripts/ci/utils/save_metrics.py \
--gpu-config 8-gpu-h200 \
--partition ${{ matrix.partition }} \
--run-id ${{ github.run_id }} \
--output test/metrics-8gpu-h200-partition-${{ matrix.partition }}.json \
--search-dir test/performance_profiles_8_gpu \
--search-dir test
- name: Upload partition metrics
if: always()
uses: actions/upload-artifact@v4
with:
name: metrics-8gpu-h200-partition-${{ matrix.partition }}
path: test/metrics-8gpu-h200-partition-${{ matrix.partition }}.json
retention-days: 5
if-no-files-found: ignore
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
with:
artifact-suffix: ${{ matrix.partition }}
# General tests - 8 GPU H20
nightly-test-general-8-gpu-h20:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-general-8-gpu-h20')
runs-on: 8-gpu-h20
env:
SGLANG_CI_RDMA_ALL_DEVICES: "mlx5_1,mlx5_2,mlx5_3,mlx5_4"
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- uses: ./.github/actions/check-maintenance
- name: Install dependencies
run: |
bash scripts/ci/cuda/ci_install_dependency.sh
- name: Run test
timeout-minutes: 30
env:
GPU_CONFIG: "8-gpu-h20"
run: |
cd test
python3 run_suite.py --hw cuda --suite nightly-8-gpu-h20 --nightly --continue-on-error
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
# General tests - 8 GPU B200
nightly-test-general-8-gpu-b200:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-general-8-gpu-b200')
runs-on: 8-gpu-b200
strategy:
fail-fast: false
matrix:
partition: [0, 1, 2, 3]
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- uses: ./.github/actions/check-maintenance
- name: Install dependencies
run: |
bash scripts/ci/cuda/ci_install_dependency.sh
- name: Run common 8-GPU model tests
if: always()
timeout-minutes: 300
env:
TRACE_BASE_URL: https://raw.githubusercontent.com/sglang-bot/sglang-ci-data/main/traces/${{ github.run_id }}
PERFETTO_RELAY_URL: ${{ vars.PERFETTO_RELAY_URL }}
GPU_CONFIG: "8-gpu-b200"
run: |
cd test
python3 run_suite.py --hw cuda --suite nightly-8-gpu-common --nightly --timeout-per-file=12000 --continue-on-error --auto-partition-id=${{ matrix.partition }} --auto-partition-size=4
- name: Publish traces to storage repo
if: always()
continue-on-error: true
env:
GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }}
GITHUB_RUN_ID: ${{ github.run_id }}
GITHUB_RUN_NUMBER: ${{ github.run_number }}
run: |
TRACE_ARGS=""
for dir in test/performance_profiles_*/; do
[ -d "$dir" ] && TRACE_ARGS="$TRACE_ARGS --traces-dir $dir"
done
if [ -n "$TRACE_ARGS" ]; then
python3 scripts/ci/utils/publish_traces.py $TRACE_ARGS
find test/performance_profiles_*/ -name '*.json.gz' -delete
else
echo "No trace directories found, skipping publish"
fi
- name: Collect performance metrics
if: always()
run: |
python3 scripts/ci/utils/save_metrics.py \
--gpu-config 8-gpu-b200 \
--partition ${{ matrix.partition }} \
--run-id ${{ github.run_id }} \
--output test/metrics-8gpu-b200-partition-${{ matrix.partition }}.json \
--search-dir test/performance_profiles_8_gpu \
--search-dir test
- name: Upload partition metrics
if: always()
uses: actions/upload-artifact@v4
with:
name: metrics-8gpu-b200-partition-${{ matrix.partition }}
path: test/metrics-8gpu-b200-partition-${{ matrix.partition }}.json
retention-days: 5
if-no-files-found: ignore
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
with:
artifact-suffix: ${{ matrix.partition }}
# Text model accuracy tests
nightly-test-text-accuracy-2-gpu-h100:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-text-accuracy-2-gpu-h100')
runs-on: 2-gpu-h100
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- uses: ./.github/actions/check-maintenance
- name: Install dependencies
run: |
bash scripts/ci/cuda/ci_install_dependency.sh
- name: Run eval test for text models
timeout-minutes: 120
run: |
cd test
python3 run_suite.py --hw cuda --suite nightly-eval-text-2-gpu --nightly --continue-on-error --timeout-per-file 4500
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
# Text model performance tests
nightly-test-text-perf-2-gpu-h100:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-text-perf-2-gpu-h100')
runs-on: 2-gpu-h100
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- uses: ./.github/actions/check-maintenance
- name: Install dependencies
run: |
bash scripts/ci/cuda/ci_install_dependency.sh
- name: Run performance test for text models
timeout-minutes: 180
env:
TRACE_BASE_URL: https://raw.githubusercontent.com/sglang-bot/sglang-ci-data/main/traces/${{ github.run_id }}
PERFETTO_RELAY_URL: ${{ vars.PERFETTO_RELAY_URL }}
GPU_CONFIG: "2-gpu-h100"
run: |
cd test
rm -rf performance_profiles_text_models/
python3 run_suite.py --hw cuda --suite nightly-perf-text-2-gpu --nightly --continue-on-error --timeout-per-file 3600
- name: Publish traces to storage repo
env:
GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }}
GITHUB_RUN_ID: ${{ github.run_id }}
GITHUB_RUN_NUMBER: ${{ github.run_number }}
run: |
python3 scripts/ci/utils/publish_traces.py --traces-dir test/performance_profiles_text_models
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
# VLM accuracy tests
nightly-test-vlm-accuracy-2-gpu-h100:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-vlm-accuracy-2-gpu-h100')
runs-on: 2-gpu-h100
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- uses: ./.github/actions/check-maintenance
- name: Install dependencies
run: |
bash scripts/ci/cuda/ci_install_dependency.sh
- name: Run eval test for VLM models (fixed MMMU-100)
timeout-minutes: 240
run: |
cd test
python3 run_suite.py --hw cuda --suite nightly-eval-vlm-2-gpu --nightly --continue-on-error --timeout-per-file 9000
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
# VLM performance tests
nightly-test-vlm-perf-2-gpu-h100:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-vlm-perf-2-gpu-h100')
runs-on: 2-gpu-h100
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- uses: ./.github/actions/check-maintenance
- name: Install dependencies
run: |
bash scripts/ci/cuda/ci_install_dependency.sh
- name: Run perf test for VLM models (MMMU)
timeout-minutes: 240
env:
TRACE_BASE_URL: https://raw.githubusercontent.com/sglang-bot/sglang-ci-data/main/traces/${{ github.run_id }}
PERFETTO_RELAY_URL: ${{ vars.PERFETTO_RELAY_URL }}
GPU_CONFIG: "2-gpu-h100"
run: |
cd test
rm -rf performance_profiles_vlms/
python3 run_suite.py --hw cuda --suite nightly-perf-vlm-2-gpu --nightly --continue-on-error --timeout-per-file 3600
- name: Publish traces to storage repo
env:
GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }}
GITHUB_RUN_ID: ${{ github.run_id }}
GITHUB_RUN_NUMBER: ${{ github.run_number }}
run: |
python3 scripts/ci/utils/publish_traces.py --traces-dir test/performance_profiles_vlms
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
# diffusion performance tests
nightly-test-multimodal-server-1-gpu:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-multimodal-server-1-gpu')
runs-on: 1-gpu-h100
strategy:
fail-fast: false
max-parallel: 5
matrix:
part: [0, 1]
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- uses: ./.github/actions/check-maintenance
- name: Install dependencies
run: |
bash scripts/ci/cuda/ci_install_dependency.sh diffusion
pip install slack_sdk
- name: Run diffusion server tests
env:
SGLANG_DIFFUSION_SLACK_TOKEN: ${{ secrets.SGLANG_DIFFUSION_SLACK_TOKEN }}
GITHUB_RUN_ID: ${{ github.run_id }}
GPU_CONFIG: "1-gpu-h100"
timeout-minutes: 90
run: |
cd python
python3 sglang/multimodal_gen/test/run_suite.py \
--suite 1-gpu \
--partition-id ${{ matrix.part }} \
--total-partitions 2
- name: Collect diffusion performance metrics
if: always()
run: |
python3 scripts/ci/utils/diffusion/save_diffusion_metrics.py \
--gpu-config 1-gpu-h100 \
--run-id ${{ github.run_id }} \
--output python/diffusion-metrics-1gpu-partition-${{ matrix.part }}.json \
--results-json python/diffusion-results.json
- name: Upload diffusion metrics
if: always()
uses: actions/upload-artifact@v4
with:
name: diffusion-metrics-1gpu-partition-${{ matrix.part }}
path: python/diffusion-metrics-1gpu-partition-${{ matrix.part }}.json
retention-days: 90
if-no-files-found: ignore
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
with:
artifact-suffix: ${{ matrix.part }}
nightly-test-multimodal-server-2-gpu:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-multimodal-server-2-gpu')
runs-on: 2-gpu-h100
strategy:
fail-fast: false
max-parallel: 5
matrix:
part: [0, 1]
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- uses: ./.github/actions/check-maintenance
- name: Install dependencies
run: |
bash scripts/ci/cuda/ci_install_dependency.sh diffusion
pip install slack_sdk
- name: Run diffusion server tests
env:
SGLANG_DIFFUSION_SLACK_TOKEN: ${{ secrets.SGLANG_DIFFUSION_SLACK_TOKEN }}
GITHUB_RUN_ID: ${{ github.run_id }}
GPU_CONFIG: "2-gpu-h100"
timeout-minutes: 90
run: |
cd python
python3 sglang/multimodal_gen/test/run_suite.py \
--suite 2-gpu \
--partition-id ${{ matrix.part }} \
--total-partitions 2
- name: Collect diffusion performance metrics
if: always()
run: |
python3 scripts/ci/utils/diffusion/save_diffusion_metrics.py \
--gpu-config 2-gpu-h100 \
--run-id ${{ github.run_id }} \
--output python/diffusion-metrics-2gpu-partition-${{ matrix.part }}.json \
--results-json python/diffusion-results.json
- name: Upload diffusion metrics
if: always()
uses: actions/upload-artifact@v4
with:
name: diffusion-metrics-2gpu-partition-${{ matrix.part }}
path: python/diffusion-metrics-2gpu-partition-${{ matrix.part }}.json
retention-days: 90
if-no-files-found: ignore
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
with:
artifact-suffix: ${{ matrix.part }}
# B200 Performance tests - 4 GPU
nightly-test-perf-4-gpu-b200:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-perf-4-gpu-b200')
runs-on: 4-gpu-b200
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- uses: ./.github/actions/check-maintenance
- name: Install dependencies
run: |
bash scripts/ci/cuda/ci_install_dependency.sh
- name: Run test
timeout-minutes: 300
run: |
cd test
python3 run_suite.py --hw cuda --suite nightly-4-gpu-b200 --nightly --continue-on-error --timeout-per-file 12000
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
# Specialized B200 tests - 8 GPU, for specific backends and configs
nightly-test-specialized-8-gpu-b200:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-perf-8-gpu-b200' || inputs.job_filter == 'nightly-test-specialized-8-gpu-b200')
runs-on: 8-gpu-b200
env:
RUNNER_LABELS: 8-gpu-b200
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- uses: ./.github/actions/check-maintenance
- name: Install dependencies
run: |
bash scripts/ci/cuda/ci_install_dependency.sh
- name: Run test
timeout-minutes: 120
env:
GPU_CONFIG: "8-gpu-b200"
run: |
cd test
python3 run_suite.py --hw cuda --suite nightly-8-gpu-b200 --nightly --continue-on-error --timeout-per-file 2400
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
# Diffusion cross-framework comparison
nightly-test-diffusion-comparison:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-diffusion-comparison')
runs-on: 4-gpu-h100
timeout-minutes: 240
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- name: Install dependencies
run: |
bash scripts/ci/cuda/ci_install_dependency.sh diffusion
- name: Run cross-framework comparison
env:
GITHUB_SHA: ${{ github.sha }}
GITHUB_RUN_ID: ${{ github.run_id }}
PYTHONUNBUFFERED: "1"
timeout-minutes: 210
run: |
python3 -u scripts/ci/utils/diffusion/run_comparison.py \
--output comparison-results.json
- name: Generate dashboard
if: always()
env:
GH_PAT_FOR_NIGHTLY_CI_DATA: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }}
GH_TOKEN: ${{ github.token }}
run: |
python3 scripts/ci/utils/diffusion/generate_diffusion_dashboard.py \
--results comparison-results.json \
--output dashboard.md \
--charts-dir comparison-charts \
--fetch-history \
--step-summary
- name: Publish to sglang-ci-data
if: always()
env:
GH_PAT_FOR_NIGHTLY_CI_DATA: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }}
run: |
python3 scripts/ci/utils/diffusion/publish_comparison_results.py \
--results comparison-results.json \
--dashboard dashboard.md \
--charts-dir comparison-charts
- name: Upload comparison artifacts
if: always()
uses: actions/upload-artifact@v4
with:
name: diffusion-comparison-${{ github.run_id }}
path: |
comparison-results.json
dashboard.md
comparison-charts/
comparison-logs/
retention-days: 90
if-no-files-found: ignore
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
# Consolidate performance metrics from all jobs
consolidate-metrics:
if: github.repository == 'sgl-project/sglang' && always()
needs:
- nightly-test-general-8-gpu-h200
- nightly-test-general-8-gpu-b200
- nightly-test-multimodal-server-1-gpu
- nightly-test-multimodal-server-2-gpu
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- name: Download all partition metrics
uses: actions/download-artifact@v4
with:
pattern: "*metrics-*"
path: metrics/
merge-multiple: true
- name: List downloaded metrics
run: |
echo "Downloaded metrics files:"
find metrics/ -name "*.json" -type f 2>/dev/null || echo "No metrics files found"
- name: Merge metrics
run: |
python3 scripts/ci/utils/merge_metrics.py \
--input-dir metrics/ \
--output consolidated-metrics-${{ github.run_id }}.json \
--run-id ${{ github.run_id }} \
--commit-sha ${{ github.sha }} \
--branch ${{ github.ref_name }}
- name: Upload consolidated metrics
uses: actions/upload-artifact@v4
with:
name: consolidated-metrics-${{ github.run_id }}
path: consolidated-metrics-${{ github.run_id }}.json
retention-days: 90
if-no-files-found: warn
# Final check job
check-all-jobs:
if: github.repository == 'sgl-project/sglang' && always()
needs:
- nightly-test-general-1-gpu-h100
- nightly-test-general-4-gpu-h100
- nightly-test-general-8-gpu-h200
- nightly-test-general-8-gpu-h20
- nightly-test-general-8-gpu-b200
- nightly-test-text-accuracy-2-gpu-h100
- nightly-test-text-perf-2-gpu-h100
- nightly-test-vlm-accuracy-2-gpu-h100
- nightly-test-vlm-perf-2-gpu-h100
- nightly-test-multimodal-server-1-gpu
- nightly-test-multimodal-server-2-gpu
- nightly-test-perf-4-gpu-b200
- nightly-test-specialized-8-gpu-b200
- nightly-test-diffusion-comparison
- consolidate-metrics
runs-on: ubuntu-latest
steps:
- name: Check if any job failed
run: |
if [[ "${{ contains(needs.*.result, 'failure') }}" == "true" ]]; then
echo "One or more nightly test jobs failed"
exit 1
fi
if [[ "${{ contains(needs.*.result, 'cancelled') }}" == "true" ]]; then
echo "One or more nightly test jobs were cancelled"
exit 1
fi
echo "All nightly test jobs passed"

View File

@@ -0,0 +1,28 @@
name: Open A PR to Copy Code From OSS
on:
workflow_dispatch:
# schedule:
# - cron: '0 10 * * *'
permissions:
contents: write
jobs:
copy:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
ref: 'main'
- name: Install GitHub CLI (if not present)
run: |
bash scripts/code_sync/install_github_cli.sh
- name: Copy from OSS code
env:
GH_TOKEN: ${{ secrets.GH_PAT_FOR_OPEN_PR_TO_PRIVATE }}
run: |
python3 scripts/code_sync/copy_from_oss.py

View File

@@ -0,0 +1,31 @@
name: Open A PR to Copy Diff To OSS
on:
workflow_dispatch:
inputs:
commit_sha:
description: 'The commit SHA to copy. Defaults to LAST to copy the latest commit.'
required: false
default: 'LAST'
permissions:
contents: write
jobs:
copy:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Install GitHub CLI (if not present)
run: |
bash scripts/code_sync/install_github_cli.sh
- name: Copy to OSS code
env:
GH_TOKEN: ${{ secrets.GH_PAT_FOR_OPEN_PR_TO_OSS }}
run: |
python3 scripts/code_sync/copy_to_oss.py --commit ${{ github.event.inputs.commit_sha }}

View File

@@ -0,0 +1,115 @@
name: Patch Docker Image
on:
workflow_dispatch:
inputs:
pr_numbers:
description: "Comma-separated PR numbers to apply (e.g. 18962,19010)"
required: false
default: ""
image_tag:
description: "Base image tag to patch (e.g. dev-x86, dev-x86-cu13)"
required: true
concurrency:
group: patch-docker-${{ inputs.image_tag }}
cancel-in-progress: true
jobs:
patch:
if: github.repository == 'sgl-project/sglang'
runs-on: x64-docker-build-node
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Pull base image and extract commit
run: |
IMAGE="lmsysorg/sglang:${{ inputs.image_tag }}"
docker pull "${IMAGE}"
if BASE_SHA=$(docker run --rm "${IMAGE}" git -C /sgl-workspace/sglang rev-parse HEAD 2>/dev/null); then
echo "Image built from commit: ${BASE_SHA}"
else
BASE_SHA=""
echo "::warning::Image has no .git directory — cannot extract base commit"
fi
echo "BASE_SHA=${BASE_SHA}" >> "$GITHUB_ENV"
- name: Generate patches
run: |
git config --global --add safe.directory "$GITHUB_WORKSPACE"
git fetch origin main
mkdir -p /tmp/patch-ctx
if [ -n "${{ inputs.pr_numbers }}" ]; then
IFS=',' read -ra PRS <<< "${{ inputs.pr_numbers }}"
for pr in "${PRS[@]}"; do
pr=$(echo "${pr}" | xargs)
echo "Fetching PR #${pr}"
git fetch origin "pull/${pr}/head:pr-${pr}"
MERGE_BASE=$(git merge-base origin/main "pr-${pr}")
echo " PR #${pr}: merge-base=${MERGE_BASE}"
git diff "${MERGE_BASE}..pr-${pr}" > "/tmp/patch-ctx/${pr}.patch"
echo " PR #${pr}: $(wc -l < /tmp/patch-ctx/${pr}.patch) lines"
done
elif [ -n "${BASE_SHA}" ]; then
echo "Generating diff: image ${BASE_SHA} → latest main"
git fetch origin "${BASE_SHA}"
git diff "${BASE_SHA}..origin/main" > /tmp/patch-ctx/main.patch
echo " main: $(wc -l < /tmp/patch-ctx/main.patch) lines"
else
echo "::error::No PR numbers specified and image has no .git — cannot generate diff against main"
exit 1
fi
TOTAL=$(cat /tmp/patch-ctx/*.patch | wc -l)
if [ "${TOTAL}" -eq 0 ]; then
echo "::warning::All patches are empty — image is already up to date"
echo "SKIP_BUILD=true" >> "$GITHUB_ENV"
fi
- name: Build patched image
if: env.SKIP_BUILD != 'true'
run: |
IMAGE="lmsysorg/sglang:${{ inputs.image_tag }}"
cat <<'DOCKERFILE' > /tmp/patch-ctx/Dockerfile
ARG BASE_IMAGE
FROM ${BASE_IMAGE}
COPY *.patch /tmp/patches/
RUN cd /sgl-workspace/sglang \
&& for p in /tmp/patches/*.patch; do \
if [ ! -s "${p}" ]; then \
echo "Skipping ${p} (empty)"; \
else \
echo "Applying ${p}..." \
&& patch -p1 --fuzz=2 --no-backup-if-mismatch -f < "${p}" \
|| { echo "ERROR: Failed to apply ${p}"; exit 1; }; \
fi; \
done \
&& rm -rf /tmp/patches
DOCKERFILE
docker build \
--no-cache \
--build-arg BASE_IMAGE="${IMAGE}" \
-t "${IMAGE}" \
/tmp/patch-ctx/
- name: Push patched image
if: env.SKIP_BUILD != 'true'
run: |
IMAGE="lmsysorg/sglang:${{ inputs.image_tag }}"
docker push "${IMAGE}"
echo "### Patched \`${IMAGE}\`" >> "$GITHUB_STEP_SUMMARY"
echo "- **Base commit:** \`${BASE_SHA:-unknown (no .git)}\`" >> "$GITHUB_STEP_SUMMARY"
echo "- **Source:** ${{ inputs.pr_numbers && format('PRs: {0}', inputs.pr_numbers) || 'latest main' }}" >> "$GITHUB_STEP_SUMMARY"

View File

@@ -0,0 +1,198 @@
name: PR Benchmark (SMG Components)
on:
push:
branches: [ main ]
paths:
- "sgl-model-gateway/**"
pull_request:
branches: [ main ]
paths:
- "sgl-model-gateway/**"
workflow_dispatch:
concurrency:
group: pr-benchmark-rust-${{ github.ref }}
cancel-in-progress: true
env:
RUSTC_WRAPPER: sccache
SCCACHE_GHA_ENABLED: "true"
permissions:
contents: read
pull-requests: write
issues: write
jobs:
benchmark-compile-check:
name: Benchmark Compilation Check
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Install dependencies
run: |
bash scripts/ci/cuda/ci_install_gateway_dependencies.sh
- name: Configure sccache
uses: mozilla-actions/sccache-action@v0.0.9
with:
version: "v0.12.0"
disable_annotations: true
- name: Rust cache
uses: Swatinem/rust-cache@v2
with:
workspaces: sgl-model-gateway
shared-key: "rust-cache"
save-if: true
cache-all-crates: true
cache-on-failure: true
- name: Check benchmarks compile
run: |
source "$HOME/.cargo/env"
cd sgl-model-gateway/
cargo check --benches
- name: Show sccache stats
if: always()
run: sccache --show-stats
benchmark:
name: Benchmark - ${{ matrix.name }}
if: |
github.repository == 'sgl-project/sglang' &&
(github.event_name == 'push' ||
github.event_name == 'workflow_dispatch' ||
(contains(github.event.pull_request.labels.*.name, 'router-benchmark') &&
contains(github.event.pull_request.labels.*.name, 'run-ci')))
strategy:
fail-fast: false
matrix:
include:
- name: Request Processing
bench_name: request_processing
bench_args: "benchmark_summary --exact"
runner: ubuntu-latest
sccache_version: "v0.12.0"
artifact_name: request-processing-results
artifact_path: criterion/benchmark_summary/
- name: Manual Policy
bench_name: manual_policy_benchmark
bench_args: ""
runner: ubuntu-latest
sccache_version: "v0.12.0"
artifact_name: manual-policy-results
artifact_path: criterion/manual_policy*/
runs-on: ${{ matrix.runner }}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 100
- name: Install dependencies
run: |
bash scripts/ci/cuda/ci_install_gateway_dependencies.sh
- name: Configure sccache
uses: mozilla-actions/sccache-action@v0.0.9
with:
version: ${{ matrix.sccache_version }}
disable_annotations: true
- name: Rust cache
uses: Swatinem/rust-cache@v2
with:
workspaces: sgl-model-gateway
shared-key: "rust-cache"
cache-all-crates: true
cache-on-failure: true
save-if: true
- name: Run benchmark
timeout-minutes: 30
run: |
source "$HOME/.cargo/env"
cd sgl-model-gateway/
if command -v sccache &> /dev/null; then
echo "Testing sccache availability..."
export RUSTC_WRAPPER=sccache
export SCCACHE_GHA_ENABLED="true"
if sccache --start-server 2>/dev/null && sccache --show-stats 2>/dev/null; then
echo "sccache is working, using it for compilation"
else
echo "sccache failed to start, falling back to regular cargo"
unset RUSTC_WRAPPER
unset SCCACHE_GHA_ENABLED
fi
else
echo "sccache not available, using regular cargo"
fi
cargo bench --bench ${{ matrix.bench_name }} -- ${{ matrix.bench_args }} 2>&1 | tee benchmark_output.txt
- name: Upload benchmark results
if: always()
uses: actions/upload-artifact@v4
with:
name: ${{ matrix.artifact_name }}-${{ github.sha }}
path: |
sgl-model-gateway/target/${{ matrix.artifact_path }}
sgl-model-gateway/benchmark_output.txt
retention-days: 30
- name: Show sccache stats
if: always()
run: sccache --show-stats
benchmark-summary:
name: Benchmark Summary
needs: [benchmark]
if: always() && (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request')
runs-on: ubuntu-latest
steps:
- name: Download all benchmark results
uses: actions/download-artifact@v4
with:
pattern: '*-results-${{ github.sha }}'
path: benchmark-results
- name: Generate summary
run: |
generate_section() {
local title="$1" dir_name="$2" lines="${3:-100}"
local dir="benchmark-results/${dir_name}-${{ github.sha }}"
echo "### $title" >> summary.md
if [ -d "$dir" ]; then
echo "✅ **Completed**" >> summary.md
if [ -f "$dir/benchmark_output.txt" ]; then
echo -e "\n<details>\n<summary>View Results</summary>\n\n\`\`\`" >> summary.md
tail -"$lines" "$dir/benchmark_output.txt" >> summary.md
echo -e "\`\`\`\n</details>" >> summary.md
fi
else
echo "❌ Failed or skipped" >> summary.md
fi
echo "" >> summary.md
}
echo "## 🚀 Benchmark Results Summary" > summary.md
echo "" >> summary.md
generate_section "Request Processing" "request-processing-results" 60
generate_section "Manual Policy (Sticky Sessions)" "manual-policy-results" 100
echo -e "---\n_Generated at $(date -u '+%Y-%m-%d %H:%M:%S UTC')_" >> summary.md
cat summary.md
cat summary.md >> $GITHUB_STEP_SUMMARY
- name: Upload summary
uses: actions/upload-artifact@v4
with:
name: benchmark-summary-${{ github.sha }}
path: summary.md
retention-days: 30

View File

@@ -0,0 +1,254 @@
on:
workflow_call:
inputs:
require-run-ci:
description: "Whether the PR must have the run-ci label"
type: boolean
default: true
cool-down-minutes:
description: "Cooldown period in minutes for low-permission users; 0 disables rate limiting"
type: number
default: 120
jobs:
pr-gate:
# 1. for commits on main: no gating needed
# 2. for workflow_dispatch: this can only be triggered by users with write access
runs-on: ubuntu-latest
steps:
- name: Fetch latest PR info
if: github.event_name == 'pull_request'
id: pr
uses: actions/github-script@v7
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
script: |
const pr = await github.rest.pulls.get({
owner: context.repo.owner,
repo: context.repo.repo,
pull_number: context.issue.number
});
core.setOutput("labels", JSON.stringify(pr.data.labels.map(l => l.name)));
core.setOutput("draft", pr.data.draft);
core.setOutput("user", pr.data.user.login);
- name: Log PR info
if: github.event_name == 'pull_request'
run: |
echo "===== PR Info ====="
echo "PR Event: ${{ github.event_name }}"
echo "PR Labels: ${{ steps.pr.outputs.labels }}"
echo "PR Draft: ${{ steps.pr.outputs.draft }}"
echo "PR User: ${{ steps.pr.outputs.user }}"
echo "Require run-ci: ${{ inputs.require-run-ci }}"
echo "Cool down minutes: ${{ inputs.cool-down-minutes }}"
echo "==================="
- name: Block draft PR
if: github.event_name == 'pull_request' && fromJson(steps.pr.outputs.draft)
run: |
echo "PR is draft. Blocking CI."
exit 1
- name: Require run-ci label (optional)
if: github.event_name == 'pull_request' && inputs.require-run-ci == true
run: |
labels='${{ steps.pr.outputs.labels }}'
if [[ "${{ contains(fromJson(steps.pr.outputs.labels), 'run-ci') }}" == "false" ]]; then
echo "Missing required label 'run-ci'. See https://docs.sglang.io/developer_guide/contribution_guide.html#how-to-trigger-ci-tests for more details."
exit 1
fi
- name: Enforce rate limit for low-permission actors (optional)
if: github.event_name == 'pull_request' && inputs.cool-down-minutes > 0
uses: actions/github-script@v7
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
script: |
const DEFAULT_MINUTES = Number("${{ inputs.cool-down-minutes }}");
const owner = context.repo.owner;
const repo = context.repo.repo;
const eventName = context.eventName;
const curRun = await github.rest.actions.getWorkflowRun({
owner, repo, run_id: context.runId
});
let triggeringActor = curRun.data.triggering_actor?.login || context.actor;
if (triggeringActor === "github-actions[bot]") {
triggeringActor = `${{ steps.pr.outputs.user }}`;
core.info(
`triggering_actor is github-actions[bot]; substituting PR author '${triggeringActor}'.`
);
}
async function hasHighPermission(username) {
try {
const { data } = await github.rest.repos.getCollaboratorPermissionLevel({ owner, repo, username });
const perm = data.permission || 'none';
return perm === 'write' || perm === 'maintain' || perm === 'admin';
} catch (e) {
if (e.status === 404 || e.status === 403) return false;
throw e;
}
}
if (await hasHighPermission(triggeringActor)) {
core.info(`Triggering user '${triggeringActor}' has high permission. No rate limit applied.`);
return;
}
let effectiveCooldownMinutes = DEFAULT_MINUTES;
let perUserCooldownMinutes = null;
try {
const contentResp = await github.rest.repos.getContent({
owner,
repo,
path: ".github/CI_PERMISSIONS.json",
ref: "main",
});
if (!Array.isArray(contentResp.data) && contentResp.data && "content" in contentResp.data) {
const raw = Buffer.from(
contentResp.data.content,
contentResp.data.encoding || "base64"
).toString();
const ciPermissions = JSON.parse(raw);
const userPerm = ciPermissions[triggeringActor];
if (userPerm && typeof userPerm.cooldown_interval_minutes === "number") {
perUserCooldownMinutes = userPerm.cooldown_interval_minutes;
core.info(
`Per-user cooldown for '${triggeringActor}' from CI_PERMISSIONS.json: ${perUserCooldownMinutes} minutes.`
);
} else {
core.info(`No per-user cooldown found for '${triggeringActor}' in CI_PERMISSIONS.json.`);
}
} else {
core.info("CI_PERMISSIONS.json content response is not a file; skipping per-user cooldown.");
}
} catch (e) {
core.info(`CI_PERMISSIONS.json not found or unreadable: ${e.message}. Using default rate limit only.`);
}
if (perUserCooldownMinutes !== null) {
effectiveCooldownMinutes = Math.min(effectiveCooldownMinutes, perUserCooldownMinutes);
}
if (effectiveCooldownMinutes <= 0) {
core.info(
`Effective cooldown for '${triggeringActor}' is 0 minutes; no rate limit enforced for this user.`
);
return;
}
const cutoff = new Date(Date.now() - effectiveCooldownMinutes * 60 * 1000);
core.info(
`Checking for workflow runs since ${cutoff.toISOString()} (last ${effectiveCooldownMinutes} minutes) for event '${eventName}'.`
);
const { data } = await github.rest.actions.listWorkflowRuns({
owner,
repo,
workflow_id: 'pr-test.yml',
event: eventName,
per_page: 100,
});
const runs = data.workflow_runs || [];
// Rate Limiting Logic:
// We only count workflow runs that actually consumed CI resources (i.e., passed the gate).
// A run "passes the gate" if any jobs beyond the gate jobs (check-changes, pr-gate, call-gate)
// actually executed (not skipped/cancelled). This prevents scenarios where:
// - User has PR A with missing 'run-ci' label (fails at gate)
// - User opens PR B with 'run-ci' label
// - PR B should be able to run even though PR A triggered a run recently
// Helper function to check if a run passed the gate (i.e., actually consumed CI resources)
async function didRunPassGate(run) {
try {
// Note: Fetching up to 100 jobs (API maximum). If a workflow has >100 jobs,
// we may miss some, but this is unlikely in practice.
const { data: jobsData } = await github.rest.actions.listJobsForWorkflowRun({
owner, repo, run_id: run.id, per_page: 100
});
const jobs = jobsData.jobs || [];
// If no jobs exist yet, the run hasn't started consuming resources
if (jobs.length === 0) {
core.info(`Run ${run.id} has no jobs yet; not counting against rate limit.`);
return false;
}
// Gate jobs that don't consume significant CI resources
const gateJobs = ['check-changes', 'pr-gate', 'call-gate', 'pr-test-finish'];
const jobsBeyondGate = jobs.filter(j => !gateJobs.some(g => j.name === g || j.name.startsWith(g + ' ')));
// A job "ran" if it reached a terminal conclusion state that indicates actual execution
const ranStates = ['success', 'failure', 'timed_out', 'action_required'];
const hasJobsThatRan = jobsBeyondGate.some(j => j.conclusion && ranStates.includes(j.conclusion));
return hasJobsThatRan;
} catch (e) {
core.warning(`Could not check jobs for run ${run.id}: ${e.message}`);
// If it's a rate limit error, count it conservatively to prevent abuse
if (e.status === 429) {
core.warning(`Hit rate limit checking run ${run.id}; counting it to be safe.`);
return true;
}
// For cancelled/skipped runs, they likely didn't consume resources
if (run.conclusion === 'cancelled' || run.conclusion === 'skipped') {
return false;
}
// Default to counting it to prevent abuse
return true;
}
}
// Limit the number of runs we'll check in detail to avoid API rate limits
const MAX_RUNS_TO_CHECK = 5;
let runsChecked = 0;
let runsSkippedAtGate = 0;
let recentFound = null;
for (const run of runs) {
if (String(run.id) === String(context.runId)) continue;
if (new Date(run.created_at) < cutoff) continue;
const isUserRun = (run.actor?.login === triggeringActor) || (run.triggering_actor?.login === triggeringActor);
if (!isUserRun) continue;
runsChecked++;
core.info(`Checking run ${run.id} (created: ${run.created_at}, conclusion: ${run.conclusion})`);
// Safety limit: if we've checked too many runs, assume the next one passed to be conservative
if (runsChecked > MAX_RUNS_TO_CHECK) {
core.warning(`Checked ${MAX_RUNS_TO_CHECK} runs; assuming this one passed gate to avoid API limits.`);
recentFound = run;
break;
}
// Only count runs that actually passed the gate and consumed CI resources
if (await didRunPassGate(run)) {
recentFound = run;
core.info(`Found recent run ${run.id} that passed gate.`);
break;
} else {
runsSkippedAtGate++;
core.info(`Run ${run.id} failed at gate; not counting against rate limit.`);
}
}
core.info(`Rate limit check summary: checked ${runsChecked} runs, ${runsSkippedAtGate} failed at gate.`);
if (recentFound) {
core.setFailed(
`User '${triggeringActor}' already triggered '${context.workflow}' via '${eventName}' at ${recentFound.created_at}. ` +
`Please wait ${effectiveCooldownMinutes} minutes before triggering again.`
);
} else {
core.info(
`No recent runs detected for '${triggeringActor}' within the last ${effectiveCooldownMinutes} minutes; proceeding.`
);
}

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,117 @@
name: PR Test - JIT Kernel
on:
workflow_call:
inputs:
jit_kernel:
required: true
type: string
pr_head_sha:
required: false
type: string
default: ''
git_ref:
required: false
type: string
default: ''
target_stage:
required: false
type: string
default: ''
test_parallel_dispatch:
required: false
type: string
default: 'false'
skip_stage_health_check:
required: false
type: boolean
default: false
# Workflow-level env is NOT inherited from the caller in reusable workflows (verified by CI test).
# The github context (including github.event_name) IS inherited from the caller.
env:
SGLANG_IS_IN_CI: true
SGLANG_CUDA_COREDUMP: "1"
SGLANG_JIT_DEEPGEMM_FAST_WARMUP: true
SGLANG_PR_TEST_BYPASS_MAINTENANCE_ON_MAIN: ${{ github.ref == 'refs/heads/main' && 'true' || 'false' }}
SKIP_STAGE_HEALTH_CHECK: ${{ inputs.skip_stage_health_check == true && 'true' || 'false' }}
jobs:
jit-kernel-unit-test:
if: |
github.event_name != 'schedule' &&
inputs.test_parallel_dispatch != 'true' &&
!inputs.target_stage
runs-on: 1-gpu-h100
timeout-minutes: 240
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }}
- uses: ./.github/actions/check-stage-health
- uses: ./.github/actions/check-maintenance
- name: Install dependencies
timeout-minutes: 20
run: |
bash scripts/ci/cuda/ci_install_dependency.sh diffusion
- name: Run test
timeout-minutes: 30
run: |
cd test/
python3 run_suite.py --hw cuda --suite stage-b-kernel-unit-1-gpu-large
jit-kernel-multigpu-unit-test:
if: |
github.event_name != 'schedule' &&
inputs.test_parallel_dispatch != 'true' &&
!inputs.target_stage
runs-on: 8-gpu-h200
timeout-minutes: 240
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }}
- uses: ./.github/actions/check-maintenance
- name: Install dependencies
timeout-minutes: 20
run: |
bash scripts/ci/cuda/ci_install_dependency.sh diffusion
- name: Run multi-GPU test
timeout-minutes: 45
run: |
cd test/
python3 run_suite.py --hw cuda --suite stage-b-kernel-unit-8-gpu-h200
jit-kernel-benchmark-test:
if: |
github.event_name != 'schedule' &&
inputs.test_parallel_dispatch != 'true' &&
!inputs.target_stage
runs-on: 1-gpu-h100
timeout-minutes: 240
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }}
- uses: ./.github/actions/check-stage-health
- uses: ./.github/actions/check-maintenance
- name: Install dependencies
timeout-minutes: 20
run: |
bash scripts/ci/cuda/ci_install_dependency.sh diffusion
- name: Run benchmark tests
timeout-minutes: 45
run: |
cd test/
python3 run_suite.py --hw cuda --suite stage-b-kernel-benchmark-1-gpu-large

View File

@@ -0,0 +1,245 @@
name: PR Test - Multimodal Gen
on:
workflow_call:
inputs:
multimodal_gen:
required: true
type: string
sgl_kernel:
required: true
type: string
b200_runner:
required: true
type: string
continue_on_error:
required: false
type: string
default: 'false'
pr_head_sha:
required: false
type: string
default: ''
git_ref:
required: false
type: string
default: ''
target_stage:
required: false
type: string
default: ''
test_parallel_dispatch:
required: false
type: string
default: 'false'
caller_needs_failure:
required: false
type: string
default: 'false'
skip_stage_health_check:
required: false
type: string
default: 'false'
# Workflow-level env is NOT inherited from the caller in reusable workflows.
# The github context (including github.event_name) IS inherited from the caller.
env:
SGLANG_IS_IN_CI: true
SGLANG_CUDA_COREDUMP: "1"
SGLANG_PR_TEST_BYPASS_MAINTENANCE_ON_MAIN: ${{ github.ref == 'refs/heads/main' && 'true' || 'false' }}
SKIP_STAGE_HEALTH_CHECK: ${{ inputs.skip_stage_health_check == 'true' }}
jobs:
multimodal-gen-test-1-gpu:
if: |
(inputs.target_stage == 'multimodal-gen-test-1-gpu') ||
(
!inputs.target_stage &&
((github.event_name == 'schedule' || inputs.test_parallel_dispatch == 'true') || (inputs.caller_needs_failure != 'true' && !cancelled())) &&
inputs.multimodal_gen == 'true'
)
runs-on: 1-gpu-h100
timeout-minutes: 240
strategy:
fail-fast: false
matrix:
part: [0, 1]
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }}
- uses: ./.github/actions/check-stage-health
- uses: ./.github/actions/check-maintenance
- name: Download artifacts
if: inputs.sgl_kernel == 'true'
uses: actions/download-artifact@v4
with:
path: sgl-kernel/dist/
merge-multiple: true
pattern: wheel-python3.10-cuda12.9
- name: Install dependencies
timeout-minutes: 20
run: |
CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh diffusion
- name: Run diffusion server tests
timeout-minutes: 240
env:
RUNAI_STREAMER_MEMORY_LIMIT: 0
CONTINUE_ON_ERROR_FLAG: ${{ inputs.continue_on_error == 'true' && '--continue-on-error' || '' }}
run: |
cd python
python3 sglang/multimodal_gen/test/run_suite.py \
--suite 1-gpu \
--partition-id ${{ matrix.part }} \
--total-partitions 2 \
$CONTINUE_ON_ERROR_FLAG
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
with:
artifact-suffix: ${{ matrix.part }}
multimodal-gen-test-2-gpu:
if: |
(inputs.target_stage == 'multimodal-gen-test-2-gpu') ||
(
!inputs.target_stage &&
((github.event_name == 'schedule' || inputs.test_parallel_dispatch == 'true') || (inputs.caller_needs_failure != 'true' && !cancelled())) &&
inputs.multimodal_gen == 'true'
)
runs-on: 2-gpu-h100
timeout-minutes: 240
strategy:
fail-fast: false
matrix:
part: [0, 1]
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }}
- uses: ./.github/actions/check-stage-health
- uses: ./.github/actions/check-maintenance
- name: Download artifacts
if: inputs.sgl_kernel == 'true'
uses: actions/download-artifact@v4
with:
path: sgl-kernel/dist/
merge-multiple: true
pattern: wheel-python3.10-cuda12.9
- name: Install dependencies
timeout-minutes: 20
run: |
CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh diffusion
- name: Run diffusion server tests
timeout-minutes: 240
env:
RUNAI_STREAMER_MEMORY_LIMIT: 0
CONTINUE_ON_ERROR_FLAG: ${{ inputs.continue_on_error == 'true' && '--continue-on-error' || '' }}
run: |
cd python
python3 sglang/multimodal_gen/test/run_suite.py \
--suite 2-gpu \
--partition-id ${{ matrix.part }} \
--total-partitions 2 \
$CONTINUE_ON_ERROR_FLAG
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
with:
artifact-suffix: ${{ matrix.part }}
multimodal-gen-test-1-b200:
if: |
(inputs.target_stage == 'multimodal-gen-test-1-b200') ||
(
!inputs.target_stage &&
((github.event_name == 'schedule' || inputs.test_parallel_dispatch == 'true') || (inputs.caller_needs_failure != 'true' && !cancelled())) &&
inputs.multimodal_gen == 'true'
)
runs-on: ${{ inputs.b200_runner }}
timeout-minutes: 240
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }}
- uses: ./.github/actions/check-maintenance
- name: Download artifacts
if: inputs.sgl_kernel == 'true'
uses: actions/download-artifact@v4
with:
path: sgl-kernel/dist/
merge-multiple: true
pattern: wheel-python3.10-cuda12.9
- name: Install dependencies
timeout-minutes: 20
run: |
CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh diffusion
- name: Run diffusion server tests
timeout-minutes: 240
env:
RUNAI_STREAMER_MEMORY_LIMIT: 0
CONTINUE_ON_ERROR_FLAG: ${{ inputs.continue_on_error == 'true' && '--continue-on-error' || '' }}
run: |
cd python
python3 sglang/multimodal_gen/test/run_suite.py \
--suite 1-gpu-b200 \
$CONTINUE_ON_ERROR_FLAG
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
multimodal-gen-unit-test:
if: |
(inputs.target_stage == 'multimodal-gen-unit-test') ||
(
!inputs.target_stage &&
((github.event_name == 'schedule' || inputs.test_parallel_dispatch == 'true') || (inputs.caller_needs_failure != 'true' && !cancelled())) &&
inputs.multimodal_gen == 'true'
)
runs-on: 1-gpu-h100
timeout-minutes: 120
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }}
- uses: ./.github/actions/check-stage-health
- uses: ./.github/actions/check-maintenance
- name: Download artifacts
if: inputs.sgl_kernel == 'true'
uses: actions/download-artifact@v4
with:
path: sgl-kernel/dist/
merge-multiple: true
pattern: wheel-python3.10-cuda12.9
- name: Install dependencies
timeout-minutes: 20
run: |
CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh diffusion
- name: Run diffusion unit tests
timeout-minutes: 60
run: |
cd python
python3 sglang/multimodal_gen/test/run_suite.py --suite unit

View File

@@ -0,0 +1,453 @@
name: PR Test (NPU)
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
workflow_dispatch:
workflow_call:
inputs:
ref:
description: 'Git ref (branch, tag, or SHA) to test. If not provided, uses the default branch.'
required: false
type: string
default: ''
run_all_tests:
description: "Run all tests (for releasing or testing purpose)"
required: false
type: boolean
default: false
concurrency:
group: pr-test-npu-${{ inputs.ref || github.ref }}
cancel-in-progress: ${{ github.event_name != 'workflow_call' }}
jobs:
# ==================== Check Changes ==================== #
check-changes:
runs-on: ubuntu-latest
outputs:
changes_exist: ${{ steps.filter.outputs.main_package == 'true' || steps.filter.outputs.multimodal_gen == 'true' || steps.run-mode.outputs.run_all_tests == 'true'}}
main_package: ${{ steps.filter.outputs.main_package == 'true' || steps.run-mode.outputs.run_all_tests == 'true' }}
multimodal_gen: ${{ steps.filter.outputs.multimodal_gen == 'true' || steps.run-mode.outputs.run_all_tests == 'true' }}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- name: Determine run mode
id: run-mode
run: |
# Run all tests for workflow_call (when ref input is provided)
# Note: github.event_name is inherited from caller, so we detect workflow_call by checking inputs.ref
if [[ "${{ inputs.run_all_tests }}" == "true" ]]; then
echo "run_all_tests=true" >> $GITHUB_OUTPUT
echo "Run mode: ALL TESTS (run_all_tests=${{ inputs.run_all_tests }})"
else
echo "run_all_tests=false" >> $GITHUB_OUTPUT
echo "Run mode: FILTERED (triggered by ${{ github.event_name }})"
fi
- name: Detect file changes
id: filter
uses: dorny/paths-filter@v3
if: steps.run-mode.outputs.run_all_tests != 'true'
with:
filters: |
main_package:
- "python/sglang/!(multimodal_gen)/**/!(*.md)"
- "python/pyproject_npu.toml"
- "scripts/ci/npu/npu_ci_install_dependency.sh"
- "test/srt/ascend/**"
- ".github/workflows/pr-test-npu.yml"
multimodal_gen:
- "python/sglang/multimodal_gen/**/*.!(md|ipynb)"
- "python/sglang/srt/**"
- "python/pyproject_npu.toml"
- "scripts/ci/npu/npu_ci_install_dependency.sh"
- ".github/workflows/pr-test-npu.yml"
# ==================== PR Gate ==================== #
pr-gate:
needs: check-changes
if: needs.check-changes.outputs.changes_exist == 'true'
uses: ./.github/workflows/pr-gate.yml
secrets: inherit
stage-b-test-1-npu-a2:
needs: [check-changes, pr-gate]
if: needs.check-changes.outputs.main_package == 'true'
runs-on: linux-aarch64-a2-1
strategy:
fail-fast: false
matrix:
part: [ 0, 1 ]
container:
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-910b-ubuntu22.04-py3.11
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- name: Mark repository safe
run: |
git config --system --add safe.directory ${GITHUB_WORKSPACE}
- name: Install dependencies
env:
TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
run: |
# speed up by using infra cache services
CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
pip config set global.index-url http://${CACHING_URL}/pypi/simple
pip config set global.trusted-host "${CACHING_URL}"
bash scripts/ci/npu/npu_ci_install_dependency.sh 910b
# copy required file from our daily cache
cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
# copy gsm8k dataset
cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
- name: Run test
timeout-minutes: 60
env:
SGLANG_USE_MODELSCOPE: true
SGLANG_IS_IN_CI: true
HF_ENDPOINT: https://hf-mirror.com
TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
STREAMS_PER_DEVICE: 32
run: |
cd test
python3 run_suite.py --hw npu --suite stage-b-test-1-npu-a2 --auto-partition-id ${{ matrix.part }} --auto-partition-size 2
stage-b-test-2-npu-a2:
needs: [check-changes, pr-gate]
if: needs.check-changes.outputs.main_package == 'true'
runs-on: linux-aarch64-a2-2
strategy:
fail-fast: true
matrix:
part: [0, 1]
container:
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-910b-ubuntu22.04-py3.11
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- name: Mark repository safe
run: |
git config --system --add safe.directory ${GITHUB_WORKSPACE}
- name: Install dependencies
env:
TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
run: |
# speed up by using infra cache services
CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
pip config set global.index-url http://${CACHING_URL}/pypi/simple
pip config set global.trusted-host "${CACHING_URL}"
bash scripts/ci/npu/npu_ci_install_dependency.sh 910b
# copy required file from our daily cache
cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
# copy gsm8k dataset
cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
- name: Run test
timeout-minutes: 60
env:
SGLANG_USE_MODELSCOPE: true
SGLANG_IS_IN_CI: true
HF_ENDPOINT: https://hf-mirror.com
TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
STREAMS_PER_DEVICE: 32
run: |
cd test
python3 run_suite.py --hw npu --suite stage-b-test-2-npu-a2 --auto-partition-id ${{ matrix.part }} --auto-partition-size 2
stage-b-test-4-npu-a3:
needs: [check-changes, pr-gate]
if: needs.check-changes.outputs.main_package == 'true'
runs-on: linux-aarch64-a3-4
container:
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-a3-ubuntu22.04-py3.11
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- name: Mark repository safe
run: |
git config --system --add safe.directory ${GITHUB_WORKSPACE}
- name: Install dependencies
env:
TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
run: |
# speed up by using infra cache services
CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
pip config set global.index-url http://${CACHING_URL}/pypi/simple
pip config set global.trusted-host "${CACHING_URL}"
bash scripts/ci/npu/npu_ci_install_dependency.sh a3
# copy required file from our daily cache
cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
# copy gsm8k dataset
cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
- name: Run test
timeout-minutes: 60
env:
SGLANG_USE_MODELSCOPE: true
SGLANG_IS_IN_CI: true
HF_ENDPOINT: https://hf-mirror.com
TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
STREAMS_PER_DEVICE: 32
run: |
cd test
python3 run_suite.py --hw npu --suite stage-b-test-4-npu-a3 --timeout-per-file 3600
stage-b-test-16-npu-a3:
needs: [check-changes, pr-gate]
if: needs.check-changes.outputs.main_package == 'true'
runs-on: linux-aarch64-a3-16
container:
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-a3-ubuntu22.04-py3.11
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- name: Mark repository safe
run: |
git config --system --add safe.directory ${GITHUB_WORKSPACE}
- name: Install dependencies
env:
TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
run: |
# speed up by using infra cache services
CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
pip config set global.index-url http://${CACHING_URL}/pypi/simple
pip config set global.trusted-host "${CACHING_URL}"
bash scripts/ci/npu/npu_ci_install_dependency.sh a3
# copy required file from our daily cache
cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
# copy gsm8k dataset
cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
- name: Run test
timeout-minutes: 60
env:
SGLANG_USE_MODELSCOPE: true
SGLANG_IS_IN_CI: true
HF_ENDPOINT: https://hf-mirror.com
TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
STREAMS_PER_DEVICE: 32
run: |
cd test
python3 run_suite.py --hw npu --suite stage-b-test-16-npu-a3 --timeout-per-file 3600
multimodal-gen-test-1-npu-a3:
needs: [check-changes, pr-gate]
if: needs.check-changes.outputs.multimodal_gen == 'true'
runs-on: linux-aarch64-a3-2
container:
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.3.rc2-a3-ubuntu22.04-py3.11
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Mark repository safe
run: |
git config --system --add safe.directory ${GITHUB_WORKSPACE}
- name: Install dependencies
env:
TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
run: |
# speed up by using infra cache services
CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
pip config set global.index-url http://${CACHING_URL}/pypi/simple
pip config set global.trusted-host "${CACHING_URL}"
bash scripts/ci/npu/npu_ci_install_dependency.sh a3 diffusion
# copy required file from our daily cache
cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
# copy gsm8k dataset
cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
- name: Run test
timeout-minutes: 60
env:
SGLANG_USE_MODELSCOPE: true
SGLANG_IS_IN_CI: true
HF_ENDPOINT: https://hf-mirror.com
TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
STREAMS_PER_DEVICE: 32
run: |
export PATH="/usr/local/Ascend/8.3.RC1/compiler/bishengir/bin:${PATH}"
cd python
python3 sglang/multimodal_gen/test/run_suite.py --suite 1-npu
multimodal-gen-test-2-npu-a3:
needs: [check-changes, pr-gate]
if: needs.check-changes.outputs.multimodal_gen == 'true'
runs-on: linux-aarch64-a3-16
container:
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.3.rc2-a3-ubuntu22.04-py3.11
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Mark repository safe
run: |
git config --system --add safe.directory ${GITHUB_WORKSPACE}
- name: Install dependencies
env:
TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
run: |
# speed up by using infra cache services
CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
pip config set global.index-url http://${CACHING_URL}/pypi/simple
pip config set global.trusted-host "${CACHING_URL}"
bash scripts/ci/npu/npu_ci_install_dependency.sh a3 diffusion
# copy required file from our daily cache
cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
# copy gsm8k dataset
cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
- name: Run test
timeout-minutes: 60
env:
SGLANG_USE_MODELSCOPE: true
SGLANG_IS_IN_CI: true
HF_ENDPOINT: https://hf-mirror.com
TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
STREAMS_PER_DEVICE: 32
run: |
export PATH="/usr/local/Ascend/8.3.RC1/compiler/bishengir/bin:${PATH}"
cd python
python3 sglang/multimodal_gen/test/run_suite.py --suite 2-npu
multimodal-gen-test-8-npu-a3:
needs: [check-changes, pr-gate]
if: needs.check-changes.outputs.multimodal_gen == 'true'
runs-on: linux-aarch64-a3-8
container:
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-a3-ubuntu22.04-py3.11
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Mark repository safe
run: |
git config --system --add safe.directory ${GITHUB_WORKSPACE}
- name: Install dependencies
env:
TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
run: |
# speed up by using infra cache services
CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
pip config set global.index-url http://${CACHING_URL}/pypi/simple
pip config set global.trusted-host "${CACHING_URL}"
bash scripts/ci/npu/npu_ci_install_dependency.sh a3 diffusion
# copy required file from our daily cache
cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
# copy gsm8k dataset
cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
- name: Run test
timeout-minutes: 60
env:
SGLANG_USE_MODELSCOPE: true
SGLANG_IS_IN_CI: true
HF_ENDPOINT: https://hf-mirror.com
TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
STREAMS_PER_DEVICE: 32
run: |
cd python
python3 sglang/multimodal_gen/test/run_suite.py --suite 8-npu
pr-test-npu-finish:
needs:
[
check-changes,
stage-b-test-1-npu-a2,
stage-b-test-2-npu-a2,
stage-b-test-4-npu-a3,
stage-b-test-16-npu-a3,
multimodal-gen-test-1-npu-a3,
multimodal-gen-test-2-npu-a3,
multimodal-gen-test-8-npu-a3,
]
if: always()
runs-on: ubuntu-latest
steps:
- name: Check all dependent job statuses
run: |
# Convert the 'needs' context to a JSON string
json_needs='${{ toJson(needs) }}'
# Get a list of all job names from the JSON keys
job_names=$(echo "$json_needs" | jq -r 'keys_unsorted[]')
for job in $job_names; do
# For each job, extract its result
result=$(echo "$json_needs" | jq -r --arg j "$job" '.[$j].result')
# Print the job name and its result
echo "$job: $result"
# Check for failure or cancellation and exit if found
if [[ "$result" == "failure" || "$result" == "cancelled" ]]; then
echo "The above jobs failed."
exit 1
fi
done
# If the loop completes, all jobs were successful
echo "All jobs completed successfully"
exit 0

View File

@@ -0,0 +1,359 @@
name: PR Test (SMG)
on:
push:
branches: [ main ]
paths:
- "sgl-model-gateway/**"
pull_request:
branches: [ main ]
types: [opened, synchronize, reopened, labeled]
paths:
- "sgl-model-gateway/**"
workflow_dispatch:
concurrency:
group: gateway-tests-${{ github.ref }}
cancel-in-progress: true
env:
RUSTC_WRAPPER: sccache
SCCACHE_GHA_ENABLED: "true"
SGLANG_IS_IN_CI: true
jobs:
build-wheel:
if: |
github.event_name != 'pull_request' ||
(github.event.action != 'labeled' && contains(github.event.pull_request.labels.*.name, 'run-ci')) ||
(github.event.action == 'labeled' && github.event.label.name == 'run-ci')
runs-on: 4-gpu-a10
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Install rust dependencies
run: |
bash scripts/ci/cuda/ci_install_gateway_dependencies.sh
- name: Configure sccache
uses: mozilla-actions/sccache-action@v0.0.9
with:
version: "v0.12.0"
disable_annotations: true
- name: Rust cache
uses: Swatinem/rust-cache@v2
with:
workspaces: sgl-model-gateway
shared-key: "rust-cache"
cache-all-crates: true
cache-on-failure: true
save-if: true
- name: Build python binding
run: |
source "$HOME/.cargo/env"
export RUSTC_WRAPPER=sccache
cd sgl-model-gateway/bindings/python
python3 -m pip install --upgrade pip maturin
maturin build --profile ci --features vendored-openssl --out dist
- name: List built wheel
run: ls -lh sgl-model-gateway/bindings/python/dist/
- name: Upload wheel artifact
uses: actions/upload-artifact@v4
with:
name: smg-wheel
path: sgl-model-gateway/bindings/python/dist/*.whl
retention-days: 1
- name: Test wheel install
run: |
pip install sgl-model-gateway/bindings/python/dist/*.whl
python3 -c "import sglang_router; print('Python package: OK')"
python3 -c "from sglang_router.sglang_router_rs import Router; print('Rust extension: OK')"
python3 -m sglang_router.launch_router --help > /dev/null && echo "Entry point: OK"
python-unit-tests:
needs: build-wheel
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
path: sglang-repo
- name: Move sgl-model-gateway folder to root
run: |
mv sglang-repo/sgl-model-gateway/* .
rm -rf sglang-repo
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.13"
- name: Download wheel artifact
uses: actions/download-artifact@v4
with:
name: smg-wheel
path: dist/
- name: Install wheel
run: pip install dist/*.whl
- name: Run Python unit tests
run: |
cd bindings/python
python3 -m pip install pytest pytest-cov pytest-xdist
pytest -q tests --cov=sglang_router --cov-config=.coveragerc --cov-report=term-missing --cov-fail-under=80
unit-tests:
if: |
github.event_name != 'pull_request' ||
(github.event.action != 'labeled' && contains(github.event.pull_request.labels.*.name, 'run-ci')) ||
(github.event.action == 'labeled' && github.event.label.name == 'run-ci')
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Install dependencies
run: |
bash scripts/ci/cuda/ci_install_gateway_dependencies.sh
- name: Configure sccache
uses: mozilla-actions/sccache-action@v0.0.9
with:
version: "v0.12.0"
disable_annotations: true
- name: Rust cache
uses: Swatinem/rust-cache@v2
with:
workspaces: sgl-model-gateway
shared-key: "rust-cache"
cache-all-crates: true
cache-on-failure: true
save-if: true
- name: Run lint
run: |
source "$HOME/.cargo/env"
cd sgl-model-gateway/
rustup component add clippy
cargo clippy --all-targets --all-features -- -D warnings
- name: Run fmt
run: |
source "$HOME/.cargo/env"
cd sgl-model-gateway/
rustup component add --toolchain nightly-x86_64-unknown-linux-gnu rustfmt
rustup toolchain install nightly --profile minimal
cargo +nightly fmt -- --check
- name: Generate vision golden fixtures
run: |
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
pip install transformers pillow numpy scipy
pip install transformers pillow numpy
cd sgl-model-gateway/
python scripts/generate_vision_golden.py
- name: Run Rust tests
timeout-minutes: 20
run: |
source "$HOME/.cargo/env"
cd sgl-model-gateway/
cargo test
- name: Show sccache stats
if: always()
run: sccache --show-stats
gateway-e2e:
name: ${{ matrix.name }}
needs: build-wheel
if: |
github.event_name != 'pull_request' ||
(github.event.action != 'labeled' && contains(github.event.pull_request.labels.*.name, 'run-ci')) ||
(github.event.action == 'labeled' && github.event.label.name == 'run-ci')
strategy:
fail-fast: false
matrix:
include:
- name: benchmarks
timeout: 32
test_dirs: "e2e_test/benchmarks"
extra_deps: "genai-bench==0.0.3"
env_vars: ""
reruns: ""
upload_benchmarks: true
parallel_opts: "" # No parallel for benchmarks (performance measurement)
- name: responses
timeout: 45
test_dirs: "e2e_test/responses"
extra_deps: ""
env_vars: "SHOW_WORKER_LOGS=0 SHOW_ROUTER_LOGS=1"
reruns: "--reruns 2 --reruns-delay 5"
setup_oracle: true
setup_brave: true
parallel_opts: "" # Cloud backend tests not compatible with parallel execution
- name: e2e
timeout: 45
test_dirs: "e2e_test/router e2e_test/embeddings"
extra_deps: "pytest-parallel py" # py is required for pytest-parallel with newer pytest
env_vars: "SHOW_WORKER_LOGS=0 SHOW_ROUTER_LOGS=1"
reruns: "--reruns 2 --reruns-delay 5"
parallel_opts: "--workers 1 --tests-per-worker 4" # Thread-based parallelism
- name: chat-completions
timeout: 45
test_dirs: "e2e_test/chat_completions"
extra_deps: ""
env_vars: "SHOW_WORKER_LOGS=0 SHOW_ROUTER_LOGS=1"
reruns: "--reruns 2 --reruns-delay 5"
parallel_opts: ""
runs-on: 4-gpu-a10
timeout-minutes: ${{ matrix.timeout }}
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Install SGLang dependencies
run: |
sudo --preserve-env=PATH bash scripts/ci/cuda/ci_install_dependency.sh
- name: Setup Oracle Instant Client
if: matrix.setup_oracle
run: |
sudo apt-get install -y unzip
INSTANT_CLIENT_DIR="/home/ubuntu/instant-client"
INSTANT_CLIENT_ZIP="instantclient-basic-linux.x64-23.9.0.25.07.zip"
if [ ! -d "$INSTANT_CLIENT_DIR/instantclient_23_9" ]; then
echo "Downloading Oracle Instant Client..."
mkdir -p "$INSTANT_CLIENT_DIR"
cd "$INSTANT_CLIENT_DIR"
wget https://download.oracle.com/otn_software/linux/instantclient/2390000/$INSTANT_CLIENT_ZIP
unzip $INSTANT_CLIENT_ZIP
rm $INSTANT_CLIENT_ZIP
else
echo "Oracle Instant Client already exists, skipping download"
fi
echo "LD_LIBRARY_PATH=/home/ubuntu/instant-client/instantclient_23_9:\$LD_LIBRARY_PATH" >> $GITHUB_ENV
- name: Start Oracle Database
if: matrix.setup_oracle
run: |
docker run -d -p 1521:1521 -e ORACLE_PASSWORD=oracle --name oracle-db gvenzl/oracle-xe:21-slim
echo "Starting Oracle DB..."
# Export Oracle connection environment variables
echo "ATP_USER=system" >> $GITHUB_ENV
echo "ATP_PASSWORD=oracle" >> $GITHUB_ENV
echo "ATP_DSN=localhost:1521/XEPDB1" >> $GITHUB_ENV
- name: Start Brave MCP Server
if: matrix.setup_brave
run: |
docker run -d --rm \
-p 8001:8080 \
-e BRAVE_API_KEY \
--name brave-search-server \
shoofio/brave-search-mcp-sse:1.0.10
echo "Starting Brave MCP Server..."
sleep 2
curl -f --max-time 1 http://localhost:8001/sse > /dev/null 2>&1 && echo "Brave MCP Server is healthy!" || echo "Brave MCP Server responded"
- name: Download wheel artifact
uses: actions/download-artifact@v4
with:
name: smg-wheel
path: wheel/
- name: Install wheel
run: |
pip uninstall -y sglang-router || true
pip install wheel/*.whl
- name: Install e2e test dependencies
run: |
python3 -m pip install pytest pytest-rerunfailures httpx openai grpcio grpcio-health-checking numpy
if [ -n "${{ matrix.extra_deps }}" ]; then
python3 -m pip --no-cache-dir install --upgrade ${{ matrix.extra_deps }}
fi
- name: Run E2E tests
run: |
python3 python/sglang/cli/killall.py
cd sgl-model-gateway
${{ matrix.env_vars }} ROUTER_LOCAL_MODEL_PATH="/home/ubuntu/models" pytest ${{ matrix.reruns }} ${{ matrix.parallel_opts }} ${{ matrix.test_dirs }} -s -vv -o log_cli=true --log-cli-level=INFO
- name: Upload benchmark results
if: matrix.upload_benchmarks && success()
uses: actions/upload-artifact@v4
with:
name: genai-bench-results-all-policies
path: sgl-model-gateway/benchmark_**/
- name: Cleanup Brave MCP Server
if: always() && matrix.setup_brave
run: |
docker stop brave-search-server || true
docker rm brave-search-server || true
- name: Cleanup Oracle Database
if: always() && matrix.setup_oracle
run: |
docker stop oracle-db || true
docker rm oracle-db || true
docker-build-test:
if: |
github.event_name != 'pull_request' ||
(github.event.action != 'labeled' && contains(github.event.pull_request.labels.*.name, 'run-ci')) ||
(github.event.action == 'labeled' && github.event.label.name == 'run-ci')
runs-on: ubuntu-24.04
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Build Docker image (no push)
uses: docker/build-push-action@v5
with:
context: .
file: docker/gateway.Dockerfile
push: false
tags: sgl-model-gateway:test
cache-from: type=gha
cache-to: type=gha,mode=max
finish:
needs: [build-wheel, python-unit-tests, unit-tests, gateway-e2e, docker-build-test]
runs-on: ubuntu-latest
steps:
- name: Finish
run: echo "This is an empty step to ensure that all jobs are completed."
summarize-benchmarks:
needs: gateway-e2e
runs-on: ubuntu-latest
if: success()
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Download benchmark results
uses: actions/download-artifact@v4
with:
name: genai-bench-results-all-policies
- name: Create benchmark summary
run: python3 sgl-model-gateway/e2e_test/benchmarks/summarize.py .

View File

@@ -0,0 +1,214 @@
name: PR Test - SGL Kernel
on:
workflow_call:
inputs:
sgl_kernel:
required: true
type: string
b200_runner:
required: true
type: string
pr_head_sha:
required: false
type: string
default: ''
git_ref:
required: false
type: string
default: ''
skip_stage_health_check:
required: false
type: boolean
default: false
# Workflow-level env is NOT inherited from the caller in reusable workflows.
# The github context (including github.event_name) IS inherited from the caller.
env:
SGLANG_IS_IN_CI: true
SGLANG_CUDA_COREDUMP: "1"
SGLANG_PR_TEST_BYPASS_MAINTENANCE_ON_MAIN: ${{ github.ref == 'refs/heads/main' && 'true' || 'false' }}
SKIP_STAGE_HEALTH_CHECK: ${{ inputs.skip_stage_health_check == true && 'true' || 'false' }}
jobs:
sgl-kernel-unit-test:
runs-on: 1-gpu-h100
timeout-minutes: 240
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }}
- uses: ./.github/actions/check-stage-health
- uses: ./.github/actions/check-maintenance
- name: Cleanup
run: |
ls -alh sgl-kernel/dist || true
rm -rf sgl-kernel/dist/* || true
- name: Download artifacts
uses: actions/download-artifact@v4
with:
path: sgl-kernel/dist/
merge-multiple: true
pattern: wheel-python3.10-cuda12.9
- name: Install dependencies
timeout-minutes: 20
run: |
CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh diffusion
- name: Run test
timeout-minutes: 30
run: |
cd sgl-kernel
pytest tests/
sgl-kernel-mla-test:
runs-on: 1-gpu-h100
timeout-minutes: 240
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }}
- uses: ./.github/actions/check-stage-health
- uses: ./.github/actions/check-maintenance
- name: Cleanup
run: |
ls -alh sgl-kernel/dist || true
rm -rf sgl-kernel/dist/* || true
- name: Download artifacts
uses: actions/download-artifact@v4
with:
path: sgl-kernel/dist/
merge-multiple: true
pattern: wheel-python3.10-cuda12.9
- name: Install dependencies
timeout-minutes: 20
run: |
CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh
- name: Run test
timeout-minutes: 30
run: |
cd test/registered/mla
python3 test_mla_deepseek_v3.py
sgl-kernel-benchmark-test:
runs-on: 1-gpu-h100
timeout-minutes: 240
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }}
- uses: ./.github/actions/check-stage-health
- uses: ./.github/actions/check-maintenance
- name: Cleanup
run: |
ls -alh sgl-kernel/dist || true
rm -rf sgl-kernel/dist/* || true
- name: Download artifacts
uses: actions/download-artifact@v4
with:
path: sgl-kernel/dist/
merge-multiple: true
pattern: wheel-python3.10-cuda12.9
- name: Install dependencies
timeout-minutes: 20
run: |
CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh
- name: Run benchmark tests
timeout-minutes: 45
run: |
cd sgl-kernel/benchmark
echo "Running sgl-kernel benchmark tests in CI mode..."
echo "CI environment variable: $CI"
echo "GITHUB_ACTIONS environment variable: $GITHUB_ACTIONS"
for bench_file in bench_*.py; do
echo "Testing $bench_file..."
timeout 60 python3 "$bench_file" || echo "Warning: $bench_file timed out or failed, continuing..."
echo "Completed $bench_file"
echo "---"
done
echo "All benchmark tests completed!"
sgl-kernel-b200-test:
runs-on: ${{ inputs.b200_runner }}
timeout-minutes: 240
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }}
- uses: ./.github/actions/check-stage-health
- uses: ./.github/actions/check-maintenance
- name: Cleanup
run: |
ls -alh sgl-kernel/dist || true
rm -rf sgl-kernel/dist/* || true
- name: Download artifacts
uses: actions/download-artifact@v4
with:
path: sgl-kernel/dist/
merge-multiple: true
pattern: wheel-python3.10-cuda12.9
- name: Install dependencies
timeout-minutes: 20
run: |
CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh diffusion
- name: Run sgl-kernel unit tests on B200
timeout-minutes: 30
run: |
cd sgl-kernel
pytest tests/
# Adding a single CUDA13 smoke test to verify that the kernel builds and runs
# TODO: Add back this test when it can pass on CI
# cuda13-kernel-smoke-test:
# if: inputs.sgl_kernel == 'true'
# runs-on: x64-cu13-kernel-tests
# steps:
# - uses: actions/checkout@v4
# - name: Cleanup
# run: |
# ls -alh sgl-kernel/dist || true
# rm -rf sgl-kernel/dist/* || true
# - name: Download CUDA 13.0 artifacts
# uses: actions/download-artifact@v4
# with:
# path: sgl-kernel/dist/
# merge-multiple: true
# pattern: wheel-python3.10-cuda13.0
# - name: Install dependencies
# run: |
# CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh
# - name: Run kernel unit tests
# timeout-minutes: 30
# run: |
# cd sgl-kernel
# pytest tests/

View File

@@ -0,0 +1,131 @@
name: PR Test (Xeon)
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
workflow_dispatch:
workflow_call:
inputs:
ref:
description: 'Git ref (branch, tag, or SHA) to test. If not provided, uses the default branch.'
required: false
type: string
default: ''
run_all_tests:
description: "Run all tests (for releasing or testing purpose)"
required: false
type: boolean
default: false
concurrency:
group: pr-test-xeon-${{ inputs.ref || github.ref }}
cancel-in-progress: false
jobs:
# ==================== Check Changes ==================== #
check-changes:
runs-on: ubuntu-latest
outputs:
main_package: ${{ steps.filter.outputs.main_package || steps.run-mode.outputs.run_all_tests}}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- name: Determine run mode
id: run-mode
run: |
# Run all tests for workflow_call (when ref input is provided)
# Note: github.event_name is inherited from caller, so we detect workflow_call by checking inputs.ref
if [[ "${{ inputs.run_all_tests }}" == "true" ]]; then
echo "run_all_tests=true" >> $GITHUB_OUTPUT
echo "Run mode: ALL TESTS (run_all_tests=${{ inputs.run_all_tests }})"
else
echo "run_all_tests=false" >> $GITHUB_OUTPUT
echo "Run mode: FILTERED (triggered by ${{ github.event_name }})"
fi
- name: Detect file changes
id: filter
uses: dorny/paths-filter@v3
if: steps.run-mode.outputs.run_all_tests != 'true'
with:
filters: |
main_package:
- "python/sglang/!(multimodal_gen)/**/!(*.md)"
- "python/pyproject_cpu.toml"
- "test/**/!(*.md)"
- "sgl-kernel/**/*.!(md|txt)"
- ".github/workflows/pr-test-xeon.yml"
- "docker/xeon.Dockerfile"
# ==================== PR Gate ==================== #
pr-gate:
needs: check-changes
if: needs.check-changes.outputs.main_package == 'true'
uses: ./.github/workflows/pr-gate.yml
secrets: inherit
build-test:
needs: [check-changes, pr-gate]
if: needs.check-changes.outputs.main_package == 'true'
runs-on: xeon-gnr
env:
HF_HOME: /home/sdp/.cache/huggingface
strategy:
matrix:
build_type: ['all']
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- name: Build and Push
run: |
version=$(cat python/sglang/version.py | cut -d'"' -f2)
tag=v${version}-xeon
PR_REPO=${{ github.event.pull_request.head.repo.clone_url }}
PR_HEAD_REF=${{ github.head_ref }}
docker build \
${PR_REPO:+--build-arg SGLANG_REPO=$PR_REPO} \
${PR_HEAD_REF:+--build-arg VER_SGLANG=$PR_HEAD_REF} \
. -f docker/xeon.Dockerfile -t sglang_xeon --no-cache
- name: Run container
run: |
docker run -dt \
-v ${{ github.workspace }}:/sglang-checkout/ --ipc=host \
-v ${HF_HOME}:/root/.cache/huggingface \
--name ci_sglang_xeon \
sglang_xeon
- name: Check AMX support
id: check_amx
timeout-minutes: 5
run: |
docker exec -w /sglang-checkout/ ci_sglang_xeon \
bash -c "source /opt/.venv/bin/activate && python3 -c 'import torch; import sgl_kernel; assert torch._C._cpu._is_amx_tile_supported(); assert hasattr(torch.ops.sgl_kernel, \"convert_weight_packed\"); '"
- name: Run unit tests
timeout-minutes: 36
run: |
docker exec -w /sglang-checkout/ ci_sglang_xeon \
bash -c "source /opt/.venv/bin/activate && cd ./test/srt && python3 run_suite.py --suite per-commit-cpu --timeout-per-file 1500"
- name: Change permission
timeout-minutes: 2
run: |
docker exec -u root ci_sglang_xeon bash -c "
rm -rf /tmp/ci-home &&
chown -R $(id -u):$(id -g) /sglang-checkout/ 2>/dev/null || true
"
- name: Cleanup container
if: always()
run: |
docker rm -f ci_sglang_xeon || true

View File

@@ -0,0 +1,143 @@
name: PR Test (XPU)
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
workflow_dispatch:
workflow_call:
inputs:
ref:
description: 'Git ref (branch, tag, or SHA) to test. If not provided, uses the default branch.'
required: false
type: string
default: ''
run_all_tests:
description: "Run all tests (for releasing or testing purpose)"
required: false
type: boolean
default: false
concurrency:
group: pr-test-xpu-${{ inputs.ref || github.ref }}
cancel-in-progress: ${{ github.event_name != 'workflow_call' }}
jobs:
# ==================== Check Changes ==================== #
check-changes:
runs-on: ubuntu-latest
outputs:
main_package: ${{ steps.filter.outputs.main_package || steps.run-mode.outputs.run_all_tests }}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- name: Determine run mode
id: run-mode
run: |
# Run all tests for workflow_call (when ref input is provided)
# Note: github.event_name is inherited from caller, so we detect workflow_call by checking inputs.ref
if [[ "${{ inputs.run_all_tests }}" == "true" ]]; then
echo "run_all_tests=true" >> $GITHUB_OUTPUT
echo "Run mode: ALL TESTS (run_all_tests=${{ inputs.run_all_tests }})"
else
echo "run_all_tests=false" >> $GITHUB_OUTPUT
echo "Run mode: FILTERED (triggered by ${{ github.event_name }})"
fi
- name: Detect file changes
id: filter
uses: dorny/paths-filter@v3
if: steps.run-mode.outputs.run_all_tests != 'true'
with:
filters: |
main_package:
- "python/sglang/!(multimodal_gen)/**/!(*.md)"
- "python/pyproject_xpu.toml"
- "test/**/!(*.md)"
- "sgl-kernel/**/*.!(md|txt)"
- ".github/workflows/pr-test-xpu.yml"
- "docker/xpu.Dockerfile"
# ==================== PR Gate ==================== #
pr-gate:
needs: check-changes
if: needs.check-changes.outputs.main_package == 'true'
uses: ./.github/workflows/pr-gate.yml
secrets: inherit
build-and-test:
needs: [check-changes, pr-gate]
if: needs.check-changes.outputs.main_package == 'true'
runs-on: intel-bmg
env:
HF_HOME: /home/sdp/.cache/huggingface
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0
ref: ${{ inputs.ref || github.ref }}
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Build Docker image
run: |
PR_REPO=${{ github.event.pull_request.head.repo.clone_url }}
PR_HEAD_REF=${{ github.head_ref }}
docker build \
${PR_REPO:+--build-arg SG_LANG_REPO=$PR_REPO} \
${PR_HEAD_REF:+--build-arg SG_LANG_BRANCH=$PR_HEAD_REF} \
--no-cache --progress=plain -f docker/xpu.Dockerfile -t xpu_sglang_main:bmg .
- name: Run container
id: start_container
run: |
container_id=$(docker run -dt \
--group-add 992 \
--group-add $(getent group video | cut -d: -f3) \
-v ${HF_HOME}:/root/.cache/huggingface \
--device /dev/dri \
-e HF_TOKEN="$(cat ~/huggingface_token.txt)" \
xpu_sglang_main:bmg)
echo "Started container: $container_id"
echo "container_id=$container_id" >> "$GITHUB_OUTPUT"
- name: Install Dependency
timeout-minutes: 20
run: |
cid="${{ steps.start_container.outputs.container_id }}"
docker exec "$cid" /home/sdp/miniforge3/envs/py3.10/bin/python3 -m pip install --upgrade pip
docker exec "$cid" /home/sdp/miniforge3/envs/py3.10/bin/python3 -m pip install pytest expecttest ray huggingface_hub
docker exec "$cid" /home/sdp/miniforge3/envs/py3.10/bin/python3 -m pip uninstall -y flashinfer-python
docker exec "$cid" /bin/bash -c '/home/sdp/miniforge3/envs/py3.10/bin/hf auth login --token ${HF_TOKEN} '
- name: Run E2E Bfloat16 tests
timeout-minutes: 20
run: |
cid="${{ steps.start_container.outputs.container_id }}"
docker exec "$cid" bash -c "source /home/sdp/miniforge3/bin/activate && conda activate py3.10 && cd /home/sdp/sglang/test/srt && python3 run_suite.py --suite per-commit-xpu"
- name: Cleanup container
if: always()
run: |
cid="${{ steps.start_container.outputs.container_id }}"
docker rm -f "$cid" || true
finish:
if: always()
needs: [build-and-test, pr-gate]
runs-on: ubuntu-latest
steps:
- name: Check job status
run: |
result="${{ needs.build-and-test.result }}"
if [ "$result" != "success" ] && [ "$result" != "skipped" ]; then
echo "Job failed with result: $result"
exit 1
fi
echo "All jobs completed successfully (result: $result)"
exit 0

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,215 @@
name: Release Branch Cut
on:
workflow_dispatch:
inputs:
branch_name:
description: 'Branch name to create (e.g., release/v0.5.7)'
required: true
type: string
commit_sha:
description: 'Commit SHA from main to cut the release branch from (defaults to latest main)'
required: false
type: string
default: ''
permissions:
actions: write
contents: write
issues: read
pull-requests: read
jobs:
cut-release-branch:
if: github.repository == 'sgl-project/sglang'
runs-on: ubuntu-latest
environment: 'prod'
outputs:
branch_name: ${{ steps.set_output.outputs.branch_name }}
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
ref: main
fetch-depth: 0
token: ${{ secrets.GITHUB_TOKEN }}
- name: Validate branch name
run: |
BRANCH_NAME="${{ github.event.inputs.branch_name }}"
if [ -z "$BRANCH_NAME" ]; then
echo "::error::Branch name is required"
exit 1
fi
# Validate branch name format (should start with release/)
if [[ ! "$BRANCH_NAME" =~ ^release/ ]]; then
echo "::warning::Branch name '$BRANCH_NAME' does not follow convention 'release/vX.Y.Z'"
fi
echo "Branch name: $BRANCH_NAME"
- name: Validate commit SHA
id: validate
run: |
COMMIT_SHA="${{ github.event.inputs.commit_sha }}"
# If no commit SHA provided, use latest main
if [ -z "$COMMIT_SHA" ]; then
COMMIT_SHA=$(git rev-parse HEAD)
echo "No commit SHA provided, using latest main: $COMMIT_SHA"
fi
# Verify the commit exists and is on main
if ! git cat-file -t "$COMMIT_SHA" > /dev/null 2>&1; then
echo "::error::Commit SHA '$COMMIT_SHA' does not exist"
exit 1
fi
# Check if commit is an ancestor of main (i.e., is on main branch)
if ! git merge-base --is-ancestor "$COMMIT_SHA" main; then
echo "::error::Commit SHA '$COMMIT_SHA' is not on the main branch"
exit 1
fi
echo "COMMIT_SHA=$COMMIT_SHA" >> $GITHUB_OUTPUT
echo "Validated commit SHA: $COMMIT_SHA"
- name: Check if branch already exists
run: |
BRANCH_NAME="${{ github.event.inputs.branch_name }}"
if git ls-remote --heads origin "$BRANCH_NAME" | grep -q "$BRANCH_NAME"; then
echo "::error::Branch '$BRANCH_NAME' already exists"
exit 1
fi
echo "Branch '$BRANCH_NAME' does not exist, proceeding with creation"
- name: Create release branch
id: set_output
run: |
COMMIT_SHA="${{ steps.validate.outputs.COMMIT_SHA }}"
BRANCH_NAME="${{ github.event.inputs.branch_name }}"
git config user.name "sglang-bot"
git config user.email "sglang-bot@users.noreply.github.com"
# Create branch from the specified commit
git checkout -b "$BRANCH_NAME" "$COMMIT_SHA"
echo "branch_name=$BRANCH_NAME" >> $GITHUB_OUTPUT
echo "Successfully created branch '$BRANCH_NAME' from commit '$COMMIT_SHA'"
- name: Update version references in documentation
run: |
BRANCH_NAME="${{ github.event.inputs.branch_name }}"
# Extract version from branch name (e.g., release/v0.5.8 -> v0.5.8)
VERSION=$(echo "$BRANCH_NAME" | sed 's/release\///')
# Update git clone version references in docs
sed -i "s/git clone -b v[0-9]\+\.[0-9]\+\.[0-9]\+\.\?post\?[0-9]*/git clone -b $VERSION/" docs/get_started/install.md
sed -i "s/git clone -b v[0-9]\+\.[0-9]\+\.[0-9]\+\.\?post\?[0-9]*/git clone -b $VERSION/" docs/platforms/amd_gpu.md
# Check if any changes were made
if git diff --quiet; then
echo "No version references needed updating"
else
git add docs/get_started/install.md docs/platforms/amd_gpu.md
git commit -m "docs: update version references to $VERSION"
echo "Updated version references to $VERSION"
fi
- name: Push release branch
run: |
BRANCH_NAME="${{ steps.set_output.outputs.branch_name }}"
git push origin "$BRANCH_NAME"
echo "Successfully pushed branch '$BRANCH_NAME'"
- name: Summary
run: |
COMMIT_SHA="${{ steps.validate.outputs.COMMIT_SHA }}"
BRANCH_NAME="${{ github.event.inputs.branch_name }}"
echo "## Release Branch Cut Summary" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "| Property | Value |" >> $GITHUB_STEP_SUMMARY
echo "|----------|-------|" >> $GITHUB_STEP_SUMMARY
echo "| Branch | \`$BRANCH_NAME\` |" >> $GITHUB_STEP_SUMMARY
echo "| Commit | \`$COMMIT_SHA\` |" >> $GITHUB_STEP_SUMMARY
echo "| Triggered by | @${{ github.actor }} |" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### Next Steps" >> $GITHUB_STEP_SUMMARY
echo "1. Tests are automatically triggered on the release branch" >> $GITHUB_STEP_SUMMARY
echo "2. Apply any hotfixes if needed" >> $GITHUB_STEP_SUMMARY
echo "3. Create a tag to trigger release: \`gh workflow run release-tag.yml -f version=X.Y.Z -f ref=$BRANCH_NAME\`" >> $GITHUB_STEP_SUMMARY
run-pr-tests-nvidia:
needs: cut-release-branch
uses: ./.github/workflows/pr-test.yml
with:
git_ref: ${{ needs.cut-release-branch.outputs.branch_name }}
run_all_tests: true
skip_stage_health_check: true
secrets: inherit
run-pr-tests-amd:
needs: cut-release-branch
uses: ./.github/workflows/pr-test-amd.yml
with:
ref: ${{ needs.cut-release-branch.outputs.branch_name }}
run_all_tests: true
secrets: inherit
run-pr-test-npu:
needs: cut-release-branch
uses: ./.github/workflows/pr-test-npu.yml
with:
ref: ${{ needs.cut-release-branch.outputs.branch_name }}
run_all_tests: true
secrets: inherit
run-pr-tests-xeon:
needs: cut-release-branch
uses: ./.github/workflows/pr-test-xeon.yml
with:
ref: ${{ needs.cut-release-branch.outputs.branch_name }}
run_all_tests: true
secrets: inherit
run-pr-tests-xpu:
needs: cut-release-branch
uses: ./.github/workflows/pr-test-xpu.yml
with:
ref: ${{ needs.cut-release-branch.outputs.branch_name }}
run_all_tests: true
secrets: inherit
run-nightly-tests-nvidia:
needs: cut-release-branch
uses: ./.github/workflows/nightly-test-nvidia.yml
with:
ref: ${{ needs.cut-release-branch.outputs.branch_name }}
secrets: inherit
run-nightly-tests-amd:
needs: cut-release-branch
uses: ./.github/workflows/nightly-test-amd.yml
with:
ref: ${{ needs.cut-release-branch.outputs.branch_name }}
secrets: inherit
run-nightly-tests-npu:
needs: cut-release-branch
uses: ./.github/workflows/nightly-test-npu.yml
with:
ref: ${{ needs.cut-release-branch.outputs.branch_name }}
secrets: inherit
run-nightly-tests-intel:
needs: cut-release-branch
uses: ./.github/workflows/nightly-test-intel.yml
with:
ref: ${{ needs.cut-release-branch.outputs.branch_name }}
secrets: inherit

View File

@@ -0,0 +1,182 @@
name: Release Docker Images Nightly (AMD)
on:
workflow_dispatch:
schedule:
- cron: '0 12 * * *'
concurrency:
# A PR number if a pull request and otherwise the commit hash. This cancels
# queued and in-progress runs for the same PR (presubmit) or commit
# (postsubmit). The workflow name is prepended to avoid conflicts between
# different workflows.
group: ${{ github.workflow }}-${{ github.event.number || github.sha }}
cancel-in-progress: true
jobs:
publish:
if: github.repository == 'sgl-project/sglang'
runs-on: amd-docker-scale
environment: 'prod'
strategy:
fail-fast: false
matrix:
gpu_arch: ['gfx942', 'gfx950']
build_type: ['all']
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
fetch-depth: 0 # Required for git describe to find tags
- name: "Set Date"
run: |
echo "DATE=$(date +%Y%m%d)" >> $GITHUB_ENV
- name: Get version from latest tag
id: version
run: |
# Get the latest version tag sorted by version number (e.g., v0.5.7 -> 0.5.7)
VERSION=$(git tag -l 'v[0-9]*' --sort=-v:refname | head -1 | sed 's/^v//')
if [ -z "$VERSION" ]; then
echo "::error::Could not determine version from git tags"
exit 1
fi
# Get short commit hash of current HEAD
COMMIT_HASH=$(git rev-parse --short HEAD)
# Compose pretend version for setuptools_scm: e.g., 0.5.8.dev20260129+g1a2b3c4
PRETEND_VERSION="${VERSION}.dev${{ env.DATE }}+g${COMMIT_HASH}"
echo "version=${VERSION}" >> $GITHUB_OUTPUT
echo "pretend_version=${PRETEND_VERSION}" >> $GITHUB_OUTPUT
echo "Detected version: ${VERSION}"
echo "Pretend version for pip: ${PRETEND_VERSION}"
- name: Login to Docker Hub (AMD)
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_AMD_USERNAME }}
password: ${{ secrets.DOCKERHUB_AMD_TOKEN }}
- name: Build and Push to rocm/sgl-dev
run: |
version=${{ steps.version.outputs.version }}
pretend_version=${{ steps.version.outputs.pretend_version }}
echo "Version: ${version}"
echo "Pretend version: ${pretend_version}"
if [ "${{ matrix.gpu_arch }}" = "gfx942" ]; then
rocm_tag="rocm700-mi30x"
elif [ "${{ matrix.gpu_arch }}" = "gfx950" ]; then
rocm_tag="rocm700-mi35x"
else
echo "Unsupported gfx arch"
exit 1
fi
tag=v${version}-${rocm_tag}
echo "IMAGE_TAG=${tag}-${{ env.DATE }}" >> $GITHUB_ENV
docker build . -f docker/rocm.Dockerfile --build-arg SGL_BRANCH=${{ github.ref_name }} --build-arg BUILD_TYPE=${{ matrix.build_type }} --build-arg GPU_ARCH=${{ matrix.gpu_arch }} --build-arg ENABLE_MORI=1 --build-arg NIC_BACKEND=ainic --build-arg SETUPTOOLS_SCM_PRETEND_VERSION=${pretend_version} -t rocm/sgl-dev:${tag}-${{ env.DATE }} --no-cache
docker push rocm/sgl-dev:${tag}-${{ env.DATE }}
- name: Login to Docker Hub (lmsys)
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Push to lmsysorg/sglang-rocm
run: |
docker tag rocm/sgl-dev:${{ env.IMAGE_TAG }} lmsysorg/sglang-rocm:${{ env.IMAGE_TAG }}
docker push lmsysorg/sglang-rocm:${{ env.IMAGE_TAG }}
# Temporarily disable docker cache seeding until performant storage is in place
cache:
if: false
# if: always() && github.repository == 'sgl-project/sglang'
runs-on: linux-mi300-gpu-1
environment: 'prod'
needs: publish
strategy:
fail-fast: false
matrix:
gpu_arch: ['gfx942']
build_type: ['all']
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
fetch-depth: 0 # Required for git describe to find tags
- name: "Set Date"
run: |
echo "DATE=$(date +%Y%m%d)" >> $GITHUB_ENV
- name: Get version from latest tag
id: version
run: |
# Get the latest version tag sorted by version number (e.g., v0.5.7 -> 0.5.7)
VERSION=$(git tag -l 'v[0-9]*' --sort=-v:refname | head -1 | sed 's/^v//')
if [ -z "$VERSION" ]; then
echo "::error::Could not determine version from git tags"
exit 1
fi
echo "version=${VERSION}" >> $GITHUB_OUTPUT
echo "Detected version: ${VERSION}"
- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_AMD_USERNAME }}
password: ${{ secrets.DOCKERHUB_AMD_TOKEN }}
- name: Pull and Save Docker Image to Cache
run: |
set -euxo pipefail
version=${{ steps.version.outputs.version }}
echo "Version: ${version}"
if [ "${{ matrix.gpu_arch }}" = "gfx942" ]; then
rocm_tag="rocm700-mi30x"
else
echo "Unsupported gfx arch"
exit 1
fi
tag=v${version}-${rocm_tag}
if [ "${{ matrix.build_type }}" = "all" ]; then
tag_suffix=""
else
echo "Unsupported build type"
exit 1
fi
image="rocm/sgl-dev:${tag}-${{ env.DATE }}${tag_suffix}"
# Determine target cache file name based on ROCm variant
if [[ "${rocm_tag}" == rocm700* ]]; then
final_path="/home/runner/sgl-data/docker/image-700.tar"
else
echo "Unexpected ROCm tag: ${rocm_tag}"
exit 1
fi
tmp_path="${final_path}.tmp"
echo "Pulling image: ${image}"
docker pull "${image}"
echo "Saving to temp file: ${tmp_path}"
docker save "${image}" -o "${tmp_path}"
echo "Moving to final path: ${final_path}"
mv -f "${tmp_path}" "${final_path}"
echo "Cache populated successfully at ${final_path}"

View File

@@ -0,0 +1,94 @@
name: Release Docker Images ROCm 7.2.0 Nightly Preview (AMD)
on:
workflow_dispatch:
schedule:
- cron: '0 12 * * *'
concurrency:
# A PR number if a pull request and otherwise the commit hash. This cancels
# queued and in-progress runs for the same PR (presubmit) or commit
# (postsubmit). The workflow name is prepended to avoid conflicts between
# different workflows.
group: ${{ github.workflow }}-${{ github.event.number || github.sha }}
cancel-in-progress: True
jobs:
publish:
if: github.repository == 'sgl-project/sglang'
runs-on: amd-docker-scale
environment: 'prod'
strategy:
fail-fast: false
matrix:
gpu_arch: ['gfx942-rocm720', 'gfx950-rocm720']
build_type: ['all']
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
fetch-depth: 0 # Required for git describe to find tags
- name: "Set Date"
run: |
echo "DATE=$(date +%Y%m%d)" >> $GITHUB_ENV
- name: Get version from latest tag
id: version
run: |
# Get the latest version tag sorted by version number (e.g., v0.5.7 -> 0.5.7)
VERSION=$(git tag -l 'v[0-9]*' --sort=-v:refname | head -1 | sed 's/^v//')
if [ -z "$VERSION" ]; then
echo "::error::Could not determine version from git tags"
exit 1
fi
# Get short commit hash of current HEAD
COMMIT_HASH=$(git rev-parse --short HEAD)
# Compose pretend version for setuptools_scm: e.g., 0.5.8.post1.dev20260211+g1a2b3c4
PRETEND_VERSION="${VERSION}.dev${{ env.DATE }}+g${COMMIT_HASH}"
echo "version=${VERSION}" >> $GITHUB_OUTPUT
echo "pretend_version=${PRETEND_VERSION}" >> $GITHUB_OUTPUT
echo "Detected version: ${VERSION}"
echo "Pretend version for pip: ${PRETEND_VERSION}"
- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_AMD_USERNAME }}
password: ${{ secrets.DOCKERHUB_AMD_TOKEN }}
- name: Build and Push to rocm/sgl-dev
run: |
version=${{ steps.version.outputs.version }}
pretend_version=${{ steps.version.outputs.pretend_version }}
echo "Version: ${version}"
echo "Pretend version: ${pretend_version}"
if [ "${{ matrix.gpu_arch }}" = "gfx942-rocm720" ]; then
rocm_tag="rocm720-mi30x"
elif [ "${{ matrix.gpu_arch }}" = "gfx950-rocm720" ]; then
rocm_tag="rocm720-mi35x"
else
echo "Unsupported gfx arch"
exit 1
fi
tag=v${version}-${rocm_tag}
echo "IMAGE_TAG=${tag}-${{ env.DATE }}" >> $GITHUB_ENV
docker build . -f docker/rocm.Dockerfile --build-arg SGL_BRANCH=${{ github.ref_name }} --build-arg BUILD_TYPE=${{ matrix.build_type }} --build-arg GPU_ARCH=${{ matrix.gpu_arch }} --build-arg ENABLE_MORI=1 --build-arg NIC_BACKEND=ainic --build-arg SETUPTOOLS_SCM_PRETEND_VERSION=${pretend_version} -t rocm/sgl-dev:${tag}-${{ env.DATE }} --no-cache
docker push rocm/sgl-dev:${tag}-${{ env.DATE }}
- name: Login to Docker Hub (lmsys)
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Push to lmsysorg/sglang-rocm
run: |
docker tag rocm/sgl-dev:${{ env.IMAGE_TAG }} lmsysorg/sglang-rocm:${{ env.IMAGE_TAG }}
docker push lmsysorg/sglang-rocm:${{ env.IMAGE_TAG }}

View File

@@ -0,0 +1,88 @@
name: Release Docker Images (AMD)
on:
push:
tags:
- 'v[0-9]+.*'
workflow_dispatch:
inputs:
version:
description: 'Version to build (without v prefix, e.g., 0.5.7)'
required: true
jobs:
publish:
if: github.repository == 'sgl-project/sglang'
runs-on: amd-docker-scale
environment: 'prod'
strategy:
matrix:
rocm_version: ['rocm700', 'rocm720']
gpu_arch: ['gfx942', 'gfx950']
build_type: ['all']
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Get version from tag
id: version
run: |
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
VERSION="${{ github.event.inputs.version }}"
else
# Extract version from tag (e.g., v0.5.7 -> 0.5.7)
VERSION="${GITHUB_REF_NAME#v}"
fi
# Validate version format
if [ -z "$VERSION" ]; then
echo "::error::Version is empty"
exit 1
fi
if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then
echo "::error::Invalid version format: $VERSION (expected: X.Y.Z)"
exit 1
fi
echo "version=${VERSION}" >> $GITHUB_OUTPUT
- name: Build and Push
run: |
version=${{ steps.version.outputs.version }}
echo "Version: ${version}"
gpu_arch_suffix=""
if [ "${{ matrix.rocm_version }}" = "rocm700" ]; then
if [ "${{ matrix.gpu_arch }}" = "gfx942" ]; then
rocm_tag="rocm700-mi30x"
elif [ "${{ matrix.gpu_arch }}" = "gfx950" ]; then
rocm_tag="rocm700-mi35x"
else
echo "Unsupported gfx arch"
exit 1
fi
elif [ "${{ matrix.rocm_version }}" = "rocm720" ]; then
gpu_arch_suffix="-${{ matrix.rocm_version }}"
if [ "${{ matrix.gpu_arch }}" = "gfx942" ]; then
rocm_tag="rocm720-mi30x"
elif [ "${{ matrix.gpu_arch }}" = "gfx950" ]; then
rocm_tag="rocm720-mi35x"
else
echo "Unsupported gfx arch"
exit 1
fi
else
echo "Unsupported rocm version"
exit 1
fi
tag=v${version}-${rocm_tag}
# rocm.Dockerfile expects SGL_BRANCH with 'v' prefix for git tag checkout
docker build . -f docker/rocm.Dockerfile --build-arg BUILD_TYPE=${{ matrix.build_type }} --build-arg GPU_ARCH=${{ matrix.gpu_arch }}${gpu_arch_suffix} --build-arg SGL_BRANCH=v${version} --build-arg ENABLE_MORI=1 --build-arg NIC_BACKEND=ainic -t lmsysorg/sglang:${tag} --no-cache
docker push lmsysorg/sglang:${tag}

View File

@@ -0,0 +1,190 @@
name: Release CUDA 13 Framework Docker Images (Temporary)
# Temporary workflow to build only versioned cu13 framework images
# Can be deleted after use
on:
workflow_dispatch:
inputs:
version:
description: "Version to build (without v prefix, e.g., 0.5.8)"
required: true
jobs:
publish-x86:
if: github.repository == 'sgl-project/sglang'
runs-on: x64-docker-build-node
steps:
- name: Delete huge unnecessary tools folder
run: rm -rf /opt/hostedtoolcache
- name: Checkout repository
uses: actions/checkout@v4
- name: Free disk space
uses: jlumbroso/free-disk-space@main
with:
tool-cache: false
docker-images: false
android: true
dotnet: true
haskell: true
large-packages: true
swap-storage: false
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Validate version
id: version
run: |
VERSION="${{ github.event.inputs.version }}"
if [ -z "$VERSION" ]; then
echo "::error::Version is empty"
exit 1
fi
if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then
echo "::error::Invalid version format: $VERSION (expected: X.Y.Z)"
exit 1
fi
echo "version=${VERSION}" >> $GITHUB_OUTPUT
- name: Build and Push AMD64 Framework (CUDA 13)
run: |
version=${{ steps.version.outputs.version }}
docker buildx build \
--target framework \
--platform linux/amd64 \
--output type=image,name=lmsysorg/sglang,push-by-digest=true,name-canonical=true,push=true \
-f docker/Dockerfile \
--build-arg CUDA_VERSION=13.0.1 \
--build-arg BUILD_TYPE=all \
--build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \
--build-arg GRACE_BLACKWELL=0 \
--build-arg SGL_VERSION=${version} \
--metadata-file /tmp/metadata.json \
--no-cache \
.
DIGEST=$(python3 -c "import json; print(json.load(open('/tmp/metadata.json'))['containerimage.digest'])")
echo "Pushed digest: ${DIGEST}"
echo "${DIGEST}" > /tmp/digest-cu130-amd64-framework.txt
- name: Upload digest
uses: actions/upload-artifact@v4
with:
name: digest-cu130-amd64
path: /tmp/digest-cu130-amd64-framework.txt
retention-days: 1
publish-arm64:
if: github.repository == 'sgl-project/sglang'
runs-on: arm-docker-build-node
steps:
- name: Delete huge unnecessary tools folder
run: rm -rf /opt/hostedtoolcache
- name: Checkout repository
uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Validate version
id: version
run: |
VERSION="${{ github.event.inputs.version }}"
if [ -z "$VERSION" ]; then
echo "::error::Version is empty"
exit 1
fi
if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then
echo "::error::Invalid version format: $VERSION (expected: X.Y.Z)"
exit 1
fi
echo "version=${VERSION}" >> $GITHUB_OUTPUT
- name: Build and Push ARM64 Framework (CUDA 13)
run: |
version=${{ steps.version.outputs.version }}
docker buildx build \
--target framework \
--platform linux/arm64 \
--output type=image,name=lmsysorg/sglang,push-by-digest=true,name-canonical=true,push=true \
-f docker/Dockerfile \
--build-arg CUDA_VERSION=13.0.1 \
--build-arg BUILD_TYPE=all \
--build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \
--build-arg GRACE_BLACKWELL=1 \
--build-arg SGL_VERSION=${version} \
--metadata-file /tmp/metadata.json \
--no-cache \
.
DIGEST=$(python3 -c "import json; print(json.load(open('/tmp/metadata.json'))['containerimage.digest'])")
echo "Pushed digest: ${DIGEST}"
echo "${DIGEST}" > /tmp/digest-cu130-arm64-framework.txt
- name: Upload digest
uses: actions/upload-artifact@v4
with:
name: digest-cu130-arm64
path: /tmp/digest-cu130-arm64-framework.txt
retention-days: 1
create-manifest:
runs-on: ubuntu-22.04
needs: [publish-x86, publish-arm64]
if: github.repository == 'sgl-project/sglang'
steps:
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Download amd64 digest
uses: actions/download-artifact@v4
with:
name: digest-cu130-amd64
path: /tmp/digests/amd64
- name: Download arm64 digest
uses: actions/download-artifact@v4
with:
name: digest-cu130-arm64
path: /tmp/digests/arm64
- name: Create multi-arch manifest
run: |
version=${{ github.event.inputs.version }}
AMD64_DIGEST=$(cat /tmp/digests/amd64/digest-cu130-amd64-framework.txt)
ARM64_DIGEST=$(cat /tmp/digests/arm64/digest-cu130-arm64-framework.txt)
# Create versioned CUDA 13 framework manifest
docker buildx imagetools create \
-t lmsysorg/sglang:v${version}-cu130 \
lmsysorg/sglang@${AMD64_DIGEST} \
lmsysorg/sglang@${ARM64_DIGEST}
# Create latest CUDA 13 framework manifest
docker buildx imagetools create \
-t lmsysorg/sglang:latest-cu130 \
lmsysorg/sglang@${AMD64_DIGEST} \
lmsysorg/sglang@${ARM64_DIGEST}

View File

@@ -0,0 +1,209 @@
name: Build and Push Development Docker Images
on:
workflow_dispatch:
inputs:
pr_number:
description: "PR number to build from (leave empty to use current branch)"
required: false
default: ""
tag:
description: "Custom tag suffix (overrides pr_number in tag). E.g. 'my-test' → dev-my-test, dev-cu13-my-test, etc."
required: false
default: ""
schedule:
- cron: "0 0 * * *"
concurrency:
group: release-docker-dev-${{ inputs.tag || inputs.pr_number || 'nightly' }}
cancel-in-progress: true
jobs:
build-dev:
if: ${{ github.repository == 'sgl-project/sglang' }}
runs-on: ${{ matrix.runner }}
strategy:
matrix:
include:
- runner: x64-docker-build-node
platform: linux/amd64
build_type: all
grace_blackwell: 0
arch_tag: x86
version: 12.9.1
- runner: arm-docker-build-node
platform: linux/arm64
build_type: all
grace_blackwell: 1
arch_tag: arm64
version: 12.9.1
- runner: x64-docker-build-node
platform: linux/amd64
build_type: all
grace_blackwell: 0
arch_tag: x86-cu13
version: 13.0.1
- runner: arm-docker-build-node
platform: linux/arm64
build_type: all
grace_blackwell: 1
arch_tag: arm64-cu13
version: 13.0.1
steps:
- name: Delete huge unnecessary tools folder
run: rm -rf /opt/hostedtoolcache
- name: Checkout repository
uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || github.ref }}
- name: Free disk space
uses: jlumbroso/free-disk-space@main
with:
tool-cache: true
docker-images: true
android: true
dotnet: true
haskell: true
large-packages: true
swap-storage: true
- name: Prune Docker to reclaim disk space
run: |
docker buildx prune --filter "until=72h" -f
docker system prune -af --filter "until=72h"
docker volume prune -af
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Build and Push Dev Image
run: |
# Nightly (schedule) installs latest release; manual dispatch builds from checked-out source
if [ "${{ github.event_name }}" = "schedule" ]; then
SOURCE_ARG="--build-arg USE_LATEST_SGLANG=1"
else
SOURCE_ARG="--build-arg BRANCH_TYPE=local"
fi
docker buildx build \
--platform ${{ matrix.platform }} \
--output type=image,name=lmsysorg/sglang,push-by-digest=true,name-canonical=true,push=true \
--target framework \
-f docker/Dockerfile \
--build-arg CUDA_VERSION=${{ matrix.version }} \
--build-arg BUILD_TYPE=${{ matrix.build_type }} \
--build-arg CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) \
--build-arg GRACE_BLACKWELL=${{ matrix.grace_blackwell }} \
${SOURCE_ARG} \
--build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \
--metadata-file /tmp/metadata.json \
--no-cache \
.
DIGEST=$(python3 -c "import json; print(json.load(open('/tmp/metadata.json'))['containerimage.digest'])")
echo "Pushed digest: ${DIGEST}"
echo "${DIGEST}" > /tmp/digest.txt
- name: Upload digest
uses: actions/upload-artifact@v4
with:
name: digest-${{ matrix.arch_tag }}
path: /tmp/digest.txt
retention-days: 1
create-manifests:
runs-on: ubuntu-22.04
needs: [build-dev]
if: ${{ github.repository == 'sgl-project/sglang' }}
strategy:
matrix:
variant:
- base: dev
x86: x86
arm64: arm64
- base: dev-cu13
x86: x86-cu13
arm64: arm64-cu13
steps:
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Download x86 digest
uses: actions/download-artifact@v4
with:
name: digest-${{ matrix.variant.x86 }}
path: /tmp/digests/x86
- name: Download arm64 digest
uses: actions/download-artifact@v4
with:
name: digest-${{ matrix.variant.arm64 }}
path: /tmp/digests/arm64
- name: Create multi-arch manifest
run: |
X86_DIGEST=$(cat /tmp/digests/x86/digest.txt)
ARM64_DIGEST=$(cat /tmp/digests/arm64/digest.txt)
SUFFIX=""
if [ -n "${{ inputs.tag }}" ]; then
SUFFIX="-${{ inputs.tag }}"
elif [ -n "${{ inputs.pr_number }}" ]; then
SUFFIX="-pr-${{ inputs.pr_number }}"
fi
TAG="${{ matrix.variant.base }}${SUFFIX}"
# For nightly (no suffix), also stamp a dated tag
EXTRA_TAG=""
if [ -z "${SUFFIX}" ]; then
SHORT_SHA="${{ github.sha }}"
EXTRA_TAG="-t lmsysorg/sglang:nightly-${TAG}-$(date +%Y%m%d)-${SHORT_SHA:0:8}"
fi
docker buildx imagetools create \
-t lmsysorg/sglang:${TAG} \
${EXTRA_TAG} \
lmsysorg/sglang@${X86_DIGEST} \
lmsysorg/sglang@${ARM64_DIGEST}
echo "✓ Published lmsysorg/sglang:${TAG}"
- name: Cleanup Old Nightly Builds
if: ${{ !inputs.tag && !inputs.pr_number }}
run: |
TOKEN=$(curl -s -H "Content-Type: application/json" \
-X POST -d '{"username": "${{ secrets.DOCKERHUB_USERNAME }}", "password": "${{ secrets.DOCKERHUB_TOKEN }}"}' \
https://hub.docker.com/v2/users/login/ | jq -r .token)
TAGS_RESPONSE=$(curl -s -H "Authorization: JWT $TOKEN" \
"https://hub.docker.com/v2/repositories/lmsysorg/sglang/tags/?page_size=100")
TAGS=$(echo "$TAGS_RESPONSE" | jq -r \
'.results[] | select(.name | test("^nightly-${{ matrix.variant.base }}-[0-9]")) | "\(.last_updated)|\(.name)"' \
| sort -r | cut -d'|' -f2)
TAG_COUNT=$(echo "$TAGS" | wc -l)
if [ "$TAG_COUNT" -gt 14 ]; then
echo "Found $TAG_COUNT nightly builds, keeping only the 14 most recent"
TAGS_TO_DELETE=$(echo "$TAGS" | tail -n +15)
for tag in $TAGS_TO_DELETE; do
echo "Deleting tag: $tag"
curl -X DELETE -H "Authorization: JWT $TOKEN" \
"https://hub.docker.com/v2/repositories/lmsysorg/sglang/tags/$tag/"
done
else
echo "Only $TAG_COUNT nightly builds found, no cleanup needed"
fi

View File

@@ -0,0 +1,39 @@
name: Release SGLang Model Gateway Docker Image
on:
push:
branches:
- main
paths:
- sgl-model-gateway/bindings/python/pyproject.toml
workflow_dispatch:
jobs:
publish:
if: github.repository == 'sgl-project/sglang'
runs-on: ubuntu-24.04
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Build and Push
run: |
version=$(cat sgl-model-gateway/bindings/python/src/sglang_router/version.py | cut -d'"' -f2)
tag=v${version}
docker buildx build . -f docker/gateway.Dockerfile \
--platform linux/amd64,linux/arm64 \
-t lmsysorg/sgl-model-gateway:${tag} \
-t lmsysorg/sgl-model-gateway:latest \
--push

View File

@@ -0,0 +1,85 @@
name: Release Docker Images Nightly (NPU)
on:
pull_request:
branches:
- 'main'
paths:
- '.github/workflows/release-docker-npu-nightly.yml'
- 'docker/npu.Dockerfile'
workflow_dispatch:
schedule:
- cron: "0 16 * * *" # Execute at 0:00 a.m. Beijing Time every day
concurrency:
group: ${{ github.workflow }}-${{ github.sha }}
cancel-in-progress: true
jobs:
build:
runs-on: ubuntu-22.04-arm
strategy:
matrix:
cann_version: ["8.5.0"]
device_type: ["910b", "a3"]
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Free up disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
with:
tool-cache: true
docker-images: false
- name: Setup Docker buildx
uses: docker/setup-buildx-action@v3
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
images: |
lmsysorg/sglang
# push with schedule event
# push with workflow_dispatch event
tags: |
type=ref,event=pr
type=ref,event=branch
type=schedule,pattern=main
flavor: |
latest=false
suffix=-cann${{ matrix.cann_version }}-${{ matrix.device_type }},onlatest=true
# Login against a Docker registry except on PR
# https://github.com/docker/login-action
- name: Log into docker hub
uses: docker/login-action@v3
if: ${{ github.repository == 'sgl-project/sglang' && github.event_name != 'pull_request' }}
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
# Enable Docker multi-architecture build environment
# Emulate non-native architectures
- name: Set up QEMU
uses: docker/setup-qemu-action@v3
# Required for building and pushing multi-arch Docker images
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
# Build and push Docker image with Buildx (don't push on PR)
# https://github.com/docker/build-push-action
- name: Build and push Docker image
id: build-and-push
uses: docker/build-push-action@v6
with:
context: docker
file: docker/npu.Dockerfile
platforms: linux/arm64,linux/amd64
labels: ${{ steps.meta.outputs.labels }}
tags: ${{ steps.meta.outputs.tags }}
push: ${{ github.repository == 'sgl-project/sglang' && github.event_name != 'pull_request' }}
provenance: false
build-args: |
SGLANG_KERNEL_NPU_TAG=2026.03.10.rc1
CANN_VERSION=${{ matrix.cann_version }}
DEVICE_TYPE=${{ matrix.device_type }}

View File

@@ -0,0 +1,93 @@
name: Release Docker Images (NPU)
on:
push:
tags:
- 'v[0-9]+.*'
workflow_dispatch:
inputs:
version:
description: 'Version to build (without v prefix, e.g., 0.5.7)'
required: true
jobs:
build:
runs-on: ubuntu-22.04-arm
strategy:
matrix:
cann_version: ["8.5.0"]
device_type: ["910b", "a3"]
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Free up disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
with:
tool-cache: true
docker-images: false
# push with tag
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
images: |
lmsysorg/sglang
tags: |
type=ref,event=pr
flavor: |
latest=false
# Login against a Docker registry except on PR
# https://github.com/docker/login-action
- name: Login to Docker Hub
uses: docker/login-action@v2
if: ${{ github.repository == 'sgl-project/sglang' && github.event_name != 'pull_request' }}
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Get version from tag
id: version
run: |
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
VERSION="${{ github.event.inputs.version }}"
else
# Extract version from tag (e.g., v0.5.7 -> 0.5.7)
VERSION="${GITHUB_REF_NAME#v}"
fi
# Validate version format
if [ -z "$VERSION" ]; then
echo "::error::Version is empty"
exit 1
fi
if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then
echo "::error::Invalid version format: $VERSION (expected: X.Y.Z)"
exit 1
fi
echo "version=v${VERSION}" >> $GITHUB_OUTPUT
echo "TAG=lmsysorg/sglang:v${VERSION}-cann${{ matrix.cann_version }}-${{ matrix.device_type }}" >> $GITHUB_OUTPUT
# Enable Docker multi-architecture build environment
# Emulate non-native architectures
- name: Set up QEMU
uses: docker/setup-qemu-action@v3
# Required for building and pushing multi-arch Docker images
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Build and push Docker image
id: build-and-push
uses: docker/build-push-action@v6
with:
context: docker
file: docker/npu.Dockerfile
platforms: linux/arm64,linux/amd64
labels: ${{ steps.meta.outputs.labels }}
tags: ${{ steps.meta.outputs.tags || steps.version.outputs.TAG }}
push: ${{ github.repository == 'sgl-project/sglang' && github.event_name != 'pull_request' }}
provenance: false
build-args: |
SGLANG_KERNEL_NPU_TAG=2026.03.10.rc1
CANN_VERSION=${{ matrix.cann_version }}
DEVICE_TYPE=${{ matrix.device_type }}
SGLANG_TAG=${{ steps.version.outputs.version }}

View File

@@ -0,0 +1,309 @@
name: Release Docker Runtime Images
#
# This workflow builds and publishes runtime Docker images (production-optimized, ~50% smaller):
# - lmsysorg/sglang:v{version}-runtime, lmsysorg/sglang:latest-runtime
# - lmsysorg/sglang:v{version}-cu130-runtime, lmsysorg/sglang:latest-cu130-runtime
#
on:
push:
tags:
- "v[0-9]+.*"
workflow_dispatch:
inputs:
version:
description: "Version to build (without v prefix, e.g., 0.5.7)"
required: true
jobs:
publish-x86:
if: github.repository == 'sgl-project/sglang'
environment: "prod"
strategy:
matrix:
variant:
- cuda_version: "12.9.1"
build_type: "all"
grace_blackwell: 0
runs-on: x64-docker-build-node
steps:
- name: Delete huge unnecessary tools folder
run: rm -rf /opt/hostedtoolcache
- name: Checkout repository
uses: actions/checkout@v4
- name: Free disk space
uses: jlumbroso/free-disk-space@main
with:
tool-cache: false
docker-images: false
android: true
dotnet: true
haskell: true
large-packages: true
swap-storage: false
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Get version from tag
id: version
run: |
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
VERSION="${{ github.event.inputs.version }}"
else
# Extract version from tag (e.g., v0.5.7 -> 0.5.7)
VERSION="${GITHUB_REF_NAME#v}"
fi
# Validate version format
if [ -z "$VERSION" ]; then
echo "::error::Version is empty"
exit 1
fi
if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then
echo "::error::Invalid version format: $VERSION (expected: X.Y.Z)"
exit 1
fi
echo "version=${VERSION}" >> $GITHUB_OUTPUT
- name: Build and Push AMD64 Runtime
run: |
version=${{ steps.version.outputs.version }}
docker buildx build \
--target runtime \
--platform linux/amd64 \
--output type=image,name=lmsysorg/sglang,push-by-digest=true,name-canonical=true,push=true \
-f docker/Dockerfile \
--build-arg CUDA_VERSION=${{ matrix.variant.cuda_version }} \
--build-arg BUILD_TYPE=${{ matrix.variant.build_type }} \
--build-arg GRACE_BLACKWELL=${{ matrix.variant.grace_blackwell }} \
--build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \
--build-arg SGL_VERSION=${version} \
--metadata-file /tmp/metadata-cu129-runtime.json \
--no-cache \
.
DIGEST=$(python3 -c "import json; print(json.load(open('/tmp/metadata-cu129-runtime.json'))['containerimage.digest'])")
echo "Pushed digest: ${DIGEST}"
echo "${DIGEST}" > /tmp/digest-cu129-amd64-runtime.txt
- name: Build and Push AMD64 Runtime (CUDA 13)
run: |
version=${{ steps.version.outputs.version }}
docker buildx build \
--target runtime \
--platform linux/amd64 \
--output type=image,name=lmsysorg/sglang,push-by-digest=true,name-canonical=true,push=true \
-f docker/Dockerfile \
--build-arg CUDA_VERSION=13.0.1 \
--build-arg BUILD_TYPE=${{ matrix.variant.build_type }} \
--build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \
--build-arg GRACE_BLACKWELL=0 \
--build-arg SGL_VERSION=${version} \
--metadata-file /tmp/metadata-cu130-runtime.json \
--no-cache \
.
DIGEST=$(python3 -c "import json; print(json.load(open('/tmp/metadata-cu130-runtime.json'))['containerimage.digest'])")
echo "Pushed digest: ${DIGEST}"
echo "${DIGEST}" > /tmp/digest-cu130-amd64-runtime.txt
- name: Upload digests
uses: actions/upload-artifact@v4
with:
name: digests-amd64
path: /tmp/digest-*.txt
retention-days: 1
publish-arm64:
if: github.repository == 'sgl-project/sglang'
environment: "prod"
strategy:
matrix:
variant:
- cuda_version: "12.9.1"
build_type: "all"
grace_blackwell: 1
runs-on: arm-docker-build-node
steps:
- name: Delete huge unnecessary tools folder
run: rm -rf /opt/hostedtoolcache
- name: Checkout repository
uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Get version from tag
id: version
run: |
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
VERSION="${{ github.event.inputs.version }}"
else
# Extract version from tag (e.g., v0.5.7 -> 0.5.7)
VERSION="${GITHUB_REF_NAME#v}"
fi
# Validate version format
if [ -z "$VERSION" ]; then
echo "::error::Version is empty"
exit 1
fi
if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then
echo "::error::Invalid version format: $VERSION (expected: X.Y.Z)"
exit 1
fi
echo "version=${VERSION}" >> $GITHUB_OUTPUT
- name: Build and Push ARM64 Runtime
run: |
version=${{ steps.version.outputs.version }}
docker buildx build \
--target runtime \
--platform linux/arm64 \
--output type=image,name=lmsysorg/sglang,push-by-digest=true,name-canonical=true,push=true \
-f docker/Dockerfile \
--build-arg CUDA_VERSION=${{ matrix.variant.cuda_version }} \
--build-arg BUILD_TYPE=${{ matrix.variant.build_type }} \
--build-arg GRACE_BLACKWELL=${{ matrix.variant.grace_blackwell }} \
--build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \
--build-arg SGL_VERSION=${version} \
--metadata-file /tmp/metadata-cu129-runtime.json \
--no-cache \
.
DIGEST=$(python3 -c "import json; print(json.load(open('/tmp/metadata-cu129-runtime.json'))['containerimage.digest'])")
echo "Pushed digest: ${DIGEST}"
echo "${DIGEST}" > /tmp/digest-cu129-arm64-runtime.txt
- name: Build and Push ARM64 Runtime (CUDA 13)
run: |
version=${{ steps.version.outputs.version }}
docker buildx build \
--target runtime \
--platform linux/arm64 \
--output type=image,name=lmsysorg/sglang,push-by-digest=true,name-canonical=true,push=true \
-f docker/Dockerfile \
--build-arg CUDA_VERSION=13.0.1 \
--build-arg BUILD_TYPE=${{ matrix.variant.build_type }} \
--build-arg GRACE_BLACKWELL=1 \
--build-arg SGL_VERSION=${version} \
--metadata-file /tmp/metadata-cu130-runtime.json \
--no-cache \
.
DIGEST=$(python3 -c "import json; print(json.load(open('/tmp/metadata-cu130-runtime.json'))['containerimage.digest'])")
echo "Pushed digest: ${DIGEST}"
echo "${DIGEST}" > /tmp/digest-cu130-arm64-runtime.txt
- name: Upload digests
uses: actions/upload-artifact@v4
with:
name: digests-arm64
path: /tmp/digest-*.txt
retention-days: 1
create-manifests:
runs-on: ubuntu-22.04
needs: [publish-x86, publish-arm64]
if: github.repository == 'sgl-project/sglang'
environment: "prod"
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Get version from tag
id: version
run: |
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
VERSION="${{ github.event.inputs.version }}"
else
# Extract version from tag (e.g., v0.5.7 -> 0.5.7)
VERSION="${GITHUB_REF_NAME#v}"
fi
# Validate version format
if [ -z "$VERSION" ]; then
echo "::error::Version is empty"
exit 1
fi
if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then
echo "::error::Invalid version format: $VERSION (expected: X.Y.Z)"
exit 1
fi
echo "version=${VERSION}" >> $GITHUB_OUTPUT
- name: Download amd64 digests
uses: actions/download-artifact@v4
with:
name: digests-amd64
path: /tmp/digests/amd64
- name: Download arm64 digests
uses: actions/download-artifact@v4
with:
name: digests-arm64
path: /tmp/digests/arm64
- name: Create multi-arch manifests
run: |
version=${{ steps.version.outputs.version }}
CU129_AMD64_RT=$(cat /tmp/digests/amd64/digest-cu129-amd64-runtime.txt)
CU130_AMD64_RT=$(cat /tmp/digests/amd64/digest-cu130-amd64-runtime.txt)
CU129_ARM64_RT=$(cat /tmp/digests/arm64/digest-cu129-arm64-runtime.txt)
CU130_ARM64_RT=$(cat /tmp/digests/arm64/digest-cu130-arm64-runtime.txt)
# Create versioned runtime manifest
docker buildx imagetools create \
-t lmsysorg/sglang:v${version}-runtime \
lmsysorg/sglang@${CU129_AMD64_RT} \
lmsysorg/sglang@${CU129_ARM64_RT}
# Create latest runtime manifest
docker buildx imagetools create \
-t lmsysorg/sglang:latest-runtime \
lmsysorg/sglang@${CU129_AMD64_RT} \
lmsysorg/sglang@${CU129_ARM64_RT}
# Create versioned CUDA 13 runtime manifest
docker buildx imagetools create \
-t lmsysorg/sglang:v${version}-cu130-runtime \
lmsysorg/sglang@${CU130_AMD64_RT} \
lmsysorg/sglang@${CU130_ARM64_RT}
# Create latest CUDA 13 runtime manifest
docker buildx imagetools create \
-t lmsysorg/sglang:latest-cu130-runtime \
lmsysorg/sglang@${CU130_AMD64_RT} \
lmsysorg/sglang@${CU130_ARM64_RT}

View File

@@ -0,0 +1,62 @@
name: Release Docker Xeon Images
on:
push:
tags:
- 'v[0-9]+.*'
workflow_dispatch:
inputs:
version:
description: 'Version to build (without v prefix, e.g., 0.5.7)'
required: true
jobs:
publish:
if: github.repository == 'sgl-project/sglang'
runs-on: ubuntu-24.04
environment: 'prod'
strategy:
matrix:
build_type: ['all']
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Get version from tag
id: version
run: |
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
VERSION="${{ github.event.inputs.version }}"
else
# Extract version from tag (e.g., v0.5.7 -> 0.5.7)
VERSION="${GITHUB_REF_NAME#v}"
fi
# Validate version format
if [ -z "$VERSION" ]; then
echo "::error::Version is empty"
exit 1
fi
if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then
echo "::error::Invalid version format: $VERSION (expected: X.Y.Z)"
exit 1
fi
echo "version=${VERSION}" >> $GITHUB_OUTPUT
- name: Build and Push
run: |
version=${{ steps.version.outputs.version }}
tag=v${version}-xeon
docker build . -f docker/xeon.Dockerfile \
--build-arg VER_SGLANG=v${version} \
-t lmsysorg/sglang:${tag} \
--no-cache
docker push lmsysorg/sglang:${tag}

View File

@@ -0,0 +1,294 @@
name: Release Docker Images
#
# This workflow builds and publishes framework Docker images (full development environment):
# - lmsysorg/sglang:v{version}, lmsysorg/sglang:latest
# - lmsysorg/sglang:v{version}-cu130, lmsysorg/sglang:latest-cu130
#
on:
push:
tags:
- "v[0-9]+.*"
workflow_dispatch:
inputs:
version:
description: "Version to build (without v prefix, e.g., 0.5.7)"
required: true
jobs:
publish-x86:
if: github.repository == 'sgl-project/sglang'
environment: "prod"
outputs:
digest-cu129: ${{ steps.build-cu129.outputs.digest }}
digest-cu130: ${{ steps.build-cu130.outputs.digest }}
strategy:
matrix:
variant:
- cuda_version: "12.9.1"
build_type: "all"
grace_blackwell: 0
runs-on: x64-docker-build-node
steps:
- name: Delete huge unnecessary tools folder
run: rm -rf /opt/hostedtoolcache
- name: Checkout repository
uses: actions/checkout@v4
- name: Free disk space
uses: jlumbroso/free-disk-space@main
with:
tool-cache: false
docker-images: false
android: true
dotnet: true
haskell: true
large-packages: true
swap-storage: false
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Get version from tag
id: version
run: |
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
VERSION="${{ github.event.inputs.version }}"
else
# Extract version from tag (e.g., v0.5.7 -> 0.5.7)
VERSION="${GITHUB_REF_NAME#v}"
fi
# Validate version format
if [ -z "$VERSION" ]; then
echo "::error::Version is empty"
exit 1
fi
if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then
echo "::error::Invalid version format: $VERSION (expected: X.Y.Z)"
exit 1
fi
echo "version=${VERSION}" >> $GITHUB_OUTPUT
- name: Build AMD64 Framework
id: build-cu129
run: |
version=${{ steps.version.outputs.version }}
docker buildx build \
--target framework \
--platform linux/amd64 \
--output type=image,name=lmsysorg/sglang,push-by-digest=true,name-canonical=true,push=true \
-f docker/Dockerfile \
--build-arg CUDA_VERSION=${{ matrix.variant.cuda_version }} \
--build-arg BUILD_TYPE=${{ matrix.variant.build_type }} \
--build-arg GRACE_BLACKWELL=${{ matrix.variant.grace_blackwell }} \
--build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \
--build-arg SGL_VERSION=${version} \
--metadata-file /tmp/metadata-cu129-framework.json \
--no-cache \
.
DIGEST=$(python3 -c "import json; print(json.load(open('/tmp/metadata-cu129-framework.json'))['containerimage.digest'])")
echo "Pushed digest: ${DIGEST}"
echo "digest=${DIGEST}" >> $GITHUB_OUTPUT
- name: Build and Push AMD64 Framework (CUDA 13)
id: build-cu130
run: |
version=${{ steps.version.outputs.version }}
docker buildx build \
--target framework \
--platform linux/amd64 \
--output type=image,name=lmsysorg/sglang,push-by-digest=true,name-canonical=true,push=true \
-f docker/Dockerfile \
--build-arg CUDA_VERSION=13.0.1 \
--build-arg BUILD_TYPE=${{ matrix.variant.build_type }} \
--build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \
--build-arg GRACE_BLACKWELL=0 \
--build-arg SGL_VERSION=${version} \
--metadata-file /tmp/metadata-cu130-framework.json \
--no-cache \
.
DIGEST=$(python3 -c "import json; print(json.load(open('/tmp/metadata-cu130-framework.json'))['containerimage.digest'])")
echo "Pushed digest: ${DIGEST}"
echo "digest=${DIGEST}" >> $GITHUB_OUTPUT
publish-arm64:
if: github.repository == 'sgl-project/sglang'
environment: "prod"
outputs:
digest-cu129: ${{ steps.build-cu129.outputs.digest }}
digest-cu130: ${{ steps.build-cu130.outputs.digest }}
strategy:
matrix:
variant:
- cuda_version: "12.9.1"
build_type: "all"
grace_blackwell: 1
runs-on: arm-docker-build-node
steps:
- name: Delete huge unnecessary tools folder
run: rm -rf /opt/hostedtoolcache
- name: Checkout repository
uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Get version from tag
id: version
run: |
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
VERSION="${{ github.event.inputs.version }}"
else
# Extract version from tag (e.g., v0.5.7 -> 0.5.7)
VERSION="${GITHUB_REF_NAME#v}"
fi
# Validate version format
if [ -z "$VERSION" ]; then
echo "::error::Version is empty"
exit 1
fi
if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then
echo "::error::Invalid version format: $VERSION (expected: X.Y.Z)"
exit 1
fi
echo "version=${VERSION}" >> $GITHUB_OUTPUT
- name: Build ARM64 Framework
id: build-cu129
run: |
version=${{ steps.version.outputs.version }}
docker buildx build \
--target framework \
--platform linux/arm64 \
--output type=image,name=lmsysorg/sglang,push-by-digest=true,name-canonical=true,push=true \
-f docker/Dockerfile \
--build-arg CUDA_VERSION=${{ matrix.variant.cuda_version }} \
--build-arg BUILD_TYPE=${{ matrix.variant.build_type }} \
--build-arg GRACE_BLACKWELL=${{ matrix.variant.grace_blackwell }} \
--build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \
--build-arg SGL_VERSION=${version} \
--metadata-file /tmp/metadata-cu129-framework.json \
--no-cache \
.
DIGEST=$(python3 -c "import json; print(json.load(open('/tmp/metadata-cu129-framework.json'))['containerimage.digest'])")
echo "Pushed digest: ${DIGEST}"
echo "digest=${DIGEST}" >> $GITHUB_OUTPUT
- name: Build and Push ARM64 Framework (CUDA 13)
id: build-cu130
run: |
version=${{ steps.version.outputs.version }}
docker buildx build \
--target framework \
--platform linux/arm64 \
--output type=image,name=lmsysorg/sglang,push-by-digest=true,name-canonical=true,push=true \
-f docker/Dockerfile \
--build-arg CUDA_VERSION=13.0.1 \
--build-arg BUILD_TYPE=${{ matrix.variant.build_type }} \
--build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \
--build-arg GRACE_BLACKWELL=1 \
--build-arg SGL_VERSION=${version} \
--metadata-file /tmp/metadata-cu130-framework.json \
--no-cache \
.
DIGEST=$(python3 -c "import json; print(json.load(open('/tmp/metadata-cu130-framework.json'))['containerimage.digest'])")
echo "Pushed digest: ${DIGEST}"
echo "digest=${DIGEST}" >> $GITHUB_OUTPUT
create-manifests:
runs-on: ubuntu-22.04
needs: [publish-x86, publish-arm64]
if: github.repository == 'sgl-project/sglang'
environment: "prod"
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Get version from tag
id: version
run: |
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
VERSION="${{ github.event.inputs.version }}"
else
# Extract version from tag (e.g., v0.5.7 -> 0.5.7)
VERSION="${GITHUB_REF_NAME#v}"
fi
# Validate version format
if [ -z "$VERSION" ]; then
echo "::error::Version is empty"
exit 1
fi
if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then
echo "::error::Invalid version format: $VERSION (expected: X.Y.Z)"
exit 1
fi
echo "version=${VERSION}" >> $GITHUB_OUTPUT
- name: Create multi-arch manifests
run: |
version=${{ steps.version.outputs.version }}
CU129_AMD64_FW=${{ needs.publish-x86.outputs.digest-cu129 }}
CU130_AMD64_FW=${{ needs.publish-x86.outputs.digest-cu130 }}
CU129_ARM64_FW=${{ needs.publish-arm64.outputs.digest-cu129 }}
CU130_ARM64_FW=${{ needs.publish-arm64.outputs.digest-cu130 }}
# Create versioned framework manifest (default)
docker buildx imagetools create \
-t lmsysorg/sglang:v${version} \
lmsysorg/sglang@${CU129_AMD64_FW} \
lmsysorg/sglang@${CU129_ARM64_FW}
# Create latest framework manifest (default)
docker buildx imagetools create \
-t lmsysorg/sglang:latest \
lmsysorg/sglang@${CU129_AMD64_FW} \
lmsysorg/sglang@${CU129_ARM64_FW}
# Create versioned CUDA 13 framework manifest
docker buildx imagetools create \
-t lmsysorg/sglang:v${version}-cu130 \
lmsysorg/sglang@${CU130_AMD64_FW} \
lmsysorg/sglang@${CU130_ARM64_FW}
# Create latest CUDA 13 framework manifest
docker buildx imagetools create \
-t lmsysorg/sglang:latest-cu130 \
lmsysorg/sglang@${CU130_AMD64_FW} \
lmsysorg/sglang@${CU130_ARM64_FW}

View File

@@ -0,0 +1,89 @@
name: Release Documentation
on:
release:
types: [published]
push:
branches:
- main
paths:
- "docs/**"
- "python/sglang/version.py"
- "python/sglang/**"
workflow_dispatch:
concurrency:
group: release-docs-${{ github.ref }}
cancel-in-progress: true
env:
SGLANG_IS_IN_CI: true
jobs:
execute-and-deploy:
runs-on: 1-gpu-h100
if: github.repository == 'sgl-project/sglang'
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Fetch full git history for release index
if: github.event_name == 'release'
run: |
git fetch --prune --unshallow || git fetch --prune --depth=0
- name: Install dependencies
run: |
bash scripts/ci/cuda/ci_install_dependency.sh
pip install -r docs/requirements.txt
apt-get update && apt-get install -y pandoc parallel retry
ln -sf "$(which python3)" /usr/bin/python
- name: Setup Jupyter Kernel
run: |
python -m ipykernel install --user --name python3 --display-name "Python 3"
- name: Execute notebooks
timeout-minutes: 40
run: |
cd docs
make clean
make compile
- name: Push HTML to sgl-project.github.io
timeout-minutes: 30
env:
GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_DOCUMENTATION }}
run: |
cd docs
make html
make markdown
python3 wrap_run_llm.py
if [[ "${{ github.event_name }}" == "release" ]]; then
python3 release_lookup/generate_index.py --output release_lookup/release_index.json
# Copy release lookup tool for official docs on published releases.
mkdir -p _build/html/release_lookup
cp release_lookup/index.html _build/html/release_lookup/
cp release_lookup/release_index.json _build/html/release_lookup/
fi
cd _build/html
git clone https://$GITHUB_TOKEN@github.com/sgl-project/sgl-project.github.io.git ../sgl-project.github.io --depth 1
if [[ "${{ github.event_name }}" == "release" ]]; then
find ../sgl-project.github.io/ -mindepth 1 -not -path "../sgl-project.github.io/.git*" -not -name CNAME -not -name ".jekyll" -not -name ".nojekyll" -delete
else
find ../sgl-project.github.io/ -mindepth 1 -not -path "../sgl-project.github.io/.git*" -not -path "../sgl-project.github.io/release_lookup*" -not -name CNAME -not -name ".jekyll" -not -name ".nojekyll" -delete
fi
cp -r * ../sgl-project.github.io
cp ../../README.md ../sgl-project.github.io/README.md
cd ../sgl-project.github.io
git config user.name "sglang-bot"
git config user.email "sglangbot@gmail.com"
git add .
git commit -m "Update $(date +'%Y-%m-%d %H:%M:%S')"
git push https://$GITHUB_TOKEN@github.com/sgl-project/sgl-project.github.io.git main
cd ..
rm -rf sgl-project.github.io

View File

@@ -0,0 +1,167 @@
name: Release SGLang Model Gateway to PyPI
on:
push:
branches:
- main
paths:
- sgl-model-gateway/bindings/python/pyproject.toml
workflow_dispatch:
jobs:
build:
name: build on ${{ matrix.platform || matrix.os }} (${{ matrix.target }} - ${{ matrix.manylinux || 'auto' }})
runs-on: ${{ matrix.os }}-latest
strategy:
fail-fast: false
matrix:
os: [ubuntu, macos, windows]
target: [x86_64, aarch64]
manylinux: [auto]
include:
- os: ubuntu
platform: linux
- os: windows
ls: dir
target: x86_64
python-architecture: x64
interpreter: 3.9 3.10 3.11 3.12 3.13
- os: macos
target: aarch64
interpreter: 3.9 3.10 3.11 3.12 3.13
- os: ubuntu
platform: linux
target: aarch64
# musllinux
- os: ubuntu
platform: linux
target: x86_64
manylinux: musllinux_1_1
- os: ubuntu
platform: linux
target: aarch64
manylinux: musllinux_1_1
exclude:
- os: windows
target: aarch64
steps:
- uses: actions/checkout@v4
with:
path: sglang-repo
- name: Move sgl-model-gateway folder to root and delete sglang-repo
run: |
mv sglang-repo/sgl-model-gateway/* .
rm -rf sglang-repo
ls -alt
shell: bash
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.13"
architecture: ${{ matrix.python-architecture || 'x64' }}
- name: Install twine
run: pip install -U twine
- name: Install protoc (macOS)
if: matrix.os == 'macos'
run: brew install protobuf
- name: Install protoc (Windows)
if: matrix.os == 'windows'
run: choco install protoc -y
- name: Build wheels
uses: PyO3/maturin-action@v1
with:
working-directory: bindings/python
target: ${{ matrix.target }}
manylinux: ${{ matrix.manylinux || 'auto' }}
args: --release --out dist --features vendored-openssl --interpreter ${{ matrix.interpreter || '3.9 3.10 3.11 3.12 3.13 3.14' }}
rust-toolchain: stable
docker-options: -e CI -e CC_aarch64_unknown_linux_gnu=aarch64-linux-gnu-gcc -e CXX_aarch64_unknown_linux_gnu=aarch64-linux-gnu-g++
before-script-linux: |
# Install build dependencies (perl/make for vendored OpenSSL, protoc for gRPC)
if command -v yum &> /dev/null; then
yum update -y && yum install -y wget unzip gcc gcc-c++ perl-core make
# Install cross-compilation toolchain for aarch64 if needed
if [ "${{ matrix.target }}" = "aarch64" ]; then
yum install -y gcc-aarch64-linux-gnu gcc-c++-aarch64-linux-gnu || true
fi
elif command -v apt-get &> /dev/null; then
apt-get update && apt-get install -y wget unzip gcc g++ perl make
# Install cross-compilation toolchain for aarch64 if needed
if [ "${{ matrix.target }}" = "aarch64" ]; then
apt-get install -y gcc-aarch64-linux-gnu g++-aarch64-linux-gnu || true
fi
fi
(cd /tmp && \
wget https://github.com/protocolbuffers/protobuf/releases/download/v32.0/protoc-32.0-linux-x86_64.zip && \
unzip protoc-32.0-linux-x86_64.zip -d /usr/local && \
rm protoc-32.0-linux-x86_64.zip)
protoc --version
- name: List built packages
run: ${{ matrix.ls || 'ls -lh' }} bindings/python/dist/
- name: Check packages
run: twine check --strict bindings/python/dist/*
- uses: actions/upload-artifact@v4
with:
name: packages-${{ matrix.os }}-${{ matrix.target }}-${{ matrix.manylinux || 'auto' }}
path: bindings/python/dist/
build-sdist:
name: Build SDist
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
path: sglang-repo
- name: Move sgl-model-gateway folder to root and delete sglang-repo
run: |
mv sglang-repo/sgl-model-gateway/* .
rm -rf sglang-repo
ls -alt
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.13"
- name: Build SDist
uses: PyO3/maturin-action@v1
with:
working-directory: bindings/python
command: sdist
args: --out dist
rust-toolchain: stable
- uses: actions/upload-artifact@v4
with:
name: sdist
path: bindings/python/dist/*.tar.gz
upload:
name: Upload to PyPI
if: github.repository == 'sgl-project/sglang' # Ensure this job only runs for the sgl-project/sglang repository
needs: [build, build-sdist]
runs-on: ubuntu-latest
steps:
- uses: actions/download-artifact@v4
with:
path: dist
merge-multiple: true
- name: Upload to PyPI
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN_ROUTER }}
run: |
pip install twine
twine upload dist/* --verbose

View File

@@ -0,0 +1,169 @@
name: Release PyPI Nightly Wheels
on:
# Run daily at 2 AM UTC
schedule:
- cron: '0 2 * * *'
# Triggered by nightly Docker workflow to use same commit
repository_dispatch:
types: [nightly-release]
# Manual trigger for testing
workflow_dispatch:
inputs:
commit_sha:
description: 'Specific commit SHA to build (leave empty for latest)'
required: false
type: string
cuda_version:
description: 'CUDA version (e.g., 129 or 130)'
required: false
default: '129'
type: string
concurrency:
group: release-pypi-nightly-${{ github.ref }}
cancel-in-progress: true
jobs:
build-nightly-wheel:
if: github.repository == 'sgl-project/sglang'
runs-on: ubuntu-latest
outputs:
nightly_version: ${{ steps.build.outputs.nightly_version }}
commit_hash: ${{ steps.build.outputs.commit_hash }}
build_date: ${{ steps.build.outputs.build_date }}
steps:
- uses: actions/checkout@v4
with:
# Use commit from: 1) Docker workflow, 2) manual input, 3) latest main
ref: ${{ github.event.client_payload.commit_sha || inputs.commit_sha || github.sha }}
fetch-depth: 0 # Need full history for setuptools-scm
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.10"
- name: Install build dependencies
run: |
pip install build wheel setuptools setuptools-scm
- name: Build wheel
id: build
run: |
cd python
cp ../README.md ../LICENSE .
# Parse git describe output to get latest tag
# Use same command as pyproject.toml to ensure version consistency
DESC=$(git tag --list --sort=-version:refname 'v*.*.*' | head -1 | xargs git describe --tags --long 2>/dev/null || echo 'v0.0.0-0-g0000000')
TAG=$(echo "$DESC" | cut -d- -f1)
HASH="g$(git rev-parse --short HEAD)"
BUILD_DATE=$(date -u +%Y%m%d)
# Increment patch version for nightlies (e.g., v0.5.9 -> 0.5.10)
# Must always increment so nightly > latest tag per PEP 440 ordering:
# X.Y.Z.devN < X.Y.Z.rcN < X.Y.Z < X.Y.(Z+1).devN
VERSION=${TAG#v} # Remove 'v' prefix
MAJOR=$(echo "$VERSION" | cut -d. -f1)
MINOR=$(echo "$VERSION" | cut -d. -f2)
PATCH_RAW=$(echo "$VERSION" | cut -d. -f3)
# Strip pre-release suffixes (rc0, post1, etc.) to get numeric patch
PATCH=$(echo "$PATCH_RAW" | sed 's/[^0-9].*//')
NEXT_PATCH=$((PATCH + 1))
NEXT_VERSION="${MAJOR}.${MINOR}.${NEXT_PATCH}"
# Use date-based dev number for correct chronological sorting
# e.g., 0.5.9.dev20260215+g4cf4f0859 > 0.5.9.dev20260214+g45a4697d4
FORCE_VERSION="${NEXT_VERSION}.dev${BUILD_DATE}+${HASH}"
echo "Forcing nightly version to: $FORCE_VERSION"
export SETUPTOOLS_SCM_PRETEND_VERSION="$FORCE_VERSION"
# Build wheel
python3 -m build --wheel
# Extract version from built wheel filename
WHEEL_FILE=$(ls dist/*.whl)
NIGHTLY_VERSION=$(echo "$WHEEL_FILE" | sed 's/.*sglang-\(.*\)-py3.*/\1/')
# Get commit info
COMMIT_HASH=$(git rev-parse --short HEAD)
BUILD_DATE=$(date -u +%Y-%m-%d)
echo "Built wheel: $WHEEL_FILE"
echo "Nightly version: ${NIGHTLY_VERSION}"
echo "Commit: ${COMMIT_HASH}"
echo "Build date: ${BUILD_DATE}"
echo "nightly_version=${NIGHTLY_VERSION}" >> $GITHUB_OUTPUT
echo "commit_hash=${COMMIT_HASH}" >> $GITHUB_OUTPUT
echo "build_date=${BUILD_DATE}" >> $GITHUB_OUTPUT
- name: Upload wheel artifact
uses: actions/upload-artifact@v4
with:
name: nightly-wheel
path: python/dist/*.whl
retention-days: 7
release-nightly:
needs: build-nightly-wheel
runs-on: ubuntu-latest
environment: 'prod'
steps:
- uses: actions/checkout@v4
- name: Download wheel artifact
uses: actions/download-artifact@v4
with:
name: nightly-wheel
path: dist/
- name: List downloaded wheels
run: |
echo "Downloaded wheel:"
ls -lh dist/
- name: Create GitHub Release for nightly wheel
uses: softprops/action-gh-release@v2
with:
tag_name: nightly-${{ needs.build-nightly-wheel.outputs.build_date }}-${{ needs.build-nightly-wheel.outputs.commit_hash }}
name: Nightly Build ${{ needs.build-nightly-wheel.outputs.build_date }} (${{ needs.build-nightly-wheel.outputs.commit_hash }})
repository: sgl-project/whl
token: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
prerelease: true
body: |
Nightly build from commit ${{ github.sha }}
Build date: ${{ needs.build-nightly-wheel.outputs.build_date }}
Version: ${{ needs.build-nightly-wheel.outputs.nightly_version }}
files: |
dist/*.whl
- name: Clone wheel index repository
run: |
git clone https://oauth2:${WHL_TOKEN}@github.com/sgl-project/whl.git sgl-whl
cd sgl-whl
git config --local user.name "sglang-bot"
git config --local user.email "sglangbot@gmail.com"
env:
WHL_TOKEN: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.10"
- name: Update wheel index
run: |
python3 scripts/update_nightly_whl_index.py \
--commit-hash ${{ needs.build-nightly-wheel.outputs.commit_hash }} \
--nightly-version ${{ needs.build-nightly-wheel.outputs.nightly_version }} \
--cuda-version ${{ inputs.cuda_version || '129' }} \
--build-date ${{ needs.build-nightly-wheel.outputs.build_date }}
- name: Push wheel index
run: |
cd sgl-whl
git add -A
git diff --staged --quiet || git commit -m "Update nightly wheel index for commit ${{ needs.build-nightly-wheel.outputs.commit_hash }}"
git push

View File

@@ -0,0 +1,183 @@
name: Release PyPI PR Wheels
on:
workflow_dispatch:
inputs:
pr_number:
description: 'PR number to build wheel for (works with both internal and fork PRs)'
required: true
type: string
concurrency:
group: build-pr-wheel-${{ github.event.inputs.pr_number }}
cancel-in-progress: true
jobs:
build-pr-wheel:
if: github.repository == 'sgl-project/sglang'
runs-on: ubuntu-latest
outputs:
wheel_version: ${{ steps.gen_version.outputs.wheel_version }}
commit_hash: ${{ steps.gen_version.outputs.commit_hash }}
build_date: ${{ steps.gen_version.outputs.build_date }}
steps:
- uses: actions/checkout@v4
with:
ref: refs/pull/${{ inputs.pr_number }}/head
fetch-depth: 0 # Need full history for version generation
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.10"
- name: Generate PR wheel version
id: gen_version
run: |
# Get base version from the latest v*.*.* git tag directly
# Note: We cannot use setuptools_scm here because the [tool.setuptools_scm]
# config (with custom git_describe_command) lives in python/pyproject.toml,
# not at the repo root. Without that config, setuptools_scm falls back to
# default git describe which finds gateway-* tags instead of v*.*.* release tags.
LATEST_TAG=$(git tag --list --sort=-version:refname 'v*.*.*' | head -1)
BASE_VERSION=${LATEST_TAG#v}
echo "Latest release tag: ${LATEST_TAG}"
# Get commit info
COMMIT_HASH=$(git rev-parse --short HEAD)
COMMIT_COUNT=$(git rev-list --count HEAD)
# Get current date in YYYY-MM-DD format
BUILD_DATE=$(date -u +%Y-%m-%d)
# Always use pr-{number} format for suffix
SUFFIX="pr-${{ inputs.pr_number }}"
# Generate PR wheel version following PEP 440
# Format: {base_version}.dev{commit_count}+pr-{number}.g{commit_hash}
WHEEL_VERSION="${BASE_VERSION}.dev${COMMIT_COUNT}+${SUFFIX}.g${COMMIT_HASH}"
echo "Base version: ${BASE_VERSION}"
echo "PR wheel version: ${WHEEL_VERSION}"
echo "Commit: ${COMMIT_HASH}"
echo "Build date: ${BUILD_DATE}"
echo "wheel_version=${WHEEL_VERSION}" >> $GITHUB_OUTPUT
echo "commit_hash=${COMMIT_HASH}" >> $GITHUB_OUTPUT
echo "base_version=${BASE_VERSION}" >> $GITHUB_OUTPUT
echo "build_date=${BUILD_DATE}" >> $GITHUB_OUTPUT
- name: Update pyproject.toml with PR wheel version
run: |
cd python
WHEEL_VERSION="${{ steps.gen_version.outputs.wheel_version }}"
# Update pyproject.toml to use static version instead of dynamic
# Remove 'version' from dynamic list and add static version
sed -i 's/dynamic = \["version"\]/dynamic = []/' pyproject.toml
sed -i "/^name = \"sglang\"/a version = \"${WHEEL_VERSION}\"" pyproject.toml
# Verify update
echo "Updated version in pyproject.toml:"
grep "^version" pyproject.toml
grep "^dynamic" pyproject.toml
- name: Install build dependencies
run: |
cd python
pip install build wheel setuptools
- name: Build wheel
run: |
cd python
cp ../README.md ../LICENSE .
python3 -m build --wheel
# List built wheels
echo "Built wheel:"
ls -lh dist/
- name: Upload wheel artifact
uses: actions/upload-artifact@v4
with:
name: pr-wheel-${{ inputs.pr_number }}
path: python/dist/*.whl
retention-days: 30
release-pr-wheel:
needs: build-pr-wheel
runs-on: ubuntu-latest
environment: 'prod'
steps:
- uses: actions/checkout@v4
- name: Download wheel artifact
uses: actions/download-artifact@v4
with:
name: pr-wheel-${{ inputs.pr_number }}
path: dist/
- name: List downloaded wheels
run: |
echo "Downloaded wheel:"
ls -lh dist/
- name: Create GitHub Release for PR wheel
uses: softprops/action-gh-release@v2
with:
tag_name: pr-${{ inputs.pr_number }}-${{ needs.build-pr-wheel.outputs.build_date }}-${{ needs.build-pr-wheel.outputs.commit_hash }}
name: "PR #${{ inputs.pr_number }} Build (${{ needs.build-pr-wheel.outputs.commit_hash }})"
repository: sgl-project/whl
token: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
prerelease: true
body: |
PR wheel build from PR #${{ inputs.pr_number }}
Commit: ${{ github.sha }}
Build date: ${{ needs.build-pr-wheel.outputs.build_date }}
Version: ${{ needs.build-pr-wheel.outputs.wheel_version }}
**Installation via index (pip):**
```bash
pip install sglang==${{ needs.build-pr-wheel.outputs.wheel_version }} --index-url https://sgl-project.github.io/whl/pr/
```
**Installation via index (uv):**
```bash
uv pip install sglang==${{ needs.build-pr-wheel.outputs.wheel_version }} --index-url https://sgl-project.github.io/whl/pr/ --extra-index-url https://pypi.org/simple --index-strategy unsafe-best-match
```
**Direct installation:**
```bash
pip install https://github.com/sgl-project/whl/releases/download/pr-${{ inputs.pr_number }}-${{ needs.build-pr-wheel.outputs.build_date }}-${{ needs.build-pr-wheel.outputs.commit_hash }}/sglang-${{ needs.build-pr-wheel.outputs.wheel_version }}-py3-none-any.whl
```
files: |
dist/*.whl
- name: Clone wheel index repository
run: |
git clone https://oauth2:${WHL_TOKEN}@github.com/sgl-project/whl.git sgl-whl
cd sgl-whl
git config --local user.name "sglang-bot"
git config --local user.email "sglangbot@gmail.com"
env:
WHL_TOKEN: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.10"
- name: Update wheel index
run: |
python3 scripts/update_pr_whl_index.py \
--pr-number ${{ inputs.pr_number }} \
--commit-hash ${{ needs.build-pr-wheel.outputs.commit_hash }} \
--wheel-version ${{ needs.build-pr-wheel.outputs.wheel_version }} \
--build-date ${{ needs.build-pr-wheel.outputs.build_date }}
- name: Push wheel index
run: |
cd sgl-whl
git add -A
git diff --staged --quiet || git commit -m "Update PR wheel index for PR #${{ inputs.pr_number }} (commit ${{ needs.build-pr-wheel.outputs.commit_hash }})"
git push

View File

@@ -0,0 +1,31 @@
name: Release PyPI
on:
push:
tags:
- 'v[0-9]+.*'
workflow_dispatch:
jobs:
publish:
if: github.repository == 'sgl-project/sglang'
runs-on: ubuntu-latest
environment: "prod"
steps:
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.10"
- name: Checkout repository
uses: actions/checkout@v4
with:
fetch-depth: 0 # Required for setuptools-scm to determine version from tags
- name: Upload to pypi
run: |
cd python
cp ../README.md ../LICENSE .
pip install build wheel setuptools setuptools-scm
python3 -m build
pip install twine
python3 -m twine upload dist/* -u __token__ -p ${{ secrets.PYPI_TOKEN }}

View File

@@ -0,0 +1,68 @@
name: Release Tag
# Creates a git tag to trigger release workflows (PyPI, Docker)
# Use this after testing on a release branch is complete
on:
workflow_dispatch:
inputs:
version:
description: 'Version to tag (without v prefix, e.g., 0.5.7)'
required: true
type: string
ref:
description: 'Branch or commit to tag (e.g., release/v0.5.7, main, or commit SHA)'
required: false
default: 'main'
type: string
permissions:
contents: write
jobs:
create-tag:
if: github.repository == 'sgl-project/sglang'
runs-on: ubuntu-latest
environment: 'prod'
steps:
- name: Validate version format
run: |
VERSION="${{ github.event.inputs.version }}"
if [ -z "$VERSION" ]; then
echo "::error::Version is required"
exit 1
fi
if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then
echo "::error::Invalid version format: $VERSION (expected: X.Y.Z or X.Y.Z.postN)"
exit 1
fi
echo "Version validated: v$VERSION"
- name: Checkout repository
uses: actions/checkout@v4
with:
ref: ${{ github.event.inputs.ref }}
fetch-depth: 0
token: ${{ secrets.GITHUB_TOKEN }}
- name: Check if tag already exists
run: |
TAG="v${{ github.event.inputs.version }}"
if git rev-parse "$TAG" >/dev/null 2>&1; then
echo "::error::Tag $TAG already exists"
exit 1
fi
echo "Tag $TAG does not exist, proceeding..."
- name: Create and push tag
run: |
TAG="v${{ github.event.inputs.version }}"
REF="${{ github.event.inputs.ref }}"
git config user.name "sglang-bot"
git config user.email "sglang-bot@users.noreply.github.com"
echo "Creating tag $TAG on ref $REF (commit: $(git rev-parse HEAD))"
git tag -a "$TAG" -m "Release $TAG"
git push origin "$TAG"
echo "::notice::Successfully created and pushed tag $TAG"
echo "This will trigger the release workflows (PyPI, Docker)"

View File

@@ -0,0 +1,440 @@
name: Release SGLang Kernels
on:
push:
branches:
- main
paths:
- sgl-kernel/python/sgl_kernel/version.py
workflow_dispatch:
inputs:
target:
type: choice
description: 'Build target'
required: false
default: 'all'
options:
- 'all'
- 'cu129'
- 'cu130'
- 'rocm700'
- 'rocm720'
- 'musa43'
tag_name:
description: "Version number, must be in the form of vX.Y.Z (e.g. v0.4.0)"
type: string
required: false
pr_number:
description: "PR number to build from (e.g. 12345)"
type: string
required: false
concurrency:
group: release-sglang-kernels-${{ github.ref }}
cancel-in-progress: true
jobs:
build-cu129-matrix:
if: |
github.repository == 'sgl-project/sglang' &&
(github.event_name == 'push' || github.event.inputs.target == 'all' || github.event.inputs.target == 'cu129')
strategy:
matrix:
python-version: ["3.10"]
cuda-version: ["12.9"]
arch: [x86_64, aarch64]
include:
- arch: x86_64
runner: x64-kernel-build-node
- arch: aarch64
runner: arm-kernel-build-node
runs-on: ${{ matrix.runner }}
steps:
- uses: actions/checkout@v4
with:
submodules: "recursive"
ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }}
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Build wheels
run: |
cd sgl-kernel
chmod +x ./build.sh
./build.sh "${{ matrix.python-version }}" "${{ matrix.cuda-version }}" ${{ matrix.arch == 'aarch64' && 'aarch64' || '' }}
env:
BUILD_JOBS: 64
NVCC_THREADS: 8
- name: Upload to PyPI
working-directory: sgl-kernel
run: |
pip install twine
python3 -m twine upload --skip-existing dist/* -u __token__ -p ${{ secrets.PYPI_TOKEN_SGLANG_KERNEL }}
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
name: wheel-python${{ matrix.python-version }}-cuda${{ matrix.cuda-version }}${{ matrix.arch == 'aarch64' && '-aarch64' || '' }}
path: sgl-kernel/dist/*
release-cu129:
needs: build-cu129-matrix
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }}
- name: Download artifacts
uses: actions/download-artifact@v4
with:
path: sgl-kernel/dist/
merge-multiple: true
pattern: wheel-*
- name: Set tag name
id: set_tag_name
run: |
if [ -z "${{ inputs.tag_name }}" ]; then
TAG_NAME="v$(cat sgl-kernel/python/sgl_kernel/version.py | cut -d'"' -f2)"
echo "tag_name=$TAG_NAME" >> $GITHUB_OUTPUT
else
echo "tag_name=${{ inputs.tag_name }}" >> $GITHUB_OUTPUT
fi
- name: Release
uses: softprops/action-gh-release@v2
with:
tag_name: ${{ steps.set_tag_name.outputs.tag_name }}
repository: sgl-project/whl
token: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
files: |
sgl-kernel/dist/*
- name: Clone wheel index
run: git clone https://oauth2:${WHL_TOKEN}@github.com/sgl-project/whl.git sgl-whl
env:
WHL_TOKEN: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
- name: Update wheel index
run: python3 scripts/update_kernel_whl_index.py --cuda 129
- name: Push wheel index
run: |
cd sgl-whl
git config --local user.name "sglang-bot"
git config --local user.email "sglangbot@gmail.com"
git add -A
git commit -m "update whl index"
git push
# for now we do not release CUDA 13.0 wheels to pypi
build-cu130-matrix:
if: |
github.repository == 'sgl-project/sglang' &&
(github.event_name == 'push' || github.event.inputs.target == 'all' || github.event.inputs.target == 'cu130')
strategy:
matrix:
python-version: ["3.10"]
cuda-version: ["13.0"]
arch: [x86_64, aarch64]
include:
- arch: x86_64
runner: x64-kernel-build-node
- arch: aarch64
runner: arm-kernel-build-node
runs-on: ${{ matrix.runner }}
steps:
- uses: actions/checkout@v4
with:
submodules: "recursive"
ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }}
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Build wheels
run: |
cd sgl-kernel
chmod +x ./build.sh
./build.sh "${{ matrix.python-version }}" "${{ matrix.cuda-version }}" ${{ matrix.arch == 'aarch64' && 'aarch64' || '' }}
env:
BUILD_JOBS: 64
NVCC_THREADS: 8
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
name: wheel-python${{ matrix.python-version }}-cuda${{ matrix.cuda-version }}${{ matrix.arch == 'aarch64' && '-aarch64' || '' }}
path: sgl-kernel/dist/*
release-cu130:
needs: build-cu130-matrix
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }}
- name: Download artifacts
uses: actions/download-artifact@v4
with:
path: sgl-kernel/dist/
merge-multiple: true
pattern: wheel-*
- name: Set tag name
id: set_tag_name
run: |
if [ -z "${{ inputs.tag_name }}" ]; then
TAG_NAME="v$(cat sgl-kernel/python/sgl_kernel/version.py | cut -d'"' -f2)"
echo "tag_name=$TAG_NAME" >> $GITHUB_OUTPUT
else
echo "tag_name=${{ inputs.tag_name }}" >> $GITHUB_OUTPUT
fi
- name: Release
uses: softprops/action-gh-release@v2
with:
tag_name: ${{ steps.set_tag_name.outputs.tag_name }}
repository: sgl-project/whl
token: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
files: |
sgl-kernel/dist/*
- name: Clone wheel index
run: git clone https://oauth2:${WHL_TOKEN}@github.com/sgl-project/whl.git sgl-whl
env:
WHL_TOKEN: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
- name: Update wheel index
run: python3 scripts/update_kernel_whl_index.py --cuda 130
- name: Push wheel index
run: |
cd sgl-whl
git config --local user.name "sglang-bot"
git config --local user.email "sglangbot@gmail.com"
git add -A
git commit -m "update whl index"
git push
build-rocm-matrix:
if: |
github.repository == 'sgl-project/sglang' &&
(github.event_name == 'push' || github.event.inputs.target == 'all' || github.event.inputs.target == 'rocm700' || github.event.inputs.target == 'rocm720')
runs-on: amd-docker-scale
strategy:
matrix:
python-version: ["3.10"]
rocm-version: ["700", "720"]
steps:
- uses: actions/checkout@v4
with:
submodules: "recursive"
ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }}
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Build wheels
run: |
cp 3rdparty/amd/wheel/sgl-kernel/* sgl-kernel/
cd sgl-kernel
chmod +x ./build_rocm.sh
./build_rocm.sh "${{ matrix.rocm-version }}"
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
name: wheel-python${{ matrix.python-version }}-rocm${{ matrix.rocm-version }}
path: sgl-kernel/dist/*
release-rocm700:
needs: build-rocm-matrix
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }}
- name: Download artifacts
uses: actions/download-artifact@v4
with:
path: sgl-kernel/dist/
merge-multiple: true
pattern: wheel-*-rocm700
- name: Set tag name
id: set_tag_name
run: |
if [ -z "${{ inputs.tag_name }}" ]; then
TAG_NAME="v$(cat sgl-kernel/python/sgl_kernel/version.py | cut -d'"' -f2)"
echo "tag_name=$TAG_NAME" >> $GITHUB_OUTPUT
else
echo "tag_name=${{ inputs.tag_name }}" >> $GITHUB_OUTPUT
fi
- name: Release
uses: softprops/action-gh-release@v2
with:
tag_name: ${{ steps.set_tag_name.outputs.tag_name }}
repository: sgl-project/whl
token: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
files: |
sgl-kernel/dist/*
- name: Clone wheel index
run: git clone https://oauth2:${WHL_TOKEN}@github.com/sgl-project/whl.git sgl-whl
env:
WHL_TOKEN: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
- name: Update wheel index
run: python3 scripts/update_kernel_whl_index.py --rocm 700
- name: Push wheel index
run: |
cd sgl-whl
git config --local user.name "sglang-bot"
git config --local user.email "sglangbot@gmail.com"
git add -A
git commit -m "update whl index"
git push
release-rocm720:
needs: build-rocm-matrix
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }}
- name: Download artifacts
uses: actions/download-artifact@v4
with:
path: sgl-kernel/dist/
merge-multiple: true
pattern: wheel-*-rocm720
- name: Set tag name
id: set_tag_name
run: |
if [ -z "${{ inputs.tag_name }}" ]; then
TAG_NAME="v$(cat sgl-kernel/python/sgl_kernel/version.py | cut -d'"' -f2)"
echo "tag_name=$TAG_NAME" >> $GITHUB_OUTPUT
else
echo "tag_name=${{ inputs.tag_name }}" >> $GITHUB_OUTPUT
fi
- name: Release
uses: softprops/action-gh-release@v2
with:
tag_name: ${{ steps.set_tag_name.outputs.tag_name }}
repository: sgl-project/whl
token: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
files: |
sgl-kernel/dist/*
- name: Clone wheel index
run: git clone https://oauth2:${WHL_TOKEN}@github.com/sgl-project/whl.git sgl-whl
env:
WHL_TOKEN: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
- name: Update wheel index
run: python3 scripts/update_kernel_whl_index.py --rocm 720
- name: Push wheel index
run: |
cd sgl-whl
git config --local user.name "sglang-bot"
git config --local user.email "sglangbot@gmail.com"
git add -A
git commit -m "update whl index"
git push
build-musa43:
if: |
github.repository == 'sgl-project/sglang' &&
(github.event_name == 'push' || github.event.inputs.target == 'all' || github.event.inputs.target == 'musa43')
runs-on: kernel-build-node-musa
strategy:
matrix:
python-version: ["3.10"]
musa-version: ["43"]
steps:
- uses: actions/checkout@v4
with:
submodules: "recursive"
- name: Build wheels
run: |
cd sgl-kernel
mv pyproject_musa.toml pyproject.toml
python setup_musa.py sdist bdist_wheel
- name: Rename MUSA wheels
run: |
bash scripts/ci/musa/rename_wheels_musa.sh ${{ matrix.musa-version }} sgl-kernel/dist
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
name: wheel-python${{ matrix.python-version }}-musa${{ matrix.musa-version }}
path: sgl-kernel/dist/*
release-musa43:
needs: build-musa43
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Download artifacts
uses: actions/download-artifact@v4
with:
path: sgl-kernel/dist/
merge-multiple: true
pattern: wheel-*
- name: Set tag name
id: set_tag_name
run: |
if [ -z "${{ inputs.tag_name }}" ]; then
TAG_NAME="v$(cat sgl-kernel/python/sgl_kernel/version.py | cut -d'"' -f2)"
echo "tag_name=$TAG_NAME" >> $GITHUB_OUTPUT
else
echo "tag_name=${{ inputs.tag_name }}" >> $GITHUB_OUTPUT
fi
- name: Release
uses: softprops/action-gh-release@v2
with:
tag_name: ${{ steps.set_tag_name.outputs.tag_name }}
repository: sgl-project/whl
token: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
files: |
sgl-kernel/dist/*
- name: Clone wheel index
run: git clone https://oauth2:${WHL_TOKEN}@github.com/sgl-project/whl.git sgl-whl
env:
WHL_TOKEN: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
- name: Update wheel index
run: python3 scripts/update_kernel_whl_index.py --musa 43
- name: Push wheel index
run: |
cd sgl-whl
git config --local user.name "sglang-bot"
git config --local user.email "sglangbot@gmail.com"
git add -A
git commit -m "update whl index"
git push

View File

@@ -0,0 +1,136 @@
name: Rerun Test
run-name: ${{ inputs.pr_head_sha && format('[rerun-test] {0} {1}', inputs.test_command, inputs.pr_head_sha) || format('[rerun-test] {0}', inputs.test_command) }}
on:
workflow_dispatch:
inputs:
test_command:
description: "Test command(s) to run, one per line (e.g. 'registered/core/test_srt_endpoint.py TestSRTEndpoint.test_simple_decode')"
required: true
type: string
runner_label:
description: "Runner label"
required: true
type: choice
options:
- 1-gpu-h100
- 1-gpu-5090
- 2-gpu-h100
- 4-gpu-h100
- 4-gpu-a10
- 4-gpu-b200
- 8-gpu-h200
- 8-gpu-h20
- 8-gpu-b200
- ubuntu-latest
pr_head_sha:
description: "PR head SHA to checkout (for /rerun-test on fork PRs)"
required: false
type: string
default: ""
use_deepep:
description: "Use ci_install_deepep.sh instead of ci_install_dependency.sh"
required: false
type: string
default: "false"
is_cpu:
description: "Run as CPU-only test (uses ubuntu-latest with uv pip install)"
required: false
type: string
default: "false"
env:
SGLANG_IS_IN_CI: true
SGLANG_CUDA_COREDUMP: "1"
SGLANG_JIT_DEEPGEMM_FAST_WARMUP: true
permissions:
actions: write
contents: read
issues: read
jobs:
rerun-test-cuda:
if: inputs.is_cpu != 'true'
runs-on: ${{ inputs.runner_label }}
timeout-minutes: 120
env:
RUNNER_LABELS: ${{ inputs.runner_label }}
SGLANG_CI_RDMA_ALL_DEVICES: ${{ inputs.runner_label == '8-gpu-h20' && 'mlx5_1,mlx5_2,mlx5_3,mlx5_4' || '' }}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || github.sha }}
- uses: ./.github/actions/check-maintenance
- name: Install dependencies
timeout-minutes: 20
run: |
if [[ "${{ inputs.runner_label }}" == "1-gpu-5090" ]]; then
source /etc/profile.d/sglang-ci.sh
fi
if [[ "${{ inputs.use_deepep }}" == "true" ]]; then
bash scripts/ci/cuda/ci_install_deepep.sh
else
bash scripts/ci/cuda/ci_install_dependency.sh
fi
- name: Run test
timeout-minutes: 60
run: |
if [[ "${{ inputs.runner_label }}" == "1-gpu-5090" ]]; then
source /etc/profile.d/sglang-ci.sh
fi
cd test/
echo "${{ inputs.test_command }}" | while IFS= read -r cmd; do
[ -z "$cmd" ] && continue
echo ">>> Running: python3 $cmd"
python3 $cmd || exit 1
done
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
rerun-test-cpu:
if: inputs.is_cpu == 'true'
runs-on: ubuntu-latest
timeout-minutes: 120
steps:
- name: Free disk space
run: |
sudo rm -rf /usr/share/dotnet /usr/local/lib/android /opt/ghc
df -h
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || github.sha }}
- uses: ./.github/actions/check-maintenance
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Install uv
uses: astral-sh/setup-uv@v5
- name: Install dependencies
timeout-minutes: 20
env:
UV_SYSTEM_PYTHON: "1"
run: |
uv pip install -e "python[dev]" --index-strategy unsafe-best-match --prerelease allow
- name: Run test
timeout-minutes: 60
run: |
cd test/
echo "${{ inputs.test_command }}" | while IFS= read -r cmd; do
[ -z "$cmd" ] && continue
echo ">>> Running: python3 $cmd"
python3 $cmd || exit 1
done

View File

@@ -0,0 +1,30 @@
name: Retag Docker Image
on:
workflow_dispatch:
inputs:
source_tag:
description: "Existing image tag (e.g., v0.4.7-cu129-amd64)"
required: true
target_tag:
description: "New tag to apply (e.g., latest)"
required: true
jobs:
retag:
if: github.repository == 'sgl-project/sglang'
runs-on: ubuntu-22.04
environment: "prod"
steps:
- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Retag image
run: |
echo "Retagging lmsysorg/sglang:${{ inputs.source_tag }} -> lmsysorg/sglang:${{ inputs.target_tag }}"
docker buildx imagetools create \
-t lmsysorg/sglang:${{ inputs.target_tag }} \
lmsysorg/sglang:${{ inputs.source_tag }}

View File

@@ -0,0 +1,43 @@
name: Runner Utilization Report
on:
schedule:
- cron: '0 8 * * *' # Daily at 8 AM UTC
pull_request:
paths:
- '.github/workflows/runner-utilization.yml'
- 'scripts/ci/utils/runner_utilization_report.py'
workflow_dispatch:
inputs:
hours:
description: 'Time window in hours'
required: false
default: '24'
type: string
filter:
description: 'Filter runner labels (e.g., 5090, h200)'
required: false
type: string
jobs:
report:
name: Generate Report
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Generate Utilization Report
timeout-minutes: 30
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
python scripts/ci/utils/runner_utilization_report.py \
--repo ${{ github.repository }} \
--hours ${{ inputs.hours || '24' }} \
${{ inputs.filter && format('--filter {0}', inputs.filter) || '' }}

View File

@@ -0,0 +1,99 @@
name: Slash Command Handler
on:
issue_comment:
types: [created, edited]
permissions:
contents: read
pull-requests: write # Required to add labels and reactions
actions: write # Required to rerun workflows
issues: write # Required for comment reactions in some contexts
jobs:
slash_command:
# Only run if it is a PR and the comment contains a recognized command
# Use contains() since startsWith() can't handle leading whitespace/newlines
if: >
github.event.issue.pull_request &&
(contains(github.event.comment.body, '/tag-run-ci-label') ||
contains(github.event.comment.body, '/rerun-failed-ci') ||
contains(github.event.comment.body, '/tag-and-rerun-ci') ||
contains(github.event.comment.body, '/rerun-stage') ||
contains(github.event.comment.body, '/rerun-test'))
runs-on: ubuntu-latest
steps:
# SECURITY: This workflow runs on issue_comment trigger with elevated permissions
# (pull-requests: write, actions: write). For non-fork PRs, we can safely checkout
# the PR branch to allow testing changes to this handler. For fork PRs, we MUST
# stay on main to prevent untrusted code execution with these elevated permissions.
- name: Get PR details
id: pr
shell: bash
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
PR_DATA=$(gh pr view ${{ github.event.issue.number }} --repo ${{ github.repository }} --json headRefName,headRepositoryOwner) || {
echo "::error::Failed to fetch PR data"
exit 1
}
# Use 'empty' filter to handle null/missing values (e.g., deleted forks)
HEAD_OWNER=$(echo "$PR_DATA" | jq -r '.headRepositoryOwner.login // empty')
REPO_OWNER="${{ github.repository_owner }}"
# Treat missing/null owner as fork for security (fail-safe)
if [[ -z "$HEAD_OWNER" || "$HEAD_OWNER" != "$REPO_OWNER" ]]; then
IS_FORK="true"
else
IS_FORK="false"
fi
echo "is_fork=$IS_FORK" >> $GITHUB_OUTPUT
echo "ref=$(echo "$PR_DATA" | jq -r '.headRefName')" >> $GITHUB_OUTPUT
echo "pr_ref=refs/pull/${{ github.event.issue.number }}/head" >> $GITHUB_OUTPUT
echo "PR owner: $HEAD_OWNER, Repo owner: $REPO_OWNER, Is fork: $IS_FORK"
- name: Check commenter permission for fork PRs
id: perm
if: steps.pr.outputs.is_fork == 'true'
shell: bash
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
PERM=$(gh api repos/${{ github.repository }}/collaborators/${{ github.event.comment.user.login }}/permission --jq '.permission') || {
PERM="none"
echo "::warning::Failed to check commenter permission, defaulting to none"
}
if [[ "$PERM" == "admin" || "$PERM" == "maintain" || "$PERM" == "write" ]]; then
echo "safe_to_checkout_pr=true" >> $GITHUB_OUTPUT
else
echo "safe_to_checkout_pr=false" >> $GITHUB_OUTPUT
fi
echo "Commenter ${{ github.event.comment.user.login }} permission: $PERM"
- name: Checkout code
uses: actions/checkout@v4
with:
# For non-fork PRs: checkout PR branch by name
# For fork PRs with trusted commenter: checkout via refs/pull/N/head
# For fork PRs with untrusted commenter: stay on main for security
ref: ${{ steps.pr.outputs.is_fork == 'false' && steps.pr.outputs.ref || (steps.perm.outputs.safe_to_checkout_pr == 'true' && steps.pr.outputs.pr_ref || '') }}
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Install dependencies
run: |
pip install PyGithub
- name: Handle Slash Command
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
REPO_FULL_NAME: ${{ github.repository }}
PR_NUMBER: ${{ github.event.issue.number }}
COMMENT_ID: ${{ github.event.comment.id }}
COMMENT_BODY: ${{ github.event.comment.body }}
USER_LOGIN: ${{ github.event.comment.user.login }}
run: |
python scripts/ci/utils/slash_command_handler.py

View File

@@ -0,0 +1,44 @@
name: Stress Test
on:
workflow_dispatch:
inputs:
num_prompts:
description: 'Number of prompts per model'
required: true
default: '50000'
type: string
duration_minutes:
description: 'Timeout per model in minutes'
required: true
default: '45'
type: string
jobs:
stress-test:
if: github.repository == 'sgl-project/sglang'
runs-on: 8-gpu-h200
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Install dependencies
run: |
bash scripts/ci/cuda/ci_install_dependency.sh
- name: Run stress tests
timeout-minutes: 210
env:
NUM_PROMPTS: ${{ inputs.num_prompts }}
DURATION_MINUTES: ${{ inputs.duration_minutes }}
run: |
cd test
python3 run_suite.py --hw cuda --suite stress
- name: Upload results
if: always()
uses: actions/upload-artifact@v4
with:
name: stress-test-results
path: |
stress_test_*.jsonl

View File

@@ -0,0 +1,85 @@
name: Trivy Scan Dev Docker Images
on:
# Run daily after nightly dev builds (which run at midnight UTC)
schedule:
- cron: "0 6 * * *"
workflow_dispatch:
inputs:
tag:
description: "Image tag to scan (e.g., dev, dev-cu13, latest)"
required: false
default: ""
jobs:
scan:
if: github.repository == 'sgl-project/sglang'
runs-on: x64-docker-build-node
timeout-minutes: 45
permissions:
contents: read
security-events: write
strategy:
fail-fast: false
matrix:
tag: ${{ inputs.tag && fromJSON(format('["{0}"]', inputs.tag)) || fromJSON('["dev", "dev-cu13"]') }}
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@v0.35.0
with:
image-ref: 'docker.io/lmsysorg/sglang:${{ matrix.tag }}'
scanners: 'vuln'
format: 'sarif'
output: 'trivy-results-${{ matrix.tag }}.sarif'
severity: 'CRITICAL,HIGH'
ignore-unfixed: true
skip-dirs: 'usr/local/go,opt/nvidia'
- name: Upload Trivy scan results to GitHub Security
uses: github/codeql-action/upload-sarif@v4
if: always() && hashFiles(format('trivy-results-{0}.sarif', matrix.tag)) != ''
with:
sarif_file: 'trivy-results-${{ matrix.tag }}.sarif'
category: 'trivy-${{ matrix.tag }}'
- name: Run Trivy (table output for logs)
if: success()
uses: aquasecurity/trivy-action@v0.35.0
with:
image-ref: 'docker.io/lmsysorg/sglang:${{ matrix.tag }}'
scanners: 'vuln'
format: 'table'
severity: 'CRITICAL,HIGH'
ignore-unfixed: true
skip-dirs: 'usr/local/go,opt/nvidia'
- name: Scan summary
if: always()
run: |
IMAGE="docker.io/lmsysorg/sglang:${{ matrix.tag }}"
SARIF="trivy-results-${{ matrix.tag }}.sarif"
echo "## Trivy Scan: \`${{ matrix.tag }}\`" >> "$GITHUB_STEP_SUMMARY"
if [ ! -f "${SARIF}" ]; then
echo "**Status:** Scan failed — no SARIF output produced" >> "$GITHUB_STEP_SUMMARY"
exit 0
fi
VULN_COUNT=$(python3 -c "
import json
data = json.load(open('${SARIF}'))
print(sum(len(run.get('results', [])) for run in data.get('runs', [])))
")
echo "- **Image**: \`${IMAGE}\`" >> "$GITHUB_STEP_SUMMARY"
echo "- **Findings**: ${VULN_COUNT}" >> "$GITHUB_STEP_SUMMARY"
if [ "${VULN_COUNT}" = "0" ]; then
echo "- **Result**: No CRITICAL/HIGH unfixed vulnerabilities found" >> "$GITHUB_STEP_SUMMARY"
else
echo "- **Result**: Found ${VULN_COUNT} finding(s) — check the Security tab for details" >> "$GITHUB_STEP_SUMMARY"
fi

View File

@@ -0,0 +1,49 @@
name: Weekly Test (Nvidia)
on:
schedule:
- cron: '0 0 * * 0' # Run every Sunday at midnight UTC
workflow_dispatch:
inputs:
job_filter:
description: 'Select which job to run (leave empty or "all" to run all jobs)'
required: false
type: choice
default: 'all'
options:
- 'all'
- 'weekly-test-8-gpu-h200'
concurrency:
group: weekly-test-nvidia-${{ github.ref }}
cancel-in-progress: true
env:
SGLANG_IS_IN_CI: true
HF_HUB_DOWNLOAD_TIMEOUT: 300
HF_HUB_ETAG_TIMEOUT: 300
jobs:
# Weekly tests - 8 GPU H200
weekly-test-8-gpu-h200:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'weekly-test-8-gpu-h200')
runs-on: 8-gpu-h200
timeout-minutes: 120
env:
RUNNER_LABELS: 8-gpu-h200
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Install dependencies
run: |
bash scripts/ci/cuda/ci_install_dependency.sh
- name: Run weekly 8-GPU H200 tests
timeout-minutes: 120
env:
GPU_CONFIG: "8-gpu-h200"
IS_H200: "1"
run: |
cd test
python3 run_suite.py --hw cuda --suite weekly-8-gpu-h200 --nightly --continue-on-error --timeout-per-file 7200

274
third_party/sglang/.gitignore vendored Normal file
View File

@@ -0,0 +1,274 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
**/build/
**/develop-eggs/
**/dist/
**/downloads/
**/eggs/
.eggs/
**/lib/
**/lib64/
**/parts/
**/sdist/
**/var/
**/wheels/
**/share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
# Tokenizer cache for tests
.tokenizer_cache/
.pytest_cache/
cover/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
.pybuilder/
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock
# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pytype static type analyzer
.pytype/
# Cython debug symbols
cython_debug/
# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
.idea/
# MacOS
.DS_Store
# Vim
*.swp
# Documentation
docs/_build
# SGL
benchmark/mmlu/data
benchmark/mmlu/data.tar
benchmark/llava_bench/images
benchmark/llava_bench/mme_pack
*.jsonl
tmp*.txt
# Torch Compile logs
tl_out/
# Plots
*.png
*.pdf
# personnal
work_dirs/
*.csv
!logo.png
# Prerequisites
*.d
# Compiled Object files
*.slo
*.lo
*.o
*.obj
# Precompiled Headers
*.gch
*.pch
# Compiled Dynamic libraries
*.so
*.dylib
*.dll
# Fortran module files
*.mod
*.smod
# Compiled Static libraries
*.lai
*.la
*.a
*.lib
# Executables
*.exe
*.out
*.app
*.iml
# VSCode
.vscode
# Autoenv
.env.leave
# Rust lib
Cargo.lock
# Generated vision test fixtures (regenerate with: python scripts/generate_vision_golden.py)
sgl-model-gateway/tests/fixtures/golden/
# Other repos
lmms-eval
**/.serena/
ctags/
outputs/
inputs/
# Eval Cache
.longbench_cache/
# CUDA kernel develop, profile and debug
.clangd
*.nsys-rep
*.ncu-rep
*.nvcudmp
# setuptools-scm generated version file
python/sglang/_version.py
# MUSA section
# Generated source files by torchada
sgl-kernel/csrc_musa/
sgl-kernel/include_musa/
sgl-kernel/csrc/**/*_musa/
# MUSA core dump files
*.mudmp
# Others
# diffusion 3D outputs
*.glb
*.ply
*.npz

3
third_party/sglang/.isort.cfg vendored Normal file
View File

@@ -0,0 +1,3 @@
[settings]
profile=black
known_first_party=sglang

View File

@@ -0,0 +1,102 @@
default_stages: [pre-commit, pre-push, manual]
exclude: ^(python/sglang/multimodal_gen/csrc|python/sglang/jit_kernel/flash_attention/cute)
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v6.0.0
hooks:
- id: check-symlinks
- id: destroyed-symlinks
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
args: [--allow-multiple-documents]
- id: check-toml
- id: check-ast
- id: check-added-large-files
- id: check-merge-conflict
- id: check-shebang-scripts-are-executable
- id: detect-private-key
exclude: ^sgl-model-gateway/tests/.*_test\.rs$
- id: debug-statements
- id: no-commit-to-branch
- repo: https://github.com/PyCQA/isort
rev: 7.0.0
hooks:
- id: isort
exclude: '^python/sglang/srt/grpc/.*_pb2\.py$|^python/sglang/srt/grpc/.*_pb2_grpc\.py$|^python/sglang/srt/grpc/.*_pb2\.pyi$|^python/sglang/srt/grpc/.*_pb2_grpc\.pyi$'
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.15.1
hooks:
- id: ruff
args:
- --select=F401,F821
- --fix
files: ^(benchmark/|docs/|examples/|python/sglang/|sgl-model-gateway/py_*|test/)
exclude: |
(?x)^(
.*/__init__\.py$|
.*\.ipynb$|
python/sglang/srt/grpc/.*_pb2\.py$|
python/sglang/srt/grpc/.*_pb2_grpc\.py$|
python/sglang/srt/grpc/.*_pb2\.pyi$|
python/sglang/srt/grpc/.*_pb2_grpc\.pyi$|
)$
- repo: https://github.com/psf/black
rev: 26.1.0
hooks:
- id: black-jupyter
exclude: '^python/sglang/srt/grpc/.*_pb2\.py$|^python/sglang/srt/grpc/.*_pb2_grpc\.py$|^python/sglang/srt/grpc/.*_pb2\.pyi$|^python/sglang/srt/grpc/.*_pb2_grpc\.pyi$'
- repo: https://github.com/codespell-project/codespell
rev: v2.4.1
hooks:
- id: codespell
args: ['--config', '.codespellrc']
- repo: https://github.com/pre-commit/mirrors-clang-format
rev: v20.1.7
hooks:
- id: clang-format
types_or: [c++, cuda]
args: [--style=file, --verbose]
- repo: https://github.com/kynan/nbstripout
rev: 0.9.0
hooks:
- id: nbstripout
args:
- '--keep-output'
- '--extra-keys=metadata.kernelspec metadata.language_info.version'
- repo: local
hooks:
- id: check-chinese-characters
name: check chinese characters in multimodal_gen
entry: >-
python3 -c 'import sys, re; p=re.compile(r"[\u4e00-\u9fff]"); ec=0; [ ([(print(f"{f}:{i+1}: {l.strip()}") or (ec:=1)) for i,l in enumerate(open(f, "r", encoding="utf-8", errors="ignore")) if p.search(l)]) for f in sys.argv[1:] ]; sys.exit(ec)'
language: system
files: ^python/sglang/multimodal_gen/.*
exclude: ^(python/sglang/multimodal_gen/configs/sample|python/sglang/multimodal_gen/apps/ComfyUI_SGLDiffusion/workflows|python/sglang/multimodal_gen/runtime/pipelines_core/stages/model_specific_stages)(/|$)
types_or: [python, markdown, json, text]
- id: sort-ci-permissions
name: sort CI_PERMISSIONS.json
entry: python3 .github/update_ci_permission.py --sort-only
language: system
files: ^\.github/CI_PERMISSIONS\.json$
pass_filenames: false
- id: check-workflow-job-names
name: check for duplicate workflow job names
entry: python3 scripts/ci/check_workflow_job_names.py
language: system
files: ^\.github/workflows/.*\.yml$
pass_filenames: false
- repo: https://github.com/lycheeverse/lychee.git
rev: lychee-v0.22.0
hooks:
- id: lychee
name: check doc links (offline)
args: ["--config", ".github/linters/lychee.toml"]
stages: [manual]
exclude: ^docs/_build/
files: |
(?x)^(
README\.md|
docs/.*\.(md|rst|ipynb)
)$

View File

@@ -0,0 +1,425 @@
## Profiling SGLang Infer System with AMD GPUs
This AppNote describes the SGLang profiling technical, code augment and running steps for systems with AMD Instinct GPUs, nevertheless the same procedure may work with Nvidia GPUs too.
Examples and steps are provided in detail, to facilitate easy reproduce and use to localize performance problem towards optimizations.
Two primary methods are covered:
- [RPD](https://github.com/ROCm/rocmProfileData.git)
- [PyTorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html)
### Profiling SGLang Infer System with RPD Profiler
RPD profiler is a low-overhead cross-platform profiler. Therefore, the same RPD code augment not only works for profiling on ROCm/AMD GPUs, but also works for profiling on CUDA/Nvidia GPUs as well. To do RPD profiling on SGLang repository, please use scripts and patch files included in this directory and follow the steps below:
1. Install RPD with rpd.patch applied during installation using install_rpd.sh, both files are in this directory.
install_rpd.sh
```bash
# download and install RPD
apt update && apt install -y sqlite3 libsqlite3-dev libfmt-dev
# install rpd module
git clone https://github.com/ROCmSoftwarePlatform/rocmProfileData
cd rocmProfileData
git checkout 976899e9c6dbc6dd2bccf770818e4e44125590ac
git apply rpd.patch
make && make install
cd rocpd_python && python setup.py install && cd ..
cd rpd_tracer && make clean;make install && python setup.py install && cd ..
```
rpd.patch
```bash
diff --git a/rpd_tracer/Makefile b/rpd_tracer/Makefile
index e9d9feb..b2e9e1a 100644
--- a/rpd_tracer/Makefile
+++ b/rpd_tracer/Makefile
@@ -16,7 +16,7 @@ ifneq (,$(HIP_PATH))
$(info Building with roctracer)
RPD_LIBS += -L/opt/rocm/lib -lroctracer64 -lroctx64 -lamdhip64 -lrocm_smi64
RPD_INCLUDES += -I/opt/rocm/include -I/opt/rocm/include/roctracer -I/opt/rocm/include/hsa
- RPD_SRCS += RoctracerDataSource.cpp RocmSmiDataSource.cpp
+ RPD_SRCS += RoctracerDataSource.cpp
RPD_INCLUDES += -D__HIP_PLATFORM_AMD__
endif
```
2. Add loadTracer.sh file included in this directory to /sglang/python/sglang.
loadTracer.sh
```bash
#!/bin/bash
################################################################################
# Copyright (c) 2021 - 2023 Advanced Micro Devices, Inc. All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
################################################################################
OUTPUT_FILE="trace.rpd"
if [ "$1" = "-o" ] ; then
OUTPUT_FILE=$2
shift
shift
fi
if [ -e ${OUTPUT_FILE} ] ; then
rm ${OUTPUT_FILE}
fi
python3 -m rocpd.schema --create ${OUTPUT_FILE}
if [ $? != 0 ] ; then
echo "Error: Could not create rpd file. Please run 'python setup.py install' from the rocpd_python dir"
exit
fi
export RPDT_FILENAME=${OUTPUT_FILE}
export RPDT_AUTOSTART=0
LD_PRELOAD=librocm-smi_64:librpd_tracer.so "$@"
```
3. Apply patch (provided in this directory) with "git apply rpd_profile_server_enable.patch" if the main profiling purpose is to get info on gpu kernels as well as limited cpu activity info.
#### Common Notes 1
Please note that although we are doing TP=8 in the example, we purposely only log RPD profiling on 2 ranks in the patch file (i.e.tp_rank=0/1) for profiling/visualization convenience, as even Perfetto streaming mode can only load maximal 8GB json file for visualization. With 2 ranks logged in RPD profiling, we could still check whether there are issues among ranks (e.g. load imbalance issue, nccl issue), and at the same time, we could log relatively longer time duration before the json file generated from RPD file hits 8GB size.
rpd_profile_server_enable.patch
```bash
diff --git a/python/sglang/srt/managers/scheduler.py b/python/sglang/srt/managers/scheduler.py
index 62d1ff9..9021c01 100644
--- a/python/sglang/srt/managers/scheduler.py
+++ b/python/sglang/srt/managers/scheduler.py
@@ -71,6 +71,8 @@ from sglang.srt.utils import (
suppress_other_loggers,
)
from sglang.utils import get_exception_traceback
+from rpdTracerControl import rpdTracerControl
+rpdTracerControl.skipCreate()
logger = logging.getLogger(__name__)
@@ -245,6 +247,7 @@ class Scheduler:
],
with_stack=True,
)
+ self.rpd = rpdTracerControl()
@torch.inference_mode()
def event_loop(self):
@@ -1027,15 +1030,24 @@ class Scheduler:
def start_profile(self) -> None:
if self.profiler is None:
raise RuntimeError("Profiler is not enabled.")
- self.profiler.start()
+ #self.profiler.start() #block pytorch profiler for rpd profiler enabling
+ if self.tp_rank == 0 or self.tp_rank == 1:
+ self.rpd.start()
+ self.rpd.rangePush("", "rpd profile range", "")
+ logger.info("rpd is enabled")
def stop_profile(self) -> None:
if self.profiler is None:
raise RuntimeError("Profiler is not enabled.")
- self.profiler.stop()
- self.profiler.export_chrome_trace(
- self.torch_profiler_trace_dir + "/" + str(time.time()) + ".trace.json.gz"
- )
+ #self.profiler.stop()
+ #self.profiler.export_chrome_trace(
+ # self.torch_profiler_trace_dir + "/" + str(time.time()) + ".trace.json.gz"
+ #)
+ if self.tp_rank ==0 or self.tp_rank ==1:
+ self.rpd.rangePop()
+ self.rpd.stop()
+ self.rpd.flush()
+ logger.info("rpd is done")
logger.info("Profiler is done")
```
#### Advanced Debugging with RPD Profiler
Sometimes, we want to use rpd profiler to capture more CPU and python activities in order to debug some challenging issues (e.g. root cause of load imbalance across gpu processes, root cause of bubbles, etc). Only in such cases, we need to apply patch "git apply rpd_profile_server_enable_wCPU_activities.patch", where 3 files are modified.
rpd_profile_server_enable_wCPU_activities.patch
```bash
diff --git a/python/sglang/srt/managers/scheduler.py b/python/sglang/srt/managers/scheduler.py
index 62d1ff9..2edb427 100644
--- a/python/sglang/srt/managers/scheduler.py
+++ b/python/sglang/srt/managers/scheduler.py
@@ -71,6 +71,8 @@ from sglang.srt.utils import (
suppress_other_loggers,
)
from sglang.utils import get_exception_traceback
+from rpdTracerControl import rpdTracerControl
+rpdTracerControl.skipCreate()
logger = logging.getLogger(__name__)
@@ -245,6 +247,7 @@ class Scheduler:
],
with_stack=True,
)
+ self.rpd = rpdTracerControl()
@torch.inference_mode()
def event_loop(self):
@@ -1027,15 +1030,26 @@ class Scheduler:
def start_profile(self) -> None:
if self.profiler is None:
raise RuntimeError("Profiler is not enabled.")
- self.profiler.start()
+ #self.profiler.start()
+ logger.info("torch profiler is disabled")
+ if self.tp_rank == 0 or self.tp_rank == 1:
+ self.rpd.setPythonTrace(True)
+ self.rpd.start()
+ self.rpd.rangePush("", "scheduler", "")
+ logger.info("rpd is enabled inside scheduler profiling")
def stop_profile(self) -> None:
if self.profiler is None:
raise RuntimeError("Profiler is not enabled.")
- self.profiler.stop()
- self.profiler.export_chrome_trace(
- self.torch_profiler_trace_dir + "/" + str(time.time()) + ".trace.json.gz"
- )
+ #self.profiler.stop()
+ #self.profiler.export_chrome_trace(
+ # self.torch_profiler_trace_dir + "/" + str(time.time()) + ".trace.json.gz"
+ #)
+ if self.tp_rank ==0 or self.tp_rank ==1:
+ self.rpd.rangePop()
+ self.rpd.stop()
+ self.rpd.flush()
+ logger.info("rpd is done inside scheduler")
logger.info("Profiler is done")
diff --git a/python/sglang/srt/managers/tokenizer_manager.py b/python/sglang/srt/managers/tokenizer_manager.py
index 2621ccd..181df85 100644
--- a/python/sglang/srt/managers/tokenizer_manager.py
+++ b/python/sglang/srt/managers/tokenizer_manager.py
@@ -58,6 +58,10 @@ from sglang.srt.sampling.sampling_params import SamplingParams
from sglang.srt.server_args import PortArgs, ServerArgs
from sglang.srt.utils import is_generation_model, is_multimodal_model
+from rpdTracerControl import rpdTracerControl
+rpdTracerControl.skipCreate()
+
+
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
logger = logging.getLogger(__name__)
@@ -514,10 +518,20 @@ class TokenizerManager:
self.send_to_scheduler.send_pyobj(req)
def start_profile(self):
+ rpd = rpdTracerControl()
+ rpd.setPythonTrace(True)
+ rpd.start()
+ rpd.rangePush("", "tokenizer_manager", "")
+ logger.info("tokenizer_manager rpd profiling started!")
req = ProfileReq.START_PROFILE
self.send_to_scheduler.send_pyobj(req)
def stop_profile(self):
+ rpd = rpdTracerControl()
+ rpd.rangePop()
+ rpd.stop()
+ rpd.flush()
+ logger.info("rpd profiling is done inside tokenizer_manager!")
req = ProfileReq.STOP_PROFILE
self.send_to_scheduler.send_pyobj(req)
diff --git a/python/sglang/srt/server.py b/python/sglang/srt/server.py
index 7111c93..2bd722c 100644
--- a/python/sglang/srt/server.py
+++ b/python/sglang/srt/server.py
@@ -30,6 +30,8 @@ import threading
import time
from http import HTTPStatus
from typing import Dict, List, Optional, Union
+from rpdTracerControl import rpdTracerControl
+rpdTracerControl.skipCreate()
# Fix a bug of Python threading
setattr(threading, "_register_atexit", lambda *args, **kwargs: None)
@@ -152,6 +154,11 @@ async def flush_cache():
@app.post("/start_profile")
async def start_profile():
"""Start profiling."""
+ rpd = rpdTracerControl()
+ rpd.setPythonTrace(True)
+ rpd.start()
+ rpd.rangePush("", "server rpd profile range", "")
+ logger.info("rpd profiling started in server.py!")
tokenizer_manager.start_profile()
return Response(
content="Start profiling.\n",
@@ -164,6 +171,11 @@ async def start_profile():
async def stop_profile():
"""Stop profiling."""
tokenizer_manager.stop_profile()
+ rpd = rpdTracerControl()
+ rpd.rangePop()
+ rpd.stop()
+ rpd.flush()
+ logger.info("rpd profiling is done in server.py!")
return Response(
content="Stop profiling. This will take some time.\n",
status_code=200,
```
4. As an example for grok1 profiling, we create a dummy_grok1 directory with config.json (see content below) inside this directory and copy this directory to the right path for "--model-path" if you want to use the example server.sh file provided.
```bash
cat ../dummy_grok1/config.json
{
"architectures": [
"Grok1ModelForCausalLM"
],
"embedding_multiplier_scale": 78.38367176906169,
"output_multiplier_scale": 0.5773502691896257,
"vocab_size": 131072,
"hidden_size": 6144,
"intermediate_size": 32768,
"max_position_embeddings": 8192,
"num_experts_per_tok": 2,
"num_local_experts": 8,
"num_attention_heads": 48,
"num_hidden_layers": 64,
"num_key_value_heads": 8,
"head_dim": 128,
"rms_norm_eps": 1e-05,
"rope_theta": 10000.0,
"model_type": "mixtral",
"torch_dtype": "bfloat16"
}
```
5. Launch server with rpd enabled script ./server.sh in one terminal inside the docker container.
#### Common Notes 2
- Remember to change model-path to the correct path
- loadTracer.sh is needed to conduct profiling
- SGLANG_TORCH_PROFILER_DIR is used for default torch profiler
- Do not use loadTracer.sh if you are using the torch profiler, simply use python3 -m sglang.launch_server.
server.sh
```bash
#!/bin/bash
# export SGLANG_TORCH_PROFILER_DIR=/data/sglang/
export SGLANG_TORCH_PROFILER_DIR=/sgl-workspace/sglang/profile/
# Get the current timestamp
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
# Define the log file with a timestamp
LOGFILE="sglang_server_log_$TIMESTAMP.json"
# Run the Python command and save the output to the log file
loadTracer.sh python3 -m sglang.launch_server \
--model-path /sgl-workspace/sglang/dummy_grok1 \
--tokenizer-path Xenova/grok-1-tokenizer \
--load-format dummy \
--quantization fp8 \
--tp 8 \
--port 30000 \
--disable-radix-cache 2>&1 | tee "$LOGFILE"
```
6. Open another terminal for the same docker container, and run the rpd enabled ./client.sh after you see "The server is fired up and is ready to roll!" message from server side terminal.
#### Common Notes 3
- Use curl http://localhost:30000/start_profile & curl http://localhost:30000/stop_profile to control the start and end of profiling. Check sglang/python/sglang/srt/managers/scheduler.py for more details.
- Please don't use RPD profiler together with PyTorch profiler to avoid interference.
- The rocmProfileData/tools/rpd2tracing.py file is used to generate json file from RPD file.
client.sh
```bash
#!/bin/bash
# Start profiling via API
curl http://localhost:30000/start_profile -H "Content-Type: application/json"
# Benchmark serving using sglang with random dataset and tokenizer
# Define the log file with a timestamp
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
LOGFILE="sglang_client_log_$TIMESTAMP.json"
# Run the benchmark with specified parameters and save logs
python3 -m sglang.bench_serving \
--backend sglang \
--tokenizer Xenova/grok-1-tokenizer \
--dataset-name random \
--random-input 1024\
--random-output 1024 \
--num-prompts 120 \
--request-rate 8 \
--output-file online.jsonl 2>&1 | tee "$LOGFILE"
# Stop profiling via API
curl http://localhost:30000/stop_profile -H "Content-Type: application/json"
# Convert tracing file to csv & json
sqlite3 trace.rpd ".mode csv" ".header on" ".output trace.csv" "select * from top;" ".output stdout"
python3 ./rocmProfileData/tools/rpd2tracing.py trace.rpd trace.json
```
7. Follow [Perfetto docs](https://perfetto.dev/docs/visualization/large-traces) to visualize large json files. Try to adjust parameters so that the trace.json file size is less than 9GB.
### Profiling SGLang Infer System with PyTorch Profiler
Please use the steps as follows:
1. Apply the patch torch_profiler.patch. Note that you can modify "if self.tp_rank == 0" in the patch to allow more ranks be recorded in profiling.
torch_profiler.patch
```bash
diff --git a/python/sglang/srt/managers/scheduler.py b/python/sglang/srt/managers/scheduler.py
index 62d1ff9..6ecd78c 100644
--- a/python/sglang/srt/managers/scheduler.py
+++ b/python/sglang/srt/managers/scheduler.py
@@ -240,7 +240,6 @@ class Scheduler:
)
self.profiler = torch.profiler.profile(
activities=[
- torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
with_stack=True,
@@ -1033,9 +1032,11 @@ class Scheduler:
if self.profiler is None:
raise RuntimeError("Profiler is not enabled.")
self.profiler.stop()
- self.profiler.export_chrome_trace(
- self.torch_profiler_trace_dir + "/" + str(time.time()) + ".trace.json.gz"
- )
+ if self.tp_rank == 0:
+ with open(f"stats_repro_{int(time.time())}.txt", "w") as f:
+ print(self.profiler.key_averages(group_by_input_shape=True).table(sort_by="cuda_time_total", row_limit=-1), file=f)
+ print("Profiling stats done.")
+
logger.info("Profiler is done")
```
2. Create the model path directory and copy it to the right path for "--model-path" if you want to use the server.sh file provided.
3. Modify the included server.sh by removing "loadTracer.sh" before python command and launch script ./server.sh in one terminal inside the docker container.
4. Similar to step 6 in RPD profiling section, but remove the last 2 lines in client.sh, which converted rpd file into csv and json files. Run modified client.sh for PyTorch profiling.
-------
- [Torch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html)

Some files were not shown because too many files have changed in this diff Show More