chore: vendor sglang v0.5.10 snapshot

2026-04-24 12:29:36 +00:00
parent 78f0d15221
commit bded08301f
4308 changed files with 1200894 additions and 2 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -12,6 +12,7 @@ src/*.egg-info
 .deps/
 outputs/
-# Local heavyweight checkouts and generated experiment artifacts
+# Vendored dependencies. Track only the maintained SGLang fork/snapshot.
-third_party/
+third_party/*
 !third_party/sglang/
 *.log
--- a/third_party/sglang/.claude/skills/add-jit-kernel/SKILL.md
+++ b/third_party/sglang/.claude/skills/add-jit-kernel/SKILL.md
@@ -0,0 +1,607 @@
 ---
 name: add-jit-kernel
 description: Step-by-step tutorial for adding a new lightweight JIT CUDA kernel to sglang's jit_kernel module
 ---
 # Tutorial: Adding a New JIT Kernel to SGLang
 This tutorial walks through adding a simple element-wise scale operation as a JIT kernel. We'll implement `scale(x, factor) = x * factor` to demonstrate the complete workflow.
 ## Goal
 Add a new operation that scales each element of a tensor by a scalar factor:
 - Input: tensor `x` (CUDA) and scalar `factor` (float, passed at runtime)
 - Output: `x * factor` (element-wise), allocated internally
 - Supported dtypes: **FP16 (`torch.float16`), BF16 (`torch.bfloat16`), FP32 (`torch.float32`)**
 ## When to use JIT vs AOT (`sgl-kernel`)
 - **JIT (`jit_kernel`)**: prefer this first for kernels that do **not** depend on CUTLASS or another large C++ project. It is the default choice for lightweight kernels that benefit from rapid iteration and first-use compilation.
 - **AOT (`sgl-kernel`)**: prefer this when the kernel **does** depend on CUTLASS or another large C++ project, or when it should live in `sgl-kernel/` and participate in the wheel build / torch op registration flow.
 - **Exception**: kernels that depend on `flashinfer`, or on CUTLASS that is already provided through `flashinfer`, can still be implemented as `jit_kernel`.
 ---
 ## Common Abstractions in `python/sglang/jit_kernel/include/sgl_kernel/`
 **Always prefer these abstractions over raw CUDA primitives.** They provide safety, readability, and consistency with the rest of the codebase.
 **Important include rule:** for every `#include <sgl_kernel/...>` line, add a short trailing comment explaining why that header is included (for example `// For TensorMatcher, SymbolicSize, SymbolicDevice`). This matches the current JIT kernel style and keeps include usage self-documenting.
 ### `utils.h` — Host-side utilities
 ```cpp
 #include <sgl_kernel/utils.h>
 ```
 - **`host::RuntimeCheck(cond, args...)`** — Assert a condition at runtime; throws `PanicError` with file/line info on failure. Prefer this over bare `assert`.
 - **`host::Panic(args...)`** — Unconditionally throw a `PanicError` with a descriptive message.
 - **`host::div_ceil(a, b)`** — Integer ceiling division `(a + b - 1) / b`.
 - **`host::irange(n)`** / **`host::irange(start, end)`** — Range views for cleaner loops.
 - **`host::pointer::offset(ptr, offsets...)`** — Byte-safe pointer arithmetic on `void*`. Use this instead of raw casts.
 ### `utils.cuh` — Device-side utilities + `LaunchKernel`
 ```cpp
 #include <sgl_kernel/utils.cuh>
 ```
 - **Type aliases**: `fp16_t`, `bf16_t`, `fp32_t`, `fp8_e4m3_t`, `fp8_e5m2_t` and their packed variants `fp16x2_t`, `bf16x2_t`, `fp32x2_t`, etc.
 - **`SGL_DEVICE`** — Expands to `__forceinline__ __device__`. Use on all device functions.
 - **`device::kWarpThreads`** — Constant `32`.
 - **`device::load_as<T>(ptr, offset)`** / **`device::store_as<T>(ptr, val, offset)`** — Type-safe loads/stores from `void*`.
 - **`device::pointer::offset(ptr, offsets...)`** — Pointer arithmetic on device.
 - **`host::LaunchKernel(grid, block, device_or_stream [, smem])`** — RAII kernel launcher that:
  - Resolves the CUDA stream from a `DLDevice` via TVM-FFI automatically.
  - Checks the CUDA error with file/line info after launch via `operator()(kernel, args...)`.
  - Supports `.enable_pdl(bool)` for PDL (Programmatic Dependent Launch, SM90+).
 - **`host::RuntimeDeviceCheck(cudaError_t)`** — Check a CUDA error; throw on failure.
 ### `tensor.h` — Tensor validation (`TensorMatcher`, Symbolic types)
 ```cpp
 #include <sgl_kernel/tensor.h>
 ```
 This is the **primary validation API** for all kernel launchers. Use it to validate every `tvm::ffi::TensorView` argument.
 - **`host::SymbolicSize{"name"}`** — A named symbolic dimension. Call `.set_value(n)` to pin it, `.unwrap()` to extract after verification.
 - **`host::SymbolicDType`** — Symbolic dtype. Use `.set_options<Ts...>()` to restrict allowed types.
 - **`host::SymbolicDevice`** — Symbolic device. Use `.set_options<kDLCUDA>()` to restrict to CUDA.
 - **`host::TensorMatcher({dims...})`** — Fluent builder for tensor validation:
  - `.with_dtype<T>()` — require a specific C++ type (e.g. `fp16_t`)
  - `.with_dtype<T1, T2, ...>()` — allow a set of types
  - `.with_device<kDLCUDA>(device_sym)` — require CUDA and bind the checked device to a `SymbolicDevice`
  - `.with_strides({strides...})` — validate strides (omit to require contiguous)
  - `.verify(tensor_view)` — execute the check; throws `PanicError` with full context on failure; **chainable** (`verify(a).verify(b)` to check multiple tensors with the same shape)
 **Typical pattern:**
 ```cpp
 auto N = SymbolicSize{"num_elements"};
 auto device = SymbolicDevice{};
 device.set_options<kDLCUDA>();
 TensorMatcher({N})  //
    .with_dtype<fp16_t>()
    .with_device<kDLCUDA>(device)
    .verify(dst)
    .verify(src);  // same shape, dtype, device as dst
 const size_t n = N.unwrap();
 const DLDevice dev = device.unwrap();
 ```
 ### `type.cuh` — `dtype_trait<T>` and `packed_t<T>`
 ```cpp
 #include <sgl_kernel/type.cuh>
 ```
 - **`dtype_trait<T>`** — Static trait struct for each scalar type. Provides:
  - `dtype_trait<T>::from(value)` — convert from another type (e.g. `fp32_t` → `fp16_t`)
  - `dtype_trait<T>::abs/sqrt/rsqrt/exp/sin/cos(x)` — type-dispatched unary math (primarily for `fp32_t`)
  - `dtype_trait<T>::max/min(x, y)` — type-dispatched binary math (primarily for `fp32_t`)
 - **`packed_t<T>`** — Two-element packed alias: `packed_t<fp16_t>` = `fp16x2_t`, `packed_t<bf16_t>` = `bf16x2_t`, `packed_t<fp32_t>` = `fp32x2_t`. Use for vectorized loads/stores.
 - **`device::cast<To, From>(value)`** — Type-safe cast using `dtype_trait`, e.g. `cast<fp32x2_t, fp16x2_t>(v)`.
 ### `vec.cuh` — Vectorized memory access (`AlignedVector`)
 ```cpp
 #include <sgl_kernel/vec.cuh>
 ```
 - **`device::AlignedVector<T, N>`** — Aligned storage for N elements of type T. N must be a power of two, `sizeof(T)*N <= 32`. Enables vectorized loads/stores for bandwidth efficiency. In terms of API/codegen constraints, the upper bound is 256-bit; in practice, 128-bit is the portable default, while 256-bit vectorization is typically only viable on `SM100+` and should be gated by an architecture check when needed.
  - `.load(ptr, offset)` — vectorized load from `ptr[offset]`
  - `.store(ptr, offset)` — vectorized store to `ptr[offset]`
  - `.fill(value)` — fill all lanes
  - `operator[](i)` — element access
 ### `tile.cuh` — `tile::Memory` (strided memory access pattern)
 ```cpp
 #include <sgl_kernel/tile.cuh>
 ```
 - `tile::Memory<T>` is fundamentally a **1D cooperative accessor** over a contiguous region.
 - **`device::tile::Memory<T>::cta(blockDim.x)`** — Creates a tile accessor where each thread handles `tid = threadIdx.x` with stride `tsize` (for `cta(blockDim.x)`, this is `blockDim.x`). Common for loops over a 1D array.
 - **`.load(ptr, offset)`** — loads `ptr[tid + offset * tsize]`
 - **`.store(ptr, val, offset)`** — stores to `ptr[tid + offset * tsize]`
 - **`.in_bound(n, offset)`** — boundary check
 For a **2D tile**, either flatten `(row, col)` into a linear tile index first, or compute the address manually with `ptr[row * stride + col]` using your thread/block coordinates.
 ### `math.cuh` — Device math (`device::math::`)
 ```cpp
 #include <sgl_kernel/math.cuh>
 ```
 - `device::math::max/min<T>(a, b)` — type-dispatched binary math via `dtype_trait`
 - `device::math::abs/sqrt/rsqrt/exp/sin/cos<T>(x)` — type-dispatched unary math via `dtype_trait`
 ### `warp.cuh` — Warp-level primitives
 ```cpp
 #include <sgl_kernel/warp.cuh>
 ```
 - `device::warp::reduce_sum<T>(value)` — warp-level sum reduction via `__shfl_xor_sync`
 - `device::warp::reduce_max<T>(value)` — warp-level max reduction
 ### `cta.cuh` — CTA-level primitives
 ```cpp
 #include <sgl_kernel/cta.cuh>
 ```
 - `device::cta::reduce_max<T>(value, smem, min_value)` — CTA-wide max using shared memory + warp reduction. Caller is responsible for a `__syncthreads()` after if the result in `smem[0]` is needed.
 ### `atomic.cuh` — Atomic operations
 ```cpp
 #include <sgl_kernel/atomic.cuh>
 ```
 - `device::atomic::max(float* addr, float value)` — float atomic max (handles negative values correctly via bit tricks).
 ### `runtime.cuh` — Occupancy and device info
 ```cpp
 #include <sgl_kernel/runtime.cuh>
 ```
 - `host::runtime::get_blocks_per_sm(kernel, block_dim)` — max active blocks per SM (occupancy)
 - `host::runtime::get_sm_count(device_id)` — number of SMs on the device
 - `host::runtime::get_cc_major(device_id)` — compute capability major version
 **Persistent kernel pattern** (cap blocks to SM count × occupancy):
 ```cpp
 static const uint32_t max_occ = runtime::get_blocks_per_sm(kernel, kBlockSize);
 static const uint32_t num_sm  = runtime::get_sm_count(device.unwrap().device_id);
 const auto num_blocks = std::min(num_sm * max_occ, div_ceil(n, kBlockSize));
 LaunchKernel(num_blocks, kBlockSize, device.unwrap())(kernel, params);
 ```
 ---
 ## Step 0 (optional): Generate a `.clangd` config for better IDE support
 ```bash
 python -m sglang.jit_kernel
 ```
 ---
 ## Step 1: Implement the CUDA kernel in `jit_kernel/csrc/`
 Create `python/sglang/jit_kernel/csrc/elementwise/scale.cuh`.
 The implementation fully uses the project abstractions described above:
 ```cpp
 #include <sgl_kernel/tensor.h>   // For TensorMatcher, SymbolicSize, SymbolicDevice
 #include <sgl_kernel/type.cuh>   // For dtype_trait, fp16_t, bf16_t, fp32_t
 #include <sgl_kernel/utils.h>    // For RuntimeCheck, div_ceil
 #include <sgl_kernel/utils.cuh>  // For LaunchKernel, SGL_DEVICE
 #include <sgl_kernel/vec.cuh>    // For AlignedVector
 #include <dlpack/dlpack.h>
 #include <tvm/ffi/container/tensor.h>
 namespace {
 // ----------------------------------------------------------------
 // Kernel: element-wise scale using vectorized 128-bit loads/stores
 // T       = fp16_t | bf16_t | fp32_t
 // kVecN   = number of elements per vector load (e.g. 8 for fp16)
 // factor  = runtime scale factor
 // ----------------------------------------------------------------
 template <typename T, int kVecN>
 __global__ void scale_kernel(T* __restrict__ dst,
                              const T* __restrict__ src,
                              float factor,
                              uint32_t n_total) {
  using vec_t = device::AlignedVector<T, kVecN>;
  const uint32_t n_vecs = n_total / kVecN;
  // --- vectorised body ---
  const uint32_t vec_stride = blockDim.x * gridDim.x;
  for (uint32_t vi = blockIdx.x * blockDim.x + threadIdx.x;
       vi < n_vecs;
       vi += vec_stride) {
    vec_t v;
    v.load(src, vi);
 #pragma unroll
    for (int i = 0; i < kVecN; ++i) {
      v[i] = static_cast<T>(static_cast<float>(v[i]) * factor);
    }
    v.store(dst, vi);
  }
  // --- scalar tail ---
  const uint32_t base = n_vecs * kVecN;
  const uint32_t scalar_stride = blockDim.x * gridDim.x;
  for (uint32_t i = blockIdx.x * blockDim.x + threadIdx.x;
       base + i < n_total;
       i += scalar_stride) {
    dst[base + i] = static_cast<T>(static_cast<float>(src[base + i]) * factor);
  }
 }
 // ----------------------------------------------------------------
 // Launcher: validates tensors, selects vector width, launches kernel
 // ----------------------------------------------------------------
 template <typename T>
 void scale(tvm::ffi::TensorView dst, tvm::ffi::TensorView src, float factor) {
  using namespace host;
  // 1. Validate input tensors with TensorMatcher
  SymbolicSize N = {"num_elements"};
  SymbolicDevice device_;
  device_.set_options<kDLCUDA>();
  TensorMatcher({N})  //
      .with_dtype<T>()
      .with_device<kDLCUDA>(device_)
      .verify(dst)
      .verify(src);  // same shape / dtype / device as dst
  const uint32_t n = static_cast<uint32_t>(N.unwrap());
  const DLDevice device = device_.unwrap();
  RuntimeCheck(n > 0, "scale: num_elements must be > 0, got ", n);
  // 2. Choose vector width for 128-bit loads (16 bytes)
  //    fp16/bf16: 8 elements × 2 bytes = 16 bytes
  //    fp32:      4 elements × 4 bytes = 16 bytes
  constexpr int kVecN = 16 / sizeof(T);
  const uint32_t n_work_items = div_ceil(n, static_cast<uint32_t>(kVecN));
  // 3. Launch
  constexpr uint32_t kBlockSize = 256;
  const uint32_t grid = div_ceil(n_work_items, kBlockSize);
  LaunchKernel(grid, kBlockSize, device)(
      scale_kernel<T, kVecN>,
      static_cast<T*>(dst.data_ptr()),
      static_cast<const T*>(src.data_ptr()),
      factor,
      n);
 }
 }  // namespace
 ```
 **Key points:**
 - Include headers from `sgl_kernel/` — **not** raw CUDA headers for anything already covered
 - Add a short trailing `// For ...` explanation to every `#include <sgl_kernel/...>` line
 - Use `TensorMatcher` for all tensor validation; never manually check shape/dtype/device
 - Use `AlignedVector` for vectorised 128-bit loads/stores — significant bandwidth win
 - Use `LaunchKernel` — it resolves the stream and checks errors automatically
 - Use `RuntimeCheck` for runtime assertions with useful error messages
 - Prefer passing runtime scalars like `factor` directly unless compile-time specialisation is genuinely required
 - `fp16_t` / `bf16_t` / `fp32_t` are the project's type aliases (from `utils.cuh`)
 - `device::cast<To, From>` or `dtype_trait<T>::from(val)` for cross-type conversions
 - `device::math::` functions for device math instead of bare `__` intrinsics
 ---
 ## Step 2: Add the Python wrapper in `jit_kernel/`
 Create `python/sglang/jit_kernel/scale.py`:
 ```python
 from __future__ import annotations
 from typing import TYPE_CHECKING
 import torch
 from sglang.jit_kernel.utils import cache_once, load_jit, make_cpp_args
 if TYPE_CHECKING:
    from tvm_ffi.module import Module
@cache_once
 def _jit_scale_module(dtype: torch.dtype) -> Module:
    """Compile and cache the JIT scale module for a given dtype."""
    args = make_cpp_args(dtype)
    return load_jit(
        "scale",
        *args,
        cuda_files=["elementwise/scale.cuh"],
        cuda_wrappers=[("scale", f"scale<{args}>")],
    )
 def scale(src: torch.Tensor, factor: float, out: torch.Tensor | None = None) -> torch.Tensor:
    """
    Element-wise scale: dst = src * factor.
    Supported dtypes: torch.float16, torch.bfloat16, torch.float32.
    Parameters
    ----------
    src    : CUDA tensor (FP16 / BF16 / FP32)
    factor : scale factor
    out    : optional pre-allocated output tensor (same shape/dtype as src)
    Returns
    -------
    Scaled tensor (dst = src * factor).
    """
    if not src.is_cuda:
        raise RuntimeError("src must be a CUDA tensor")
    if src.dtype not in (torch.float16, torch.bfloat16, torch.float32):
        raise RuntimeError(
            f"Unsupported dtype {src.dtype}. Supported: float16, bfloat16, float32"
        )
    if out is None:
        out = torch.empty_like(src)
    else:
        if out.shape != src.shape:
            raise RuntimeError("out shape must match src")
        if out.dtype != src.dtype:
            raise RuntimeError("out dtype must match src")
        if out.device != src.device:
            raise RuntimeError("out device must match src")
    # Keep the Python wrapper thin, but still enforce the basic preconditions
    # that the current JIT/FFI path does not reject safely on its own.
    module = _jit_scale_module(src.dtype)
    module.scale(out, src, factor)
    return out
 ```
 **Key points:**
 - Use `cache_once` — **not** `functools.lru_cache` (incompatible with `torch.compile`)
 - `load_jit` first arg(s) form the unique build marker; same marker = same cached binary
 - Only include compile-time specialisation knobs in the build marker; runtime values like `factor` should stay runtime unless the kernel truly needs templating
 - `cuda_wrappers`: `(export_name, kernel_symbol)` — `export_name` is called from Python
 - `make_cpp_args(dtype, ...)` converts `torch.dtype` to C++ type alias:
 - Keep Python launchers thin, but still validate the basic invariants (`is_cuda`, supported dtype, `out` metadata). In the current JIT/FFI path, invalid tensors are not always rejected safely before launch
 | `torch.dtype`      | C++ type   |
 |--------------------|------------|
 | `torch.float16`    | `fp16_t`   |
 | `torch.bfloat16`   | `bf16_t`   |
 | `torch.float32`    | `fp32_t`   |
 ---
 ## Step 3 (optional): Tune JIT build flags
 ```python
 return load_jit(
    "scale",
    *args,
    cuda_files=["elementwise/scale.cuh"],
    cuda_wrappers=[("scale", f"scale<{args}>")],
    extra_cuda_cflags=["-O3", "--use_fast_math"],
 )
 ```
 If your kernel requires SM90+, raise a clear Python error before calling `load_jit`:
 ```python
 if torch.cuda.get_device_capability()[0] < 9:
    raise RuntimeError("This kernel requires SM90 (Hopper) or later")
 ```
 ---
 ## Step 4: Write tests (required)
 JIT kernel tests live under `python/sglang/jit_kernel/tests/`. **CI does not run `pytest` in that directory directly.** The unified runner `test/run_suite.py` discovers every `test_*.py` there (and every `bench_*.py` under `benchmark/`), collects `register_*_ci(...)` calls by **statically parsing each file’s AST**, and executes the selected suite. Every test file must register at least one CUDA entry or the collector fails its sanity check.
 - **PR / per-commit CUDA suites** (see `test/run_suite.py` → `PER_COMMIT_SUITES`): JIT unit tests use `stage-b-kernel-unit-1-gpu-large` (see `.github/workflows/pr-test-jit-kernel.yml`: `python3 run_suite.py --hw cuda --suite stage-b-kernel-unit-1-gpu-large`).
 - **Nightly kernel suite**: `nightly-kernel-1-gpu` with `--nightly` — typically used with `SGLANG_JIT_KERNEL_RUN_FULL_TESTS=1` in CI for expanded parameter grids (see `python/sglang/jit_kernel/utils.py` → `should_run_full_tests` / `get_ci_test_range`). Wired in `.github/workflows/nightly-test-nvidia.yml` (e.g. `python3 run_suite.py --hw cuda --suite nightly-kernel-1-gpu --nightly --continue-on-error`).
 Registration pattern (module level, **literal** `est_time` and `suite` strings — required for AST parsing):
 ```python
 from sglang.test.ci.ci_register import register_cuda_ci
 register_cuda_ci(est_time=30, suite="stage-b-kernel-unit-1-gpu-large")
 # Optional second registration: same file also listed under the nightly kernel suite
 # register_cuda_ci(est_time=120, suite="nightly-kernel-1-gpu", nightly=True)
 ```
 Keep `est_time` and `suite` as literal values. `run_suite.py` collects them from the file AST, so computed values and helper wrappers can break CI discovery.
 Use `register_cuda_ci(..., disabled="reason")` if the file must stay in-tree but should be skipped in CI (e.g. multi-GPU only).
 **Run like CI** (from repo root):
 ```bash
 cd test && python3 run_suite.py --hw cuda --suite stage-b-kernel-unit-1-gpu-large
 ```
 For fast iteration you can still run `pytest` on a single file locally; CI coverage is via `run_suite.py`.
 Create `python/sglang/jit_kernel/tests/test_scale.py`:
 ```python
 import pytest
 import torch
 from sglang.jit_kernel.scale import scale
 from sglang.test.ci.ci_register import register_cuda_ci
 register_cuda_ci(est_time=30, suite="stage-b-kernel-unit-1-gpu-large")
@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32])
@pytest.mark.parametrize("size", [1, 127, 128, 1024, 4097])  # cover tail remainder
@pytest.mark.parametrize("factor", [0.5, 1.0, 2.0, 3.0])
 def test_scale_correctness(dtype, size, factor):
    src = torch.randn(size, dtype=dtype, device="cuda")
    out = scale(src, factor)
    expected = src * factor
    rtol, atol = (1e-5, 1e-6) if dtype == torch.float32 else (1e-2, 1e-2)
    torch.testing.assert_close(out, expected, rtol=rtol, atol=atol)
@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32])
 def test_scale_out_param(dtype):
    src = torch.randn(1024, dtype=dtype, device="cuda")
    out = torch.empty_like(src)
    result = scale(src, 2.0, out=out)
    assert result is out
    torch.testing.assert_close(out, src * 2.0, rtol=1e-2, atol=1e-2)
 def test_scale_cpu_error():
    src = torch.randn(128, dtype=torch.float16)  # CPU tensor
    with pytest.raises(RuntimeError, match="CUDA"):
        scale(src, 2.0)
 def test_scale_unsupported_dtype():
    src = torch.randint(0, 10, (128,), dtype=torch.int32, device="cuda")
    with pytest.raises(RuntimeError, match="dtype"):
        scale(src, 2.0)
 if __name__ == "__main__":
    import sys
    sys.exit(pytest.main([__file__, "-v", "-s"]))
 ```
 ---
 ## Step 5: Add a benchmark (required)
 Benchmarks are `bench_*.py` files under `python/sglang/jit_kernel/benchmark/`. They are picked up by the same `run_suite.py` machinery as unit tests. Register them for **`stage-b-kernel-benchmark-1-gpu-large`** (PR JIT benchmark job: `python3 run_suite.py --hw cuda --suite stage-b-kernel-benchmark-1-gpu-large`).
 Create `python/sglang/jit_kernel/benchmark/bench_scale.py`:
 ```python
 import itertools
 import torch
 import triton
 import triton.testing
 from sglang.jit_kernel.benchmark.utils import (
    DEFAULT_DEVICE,
    DEFAULT_DTYPE,
    get_benchmark_range,
    run_benchmark,
 )
 from sglang.jit_kernel.scale import scale as jit_scale
 from sglang.test.ci.ci_register import register_cuda_ci
 register_cuda_ci(est_time=6, suite="stage-b-kernel-benchmark-1-gpu-large")
 SIZE_LIST = get_benchmark_range(
    full_range=[2**n for n in range(10, 20)],  # 1K … 512K elements
    ci_range=[4096, 65536],
 )
 configs = list(itertools.product(SIZE_LIST))
@triton.testing.perf_report(
    triton.testing.Benchmark(
        x_names=["size"],
        x_vals=configs,
        line_arg="provider",
        line_vals=["jit", "torch"],
        line_names=["SGL JIT Kernel", "PyTorch"],
        styles=[("blue", "-"), ("red", "--")],
        ylabel="us",
        plot_name="scale-performance",
        args={},
    )
 )
 def benchmark(size: int, provider: str):
    src = torch.randn(size, dtype=DEFAULT_DTYPE, device=DEFAULT_DEVICE)
    factor = 2.0
    if provider == "jit":
        fn = lambda: jit_scale(src, factor)
    else:
        fn = lambda: src * factor
    return run_benchmark(fn)
 if __name__ == "__main__":
    benchmark.run(print_data=True)
 ```
 Run locally:
 ```bash
 python python/sglang/jit_kernel/benchmark/bench_scale.py
 ```
 Run the benchmark suite the way CI does:
 ```bash
 cd test && python3 run_suite.py --hw cuda --suite stage-b-kernel-benchmark-1-gpu-large
 ```
 ---
 ## Troubleshooting
 - **`No CI registry found in ...` from `run_suite.py`**: add a module-level `register_cuda_ci(...)` with literal `est_time` and `suite` (and optional `nightly=True`); starred args and non-literal values break AST collection
 - **JIT compilation fails**: ensure the `.cuh` file is under `python/sglang/jit_kernel/csrc/`; reduce template argument combinations
 - **CUDA crash / illegal memory access**: `CUDA_LAUNCH_BLOCKING=1`; `compute-sanitizer --tool memcheck python ...`
 - **Unstable benchmark results**: `run_benchmark` uses CUDA-graph-based timing by default
 ---
 ## References
 - `docs/developer_guide/development_jit_kernel_guide.md`
 - `test/run_suite.py` — suite names, discovery of `jit_kernel/tests/` and `jit_kernel/benchmark/`, execution entrypoint for CI
 - `python/sglang/test/ci/ci_register.py` — `register_cuda_ci` and AST registration rules
 - `python/sglang/jit_kernel/utils.py` — `cache_once`, `load_jit`, `make_cpp_args`, `should_run_full_tests`, `get_ci_test_range`
 - `python/sglang/jit_kernel/include/sgl_kernel/tensor.h` — `TensorMatcher`, `SymbolicSize/DType/Device`
 - `python/sglang/jit_kernel/include/sgl_kernel/utils.cuh` — type aliases, `LaunchKernel`, `SGL_DEVICE`
 - `python/sglang/jit_kernel/include/sgl_kernel/vec.cuh` — `AlignedVector`
 - `python/sglang/jit_kernel/include/sgl_kernel/tile.cuh` — `tile::Memory`
 - `python/sglang/jit_kernel/include/sgl_kernel/type.cuh` — `dtype_trait`, `packed_t`, `device::cast`
 - `python/sglang/jit_kernel/include/sgl_kernel/math.cuh` — `device::math::`
 - `python/sglang/jit_kernel/include/sgl_kernel/warp.cuh` — `warp::reduce_sum/max`
 - `python/sglang/jit_kernel/include/sgl_kernel/cta.cuh` — `cta::reduce_max`
 - `python/sglang/jit_kernel/include/sgl_kernel/atomic.cuh` — `atomic::max`
 - `python/sglang/jit_kernel/include/sgl_kernel/runtime.cuh` — occupancy / SM count helpers
 - `python/sglang/jit_kernel/csrc/add_constant.cuh` — minimal runnable reference
 - `python/sglang/jit_kernel/csrc/elementwise/rmsnorm.cuh` — real example using `TensorMatcher` + `LaunchKernel` + `tile::Memory`
 - `python/sglang/jit_kernel/csrc/elementwise/qknorm.cuh` — real example using `runtime::get_blocks_per_sm` + persistent kernel pattern
 - `python/sglang/jit_kernel/benchmark/utils.py` — benchmark helpers
 ## Summary of Files Created
 ```
 python/sglang/jit_kernel/csrc/elementwise/scale.cuh   # NEW: CUDA kernel
 python/sglang/jit_kernel/scale.py                     # NEW: Python wrapper
 python/sglang/jit_kernel/tests/test_scale.py          # NEW: Tests
 python/sglang/jit_kernel/benchmark/bench_scale.py     # NEW: Benchmark
 ```
--- a/third_party/sglang/.claude/skills/add-sgl-kernel/SKILL.md
+++ b/third_party/sglang/.claude/skills/add-sgl-kernel/SKILL.md
@@ -0,0 +1,363 @@
 ---
 name: add-sgl-kernel
 description: Step-by-step tutorial for adding a heavyweight AOT CUDA/C++ kernel to sgl-kernel (including tests & benchmarks)
 ---
 # Tutorial: Adding a New Kernel to `sgl-kernel` (AOT / Heavyweight)
 This tutorial walks through adding a simple element-wise scale operation as an AOT kernel. We'll implement `scale(x, factor) = x * factor` to demonstrate the complete workflow.
 ## Goal
 Add a new operation that scales each element of a tensor by a scalar factor:
 - Input: tensor `x` (CUDA) and scalar `factor` (float)
 - Output: `x * factor` (element-wise, in-place or into pre-allocated `out`)
 - Supported dtypes: **FP16 (`torch.float16`), BF16 (`torch.bfloat16`), FP32 (`torch.float32`)**
  - Dispatched via `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` macro (defined in `sgl-kernel/include/utils.h`)
 ## Two rules of thumb (must follow)
 1. **Prefer `python/sglang/jit_kernel` first** when the kernel does **not** depend on CUTLASS or another large C++ project. This is the default path for lightweight kernels that benefit from rapid iteration.
 2. **Prefer `sgl-kernel`** when the kernel **does** depend on CUTLASS or another large C++ project, or when it should be part of the AOT wheel / torch op registration flow.
 3. **Exception**: if the dependency is `flashinfer`, or CUTLASS that is already provided through `flashinfer`, the kernel can still be implemented as `jit_kernel`.
 In addition, every new kernel must ship with:
 - **Tests** (pytest)
 - **A benchmark script** (triton.testing)
 ---
 ## Repository integration map
 You will typically touch these files/areas:
 - Implementation: `sgl-kernel/csrc/elementwise/scale.cu` (pick the right subdirectory)
 - Public declarations: `sgl-kernel/include/sgl_kernel_ops.h`
 - Torch extension registration: `sgl-kernel/csrc/common_extension.cc`
 - Build: `sgl-kernel/CMakeLists.txt` (`set(SOURCES ...)`)
 - Python API: `sgl-kernel/python/sgl_kernel/` and `sgl-kernel/python/sgl_kernel/__init__.py`
 - Tests: `sgl-kernel/tests/test_scale.py`
 - Benchmarks: `sgl-kernel/benchmark/bench_scale.py`
 ---
 ## Step 1: Implement the kernel in `csrc/`
 Pick the right subdirectory:
 - `csrc/elementwise/` — for element-wise ops (our example)
 - `csrc/gemm/`, `csrc/attention/`, `csrc/moe/` — for other categories
 Create `sgl-kernel/csrc/elementwise/scale.cu`:
 ```cpp
 #include <ATen/cuda/CUDAContext.h>
 #include <c10/cuda/CUDAGuard.h>
 #include <torch/all.h>
 #include "utils.h"  // DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16
 // scale_kernel: out[i] = input[i] * factor
 // Supports float, half (__half), __nv_bfloat16 via template T
 template <typename T>
 __global__ void scale_kernel(T* __restrict__ out,
                              const T* __restrict__ input,
                              float factor,
                              int64_t n) {
  int64_t idx = static_cast<int64_t>(blockIdx.x) * blockDim.x + threadIdx.x;
  if (idx < n) {
    out[idx] = static_cast<T>(static_cast<float>(input[idx]) * factor);
  }
 }
 void scale(at::Tensor& out, const at::Tensor& input, double factor) {
  TORCH_CHECK(input.is_cuda(),       "input must be a CUDA tensor");
  TORCH_CHECK(input.is_contiguous(), "input must be contiguous");
  TORCH_CHECK(out.is_cuda(),         "out must be a CUDA tensor");
  TORCH_CHECK(out.is_contiguous(),   "out must be contiguous");
  TORCH_CHECK(out.sizes() == input.sizes(),  "out and input must have the same shape");
  TORCH_CHECK(out.scalar_type() == input.scalar_type(),
              "out and input must have the same dtype");
  const int64_t n = input.numel();
  const int threads = 256;
  const int blocks  = (n + threads - 1) / threads;
  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
  const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
  // Dispatches over float, float16, bfloat16
  DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16(input.scalar_type(), c_type, [&] {
    scale_kernel<c_type><<<blocks, threads, 0, stream>>>(
        static_cast<c_type*>(out.data_ptr()),
        static_cast<const c_type*>(input.data_ptr()),
        static_cast<float>(factor),
        n);
    cudaError_t status = cudaGetLastError();
    TORCH_CHECK(status == cudaSuccess,
                "scale_kernel launch failed: ", cudaGetErrorString(status));
    return true;
  });
 }
 ```
 **Key points:**
 - Use `at::Tensor` (PyTorch tensors), `TORCH_CHECK` for validation, `at::cuda::getCurrentCUDAStream()` for stream
 - Keep Python wrappers thin; do shape/dtype/device validation in C++ right around the launch path
 - `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` covers `float`, `half` (FP16), `__nv_bfloat16` (BF16)
 - Add device error checking after every kernel launch
 - If a kernel only works on certain architectures, enforce that with `TORCH_CHECK` and skip logic in tests
 ---
 ## Step 2: Add a C++ declaration in `include/sgl_kernel_ops.h`
 Edit `sgl-kernel/include/sgl_kernel_ops.h`, add to the elementwise section:
 ```cpp
 void scale(at::Tensor& out, const at::Tensor& input, double factor);
 ```
 ---
 ## Step 3: Register the op in `csrc/common_extension.cc`
 Edit `sgl-kernel/csrc/common_extension.cc`, inside `TORCH_LIBRARY_FRAGMENT(sgl_kernel, m)`:
 ```cpp
 // From csrc/elementwise
 m.def("scale(Tensor! out, Tensor input, float factor) -> ()");
 m.impl("scale", torch::kCUDA, &scale);
 ```
 **Key points:**
 - `Tensor!` means in-place / mutable output argument
 - The schema is important for `torch.compile` and for consistent call signatures
 - Keep the torch schema in PyTorch scalar types (`float` here), but note that the C++ launcher signature still needs `double` for scalar arguments accepted by `torch::Library`
 ---
 ## Step 4: Add the new source file to `CMakeLists.txt`
 Edit `sgl-kernel/CMakeLists.txt`, add to `set(SOURCES ...)`:
 ```cmake
 csrc/elementwise/scale.cu
 ```
 **Key points:**
 - Keep the list **alphabetically sorted** (the file explicitly requires this)
 - If the kernel has arch constraints, reflect that in tests/benchmarks via skip logic
 ---
 ## Step 5: Expose a Python API under `sgl-kernel/python/sgl_kernel/`
 Prefer following the existing module organization first. For elementwise kernels, the usual pattern is:
 - implement the Python wrapper in `sgl-kernel/python/sgl_kernel/elementwise.py`
 - then re-export it from `sgl-kernel/python/sgl_kernel/__init__.py`
 For example, in `sgl-kernel/python/sgl_kernel/elementwise.py`, add:
 ```python
 import torch
 def scale(
    input: torch.Tensor,
    factor: float,
    out: torch.Tensor | None = None,
 ) -> torch.Tensor:
    """
    Element-wise scale: out = input * factor.
    Supported dtypes: torch.float16, torch.bfloat16, torch.float32.
    Parameters
    ----------
    input  : CUDA input tensor
    factor : scale factor (float)
    out    : optional pre-allocated CUDA output tensor (same shape/dtype as input)
    """
    if out is None:
        out = torch.empty_like(input)
    torch.ops.sgl_kernel.scale.default(out, input, factor)
    return out
 ```
 Then re-export it from `sgl-kernel/python/sgl_kernel/__init__.py` following the existing import style used by other kernels.
 ---
 ## Step 6: Write tests (required)
 Create `sgl-kernel/tests/test_scale.py`:
 ```python
 import pytest
 import torch
 import sgl_kernel
@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32])
@pytest.mark.parametrize("size", [128, 1024, 4096, 65536])
@pytest.mark.parametrize("factor", [0.5, 1.0, 2.0])
 def test_scale_correctness(dtype, size, factor):
    input = torch.randn(size, dtype=dtype, device="cuda")
    out   = torch.empty_like(input)
    result = sgl_kernel.scale(input, factor, out=out)
    assert result is out
    expected = input * factor
    rtol, atol = (1e-5, 1e-6) if dtype == torch.float32 else (1e-2, 1e-2)
    torch.testing.assert_close(out, expected, rtol=rtol, atol=atol)
 def test_scale_shape_mismatch():
    input = torch.randn(128, dtype=torch.float16, device="cuda")
    out   = torch.empty(256, dtype=torch.float16, device="cuda")
    with pytest.raises(RuntimeError, match="same shape"):
        sgl_kernel.scale(input, 2.0, out=out)
 def test_scale_cpu_input():
    input = torch.randn(128, dtype=torch.float16)  # CPU
    out   = torch.empty_like(input)
    with pytest.raises(RuntimeError, match="CUDA"):
        sgl_kernel.scale(input, 2.0, out=out)
 if __name__ == "__main__":
    import sys
    sys.exit(pytest.main([__file__, "-q"]))
 ```
 ---
 ## Step 7: Add a benchmark (required)
 Create `sgl-kernel/benchmark/bench_scale.py`:
 ```python
 import itertools
 import torch
 import triton
 import triton.testing
 import sgl_kernel
 from sglang.utils import is_in_ci
 IS_CI = is_in_ci()
 dtypes  = [torch.float16] if IS_CI else [torch.float16, torch.bfloat16, torch.float32]
 sizes   = [4096] if IS_CI else [2**n for n in range(10, 20)]  # 1K … 512K
 factors = [2.0]
 configs = list(itertools.product(dtypes, sizes))
 def torch_scale(input: torch.Tensor, factor: float) -> torch.Tensor:
    return input * factor
@triton.testing.perf_report(
    triton.testing.Benchmark(
        x_names=["dtype", "size"],
        x_vals=configs,
        line_arg="provider",
        line_vals=["sglang", "torch"],
        line_names=["SGL Kernel", "PyTorch"],
        styles=[("green", "-"), ("red", "--")],
        ylabel="µs (median)",
        plot_name="scale-performance",
        args={},
    )
 )
 def benchmark(dtype, size, provider):
    input  = torch.randn(size, dtype=dtype, device="cuda")
    out    = torch.empty_like(input)
    factor = 2.0
    if provider == "sglang":
        fn = lambda: sgl_kernel.scale(input, factor, out=out)
    else:
        fn = lambda: torch_scale(input, factor)
    ms, min_ms, max_ms = triton.testing.do_bench_cudagraph(
        fn, quantiles=[0.5, 0.2, 0.8]
    )
    return 1000 * ms, 1000 * max_ms, 1000 * min_ms
 if __name__ == "__main__":
    benchmark.run(print_data=True)
 ```
 ---
 ## Step 8: Build
 Build:
 ```bash
 cd sgl-kernel
 make build -j16
 ```
 If you need to limit host resource usage:
 ```bash
 cd sgl-kernel
 make build -j1 MAX_JOBS=2 CMAKE_ARGS="-DSGL_KERNEL_COMPILE_THREADS=1"
 ```
 ---
 ## Step 9: Validate
 After building successfully, run the test and benchmark:
 ```bash
 pytest sgl-kernel/tests/test_scale.py -q
 python sgl-kernel/benchmark/bench_scale.py
 ```
 ---
 ## Troubleshooting
 - **Async CUDA errors**: `CUDA_LAUNCH_BLOCKING=1`
 - **Memory errors**: `compute-sanitizer --tool memcheck python ...`
 - **Build is too slow / OOM**: reduce `MAX_JOBS` and `SGL_KERNEL_COMPILE_THREADS`
 - **Binary bloat**: use `sgl-kernel/analyze_whl_kernel_sizes.py`
 - **CMake sources list**: if your `.cu` file is missing from `SOURCES`, the symbol will be undefined at link time
 ---
 ## References
 - `sgl-kernel/README.md`
 - `sgl-kernel/include/sgl_kernel_ops.h`
 - `sgl-kernel/csrc/common_extension.cc`
 - `sgl-kernel/CMakeLists.txt`
 - `sgl-kernel/include/utils.h` — `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` macro and friends
 - `sgl-kernel/csrc/elementwise/activation.cu` — reference for the FP16/BF16/FP32 dispatch pattern
 ## Summary of Files Created/Modified
 ```
 sgl-kernel/csrc/elementwise/scale.cu          # NEW: CUDA kernel + launcher
 sgl-kernel/include/sgl_kernel_ops.h           # MODIFIED: C++ declaration
 sgl-kernel/csrc/common_extension.cc           # MODIFIED: schema + dispatch registration
 sgl-kernel/CMakeLists.txt                     # MODIFIED: add source file (alphabetical)
 sgl-kernel/python/sgl_kernel/elementwise.py   # MODIFIED: Python wrapper
 sgl-kernel/python/sgl_kernel/__init__.py      # MODIFIED: re-export Python API
 sgl-kernel/tests/test_scale.py                # NEW: tests
 sgl-kernel/benchmark/bench_scale.py           # NEW: benchmark
 ```
--- a/third_party/sglang/.claude/skills/ci-workflow-guide/SKILL.md
+++ b/third_party/sglang/.claude/skills/ci-workflow-guide/SKILL.md
@@ -0,0 +1,386 @@
 ---
 name: ci-workflow-guide
 description: Guide to SGLang CI workflow orchestration — stage ordering, fast-fail, gating, partitioning, execution modes, and debugging CI failures. Use when modifying CI workflows, adding stages, debugging CI pipeline issues, or understanding how tests are dispatched and gated across stages.
 ---
 # SGLang CI Workflow Orchestration Guide
 This skill covers the CI **infrastructure** layer — how tests are dispatched, gated, and fast-failed across stages. For test authoring (templates, fixtures, registration, model selection), see the [write-sglang-test skill](../write-sglang-test/SKILL.md).
 ---
 ## Naming Conventions
 - **Suite**: `stage-{a,b,c}-test-{gpu_count}-gpu-{hardware}` (e.g., `stage-b-test-1-gpu-small`)
 - **CI runner**: `{gpu_count}-gpu-{hardware}` (e.g., `1-gpu-5090`, `4-gpu-h100`, `8-gpu-h200`)
 ---
 ## Key Files
 | File | Role |
 |------|------|
 | `.github/workflows/pr-test.yml` | Main workflow — all stages, jobs, conditions, matrix definitions |
 | `.github/workflows/pr-gate.yml` | PR gating: draft check, `run-ci` label, per-user rate limiting |
 | `.github/actions/check-stage-health/action.yml` | Cross-job fast-fail: queries API for any failed job |
 | `.github/actions/wait-for-jobs/action.yml` | Stage gating: polls API until stage jobs complete |
 | `.github/actions/check-maintenance/action.yml` | Maintenance mode check |
 | `test/run_suite.py` | Suite runner: collects, filters, partitions, executes tests |
 | `python/sglang/test/ci/ci_register.py` | Test registration (AST-parsed markers), LPT auto-partition |
 | `python/sglang/test/ci/ci_utils.py` | `run_unittest_files()`: execution, retry, continue-on-error |
 | `scripts/ci/utils/slash_command_handler.py` | Handles slash commands from PR comments |
 ---
 ## Architecture Overview
 ```
 ┌──────────────┐
 │ build kernel │
 └──────┬───────┘
        │
        ├─ check-changes ──── detects which packages changed
        │                      (main_package, sgl_kernel, jit_kernel, multimodal_gen)
        │
        ├─ call-gate ──────── pr-gate.yml (draft? label? rate limit?)
        │
        ├─────────────────────────────────────────────────────┐
        │                                                     │
        ▼                                                     │
 ┌─────────────────────────────────────┐                      │
 │          Stage A (~3 min)           │                      │
 │         pre-flight check            │                      │
 │                                     │                      │
 │  ┌─────────────────────────────┐    │                      │
 │  │ stage-a-test-1-gpu-small    │    │                      │
 │  │ (small GPUs)                │    │                      │
 │  └─────────────────────────────┘    │                      │
 │  ┌─────────────────────────────┐    │                      │
 │  │ stage-a-test-cpu            │    │                      │
 │  │ (CPU)                       │    │                      │
 │  └─────────────────────────────┘    │                      │
 └──────┬──────────────────────────────┘                      │
        │                                                     │
        ▼                                                     ▼
 ┌─────────────────────────────────────┐          ┌──────────────────────────┐
 │          Stage B (~30 min)          │          │      kernel test         │
 │           basic tests               │          └──────────────────────────┘
 │                                     │          ┌──────────────────────────┐
 │  ┌─────────────────────────────┐    │          │   multimodal gen test    │
 │  │ stage-b-test-1-gpu-small    │    │          └──────────────────────────┘
 │  │ (small GPUs, e.g. 5090)     │    │
 │  └─────────────────────────────┘    │
 │  ┌─────────────────────────────┐    │
 │  │ stage-b-test-1-gpu-large    │    │
 │  │ (large GPUs, e.g. H100)     │    │
 │  └─────────────────────────────┘    │
 │  ┌─────────────────────────────┐    │
 │  │ stage-b-test-2-gpu-large    │    │
 │  │ (large GPUs, e.g. H100)     │    │
 │  └─────────────────────────────┘    │
 └──────┬──────────────────────────────┘
        │
        ▼
 ┌─────────────────────────────────────┐
 │          Stage C (~30 min)          │
 │          advanced tests             │
 │                                     │
 │  ┌─────────────────────────────┐    │
 │  │ stage-c-test-4-gpu-h100     │    │
 │  │ (H100 GPUs)                 │    │
 │  └─────────────────────────────┘    │
 │  ┌─────────────────────────────┐    │
 │  │ stage-c-test-8-gpu-h200     │    │
 │  │ (8 x H200 GPUs)             │    │
 │  └─────────────────────────────┘    │
 │  ┌─────────────────────────────┐    │
 │  │ stage-c-test-4-gpu-b200     │    │
 │  │ (4 x B200 GPUs)             │    │
 │  └─────────────────────────────┘    │
 │  ┌─────────────────────────────┐    │
 │  │ Other advanced tests        │    │
 │  │ (DeepEP, PD Disagg, GB300)  │    │
 │  └─────────────────────────────┘    │
 └──────┬──────────────────────────────┘
        │
        ▼
 ┌─────────────────────────────────────┐
 │         pr-test-finish              │
 │  aggregates all results, fails if   │
 │  any job failed/cancelled           │
 └─────────────────────────────────────┘
 ```
 **Every stage test job** includes a `check-stage-health` step after checkout — if any job in the run has already failed, the job fast-fails (red X) with a root cause annotation.
 **Scheduled runs** skip `wait-for-stage-*` jobs, running all stages in parallel. Fast-fail is also disabled.
 ---
 ## Fast-Fail Layers
 4 layers of fast-fail, from fine to coarse:
 | Layer | Mechanism | Granularity | Disabled on schedule? |
 |-------|-----------|-------------|----------------------|
 | **1. Test method → file** | `unittest -f` (failfast) | One test method fails → entire test file stops immediately | Yes |
 | **2. File → suite** | `run_unittest_files()` default | One test file fails → entire suite stops (`--continue-on-error` off) | Yes |
 | **3. Job → job (same stage)** | `check-stage-health` action | One job fails → other waiting jobs in same stage fast-fail (red X) | Yes |
 | **4. Stage → stage (cross-stage)** | `wait-for-stage` + `needs` | Stage A fails → stage B/C jobs skip entirely (never get a runner) | Yes (wait jobs skipped) |
 - **Layer 1**: `-f` flag appended to all `python3 -m pytest` / `unittest` invocations in `ci_utils.py`
 - **Layer 2**: `--continue-on-error` flag in `run_suite.py` — off for PRs, on for scheduled runs
 - **Layer 3**: `check-stage-health` auto-detects `schedule` event and skips; filters out cascade failures to show only root cause jobs
 - **Layer 4**: `wait-for-stage-*` jobs are conditioned on `github.event_name == 'pull_request'` — skipped for scheduled runs
 ---
 ## Execution Modes
 | Aspect | PR (`pull_request`) | Scheduled (`cron`, every 6h) | `/rerun-stage` (`workflow_dispatch`) |
 |--------|---------------------|------------------------------|--------------------------------------|
 | **Stage ordering** | Sequential: A → B → C via `wait-for-stage-*` | Parallel (all at once) | Single target stage only |
 | **Cross-job fast-fail** | Yes (`check-stage-health`) | Yes | Yes |
 | **continue-on-error** | No (stop at first failure within suite) | Yes (run all tests) | No |
 | **Retry** | Enabled | Enabled | Enabled |
 | **max_parallel** | 3 (default), 14 if `high priority` label | 14 | 3 (default), 14 if `high priority` |
 | **PR gate** | Yes (draft, label, rate limit) | Skipped | Skipped |
 | **Concurrency** | `cancel-in-progress: true` per branch | Queue (no cancel) | Isolated per stage+SHA |
 ---
 ## Stage Gating (`wait-for-jobs` action)
 `wait-for-stage-a` and `wait-for-stage-b` are lightweight `ubuntu-latest` jobs that poll the GitHub Actions API.
 **How it works:**
 1. Calls `listJobsForWorkflowRun` to list all jobs in the current run
 2. Matches jobs by exact name or prefix (for matrix jobs, e.g., `stage-b-test-1-gpu-small (3)`)
 3. If any matched job has `conclusion === 'failure'` → fail immediately (fast-fail)
 4. If all matched jobs are completed and count matches `expected_count` → success
 5. Otherwise → sleep `poll-interval-seconds` (default: 60s) and retry
 6. Timeout after `max-wait-minutes` (240 min for stage-a, 480 min for stage-b)
 **Job specs example** (stage-b):
 ```json
 [
  {"prefix": "stage-b-test-1-gpu-small", "expected_count": 8},
  {"prefix": "stage-b-test-1-gpu-large", "expected_count": 14},
  {"prefix": "stage-b-test-2-gpu-large", "expected_count": 4},
  {"prefix": "stage-b-test-4-gpu-b200", "expected_count": 1}
 ]
 ```
 > **Critical**: `expected_count` must match the matrix size. If you add/remove matrix entries, update the wait job's spec accordingly.
 **PR only**: Condition `github.event_name == 'pull_request' && !inputs.target_stage` — scheduled runs and `/rerun-stage` skip these entirely, allowing parallel execution.
 ---
 ## Cross-Job Fast-Fail (`check-stage-health` action)
 Composite action called after checkout in every stage test job (21 jobs total across `pr-test.yml`, `pr-test-multimodal-gen.yml`, `pr-test-sgl-kernel.yml`, `pr-test-jit-kernel.yml`).
 **How it works:**
 1. Queries `listJobsForWorkflowRun` for the current workflow run
 2. Filters for **root cause failures only** — jobs with `conclusion === 'failure'` whose failing step is NOT `check-stage-health` (excludes cascade failures)
 3. If root cause failures found → calls `core.setFailed()` with the list of root cause job names
 4. If none → does nothing (step succeeds)
 **Cascade filtering**: When job A fast-fails due to health check, it also has `conclusion: failure`. Without filtering, job B would list both the original failure AND job A's fast-fail. The filter checks each failed job's `steps` array — if the failing step name contains `check-stage-health` or `Check stage health`, it's excluded from the root cause list.
 **Usage pattern:**
 ```yaml
 steps:
  - name: Checkout code
    uses: actions/checkout@v4
    ...
  - uses: ./.github/actions/check-stage-health
    id: stage-health
  - name: Install dependencies        # skipped automatically if health check failed
    ...                                # (default if: success() is false)
  - name: Run test                     # also skipped
    ...
 ```
 **Visual effect**: Job shows **red X** (failure) with error annotation showing root cause job names. Subsequent steps are naturally skipped (default `if: success()` is false after a failed step). No per-step `if` guards needed.
 **No stage filtering**: Checks ALL jobs in the run, not just the current stage. Any failure anywhere triggers fast-fail.
 **Error message example:**
 ```
 Fast-fail: skipping — root cause job(s): stage-b-test-1-gpu-small (0), stage-b-test-1-gpu-small (1)
 ```
 ---
 ## Within-Suite Failure Handling
 Controlled by `run_unittest_files()` in `python/sglang/test/ci/ci_utils.py`.
 ### Flags
 | Flag | PR default | Scheduled default | Effect |
 |------|------------|-------------------|--------|
 | `--continue-on-error` | Off | On | Off: stop at first failure. On: run all files, report all failures at end |
 | `--enable-retry` | On | On | Retry retriable failures (accuracy/perf assertions) |
 | `--max-attempts` | 2 | 2 | Max attempts per file including initial run |
 ### Retry Classification
 When a test fails and retry is enabled, the output is classified:
 **Non-retriable** (checked first — real code errors):
 `SyntaxError`, `ImportError`, `ModuleNotFoundError`, `NameError`, `TypeError`, `AttributeError`, `RuntimeError`, `CUDA out of memory`, `OOM`, `Segmentation fault`, `core dumped`, `ConnectionRefusedError`, `FileNotFoundError`
 **Retriable** (accuracy/performance):
 `AssertionError` with comparison patterns (`not greater than`, `not less than`, `not equal to`), `accuracy`, `score`, `latency`, `throughput`, `timeout`
 **Default**: Unknown `AssertionError` → retriable. Other unknown failures → not retriable.
 ### How `continue_on_error` is set
 In `pr-test.yml`'s `check-changes` job:
 - `schedule` runs or `run_all_tests` flag → `continue_on_error = 'true'`
 - PR runs → `continue_on_error = 'false'`
 Each test job propagates via:
 ```yaml
 env:
  CONTINUE_ON_ERROR_FLAG: ${{ needs.check-changes.outputs.continue_on_error == 'true' && '--continue-on-error' || '' }}
 run: |
  python3 run_suite.py --hw cuda --suite <name> $CONTINUE_ON_ERROR_FLAG
 ```
 ---
 ## Test Partitioning
 Large suites are split across matrix jobs using the **LPT (Longest Processing Time) heuristic** in `ci_register.py:auto_partition()`:
 1. Sort tests by `est_time` descending, filename as tie-breaker (deterministic)
 2. Greedily assign each test to the partition with smallest cumulative time
 3. Result: roughly equal total time per partition
 **Partition table** (CUDA per-commit suites):
 | Suite | Partitions | Runner | max_parallel |
 |-------|-----------|--------|-------------|
 | `stage-a-test-1-gpu-small` | 1 (no matrix) | `1-gpu-5090` | — |
 | `stage-a-test-cpu` | 1 (no matrix) | `ubuntu-latest` | — |
 | `stage-b-test-1-gpu-small` | 8 | `1-gpu-5090` | 8 |
 | `stage-b-test-1-gpu-large` | 14 | `1-gpu-h100` | dynamic (3 or 14) |
 | `stage-b-test-2-gpu-large` | 4 | `2-gpu-h100` | — |
 | `stage-b-test-4-gpu-b200` | 1 (no matrix) | `4-gpu-b200` | — |
 | `stage-b-kernel-unit-1-gpu-large` | 1 (no matrix) | `1-gpu-h100` | — |
 | `stage-b-kernel-unit-8-gpu-h200` | 1 (no matrix) | `8-gpu-h200` | — |
 | `stage-b-kernel-benchmark-1-gpu-large` | 1 (no matrix) | `1-gpu-h100` | — |
 | `stage-c-test-4-gpu-h100` | 3 | `4-gpu-h100` | — |
 | `stage-c-test-8-gpu-h200` | 4 | `8-gpu-h200` | — |
 | `stage-c-test-8-gpu-h20` | 2 | `8-gpu-h20` | — |
 | `stage-c-test-deepep-4-gpu-h100` | 1 (no matrix) | `4-gpu-h100` | — |
 | `stage-c-test-deepep-8-gpu-h200` | 1 (no matrix) | `8-gpu-h200` | — |
 | `stage-c-test-4-gpu-b200` | 4 | `4-gpu-b200` | — |
 | `stage-c-test-4-gpu-gb200` | 1 (no matrix) | `4-gpu-gb200` | — |
 > **Note**: Kernel suites (`stage-b-kernel-*`) run via `pr-test-jit-kernel.yml` and `pr-test-sgl-kernel.yml`, not the main `pr-test.yml`. Multimodal diffusion uses `python/sglang/multimodal_gen/test/run_suite.py`, not `test/run_suite.py`.
 **Workflow usage:**
 ```yaml
 strategy:
  matrix:
    partition: [0, 1, 2, 3, 4, 5, 6, 7]
 steps:
  - run: python3 run_suite.py --hw cuda --suite stage-b-test-1-gpu-small \
           --auto-partition-id ${{ matrix.partition }} --auto-partition-size 8
 ```
 ---
 ## check-changes Job
 Determines which test suites to run based on file changes.
 ### Detection Methods
 | Trigger | Method | Details |
 |---------|--------|---------|
 | `pull_request` | `dorny/paths-filter` | Detects changes via GitHub diff |
 | `workflow_dispatch` (with `pr_head_sha`) | GitHub API | `repos/{repo}/compare/main...{sha}` |
 | `schedule` / `run_all_tests` | Force all true | Runs everything |
 ### Output Flags
 | Output | Triggers |
 |--------|----------|
 | `main_package` | Stage A/B/C test suites |
 | `sgl_kernel` | Kernel wheel builds + kernel test suites |
 | `jit_kernel` | JIT kernel test workflow |
 | `multimodal_gen` | Multimodal-gen test workflow |
 > **Note**: `sgl_kernel` is forced to `false` when `target_stage` is set, because `sgl-kernel-build-wheels` won't run and wheel artifacts won't be available.
 ---
 ## Concurrency Control
 ```
 group: pr-test-{event_name}-{branch}-{pr_sha}-{stage}
 ```
 | Segment | Source | Purpose |
 |---------|--------|---------|
 | `event_name` | `github.event_name` | Prevents scheduled runs colliding with fork PRs named `main` |
 | `branch` | `github.head_ref \|\| github.ref_name` | Per-branch isolation |
 | `pr_sha` | `inputs.pr_head_sha \|\| 'current'` | Isolates `/rerun-stage` from main runs |
 | `stage` | `inputs.target_stage \|\| 'all'` | Allows parallel stage dispatches |
 `cancel-in-progress: true` for `pull_request` events (new push cancels old run), `false` for `workflow_call`.
 ---
 ## How To: Add a New Stage Job
 1. Define the job in `pr-test.yml` with `needs: [check-changes, call-gate, wait-for-stage-X, ...]`
 2. Copy the `if:` condition pattern from an existing same-stage job (handles `target_stage`, `schedule`, `main_package`)
 3. Add `checkout` step
 4. Add `check-stage-health` step (after checkout) — if any prior job failed, `core.setFailed()` fires and all subsequent steps auto-skip via default `if: success()`
 5. Add `check-maintenance` step
 6. Add `download-artifact` step if `sgl_kernel` changed
 7. Add `install dependencies` step
 8. Add `run test` step with `$CONTINUE_ON_ERROR_FLAG`
 9. Add `upload-cuda-coredumps` step with `if: always()`
 10. Register the suite name in `PER_COMMIT_SUITES` in `test/run_suite.py`
 11. If using matrix, add `--auto-partition-id` and `--auto-partition-size` to the run command
 12. **Update `wait-for-stage-X`** job spec with the new job name and `expected_count` (if matrix)
 13. **Add the job to `pr-test-finish.needs`** list
 ---
 ## How To: Debug CI Failures
 | Symptom | Likely cause | What to check |
 |---------|-------------|---------------|
 | All stage-B/C jobs green but steps skipped | Earlier job failed, `check-stage-health` triggered | Find the actual failed job (red X) |
 | `wait-for-stage-b` timeout | `expected_count` doesn't match matrix size | Verify job spec counts match `matrix:` array length |
 | `pr-test-finish` fails but all jobs green | A job was `cancelled` (counts as failure in finish) | Check concurrency cancellation |
 | Tests pass locally but fail in CI | Partition assignment, runner GPU type, or `est_time` inaccuracy | Check which partition the test lands in; verify runner label |
 | Flaky test retried and passed | Retriable failure (accuracy/perf) | Check `[CI Retry]` markers in job logs |
 | Flaky test NOT retried | Matched non-retriable pattern | Check if error matches `NON_RETRIABLE_PATTERNS` in `ci_utils.py` |
 ---
 ## Slash Commands
 | Command | Effect |
 |---------|--------|
 | `/tag-run-ci-label` | Adds `run-ci` label to PR |
 | `/rerun-failed-ci` | Reruns failed jobs in the latest workflow run |
 | `/tag-and-rerun-ci` | Adds label + reruns |
 | `/rerun-stage <stage>` | Dispatches `pr-test.yml` with `target_stage=<stage>` |
 | `/rerun-test <test-file>` | Reruns a specific test file via `rerun-test.yml` |
 Handled by `scripts/ci/utils/slash_command_handler.py` → `.github/workflows/slash-command-handler.yml`.
--- a/third_party/sglang/.claude/skills/debug-cuda-crash/SKILL.md
+++ b/third_party/sglang/.claude/skills/debug-cuda-crash/SKILL.md
@@ -0,0 +1,657 @@
 ---
 name: debug-cuda-crash
 description: Call this skill when you need to debug CUDA crashes in SGLang using kernel API logging
 ---
 # Tutorial: Debugging CUDA Crashes with Kernel API Logging
 This tutorial shows you how to debug CUDA crashes and errors in SGLang using the `@debug_kernel_api` logging decorator.
 ## Goal
 When your code crashes with CUDA errors such as illegal memory access, device-side assert, out-of-bounds, or NaN/Inf, use kernel API logging to:
 - Capture input tensors BEFORE the crash occurs
 - Understand what data caused the problem
 - Track tensor shapes, dtypes, and values through the call boundary that triggered the crash
 - Detect numerical issues such as NaN, Inf, or obviously wrong shapes
 ## Why Use Kernel API Logging?
 **Problem**: CUDA errors often crash the program before normal debugging output is flushed.
 **Solution**: SGLang's `@debug_kernel_api` decorator logs inputs before execution, so you can still see what caused the crash even after the program aborts.
 ## What Is Covered?
 The current logging coverage focuses on the highest-value kernel boundaries in SGLang:
 - Custom ops registered through `register_custom_op(...)`
 - External custom ops registered through `register_custom_op_from_extern(...)`
 - LLM attention, linear, quantization, and multi-platform wrapper entry points
 - Diffusion attention impl, linear, rotary, and custom-op wrapper entry points
 - Selected direct `torch.ops.sglang.*` hotspots and model-specific bypasses
 This means the logging is useful for both LLM and diffusion kernel debugging, but it does not automatically cover every pure PyTorch call in the repository.
 ## Step 1: Enable Kernel API Logging
 ### Basic Logging (Function Names Only)
 ```bash
 export SGLANG_KERNEL_API_LOGLEVEL=1
 export SGLANG_KERNEL_API_LOGDEST=stdout
 python my_script.py
 ```
 Output:
 ```
 ================================================================================
 [2026-03-19 00:47:06] SGLang Kernel API Call: RMSNorm.forward
 ================================================================================
 [2026-03-19 00:47:06] SGLang Kernel API Call: sglang.quant_method.UnquantizedLinearMethod.apply
 ================================================================================
 [2026-03-19 00:47:06] SGLang Kernel API Call: sglang.custom_op.fused_inplace_qknorm
 ```
 This is a real level-1 excerpt captured from `Qwen/Qwen3-0.6B`.
 ### Detailed Logging (Inputs with Metadata)
 ```bash
 export SGLANG_KERNEL_API_LOGLEVEL=3
 export SGLANG_KERNEL_API_LOGDEST=debug.log
 python my_script.py
 ```
 Output in `debug.log`:
 ```
 ================================================================================
 [2026-03-19 00:47:30] SGLang Kernel API Call: sglang.quant_method.UnquantizedLinearMethod.apply
 Positional input arguments:
  arg[0]=QKVParallelLinear(
      repr=QKVParallelLinear(in_features=1024, output_features=4096, bias=False, tp_size=1, gather_output=False)
    )
  arg[1]=Tensor(
      shape=(1, 1024)
      dtype=torch.bfloat16
      device=cuda:0
      requires_grad=False
      is_contiguous=True
    )
  arg[2]=None
 Output:
  return=Tensor(
      shape=(1, 4096)
      dtype=torch.bfloat16
      device=cuda:0
      requires_grad=False
      is_contiguous=True
    )
 ```
 This is a real level-3 excerpt captured from `Qwen/Qwen3-0.6B`.
 ### Full Logging (With Tensor Statistics)
 ```bash
 export SGLANG_KERNEL_API_LOGLEVEL=5
 export SGLANG_KERNEL_API_LOGDEST=debug.log
 python my_script.py
 ```
 Additional output:
 ```
 ================================================================================
 [2026-03-19 01:00:42] SGLang Kernel API Call: diffusion.quant_method.UnquantizedLinearMethod.apply
 Positional input arguments:
  arg[1]=Tensor(
      shape=(1, 77, 768)
      dtype=torch.bfloat16
      device=cuda:0
      requires_grad=False
      is_contiguous=True
      min=-27.250000
      max=28.500000
      mean=0.011723
      nan_count=0
      inf_count=0
    )
 Output:
  return=Tensor(
      shape=(1, 77, 2304)
      dtype=torch.bfloat16
      device=cuda:0
      requires_grad=False
      is_contiguous=True
      min=-8.937500
      max=9.375000
      mean=0.009460
      nan_count=0
      inf_count=0
    )
 ```
 This is a real level-5 excerpt captured from `black-forest-labs/FLUX.1-dev`.
 ### Crash-Safe Dumps (Inputs Saved Before Execution)
 ```bash
 export SGLANG_KERNEL_API_LOGLEVEL=10
 export SGLANG_KERNEL_API_LOGDEST=debug.log
 export SGLANG_KERNEL_API_DUMP_DIR=/tmp/sglang_kernel_api_dumps
 python my_script.py
 ```
 At level 10, SGLang saves the inputs before execution. If the kernel crashes, the dump directory still contains the inputs and exception metadata.
 If CUDA graph capture is active, tensor dumps are skipped automatically to avoid capture-time CUDA errors. In that case, you still get the kernel API call log, but not `inputs.pt` / `outputs.pt`.
 Level-10 dumps are best understood as crash-safe call snapshots. They always preserve the observed call boundary. They do not guarantee one-click replay for every method, because some methods depend on module state that is not serialized into the dump.
 Real level-10 dump layout from `Qwen/Qwen3-0.6B`:
 ```text
 /tmp/sglang_kernel_api_validation/qwen_qwen3_0_6b_level10_dumps
 /tmp/sglang_kernel_api_validation/qwen_qwen3_0_6b_level10_dumps/20260319_004821_182_pid919286_RotaryEmbedding.forward_call0001
 /tmp/sglang_kernel_api_validation/qwen_qwen3_0_6b_level10_dumps/20260319_004821_182_pid919286_RotaryEmbedding.forward_call0001/inputs.pt
 /tmp/sglang_kernel_api_validation/qwen_qwen3_0_6b_level10_dumps/20260319_004821_182_pid919286_RotaryEmbedding.forward_call0001/metadata.json
 /tmp/sglang_kernel_api_validation/qwen_qwen3_0_6b_level10_dumps/20260319_004821_182_pid919286_RotaryEmbedding.forward_call0001/outputs.pt
 ```
 Real `metadata.json` excerpt:
 ```json
 {
  "function_name": "RotaryEmbedding.forward",
  "timestamp": "20260319_004821_182",
  "process_id": 919286,
  "execution_status": "completed",
  "input_tensor_keys": ["arg_0", "arg_1", "arg_2"],
  "output_tensor_keys": ["result_0", "result_1"]
 }
 ```
 ## Step 2: Reproduce an LLM CUDA Crash
 Create a temporary reproducer:
 ```bash
 python3 - <<'PY'
 from pathlib import Path
 Path("/tmp/sglang_llm_crash.py").write_text(
    "import torch\\n"
    "import torch.nn.functional as F\\n"
    "from sglang.srt.utils.custom_op import register_custom_op\\n\\n"
    "def _fake_embedding(indices, table):\\n"
    "    return torch.empty((*indices.shape, table.shape[-1]), device=table.device, dtype=table.dtype)\\n\\n"
    "@register_custom_op(op_name='mock_llm_cuda_crash', fake_impl=_fake_embedding)\\n"
    "def mock_llm_cuda_crash(indices, table):\\n"
    "    out = F.embedding(indices, table)\\n"
    "    torch.cuda.synchronize()\\n"
    "    return out\\n\\n"
    "table = torch.randn(4, 8, device='cuda', dtype=torch.float16)\\n"
    "indices = torch.tensor([0, 7], device='cuda', dtype=torch.long)\\n"
    "mock_llm_cuda_crash(indices, table)\\n"
 )
 PY
 SGLANG_KERNEL_API_LOGLEVEL=1 \
 SGLANG_KERNEL_API_LOGDEST=/tmp/sglang_llm_level1.log \
 python3 /tmp/sglang_llm_crash.py
 ```
 What to expect:
 - The script exits with a CUDA `device-side assert`
 - The log still contains the last API boundary before the crash
 Try the same example at level 3:
 ```bash
 SGLANG_KERNEL_API_LOGLEVEL=3 \
 SGLANG_KERNEL_API_LOGDEST=/tmp/sglang_llm_level3.log \
 python3 /tmp/sglang_llm_crash.py
 ```
 Now the log shows tensor metadata before the crash.
 Try level 10:
 ```bash
 SGLANG_KERNEL_API_LOGLEVEL=10 \
 SGLANG_KERNEL_API_LOGDEST=/tmp/sglang_llm_level10.log \
 SGLANG_KERNEL_API_DUMP_DIR=/tmp/sglang_llm_level10_dumps \
 python3 /tmp/sglang_llm_crash.py
 ```
 Now you should see:
 - A log entry for `sglang.custom_op.mock_llm_cuda_crash`
 - A dump directory with `inputs.pt`
 - `metadata.json` showing `execution_status: "exception"`
 - No `outputs.pt`, because the kernel crashed before producing output
 For real-model success-path level-10 dumps, it is often easier to temporarily disable CUDA graph and piecewise CUDA graph for the debug run.
 ## Step 3: Reproduce a Diffusion CUDA Crash
 Create a temporary diffusion-side reproducer:
 ```bash
 python3 - <<'PY'
 from pathlib import Path
 Path("/tmp/sglang_diffusion_crash.py").write_text(
    "import torch\\n"
    "import torch.nn.functional as F\\n"
    "from sglang.multimodal_gen.runtime.layers.utils import register_custom_op\\n\\n"
    "def _fake_embedding(positions, cache):\\n"
    "    return torch.empty((*positions.shape, cache.shape[-1]), device=cache.device, dtype=cache.dtype)\\n\\n"
    "@register_custom_op(op_name='mock_diffusion_cuda_crash', fake_impl=_fake_embedding)\\n"
    "def mock_diffusion_cuda_crash(positions, cache):\\n"
    "    out = F.embedding(positions, cache)\\n"
    "    torch.cuda.synchronize()\\n"
    "    return out\\n\\n"
    "cache = torch.randn(4, 64, device='cuda', dtype=torch.float16)\\n"
    "positions = torch.tensor([0, 9], device='cuda', dtype=torch.long)\\n"
    "mock_diffusion_cuda_crash(positions, cache)\\n"
 )
 PY
 SGLANG_KERNEL_API_LOGLEVEL=1 \
 SGLANG_KERNEL_API_LOGDEST=/tmp/sglang_diffusion_level1.log \
 python3 /tmp/sglang_diffusion_crash.py
 ```
 Try level 3:
 ```bash
 SGLANG_KERNEL_API_LOGLEVEL=3 \
 SGLANG_KERNEL_API_LOGDEST=/tmp/sglang_diffusion_level3.log \
 python3 /tmp/sglang_diffusion_crash.py
 ```
 Try level 10:
 ```bash
 SGLANG_KERNEL_API_LOGLEVEL=10 \
 SGLANG_KERNEL_API_LOGDEST=/tmp/sglang_diffusion_level10.log \
 SGLANG_KERNEL_API_DUMP_DIR=/tmp/sglang_diffusion_level10_dumps \
 python3 /tmp/sglang_diffusion_crash.py
 ```
 If your local environment has unrelated FlashInfer import issues, resolve them in the shell before running the example. The example itself does not set any `FLASHINFER_*` environment variable.
 ## Step 4: Multi-Process Debugging
 When running with multiple GPUs or worker processes, use `%i` in the log path:
 ```bash
 export SGLANG_KERNEL_API_LOGLEVEL=3
 export SGLANG_KERNEL_API_LOGDEST=debug_rank_%i.log
 torchrun --nproc_per_node=4 my_script.py
 ```
 This creates separate logs such as:
 - `debug_rank_12345.log`
 - `debug_rank_12346.log`
 - `debug_rank_12347.log`
 - `debug_rank_12348.log`
 Real multi-process example from a 2-GPU `Qwen/Qwen2.5-0.5B-Instruct` run:
 ```text
 /tmp/sglang_kernel_api_validation_multi/qwen_qwen2_5_0_5b_instruct_level3_950201.log
 /tmp/sglang_kernel_api_validation_multi/qwen_qwen2_5_0_5b_instruct_level3_950349.log
 /tmp/sglang_kernel_api_validation_multi/qwen_qwen2_5_0_5b_instruct_level3_950350.log
 /tmp/sglang_kernel_api_validation_multi/qwen_qwen2_5_0_5b_instruct_level3_950351.log
 ```
 You should usually do the same for level-10 dump directories:
 ```bash
 export SGLANG_KERNEL_API_LOGLEVEL=10
 export SGLANG_KERNEL_API_LOGDEST=debug_rank_%i.log
 export SGLANG_KERNEL_API_DUMP_DIR=/tmp/sglang_kernel_api_dumps_%i
 ```
 This avoids multiple ranks writing into the same dump directory tree.
 ## Step 5: Filter Level-10 Dumps
 If level 10 is too noisy, restrict dumps to specific APIs:
 ```bash
 export SGLANG_KERNEL_API_LOGLEVEL=10
 export SGLANG_KERNEL_API_LOGDEST=debug.log
 export SGLANG_KERNEL_API_DUMP_DIR=/tmp/sglang_kernel_api_dumps
 export SGLANG_KERNEL_API_DUMP_INCLUDE='sglang.custom_op.*'
 export SGLANG_KERNEL_API_DUMP_EXCLUDE='*.fake_impl'
 ```
 `SGLANG_KERNEL_API_DUMP_INCLUDE` and `SGLANG_KERNEL_API_DUMP_EXCLUDE` use shell-style wildcard matching.
 ## Step 6: Common CUDA Errors and What to Check
 ### Illegal Memory Access or Device-Side Assert
 **Typical errors**:
 ```
 RuntimeError: CUDA error: an illegal memory access was encountered
 torch.AcceleratorError: CUDA error: device-side assert triggered
 ```
 Use:
 ```bash
 export SGLANG_KERNEL_API_LOGLEVEL=3
 ```
 Check in the logs:
 - ✅ Tensor shapes
 - ✅ Tensor dtypes
 - ✅ CUDA vs CPU device placement
 - ✅ Tensor stride / contiguity
 - ✅ Whether the failing call has inputs logged but no outputs logged
 Typical shape-mismatch pattern:
 ```text
 SGLang Kernel API Call: ...
 arg[0]=Tensor(shape=(..., 128), ...)   # ✅ expected dimension
 arg[1]=Tensor(shape=(..., 64), ...)    # ❌ mismatch
 ```
 This often points to head-dim, hidden-dim, or cache-layout mismatch rather than a random CUDA failure.
 ### NaN or Inf
 Use:
 ```bash
 export SGLANG_KERNEL_API_LOGLEVEL=5
 ```
 Check:
 - `min`
 - `max`
 - `mean`
 - `nan_count`
 - `inf_count`
 Typical bad pattern:
 ```text
 Tensor(
  ...
  min=-1234567.000000   # ❌ suspiciously large
  max=9876543.000000    # ❌ suspiciously large
  mean=nan              # ❌ bad
  nan_count=128         # ❌ found NaNs
  inf_count=0           # ✅ no Infs here
 )
 ```
 This usually means the bad values were already present before the crashing kernel.
 ### Out of Memory
 Use:
 ```bash
 export SGLANG_KERNEL_API_LOGLEVEL=3
 ```
 Check:
 - Unexpectedly large tensor shapes
 - Batch size
 - Sequence length
 - Frame count or image resolution in diffusion workloads
 Also check whether a supposedly per-token or per-frame tensor accidentally became full-sequence or full-image sized.
 Typical bad pattern:
 ```text
 Tensor(
  shape=(1024, 8192, 128, 128)   # ❌ way too large
  ...
 )
 ```
 ### Example: Spot a Shape Bug from the Log
 Suppose the failing API log looks like this:
 ```text
 [2026-03-19 00:47:30] SGLang Kernel API Call: RotaryEmbedding.forward
 Positional input arguments:
  arg[0]=Tensor(shape=(1, 8), dtype=torch.int64, ...)
  arg[1]=Tensor(shape=(1, 8, 8, 256), dtype=torch.bfloat16, ...)    # ✅ query
  arg[2]=Tensor(shape=(1, 8, 4, 64), dtype=torch.bfloat16, ...)     # ❌ key head_dim mismatch
 ```
 What this tells you:
 - ✅ positions look reasonable
 - ✅ query looks plausible
 - ❌ key last dimension is inconsistent with the expected rotary/head dimension
 That usually means the bug is in projection layout, head packing, or cache format rather than in the rotary kernel itself.
 ## Step 7: Combine with compute-sanitizer
 For harder bugs, combine kernel API logging with CUDA memory checking:
 ```bash
 export SGLANG_KERNEL_API_LOGLEVEL=3
 export SGLANG_KERNEL_API_LOGDEST=debug.log
 compute-sanitizer --tool memcheck python3 /tmp/sglang_llm_crash.py
 ```
 Use `debug.log` to see the exact inputs that reached the crashing API boundary.
 Typical `compute-sanitizer` output:
 ```text
 ========= COMPUTE-SANITIZER
 ========= Invalid __global__ write of size 4 bytes
 =========     at 0x1234 in SomeKernel
 =========     by thread (256,0,0) in block (10,0,0)
 =========     Address 0x... is out of bounds
 ```
 Use the sanitizer output to identify the failing kernel and use `debug.log` to identify the exact tensors that reached the API boundary right before it.
 If you need more synchronous host-side error reporting, you can try `CUDA_LAUNCH_BLOCKING=1` as a separate follow-up experiment. It is not part of the default workflow because it changes execution timing and can hide concurrency-related behavior.
 ## Step 8: Combine with cuda-gdb
 For crashes that need a stack trace instead of only memory diagnostics:
 ```bash
 export SGLANG_KERNEL_API_LOGLEVEL=3
 export SGLANG_KERNEL_API_LOGDEST=debug.log
 cuda-gdb --args python3 /tmp/sglang_llm_crash.py
 ```
 Inside `cuda-gdb`:
 ```text
 (cuda-gdb) run
 (cuda-gdb) where
 ```
 Then correlate the backtrace with `debug.log`.
 ## Step 9: Kernel-Level Debugging with printf()
 When you own the CUDA kernel, `printf()` is still useful for narrowing down bad indices, bad launch geometry, or broken state propagation.
 Basic pattern:
 ```cpp
 __global__ void MyKernel(const float* input, float* output, int n) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (threadIdx.x == 0 && blockIdx.x == 0) {
    printf("n=%d input0=%f\n", n, input[0]);
  }
  if (idx < n) {
    output[idx] = input[idx] * 2.0f;
  }
 }
 ```
 After launch, force the output to flush:
 ```python
 my_kernel(...)
 torch.cuda.synchronize()
 ```
 For warp-specialized kernels, do not blindly print only on `threadIdx.x == 0`. Pick one representative thread per warp or per specialization group instead.
 ### Warp-Specialized Kernels: Choosing the Right Print Thread
 Problem:
 - `threadIdx.x == 0` only prints from the first warp in the block
 - for warp-specialized kernels, that often misses the warp or group that is actually wrong
 Better pattern:
 ```cpp
 __global__ void WarpSpecializedKernel(...) {
  // Example: first lane of each warp
  if ((threadIdx.x % 32) == 0) {
    printf("warp=%d\n", threadIdx.x / 32);
  }
 }
 ```
 Or, if the kernel is organized in larger specialization groups, print once per group instead of once per block.
 Common mistake:
 ```cpp
 // Only warp 0 prints
 if (threadIdx.x == 0) {
  printf("warp=%d\n", threadIdx.x / 32);
 }
 ```
 ### Quick Reference
 | Kernel Type | Print Condition | Notes |
 |----------|----------|-------------|
 | Simple kernel | `threadIdx.x == 0` | One thread per block is usually enough |
 | Warp-specialized kernel | one representative lane per warp | e.g. `threadIdx.x % 32 == 0` |
 | Group-specialized kernel | one representative lane per group | choose based on the kernel's scheduling layout |
 ### Other Kernel Debugging Tools
 ```cpp
 assert(value >= 0.0f && "value must be non-negative");
 static_assert(BLOCK_SIZE % 32 == 0, "BLOCK_SIZE must be warp aligned");
 ```
 ## Environment Variables Reference
 | Variable | Values | Description |
 |----------|--------|-------------|
 | `SGLANG_KERNEL_API_LOGLEVEL` | `0` | No logging (default) |
 |  | `1` | Function names only |
 |  | `3` | Inputs and outputs with metadata |
 |  | `5` | Level 3 plus tensor statistics |
 |  | `10` | Level 5 plus crash-safe tensor dumps |
 | `SGLANG_KERNEL_API_LOGDEST` | `stdout` | Log to stdout |
 |  | `stderr` | Log to stderr |
 |  | `<path>` | Log to file |
 |  | `log_%i.txt` | `%i` expands to process ID |
 | `SGLANG_KERNEL_API_DUMP_DIR` | `<path>` | Directory for level-10 dumps |
 | `SGLANG_KERNEL_API_DUMP_INCLUDE` | wildcard list | Only dump matching API names |
 | `SGLANG_KERNEL_API_DUMP_EXCLUDE` | wildcard list | Skip matching API names |
 ## Best Practices
 ### 1. Start with Level 3
 ```bash
 export SGLANG_KERNEL_API_LOGLEVEL=3
 ```
 Level 3 is usually enough to catch wrong shapes, wrong dtypes, and wrong devices.
 ### 2. Use Level 5 for Numerical Issues
 ```bash
 export SGLANG_KERNEL_API_LOGLEVEL=5
 ```
 Use it when you suspect NaN or Inf values.
 ### 3. Use Level 10 for Crash Reproduction
 ```bash
 export SGLANG_KERNEL_API_LOGLEVEL=10
 ```
 This is the most useful mode when the process crashes before you can inspect live tensors.
 If you need successful input/output dumps from a real model run, temporarily disable CUDA graph for that debug session.
 When level 10 is too noisy, pair it with `SGLANG_KERNEL_API_DUMP_INCLUDE` / `SGLANG_KERNEL_API_DUMP_EXCLUDE` instead of dumping every covered API.
 ### 4. Log to File for Crashes
 ```bash
 export SGLANG_KERNEL_API_LOGDEST=crash.log
 ```
 File logs are safer than stdout when the process aborts.
 ### 5. Disable Logging in Production
 ```bash
 unset SGLANG_KERNEL_API_LOGLEVEL
 ```
 When disabled, the decorator returns the original callable and adds no runtime logging overhead.
 ## Troubleshooting
 ### No Logs Appear
 Check:
 1. `echo $SGLANG_KERNEL_API_LOGLEVEL`
 2. `echo $SGLANG_KERNEL_API_LOGDEST`
 3. Whether the failing path goes through a covered API boundary
 ### Too Much Output
 Reduce the level:
 ```bash
 export SGLANG_KERNEL_API_LOGLEVEL=3
 ```
 ### Statistics Are Skipped During CUDA Graph Capture
 If you see:
 ```text
 statistics=[skipped: CUDA graph capture in progress]
 ```
 That is expected. Level-5 statistics are intentionally skipped during CUDA graph capture to avoid synchronization side effects.
 ### Tensor Dumps Are Skipped During CUDA Graph Capture
 If you see:
 ```text
 Tensor dump skipped: CUDA graph capture in progress
 ```
 That is also expected. Level-10 dumps require copying tensors to CPU, which is not allowed during CUDA graph capture.
--- a/third_party/sglang/.claude/skills/generate-profile/SKILL.md
+++ b/third_party/sglang/.claude/skills/generate-profile/SKILL.md
@@ -0,0 +1,141 @@
 ---
 name: generate-profile
 description: Generate an e2e profiling trace of an SGLang server run. Launches a server, validates accuracy, captures a Chrome-compatible trace, and returns the profile path.
 ---
 # Generate an E2E Profile of an SGLang Server Run
 This skill launches an SGLang server, validates it with a quick accuracy test, generates a profiling trace, and returns the profile file path.
 ## Prerequisites
 - A working SGLang installation (`pip install -e .` or equivalent)
 - At least one available CUDA GPU
 ## Step-by-step Workflow
 ### Step 1: Launch the server
 ```bash
 CUDA_VISIBLE_DEVICES=<gpu_id> sglang serve --model-path <model> --port <port> &
 ```
 - Default model: `Qwen/Qwen3-8B` (good balance of speed and quality)
 - Default port: `30000`
 - The server runs in the background. Save the PID for cleanup.
 - Use the GPU specified by the user's preferences (check memory files for GPU preferences).
 ### Step 2: Wait for server readiness
 Poll the health endpoint until the server is ready:
 ```bash
 for i in $(seq 1 120); do
  if curl -s http://127.0.0.1:<port>/health 2>/dev/null | grep -q "ok\|healthy"; then
    echo "Server ready"
    break
  fi
  sleep 5
 done
 ```
 The server prints **"The server is fired up and ready to roll!"** to stdout when ready. The health endpoint returns 200 once the server can accept requests.
 Typical startup time: 30-90 seconds depending on model size and whether CUDA graphs are being compiled.
 ### Step 3: Validate accuracy (sanity check)
 ```bash
 python3 -m sglang.test.few_shot_gsm8k --num-q 20
 ```
 - Expected accuracy: **> 0.8** for capable models (Qwen3-8B, Llama-3.1-8B-Instruct, etc.)
 - This is a quick sanity check, not a rigorous benchmark.
 - If accuracy is unexpectedly low, something is wrong — do not proceed to profiling.
 ### Step 4: Generate the profile
 ```bash
 python3 -m sglang.test.send_one --profile
 ```
 This command:
 1. Sends a request to the server
 2. Triggers the profiler for 5 steps (default)
 3. Generates a trace file under `/tmp/<timestamp>/`
 4. The trace directory contains:
   - `<timestamp>-TP-0.trace.json.gz` — Chrome trace format (open in `chrome://tracing` or Perfetto)
   - `server_args.json` — the server configuration used
 **Output format:**
 ```
 Dump profiling traces to /tmp/<timestamp>
 ```
 The profile path is printed to stdout. Parse it from the output.
 **Optional flags:**
 - `--profile-steps N` — number of profiling steps (default: 5)
 - `--profile-by-stage` — profile by stage (prefill/decode separately)
 - `--profile-prefix <path>` — custom output prefix
 ### Step 5: Kill the server
 ```bash
 pkill -9 -f "sglang.launch_server\|sglang serve\|sglang.srt"
 ```
 Wait a moment and verify no sglang processes remain:
 ```bash
 sleep 2 && pgrep -af "sglang serve" || echo "Server killed"
 ```
 ### Step 6: Report the profile path
 Return the profile directory path (e.g., `/tmp/1773999986.4769795`) and list its contents so the user knows what files were generated.
 ## Example Full Run
 ```bash
 # 1. Launch server
 source cleanup/bin/activate
 CUDA_VISIBLE_DEVICES=1 sglang serve --model-path Qwen/Qwen3-8B --port 30000 &
 # 2. Wait for ready
 for i in $(seq 1 120); do
  curl -s http://127.0.0.1:30000/health | grep -q "ok" && break
  sleep 5
 done
 # 3. Accuracy check
 python3 -m sglang.test.few_shot_gsm8k --num-q 20
 # Expected: Accuracy > 0.8
 # 4. Profile
 python3 -m sglang.test.send_one --profile
 # Output: "Dump profiling traces to /tmp/1773999986.4769795"
 # 5. Cleanup
 pkill -9 -f "sglang.launch_server\|sglang serve\|sglang.srt"
 sleep 2
 # 6. Check output
 ls -la /tmp/1773999986.4769795/
 # 1773999986.4851577-TP-0.trace.json.gz  (Chrome trace)
 # server_args.json                        (server config)
 ```
 ## Customization
 - **Different port**: Pass `--port <port>` and use `--host 127.0.0.1 --port <port>` for test commands
 - **Multi-GPU**: Use `--tp <N>` for tensor parallelism; trace files will be generated per TP rank
 - **Longer profile**: Use `--profile-steps 10` for more steps in the trace
 - **Stage profiling**: Use `--profile-by-stage` to separate prefill and decode phases
 ## Viewing the Profile
 Open the `.trace.json.gz` file in:
 - **Perfetto UI**: https://ui.perfetto.dev/ (drag and drop the file)
 - **Chrome tracing**: `chrome://tracing` (load the file)
 Both support the gzipped Chrome trace format natively.
--- a/third_party/sglang/.claude/skills/sglang-bisect-ci-regression/SKILL.md
+++ b/third_party/sglang/.claude/skills/sglang-bisect-ci-regression/SKILL.md
@@ -0,0 +1,219 @@
 # SGLang Bisect CI Regression
 Investigate a consistently failing CI test to find the root cause - whether it's a code regression from a specific PR, a hardware/runner-specific issue, or an environment change. Optionally reproduce the failure on a remote GPU server.
 ## Slash Command
 `/sglang-bisect-ci-regression <test_name_or_ci_url> [ssh_target] [docker_container]`
 ## When to Use This Skill
 - A CI test is failing consistently on main (scheduled runs)
 - You need to find which PR introduced a regression
 - You suspect a runner-specific or GPU-specific issue
 - You want to reproduce a CI failure on a remote server
 ## Arguments
 - **First argument (required)**: Test file name (e.g. `test_lora_tp.py`) or a GitHub Actions job URL
 - **Second argument (optional)**: SSH target for remote reproduction (e.g. `user@host`)
 - **Third argument (optional)**: Docker container name on the SSH target (e.g. `sglang_dev`)
 If SSH target and docker container are not provided, the skill will only perform the CI log analysis and bisection, without remote reproduction. **Ask the user** for these if reproduction is needed and they weren't provided.
 ## Background: Scheduled CI Runs
 SGLang uses the `pr-test.yml` workflow with **scheduled runs** (cron-triggered) to periodically test the `main` branch. These runs are the primary data source for detecting regressions:
 - **Workflow**: `pr-test.yml` with `event: schedule`
 - **Branch**: `main`
 - **Dashboard**: https://github.com/sgl-project/sglang/actions/workflows/pr-test.yml?query=event%3Aschedule
 - **Frequency**: Runs multiple times daily, each pinned to the HEAD of `main` at trigger time
 - **Purpose**: Catches regressions that slip through PR-level CI (e.g., interaction bugs between merged PRs, hardware-specific issues)
 Always use these scheduled runs (not PR-triggered runs) when bisecting regressions on `main`. The `--event schedule` filter in `gh run list` ensures you only see these periodic main-branch runs.
 ## Workflow
 ### Phase 1: Extract the Failure Signature
 1. **Get the failing test details from CI logs.** If given a URL, fetch logs directly. If given a test name, find recent scheduled runs of `pr-test.yml` on `main` that failed:
 ```bash
 # List recent scheduled runs targeting main (the primary source of truth for regressions)
 # These are cron-triggered runs visible at:
 # https://github.com/sgl-project/sglang/actions/workflows/pr-test.yml?query=event%3Aschedule
 gh run list --repo sgl-project/sglang --workflow="pr-test.yml" --event schedule --branch main --limit 20 --json databaseId,conclusion,createdAt,headSha
 # Find the job containing the test
 gh run view {RUN_ID} --repo sgl-project/sglang --json jobs --jq '.jobs[] | select(.conclusion == "failure") | {name, conclusion, databaseId}'
 # Get the failure details
 gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -E -B 5 -A 30 "AssertionError|FAIL|Error|{TEST_NAME}"
 ```
 2. **Record the failure signature:**
   - Exact error message and assertion
   - Affected test method name
   - Model/config involved
   - Numeric values (e.g., tolerance diffs, scores)
   - Whether the failure is deterministic (same values across runs)
 ### Phase 2: Temporal Bisection
 3. **Find the boundary between passing and failing runs.** Walk through the scheduled run history (from the `pr-test.yml` schedule runs on `main`) to identify:
   - Last known PASSING run (sha + date)
   - First known FAILING run (sha + date)
 ```bash
 # For each scheduled run, check the specific partition/job status
 gh run view {RUN_ID} --repo sgl-project/sglang --json jobs --jq '.jobs[] | select(.name == "{JOB_NAME}") | {conclusion, databaseId}'
 # Verify a specific test passed or failed in a run
 gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -E "{TEST_NAME}|PASSED|FAILED|logprobs mismatch" | head -10
 ```
 4. **List commits between the boundary:**
 ```bash
 git log --oneline {LAST_PASS_SHA}..{FIRST_FAIL_SHA}
 ```
 5. **Filter for relevant commits** that touch files related to the failing test (model layers, kernels, test utilities, etc.):
 ```bash
 git log --oneline {LAST_PASS_SHA}..{FIRST_FAIL_SHA} -- {relevant_paths}
 ```
 ### Phase 3: Runner/Hardware Analysis
 6. **Check if the failure is runner-specific.** Extract the runner identity from each failing and passing run:
 ```bash
 # Get runner name and machine
 gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -E "Runner name|Machine name" | head -5
 # Get GPU/driver info
 gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -i -E "NVIDIA-SMI|Driver Version|CUDA Version" | head -5
 # Get package versions
 gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -E "sgl.kernel.*==|flashinfer.*==" | head -5
 ```
 7. **Correlate runners with pass/fail outcomes.** Build a table:
 | Run ID | Date | Runner | GPU Type | Driver | Result |
 |--------|------|--------|----------|--------|--------|
 If all failures map to a specific runner type/GPU and all passes map to another, the issue is **hardware-specific**, not a code regression.
 ### Phase 4: Code Analysis
 8. **If a code regression is suspected** (failures not runner-specific), examine the candidate commits:
   - Read the changed files
   - Understand how the changes could affect the failing test
   - Look for prefill-vs-decode differences, TP-specific paths, kernel changes
 9. **If a hardware issue is suspected**, analyze:
   - Kernel compatibility (CUDA compute capability)
   - Driver version differences
   - All-reduce / NCCL behavior differences
   - CUDA graph capture differences across GPU architectures
 ### Phase 5: Remote Reproduction (Optional)
 Only if SSH target and docker container were provided.
 10. **Verify the remote environment:**
 ```bash
 ssh {SSH_TARGET} "docker exec {CONTAINER} nvidia-smi --query-gpu=name,driver_version --format=csv"
 ssh {SSH_TARGET} "docker exec {CONTAINER} pip show sgl-kernel sglang flashinfer-python 2>&1 | grep -E 'Name:|Version:'"
 ```
 11. **Ensure latest code is installed.** If the container is stale, update:
 ```bash
 # Try fetching latest main
 ssh {SSH_TARGET} "docker exec {CONTAINER} bash -c 'cd /path/to/sglang && git fetch origin main && git checkout origin/main'"
 # Or download and install from tarball if git auth fails
 ssh {SSH_TARGET} "docker exec {CONTAINER} bash -c 'cd /tmp && curl -L https://github.com/sgl-project/sglang/archive/refs/heads/main.tar.gz | tar xz && cd sglang-main && pip install -e \"python[all]\"'"
 # Reinstall (after git fetch)
 ssh {SSH_TARGET} "docker exec {CONTAINER} bash -c 'cd /path/to/sglang && pip install -e \"python[all]\"'"
 # Install test dependencies if needed
 ssh {SSH_TARGET} "docker exec {CONTAINER} pip install peft rouge-score"
 ```
 12. **Create a minimal reproduction script** that:
    - Uses `if __name__ == '__main__'` with `mp.set_start_method("spawn")`
    - Runs the specific failing test configuration
    - Prints key metrics (diffs, scores, outputs)
    - Exits with code 1 on failure
 13. **Copy and run the reproduction script:**
 ```bash
 scp /tmp/repro_script.py {SSH_TARGET}:/tmp/
 ssh {SSH_TARGET} "docker cp /tmp/repro_script.py {CONTAINER}:/tmp/"
 ssh {SSH_TARGET} "docker exec -e CUDA_VISIBLE_DEVICES=0,1 {CONTAINER} python3 /tmp/repro_script.py"
 ```
 14. **Run control experiments** to isolate the variable:
    - If suspecting TP issue: run with TP=1 as control
    - If suspecting GPU issue: compare same code on different GPU
    - If suspecting a specific commit: test before/after that commit
 ### Phase 6: Report
 15. **Produce a structured report:**
 ```markdown
 ## CI Regression Bisection Report
 ### Failure Signature
 - **Test**: {test_file}::{test_method}
 - **Error**: {exact error message}
 - **Key metrics**: {numeric values}
 - **Deterministic**: Yes/No
 ### Root Cause Classification
 One of:
 - **Code Regression**: PR #{number} introduced the bug
 - **Hardware-Specific**: Fails on {GPU_TYPE}, passes on others
 - **Environment Change**: New runner/driver/package version
 - **Pre-existing Flakiness**: Intermittent, not a new regression
 ### Evidence
 | Condition | Result |
 |-----------|--------|
 | {condition1} | PASS/FAIL |
 | {condition2} | PASS/FAIL |
 ### Timeline
 - {date}: Last known pass ({sha}, {runner})
 - {date}: First known fail ({sha}, {runner})
 - {date}: Confirmed reproduction on {server}
 ### Recommended Fix
 - **Short-term**: {workaround}
 - **Long-term**: {proper fix}
 ```
 ## Key Patterns to Recognize
 | Pattern | Diagnosis |
 |---------|-----------|
 | Same SHA passes on runner A, fails on runner B | Hardware/runner-specific |
 | All runners fail after commit X | Code regression from commit X |
 | Intermittent - same runner sometimes passes/fails | Flaky test or race condition |
 | Prefill OK but decode fails | TP/all-reduce issue in decode path |
 | Works with TP=1, fails with TP>1 | Tensor parallelism bug |
 | Exact same numeric diff every time | Deterministic bug, not flakiness |
 ## Important Notes
 - **Always check runner identity** before concluding it's a code regression. Many "consistent" failures are actually runner-specific.
 - **Test partition assignments change over time** as tests are added/removed. A test may move between partitions, landing on different runner types.
 - **H200 runners** use `/root/actions-runner/` path and machine names like `gpu-h200-worker-*`. Non-H200 runners use `/public_sglang_ci/runner-*` paths.
 - When running remote reproduction, use `run_in_background` for long-running tests and check output with `TaskOutput`.
 - Container environments may be stale - always verify package versions match CI before drawing conclusions.
--- a/third_party/sglang/.claude/skills/write-sglang-test/SKILL.md
+++ b/third_party/sglang/.claude/skills/write-sglang-test/SKILL.md
@@ -0,0 +1,444 @@
 ---
 name: write-sglang-test
 description: Guide for writing SGLang CI/UT tests. Covers CustomTestCase, CI registration, server fixtures, model selection, mock testing, and test placement. Always read test/README.md for the full CI layout, how to run tests, and extra tips. Use when creating new tests, adding CI test cases, writing unit tests, or when the user asks to add tests for SGLang features.
 ---
 # Writing SGLang CI / UT Tests
 This skill covers **how to write and register tests**. For CI pipeline internals (stage ordering, fast-fail, gating, partitioning, debugging CI failures), see the [CI workflow guide](../ci-workflow-guide/SKILL.md).
 ## Core Rules
 1. **Always use `CustomTestCase`** — never raw `unittest.TestCase`. It ensures `tearDownClass` runs even when `setUpClass` fails, preventing resource leaks in CI.
 2. **`tearDownClass` must be defensive** — use `hasattr`/null checks before accessing resources (e.g. `cls.process`) that `setUpClass` may not have finished allocating.
 3. **Place tests in `test/registered/<category>/`** — except JIT kernel tests and benchmarks, which live in `python/sglang/jit_kernel/tests/` and `python/sglang/jit_kernel/benchmark/` (nested subfolders are allowed)
 4. **Reuse server fixtures** — inherit from `DefaultServerBase` or write `setUpClass`/`tearDownClass` with `popen_launch_server`
 5. **Prefer mock over real server** — when testing logic that doesn't need a server / engine launch (middleware, request routing, config validation, argument parsing), use `unittest.mock.patch` / `MagicMock` and place tests in `test/registered/unit/`. Only launch a real server when the test genuinely needs inference results or server lifecycle behavior.
 JIT kernel exception:
 - If the task is adding or updating code under `python/sglang/jit_kernel/`, prefer the `add-jit-kernel` skill first.
 - JIT kernel correctness tests use `python/sglang/jit_kernel/tests/**/test_*.py`.
 - JIT kernel benchmarks use `python/sglang/jit_kernel/benchmark/**/bench_*.py`.
 - Those files are still executed by `test/run_suite.py`, but through dedicated kernel suites rather than `test/registered/`.
 ---
 ## Model & Backend Selection
 | Scenario | Model | CI Registration | Suite |
 |----------|-------|-----------------|-------|
 | **Unit tests** (no server / engine launch) | None | `register_cpu_ci` (prefer) or `register_cuda_ci` | `stage-a-test-cpu` or `stage-b-test-1-gpu-small` |
 | **Common / backend-independent** (middleware, abort, routing, config, arg parsing) | `DEFAULT_SMALL_MODEL_NAME_FOR_TEST` (1B) | `register_cuda_ci` only | `stage-b-test-1-gpu-small` |
 | **Model-agnostic functionality** (sampling, session, OpenAI API features) | `DEFAULT_SMALL_MODEL_NAME_FOR_TEST` (1B) | `register_cuda_ci` (+ AMD if relevant) | `stage-b-test-1-gpu-small` |
 | **General performance** (single node, no spec/DP/parallelism) | `DEFAULT_MODEL_NAME_FOR_TEST` (8B) | `register_cuda_ci` | `stage-b-test-1-gpu-large` |
 | **Bigger features** (spec, DP, TP, disaggregation) | Case by case | Case by case | See suite table below |
 **Key principle for E2E tests**: Do NOT add `register_amd_ci` unless the test specifically exercises AMD/ROCm code paths. Common E2E tests just need any GPU to run — duplicating across backends wastes CI time with no extra coverage.
 ### All model constants
 Defined in `python/sglang/test/test_utils.py`:
 | Constant | Model | When to use |
 |----------|-------|-------------|
 | `DEFAULT_SMALL_MODEL_NAME_FOR_TEST` | Llama-3.2-1B-Instruct | Common features, model-agnostic tests |
 | `DEFAULT_SMALL_MODEL_NAME_FOR_TEST_BASE` | Llama-3.2-1B | Base (non-instruct) model tests |
 | `DEFAULT_MODEL_NAME_FOR_TEST` | Llama-3.1-8B-Instruct | General performance (single node) |
 | `DEFAULT_MOE_MODEL_NAME_FOR_TEST` | Mixtral-8x7B-Instruct | MoE-specific tests |
 | `DEFAULT_SMALL_EMBEDDING_MODEL_NAME_FOR_TEST` | — | Embedding tests |
 | `DEFAULT_SMALL_VLM_MODEL_NAME_FOR_TEST` | — | Vision-language tests |
 ### Naming Conventions
 - **Suite**: `stage-{a,b,c}-test-{gpu_count}-gpu-{hardware}` (e.g., `stage-b-test-1-gpu-small`)
 - **CI runner**: `{gpu_count}-gpu-{hardware}` (e.g., `1-gpu-5090`, `4-gpu-h100`, `8-gpu-h200`)
 ### All CI Suites
 #### Per-commit (CUDA)
 | Suite | Runner (label) | Description |
 |-------|----------------|-------------|
 | `stage-a-test-1-gpu-small` | `1-gpu-5090` | Quick checks on a small NVIDIA GPU before heavier stages |
 | `stage-a-test-cpu` | `ubuntu-latest` | CPU-only unit tests |
 | `stage-b-test-1-gpu-small` | `1-gpu-5090` | Core engine tests that fit a 5090-class card |
 | `stage-b-test-1-gpu-large` | `1-gpu-h100` | Tests that need H100-class memory or kernels (e.g. FA3) |
 | `stage-b-test-2-gpu-large` | `2-gpu-h100` | Two-GPU correctness and parallelism (TP/PP) on H100 |
 | `stage-b-test-4-gpu-b200` | `4-gpu-b200` | Early Blackwell coverage (SM100+ paths) on four GPUs |
 | `stage-b-kernel-unit-1-gpu-large` | `1-gpu-h100` | JIT kernel correctness tests under `python/sglang/jit_kernel/tests/` |
 | `stage-b-kernel-unit-8-gpu-h200` | `8-gpu-h200` | Multi-GPU JIT kernel correctness tests under `python/sglang/jit_kernel/tests/` |
 | `stage-b-kernel-benchmark-1-gpu-large` | `1-gpu-h100` | JIT kernel benchmark files under `python/sglang/jit_kernel/benchmark/` |
 | `stage-c-test-4-gpu-h100` | `4-gpu-h100` | Large 4-GPU H100 integration and scaling tests |
 | `stage-c-test-8-gpu-h200` | `8-gpu-h200` | Large 8-GPU H200 runs for big models and parallelism |
 | `stage-c-test-8-gpu-h20` | `8-gpu-h20` | Large 8-GPU H20 runs for big models |
 | `stage-c-test-deepep-4-gpu-h100` | `4-gpu-h100` | DeepEP expert-parallel and networking on four H100s |
 | `stage-c-test-deepep-8-gpu-h200` | `8-gpu-h200` | DeepEP at 8-GPU H200 scale |
 | `stage-c-test-8-gpu-b200` | `8-gpu-b200` | 8-GPU B200 suite (registered but not yet wired to a workflow) |
 | `stage-c-test-4-gpu-b200` | `4-gpu-b200` | 4-GPU B200 suite for large models on Blackwell |
 | `stage-c-test-4-gpu-gb200` | `4-gpu-gb200` | 4-GPU GB200 suite for large models on Grace Blackwell |
 #### Per-commit (AMD)
 | Suite | Runner (label) | Description |
 |-------|----------------|-------------|
 | `stage-a-test-1-gpu-small-amd` | `linux-mi325-1gpu-sglang` | Quick checks on one MI325-class GPU |
 | `stage-b-test-1-gpu-small-amd` | `linux-mi325-1gpu-sglang` | Core 1-GPU AMD tests (14 partitions) |
 | `stage-b-test-1-gpu-small-amd-nondeterministic` | `linux-mi325-1gpu-sglang` | Non-deterministic 1-GPU AMD tests |
 | `stage-b-test-1-gpu-small-amd-mi35x` | `linux-mi35x-gpu-1` | 1-GPU tests on MI35x hardware |
 | `stage-b-test-1-gpu-large-amd` | `linux-mi325-1gpu-sglang` | Large 1-GPU AMD tests (2 partitions) |
 | `stage-b-test-2-gpu-large-amd` | `linux-mi325-2gpu-sglang` | 2-GPU ROCm correctness and parallel setups |
 | `stage-b-test-large-8-gpu-35x-disaggregation-amd` | `linux-mi35x-gpu-8.fabric` | PD disaggregation and RDMA on 8×MI35x fabric |
 | `stage-c-test-4-gpu-amd` | `linux-mi325-4gpu-sglang` | 4-GPU AMD integration (2 partitions) |
 | `stage-c-test-large-8-gpu-amd` | `linux-mi325-8gpu-sglang` | 8-GPU MI325 scaling and integration |
 | `stage-c-test-large-8-gpu-amd-mi35x` | `linux-mi35x-gpu-8` | 8-GPU MI35x scaling (2 partitions) |
 ### Per-commit (Ascend NPU)
 | Suite | Runner (label) | Description |
 | --- | --- | --- |
 | `per-commit-1-npu-a2` | `linux-aarch64-a2-1` | 1-NPU LLM CI machine |
 | `per-commit-2-npu-a2` | `linux-aarch64-a2-2` | 2-NPU LLM CI machine |
 | `per-commit-4-npu-a3` | `linux-aarch64-a3-4` | 4-NPU LLM CI machine |
 | `per-commit-16-npu-a3` | `linux-aarch64-a3-16` | 16-NPU LLM CI machine  |
 | `multimodal-gen-test-1-npu-a3` | `linux-aarch64-a3-2` | 1-NPU multimodal CI machine |
 | `multimodal-gen-test-2-npu-a3` | `linux-aarch64-a3-16` | 2-NPU multimodal CI machine |
 | `multimodal-gen-test-8-npu-a3` | `linux-aarch64-a3-16` | 8-NPU multimodal CI machine |
 #### Nightly
 Nightly suites are listed in `NIGHTLY_SUITES` in [`test/run_suite.py`](../../../test/run_suite.py). They run via `nightly-test-nvidia.yml`, `nightly-test-amd.yml` amd `nightly-test-npu.yml`, not `pr-test.yml`. Examples:
 - `nightly-1-gpu` (CUDA)
 - `nightly-kernel-1-gpu` (CUDA, JIT kernel full grids)
 - `nightly-kernel-8-gpu-h200` (CUDA, multi-GPU JIT kernel nightly)
 - `nightly-8-gpu-h200` (CUDA)
 - `nightly-eval-vlm-2-gpu` (CUDA)
 - `nightly-amd` (AMD)
 - `nightly-amd-8-gpu-mi35x` (AMD)
 - `nightly-1-npu-a3` (NPU)
 - `nightly-2-npu-a3` (NPU)
 - `nightly-4-npu-a3` (NPU)
 - `nightly-8-npu-a3` (NPU)
 - `nightly-16-npu-a3` (NPU)
 > **Note**: Multimodal diffusion uses `python/sglang/multimodal_gen/test/run_suite.py`, not `test/run_suite.py`.
 ### Choosing a Suite
 Use the lightest suite that meets your test's needs:
 - **No GPU required** → `stage-a-test-cpu`
 - **Most small GPU tests** → `stage-b-test-1-gpu-small` (default choice)
 - **Need H100 memory or Hopper features** → `stage-b-test-1-gpu-large`
 - **JIT kernel correctness** → `stage-b-kernel-unit-1-gpu-large`
 - **JIT kernel benchmarks** → `stage-b-kernel-benchmark-1-gpu-large`
 - **Multi-GPU** → only when the test actually needs multiple GPUs
 ---
 ## Test File Templates
 ### Unit Tests (no server / engine launch)
 See `test/registered/unit/README.md` for quick-start and rules. Unit tests live in `test/registered/unit/`, mirroring `python/sglang/srt/`:
 ```python
 """Unit tests for srt/<module>"""
 import unittest
 from unittest.mock import MagicMock, patch
 from sglang.srt.<module> import TargetClass
 from sglang.test.ci.ci_register import register_cpu_ci
 from sglang.test.test_utils import CustomTestCase
 register_cpu_ci(est_time=5, suite="stage-a-test-cpu")
 # Prefer CPU. Only use register_cuda_ci when the test truly needs a GPU.
 class TestTargetClass(CustomTestCase):
    def test_basic_behavior(self):
        obj = TargetClass(...)
        self.assertEqual(obj.method(), expected)
    @patch("sglang.srt.<module>.some_dependency")
    def test_with_mock(self, mock_dep):
        mock_dep.return_value = MagicMock()
        # test logic with dependency mocked
        ...
 if __name__ == "__main__":
    unittest.main()
 ```
 Use `unittest.mock.patch` / `MagicMock` to mock dependencies and isolate the logic under test. If the module transitively imports GPU-only packages (e.g. `sgl_kernel`), they can be stubbed so the test runs on CPU CI. See `test/registered/unit/README.md` for details and examples.
 **Quality bar** — test real logic (validation boundaries, state transitions, error paths, branching, etc.). Skip tests that just verify Python itself works (e.g., "does calling an abstract method raise `NotImplementedError`?", "does a dataclass store the field I assigned?"). Consolidate repetitive patterns into parameterized tests. No production code changes in test PRs.
 ### E2E test (small model, server needed)
 ```python
 import unittest
 import requests
 from sglang.srt.utils import kill_process_tree
 from sglang.test.ci.ci_register import register_cuda_ci
 from sglang.test.test_utils import (
    DEFAULT_SMALL_MODEL_NAME_FOR_TEST,
    DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
    DEFAULT_URL_FOR_TEST,
    CustomTestCase,
    popen_launch_server,
 )
 register_cuda_ci(est_time=60, suite="stage-b-test-1-gpu-small")
 class TestMyFeature(CustomTestCase):
    @classmethod
    def setUpClass(cls):
        cls.model = DEFAULT_SMALL_MODEL_NAME_FOR_TEST
        cls.base_url = DEFAULT_URL_FOR_TEST
        cls.process = popen_launch_server(
            cls.model,
            cls.base_url,
            timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
            other_args=["--arg1", "value1"],  # feature-specific args
        )
    @classmethod
    def tearDownClass(cls):
        if hasattr(cls, "process") and cls.process:
            kill_process_tree(cls.process.pid)
    def test_basic_functionality(self):
        response = requests.post(
            self.base_url + "/generate",
            json={"text": "Hello", "sampling_params": {"max_new_tokens": 32}},
        )
        self.assertEqual(response.status_code, 200)
 if __name__ == "__main__":
    unittest.main(verbosity=3)
 ```
 ### E2E test (8B model, server needed, performance)
 ```python
 import time
 import unittest
 import requests
 from sglang.srt.utils import kill_process_tree
 from sglang.test.ci.ci_register import register_cuda_ci
 from sglang.test.test_utils import (
    DEFAULT_MODEL_NAME_FOR_TEST,
    DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
    DEFAULT_URL_FOR_TEST,
    CustomTestCase,
    popen_launch_server,
 )
 register_cuda_ci(est_time=300, suite="stage-b-test-1-gpu-large")
 class TestMyFeaturePerf(CustomTestCase):
    @classmethod
    def setUpClass(cls):
        cls.model = DEFAULT_MODEL_NAME_FOR_TEST
        cls.base_url = DEFAULT_URL_FOR_TEST
        cls.process = popen_launch_server(
            cls.model,
            cls.base_url,
            timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
        )
    @classmethod
    def tearDownClass(cls):
        if hasattr(cls, "process") and cls.process:
            kill_process_tree(cls.process.pid)
    def test_latency(self):
        start = time.perf_counter()
        response = requests.post(
            self.base_url + "/generate",
            json={"text": "Hello", "sampling_params": {"max_new_tokens": 128}},
        )
        elapsed = time.perf_counter() - start
        self.assertEqual(response.status_code, 200)
        self.assertLess(elapsed, 5.0, "Latency exceeded threshold")
 if __name__ == "__main__":
    unittest.main(verbosity=3)
 ```
 ---
 ## Server Fixture Reuse
 For tests that only need a standard server, inherit from `DefaultServerBase` and override class attributes:
 ```python
 from sglang.test.server_fixtures.default_fixture import DefaultServerBase
 class TestMyFeature(DefaultServerBase):
    model = DEFAULT_SMALL_MODEL_NAME_FOR_TEST
    other_args = ["--enable-my-feature"]
    def test_something(self):
        ...
 ```
 Available fixtures in `python/sglang/test/server_fixtures/`:
 | Fixture | Use case |
 |---------|----------|
 | `DefaultServerBase` | Standard single-server tests |
 | `EagleServerBase` | EAGLE speculative decoding |
 | `PDDisaggregationServerBase` | Disaggregated prefill/decode |
 | `MMMUServerBase` | Multimodal VLM tests |
 ---
 ## CI Registration
 Every CI-discovered test file must call a registration function at module level:
 ```python
 from sglang.test.ci.ci_register import (
    register_cuda_ci,
    register_amd_ci,
    register_cpu_ci,
    register_npu_ci,
 )
 # Per-commit test (small 1-gpu, runs on 5090)
 register_cuda_ci(est_time=80, suite="stage-b-test-1-gpu-small")
 # Per-commit test (large 1-gpu, runs on H100)
 register_cuda_ci(est_time=120, suite="stage-b-test-1-gpu-large")
 # Nightly-only test
 register_cuda_ci(est_time=200, suite="nightly-1-gpu", nightly=True)
 # Multi-backend test (only when testing backend-specific code paths)
 register_cuda_ci(est_time=80, suite="stage-a-test-1-gpu-small")
 register_amd_ci(est_time=120, suite="stage-a-test-1-gpu-small-amd")
 register_npu_ci(est_time=400, suite="nightly-8-npu-a3", nightly=True)
 # Temporarily disabled test
 register_cuda_ci(est_time=80, suite="stage-b-test-1-gpu-small", disabled="flaky - see #12345")
 ```
 Parameters:
 - `est_time`: estimated runtime in seconds (used for CI partitioning)
 - `suite`: which CI suite to run in (see suite tables above)
 - `nightly=True`: for nightly-only tests (default `False` = per-commit)
 - `disabled="reason"`: temporarily disable with explanation
 **Key principle**: Only add `register_amd_ci` / `register_npu_ci` when the test exercises backend-specific code paths. Common E2E tests just need `register_cuda_ci` — duplicating across backends wastes CI time.
 ### JIT Kernel Registration
 JIT kernel files live outside `test/registered/` but still use registration:
 ```python
 from sglang.test.ci.ci_register import register_cuda_ci
 # Correctness tests in python/sglang/jit_kernel/tests/
 register_cuda_ci(est_time=30, suite="stage-b-kernel-unit-1-gpu-large")
 register_cuda_ci(est_time=120, suite="stage-b-kernel-unit-8-gpu-h200")
 # Benchmarks in python/sglang/jit_kernel/benchmark/
 register_cuda_ci(est_time=6, suite="stage-b-kernel-benchmark-1-gpu-large")
 # Optional nightly registration
 register_cuda_ci(est_time=120, suite="nightly-kernel-1-gpu", nightly=True)
 register_cuda_ci(est_time=120, suite="nightly-kernel-8-gpu-h200", nightly=True)
 ```
 Keep `est_time` and `suite` as **literal values** — `run_suite.py` collects them by AST parsing
 ---
 ## Test Placement
 ```
 test/
 ├── registered/          # CI tests (auto-discovered by run_suite.py)
 │   ├── unit/            # No server / engine launch (see test/registered/unit/README.md)
 │   ├── kernels/         # CUDA kernel correctness (no server, GPU required)
 │   ├── sampling/        # test_penalty.py, test_sampling_params.py ...
 │   ├── sessions/        # test_session_control.py ...
 │   ├── openai_server/   # basic/, features/, validation/ ...
 │   ├── spec/            # eagle/, utils/ ...
 │   ├── models/          # model-specific accuracy tests
 │   ├── perf/            # performance benchmarks
 │   └── <category>/      # create new category if needed
 ├── manual/              # Non-CI: debugging, one-off, manual verification
 └── run_suite.py         # CI runner (scans registered/ plus jit_kernel test/benchmark files)
 python/sglang/jit_kernel/
 ├── tests/               # JIT kernel correctness tests (CI-discovered by test/run_suite.py)
 └── benchmark/           # JIT kernel benchmarks (CI-discovered by test/run_suite.py)
 ```
 **Decision rule** (see also `test/registered/README.md`):
 - Component logic, no server → `registered/unit/`
 - JIT kernel correctness / benchmarks → `python/sglang/jit_kernel/tests/` or `python/sglang/jit_kernel/benchmark/`
 - Other kernel correctness → `registered/kernels/`
 - Server needed → `registered/<category>/`
 - Local debugging → `manual/`
 ---
 ## Eval Accuracy Mixins
 **Design philosophy**: Most test files don't care about eval logic — they only need a "does this feature break model output quality?" sanity check. The mixin pattern separates **what to test** (threshold) from **how to test** (run_eval, assertions, CI summary). Test classes declare thresholds as class attributes; the mixin provides the `test_*` method. Override when you need extra assertions (e.g. EAGLE accept length).
 Available mixins in `python/sglang/test/kits/eval_accuracy_kit.py`: `MMLUMixin`, `HumanEvalMixin`, `MGSMEnMixin`, `GSM8KMixin`. Can be combined freely. Read the source for attrs and defaults.
 ```python
 class TestMyFeature(CustomTestCase, MMLUMixin):
    mmlu_score_threshold = 0.65
    mmlu_num_examples = 64
    mmlu_num_threads = 32
    # test_mmlu is inherited — no code needed
 ```
 ---
 ## Key Utilities
 ```python
 from sglang.test.test_utils import (
    CustomTestCase,              # base class with retry logic
    popen_launch_server,         # launch server subprocess
    DEFAULT_URL_FOR_TEST,        # auto-configured base URL
    DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,  # 600s default
    run_bench_serving,           # benchmark helper (launch + bench)
 )
 from sglang.srt.utils import kill_process_tree  # cleanup server
 ```
 ---
 ## Checklist
 Before submitting a test:
 - [ ] Inherits from `CustomTestCase` (not `unittest.TestCase`)
 - [ ] Has `register_*_ci(...)` call at module level
 - [ ] Placed in `test/registered/<category>/`, unless this is a JIT kernel test/benchmark
 - [ ] JIT kernel work: files live in `python/sglang/jit_kernel/tests/` or `python/sglang/jit_kernel/benchmark/`
 - [ ] Backend-independent tests: `register_cuda_ci` only + smallest model
 - [ ] Logic that doesn't need a server / engine launch → unit test in `registered/unit/` (see Unit Tests section)
 - [ ] `setUpClass` launches server, `tearDownClass` kills it (if server-based)
 - [ ] `tearDownClass` is defensive — uses `hasattr`/null checks before accessing resources that may not have been allocated
 - [ ] Has `if __name__ == "__main__": unittest.main()`
 - [ ] `est_time` is reasonable (measure locally)
--- a/third_party/sglang/.codespellrc
+++ b/third_party/sglang/.codespellrc
@@ -0,0 +1,3 @@
 [codespell]
 ignore-words-list = ans, als, hel, boostrap, childs, te, vas, hsa, ment, cann, thi, makro, wil, rouge, PRIS
 skip = *.json,*.jsonl,*.patch,*.txt
--- a/third_party/sglang/.coveragerc
+++ b/third_party/sglang/.coveragerc
@@ -0,0 +1,16 @@
 [run]
 source = python/sglang/srt
 omit =
    */test/*
    */__pycache__/*
 [report]
 show_missing = true
 exclude_lines =
    pragma: no cover
    if __name__ == .__main__.:
    raise NotImplementedError
    if TYPE_CHECKING
 [html]
 directory = htmlcov
--- a/third_party/sglang/.devcontainer/Dockerfile
+++ b/third_party/sglang/.devcontainer/Dockerfile
@@ -0,0 +1,35 @@
 FROM lmsysorg/sglang:dev
 # Create non-root user with specified UID and GID
 # NOTE: Replace with your own UID and GID. This is a workaround from https://github.com/microsoft/vscode-remote-release/issues/49#issuecomment-489060908.
 ARG HOST_UID=1003
 ARG HOST_GID=1003
 RUN groupadd -g $HOST_GID devuser && \
    useradd -m -u $HOST_UID -g $HOST_GID -s /bin/zsh devuser
 # Give devuser sudo access
 RUN apt-get update && apt-get install -y sudo && \
    echo "devuser ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/devuser && \
    rm -rf /var/lib/apt/lists/* && \
    apt-get clean
 # Set up oh-my-zsh for devuser
 RUN cp -r /root/.oh-my-zsh /home/devuser/.oh-my-zsh && \
    cp /root/.zshrc /home/devuser/.zshrc && \
    cp /root/.vimrc /home/devuser/.vimrc && \
    cp /root/.tmux.conf /home/devuser/.tmux.conf && \
    sed -i 's|/root/.oh-my-zsh|/home/devuser/.oh-my-zsh|g' /home/devuser/.zshrc && \
    chown -R devuser:devuser /home/devuser/
 # Set workspace directory and ownership
 WORKDIR /sgl-workspace/sglang
 RUN chown -R devuser:devuser /sgl-workspace
 # Switch to devuser
 USER devuser
 # Install uv
 RUN curl -LsSf https://astral.sh/uv/install.sh | sh
 # Install rust
 RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
--- a/third_party/sglang/.devcontainer/devcontainer.json
+++ b/third_party/sglang/.devcontainer/devcontainer.json
@@ -0,0 +1,30 @@
 {
    "name": "sglang",
    "build": {
        "dockerfile": "Dockerfile"
    },
    "remoteUser": "devuser",
    "customizations": {
        "vscode": {
            "extensions": [
                // Python development
                "ms-python.python",
                "charliermarsh.ruff",
                // Rust development
                "rust-lang.rust-analyzer",
                "tamasfe.even-better-toml"
            ]
        }
    },
    "forwardPorts": [],
    "runArgs": [
        "--gpus",
        "all"
    ],
    // The two lines below ensures that your local changes in the sglang
    // repo is automatically synced to the sglang pip package installed
    // in the dev docker container. You can remove / comment out these
    // two lines if you prefer to sync code changes manually.
    "workspaceMount": "source=${localWorkspaceFolder},target=/sgl-workspace/sglang,type=bind",
    "workspaceFolder": "/sgl-workspace/sglang"
 }
--- a/third_party/sglang/.dockerignore
+++ b/third_party/sglang/.dockerignore
@@ -0,0 +1 @@
 .gitignore
--- a/third_party/sglang/.github/CI_PERMISSIONS.json
+++ b/third_party/sglang/.github/CI_PERMISSIONS.json
--- a/third_party/sglang/.github/CODEOWNERS
+++ b/third_party/sglang/.github/CODEOWNERS
@@ -0,0 +1,74 @@
 .github @merrymercy @Fridge003 @ispobock @Kangyan-Zhou @bingxche
 /docker @Fridge003 @ispobock @HaiShaw @ishandhanani @yctseng0211
 /docker/npu.Dockerfile @ping1jing2 @iforgetmyname
 /python/pyproject.toml @merrymercy @Fridge003 @ispobock
 /python/sglang/jit_kernel @DarkSharpness @BBuf @celve @HydraQYH @yuan-luo
 /python/sglang/jit_kernel/diffusion @yingluosanqian @BBuf @mickqian
 /python/sglang/multimodal_gen @mickqian @yhyang201 @ping1jing2
 /python/sglang/multimodal_gen/runtime/cache @DefTruth
 /python/sglang/multimodal_gen/runtime/layers @mickqian @yhyang201 @BBuf @yingluosanqian @ping1jing2
 /python/sglang/multimodal_gen/runtime/models/dits @mickqian @yhyang201 @BBuf @yingluosanqian @ping1jing2
 /python/sglang/srt/batch_invariant_ops @Fridge003 @hebiao064
 /python/sglang/srt/compilation @hebiao064 @Oasis-Git
 /python/sglang/srt/constrained @hnyls2002 @DarkSharpness
 /python/sglang/srt/disaggregation @ByronHsu @hnyls2002 @ShangmingCai
 /python/sglang/srt/disaggregation/ascend @ping1jing2 @iforgetmyname
 /python/sglang/srt/distributed @yizhang2077 @merrymercy @ch-wan
 /python/sglang/srt/distributed/device_communicators/mooncake_transfer_engine.py @ShangmingCai @stmatengss
 /python/sglang/srt/dllm @ClawSeven @btw616
 /python/sglang/srt/entrypoints @ispobock @CatherineSue @slin1237 @merrymercy @JustinTong0323
 /python/sglang/srt/entrypoints/engine_score_mixin.py @sundar24295s @chanh @fortunecookiee
 /python/sglang/srt/entrypoints/grpc_server.py @CatherineSue @slin1237
 /python/sglang/srt/entrypoints/openai/serving_score.py @sundar24295s @chanh @fortunecookiee
 /python/sglang/srt/eplb @fzyzcjy @ch-wan
 /python/sglang/srt/function_call @CatherineSue @JustinTong0323
 /python/sglang/srt/grpc @CatherineSue @slin1237
 /python/sglang/srt/hardware_backend/npu @ping1jing2 @iforgetmyname
 /python/sglang/srt/hardware_backend/npu/quantization @OrangeRedeng @TamirBaydasov @iforgetmyname
 /python/sglang/srt/layers @merrymercy @Ying1123 @Fridge003 @ispobock @HaiShaw @ch-wan @BBuf @Edwardf0t1
 /python/sglang/srt/layers/attention @merrymercy @Fridge003 @ispobock @Qiaolin-Yu @hebiao064 @HaiShaw
 /python/sglang/srt/layers/attention/fla @yizhang2077 @hebiao064 @yuan-luo
 /python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py @yizhang2077 @hebiao064 @hanming-lu @yuan-luo
 /python/sglang/srt/layers/attention/mamba @yizhang2077 @hebiao064
 /python/sglang/srt/layers/attention/nsa @1am9trash @hubertlu-tw @kkHuang-amd @HaiShaw @Fridge003 @hlu1 @rainj-me
 /python/sglang/srt/layers/attention/vision.py @mickqian @yuan-luo @yhyang201
 /python/sglang/srt/layers/quantization @ch-wan @BBuf @Edwardf0t1 @FlamingoPg @AniZpZ @HaiShaw @b8zhong
 /python/sglang/srt/layers/quantization/quark @kkHuang-amd @yichiche @hubertlu-tw @1am9trash @BowenBao
 /python/sglang/srt/lora @Ying1123 @Fridge003 @lifuhuang @yushengsu-thu
 /python/sglang/srt/managers @merrymercy @Ying1123 @hnyls2002 @xiezhq-hermann
 /python/sglang/srt/managers/scheduler_pp_mixin.py @ShangmingCai @XucSh
 /python/sglang/srt/managers/tokenizer_manager_score_mixin.py @sundar24295s @chanh @fortunecookiee
 /python/sglang/srt/mem_cache @merrymercy @Ying1123 @hnyls2002 @xiezhq-hermann @hanming-lu @yizhang2077 @hzh0425 @ispobock
 /python/sglang/srt/model_executor @merrymercy @Ying1123 @hnyls2002 @Fridge003 @ispobock
 /python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py @hebiao064
 /python/sglang/srt/models/deepseek_common @Fridge003 @ispobock @fzyzcjy @ch-wan
 /python/sglang/srt/models/deepseek_v2.py @fzyzcjy @zhyncs @ispobock @ch-wan @merrymercy @Fridge003
 /python/sglang/srt/models/transformers.py @adarshxs
 /python/sglang/srt/multimodal @mickqian @JustinTong0323 @yhyang201 @yuan-luo
 /python/sglang/srt/observability @merrymercy @fzyzcjy @sufeng-buaa
 /python/sglang/srt/ray @Qiaolin-Yu @xyuzh
 /python/sglang/srt/speculative @Ying1123 @merrymercy @hnyls2002
 /sgl-kernel @ispobock @BBuf @yizhang2077 @merrymercy @FlamingoPg @HaiShaw
 /sgl-model-gateway @slin1237 @CatherineSue
 /sgl-model-gateway/benches @slin1237
 /sgl-model-gateway/bindings/python @CatherineSue @key4ng @slin1237
 /sgl-model-gateway/e2e_test @CatherineSue @key4ng
 /sgl-model-gateway/examples/wasm @slin1237
 /sgl-model-gateway/src/config @slin1237
 /sgl-model-gateway/src/core @slin1237
 /sgl-model-gateway/src/data_connector @key4ng
 /sgl-model-gateway/src/grpc_client @CatherineSue @slin1237
 /sgl-model-gateway/src/mcp @key4ng @slin1237
 /sgl-model-gateway/src/policies @slin1237 @ByronHsu
 /sgl-model-gateway/src/proto @CatherineSue @slin1237
 /sgl-model-gateway/src/protocols @CatherineSue @key4ng
 /sgl-model-gateway/src/reasoning_parser @CatherineSue
 /sgl-model-gateway/src/routers @CatherineSue @key4ng @slin1237
 /sgl-model-gateway/src/tokenizer @slin1237 @CatherineSue
 /sgl-model-gateway/src/tool_parser @slin1237 @CatherineSue
 /sgl-model-gateway/src/wasm @slin1237
 /sgl-model-gateway/examples/wasm @slin1237
 /test/registered/core/test_score_api.py @sundar24295s @chanh @fortunecookiee
 /benchmark/prefill_only/bench_score.py @sundar24295s @chanh @fortunecookiee
 /test/srt/ascend @ping1jing2 @iforgetmyname
 /test/srt/test_modelopt* @Edwardf0t1
--- a/third_party/sglang/.github/FOLDER_README.md
+++ b/third_party/sglang/.github/FOLDER_README.md
@@ -0,0 +1,12 @@
 # Maintenance Tools
 This folder contains tools and workflows for automating maintenance tasks.
 ## CI Permissions
 `CI_PERMISSIONS.json` defines the CI permissions granted to each user.
 Maintainers can directly edit the file to add entries with `"reason": "custom override"`.
 Maintainers can also run `update_ci_permission.py` to update it with some auto rules (e.g., top contributors in the last 90 days get full permissions).
 ## Others
 - `MAINTAINER.md` defines the code maintenance model.
--- a/third_party/sglang/.github/ISSUE_TEMPLATE/1-bug-report.yml
+++ b/third_party/sglang/.github/ISSUE_TEMPLATE/1-bug-report.yml
@@ -0,0 +1,35 @@
 name: 🐞 Bug report
 description: Report a bug to help us reproduce and fix it.
 title: "[Bug] "
 labels: ['Bug']
 body:
 - type: checkboxes
  attributes:
    label: Checklist
    options:
      - label: I searched related issues but found no solution.
      - label: The bug persists in the latest version.
      - label: Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
      - label: If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
      - label: Please use English. Otherwise, it will be closed.
 - type: textarea
  attributes:
    label: Describe the bug
    description: A clear, concise description of the bug.
  validations:
    required: true
 - type: textarea
  attributes:
    label: Reproduction
    description: Command/script run and model used.
    placeholder: Paste the command here.
  validations:
    required: true
 - type: textarea
  attributes:
    label: Environment
    description: Run `python3 -m sglang.check_env` and paste output here. Issues without this will be closed.
    placeholder: Paste environment output here.
  validations:
    required: true
--- a/third_party/sglang/.github/ISSUE_TEMPLATE/2-feature-request.yml
+++ b/third_party/sglang/.github/ISSUE_TEMPLATE/2-feature-request.yml
@@ -0,0 +1,23 @@
 name: 🚀 Feature request
 description: Suggest an idea for this project
 title: "[Feature] "
 body:
 - type: checkboxes
  attributes:
    label: Checklist
    options:
      - label: If this is not a feature request but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
      - label: Please use English. Otherwise, it will be closed.
 - type: textarea
  attributes:
    label: Motivation
    description: |
      Clearly and concisely describe the feature's motivation.
  validations:
    required: true
 - type: textarea
  attributes:
    label: Related resources
    description: |
      Provide official releases or third-party implementations if available.
--- a/third_party/sglang/.github/MAINTAINER.md
+++ b/third_party/sglang/.github/MAINTAINER.md
@@ -0,0 +1,154 @@
 # SGLang Code Maintenance Model
 This document describes the code maintenance model for the SGLang project.
 Since SGLang is a large project involving multiple organizations and hardware platforms, we designed this model with the following goals:
 - Ensure a responsive and smooth review process.
 - Allow for fast iteration, so maintainers can sometimes bypass flaky CI tests for important PRs.
 ## Role Descriptions
 There are four roles in this maintenance model. Some are custom roles, while others are predefined by GitHub.
 - **Merge Oncall**: The person who drives the PR merge process. They have strong area-specific expertise and uphold a high bar for code quality.
  - Permission: Merge PRs. Bypass branch protection rules if needed.
  - Responsibility: Shepherd the merge of PRs assigned to their area. Revert or hotfix any issues related to their merge (especially if they bypass).
 - **Codeowner**: The person who protects critical code. Without a bypass, each PR needs at least one Codeowner approval for each modified file protected by [CODEOWNERS](./CODEOWNERS). Please note that this role is not an honor but a significant responsibility because PRs cannot be merged without your approval (except when bypassed by a Merge Oncall).
  - Permission: Approve PRs, allowing them to be merged without a bypass.
  - Responsibility: Review PRs in a timely manner.
 - **Write**: A person with write permission to the SGLang repo.
  - Permission: Merge PRs if they have passed required tests and been approved by Codeowners. This role cannot bypass branch protection rules.
  - Responsibility: Review and merge PRs in a timely manner.
 - **CI Oncall**: A person who manages CI runners for specific hardware platforms.
  - Permission: Add CI runners.
  - Responsibility: Keep the CI runners up and running.
 __Note__: Difference between Merge Oncall and Codeowner
 - The Merge Oncall is an active role held by someone who actively tries to help merge PRs and can bypass CI if needed.
 - The Codeowner is a passive protection role provided by GitHub; it prevents accidental changes to critical code.
 - The list of Merge Oncalls is attached below. The list of Codeowners is in the [CODEOWNERS](./CODEOWNERS) file.
 __Note__: The permissions to trigger CI tests are defined separately according to these [rules](https://docs.sglang.io/developer_guide/contribution_guide.html#how-to-trigger-ci-tests).
 ## Pull Request Merge Process
 1. The author submits a pull request (PR) and fills out the PR checklist.
 2. A bot assigns this PR to a Merge Oncall and @-mentions them. At the same time, GitHub will automatically request reviews from Codeowners.
 3. Someone tags the PR with a `run-ci` label ([help](https://docs.sglang.io/developer_guide/contribution_guide.html#how-to-trigger-ci-tests)). Then the author can trigger CI by pushing new commits.
 4. The Merge Oncall coordinates the review (e.g., asking people to review) and approves the PR; the Codeowners also approve the PR. If the assigned Merge Oncall is not responsive, the author can ping other related Merge Oncalls and Reviewers in the list below.
 5. The code can now be merged:
   - **Ideal case:** For each modified file, one Codeowner has approved the PR. The PR has also passed the required CI tests. Then, anyone with write permission can merge the PR.
   - **Exception:** In cases where it is difficult to meet all requirements (due to flaky CI or slow responses), a Merge Oncall can bypass branch protection to merge the PR.
 If you meet any issues during the merge, you can discuss in [slack channels](https://slack.sglang.io/): #pull-request, #ci-cd-build-release, #dev.
 ## The List of Merge Oncalls and Reviewers
 This section lists the oncalls for each module or feature.
 The format is @github-username (Slack username).
 ### Scheduler
 [@merrymercy](https://github.com/merrymercy) (Lianmin Zheng), [@hnyls2002](https://github.com/hnyls2002) (Liangsheng Yin), [@cctry](https://github.com/cctry) (Shiyang Chen)
 related files
 - python/sglang/srt/managers
 - python/sglang/srt/model_executor
 ### Diffusion
 [@mickqian](https://github.com/mickqian) (Mick), [@BBuf](https://github.com/BBuf) (BBuf)
 related files
 - python/sglang/multimodal_gen
 ### PD disaggregation
 [@ByronHsu](https://github.com/ByronHsu) (Byron Hsu), [@cctry](https://github.com/cctry) (Shiyang Chen), [@ShangmingCai](https://github.com/ShangmingCai) (Shangming Cai)
 related files
 - python/sglang/srt/disaggregation
 ### KV Cache
 [@ispobock](https://github.com/ispobock) (Ke Bao), [@xiezhq-hermann](https://github.com/xiezhq-hermann) (Zhiqiang Xie)
 related files
 - python/sglang/srt/mem_cache
 ### Parallelism
 [@ch-wan](https://github.com/ch-wan) (Cheng Wan), [@fzyzcjy](https://github.com/fzyzcjy) (Tom)
 related files
 - python/sglang/srt/eplb
 - python/sglang/srt/distributed
 - python/sglang/srt/layers/dp_attention.py
 ### Kernel
 [@BBuf](https://github.com/BBuf) (BBuf)
 related files
 - python/sglang/jit_kernel
 - sgl-kernel
 ### Speculative decoding
 [@hnyls2002](https://github.com/hnyls2002) (Liangsheng Yin), [@Qiaolin-Yu](https://github.com/Qiaolin-Yu) (Qiaolin Yu)
 related files
 - python/sglang/srt/speculative
 ### NV and model-specific optimizations
 [@Fridge003](https://github.com/Fridge003) (Baizhou Zhang), [@ishandhanani](https://github.com/ishandhanani) (Ishan Dhanani), [@Qiaolin-Yu](https://github.com/Qiaolin-Yu) (Qiaolin Yu)
 related files
 - python/sglang/srt/models
 - python/sglang/srt/layers/attention
 ### AMD optimizations
 [@HaiShaw](https://github.com/HaiShaw) (Henry HAI)
 ### NPU optimizations
 [@iforgetmyname](https://github.com/iforgetmyname) (Even Zhou)
 related files
 - python/sglang/srt/hardware_backend/npu
 ### CI, Release, Package
 [@Kangyan-Zhou](https://github.com/Kangyan-Zhou) (Kangyan Zhou), [@Fridge003](https://github.com/Fridge003) (Baizhou Zhang)
 related files
 - .github/workflows
 ### Router, API
 [@slin1237](https://github.com/slin1237) (Simo Lin)
 related files
 - sgl-model-gateway
 - python/sglang/srt/grpc
 - python/sglang/srt/entrypoints
 ### Other Notes
 Now we have many Merge Oncalls mainly because the CI is flaky and the CODEOWNERS is too coarse-grained.
 In the future, we hope the CI can be improved and we only need bypass rarely. After that, most Merge Oncalls can be converted back to Write and CODEOWNERS.
 This list is based on the current situation. If you or someone you know would like to take on more responsibility and are qualified, please ping [Lianmin Zheng](https://github.com/merrymercy) and [Ying Sheng](https://github.com/Ying1123) in the Slack channel. They will start a nomination and internal review process.
 ## The List of CI Oncalls
 This section lists the oncalls for each hardware platform. The format is @github-username (Slack username).
 ### NVIDIA GPUs
 [@Kangyan-Zhou](https://github.com/Kangyan-Zhou) (Kangyan Zhou), [@ch-wan](https://github.com/ch-wan) (Cheng Wan), [@HanHan009527](https://github.com/HanHan009527) (hanhan), [@ishandhanani](https://github.com/ishandhanani) (Ishan Dhanani), [@ShangmingCai](https://github.com/ShangmingCai) (Shangming Cai), [@alisonshao](https://github.com/alisonshao) (Alison Shao).
 ### AMD GPUs
 [@saienduri](https://github.com/saienduri) (Sai Enduri), [@HaiShaw](https://github.com/HaiShaw) (Henry HAI)
 ### Intel CPU and XPU
 [@mingfeima](https://github.com/mingfeima) (Mingfei Ma), [@DiweiSun](https://github.com/DiweiSun) (Diwei Sun)
 ### Ascend NPUs
 [@iforgetmyname](https://github.com/iforgetmyname) (Even Zhou)
 This list is based on the current situation. If you or someone you know would like to donate machines for CI, they can serve as the CI oncalls for their machines. Please ping [Lianmin Zheng](https://github.com/merrymercy) and [Ying Sheng](https://github.com/Ying1123) in the Slack channel. They will start a nomination and internal review process.
 ## CI Maintenance Mode
 When the CI is unhealthy (e.g., the scheduled pr-test on `main` is broken for consecutive runs), the project enters **CI Maintenance Mode** by opening [issue #21065](https://github.com/sgl-project/sglang/issues/21065). While active:
 - All PR CI runs are paused. Resources are allocated to PRs that fix the CI.
 - **Merging non-CI-fix PRs is prohibited.** Only PRs that fix the CI may be merged. In severe cases, merge permissions may be revoked.
 Maintenance mode ends when `pr-test.yml` is all green on `main` and the issue is closed.
 ## Suspending Permissions
 If a Merge Oncall bypasses checks to merge a PR that breaks the `main` branch, merges a non-CI-fix PR during CI Maintenance Mode, or repeatedly breaks the CI due to various reasons, their privileges will be suspended for at least two days, depending on the severity of the incident.
--- a/third_party/sglang/.github/actions/check-maintenance/action.yml
+++ b/third_party/sglang/.github/actions/check-maintenance/action.yml
@@ -0,0 +1,63 @@
 name: Check Maintenance Mode
 description: Blocks CI when maintenance mode is active (issue #21065 is open), unless the PR has the bypass-maintenance label, or env SGLANG_PR_TEST_BYPASS_MAINTENANCE_ON_MAIN=true (PR Test workflow on main only). Merging non-CI-fix PRs is prohibited during maintenance mode; in severe cases, merge permissions may be revoked.
 inputs:
  github-token:
    description: GitHub token for API access
    required: false
    default: ${{ github.token }}
 runs:
  using: composite
  steps:
    - name: Check maintenance mode
      shell: bash
      env:
        GH_TOKEN: ${{ inputs.github-token }}
      run: |
        MAINTENANCE_ISSUE=21065
        REPO="${{ github.repository }}"
        PR_NUMBER="${{ github.event.pull_request.number }}"
        # PR Test workflow only: scheduled runs and runs on main (dispatch / workflow_call) set this env
        if [[ "${SGLANG_PR_TEST_BYPASS_MAINTENANCE_ON_MAIN:-}" == "true" ]]; then
          echo "✅ PR Test on main branch; bypassing maintenance gate."
          exit 0
        fi
        # Check if maintenance issue is open (fail-open: if API errors, allow CI to proceed)
        ISSUE_STATE=$(gh issue view "$MAINTENANCE_ISSUE" --repo "$REPO" --json state --jq '.state' 2>/dev/null || echo "UNKNOWN")
        if [[ "$ISSUE_STATE" != "OPEN" ]]; then
          echo "✅ Maintenance mode is OFF. Proceeding with CI."
          exit 0
        fi
        # For PRs, check if bypass-maintenance label is present
        if [[ -n "$PR_NUMBER" ]]; then
          HAS_BYPASS=$(gh pr view "$PR_NUMBER" --repo "$REPO" --json labels --jq '[.labels[].name] | map(select(. == "bypass-maintenance")) | length' 2>/dev/null || echo "0")
          if [[ "$HAS_BYPASS" -gt 0 ]]; then
            echo "✅ PR #$PR_NUMBER has 'bypass-maintenance' label. Bypassing maintenance mode."
            exit 0
          fi
        fi
        MSG=$(printf "%s\n" \
          "## ⚠️ CI Maintenance Mode is Active" \
          "The CI infrastructure is currently under maintenance." \
          "All PR CI runs are paused until maintenance is complete." \
          "**Merging non-CI-fix PRs is prohibited during maintenance mode.** In severe cases, merge permissions may be revoked." \
          "You might also experience unexpected failures during this period." \
          "The team is working on the issue and will update the status as soon as possible." \
          "" \
          "What should you do?" \
          "- **Do NOT merge non-CI-fix PRs** until maintenance mode is lifted" \
          "- Check back later (~12 hours)" \
          "- Follow CI Maintenance Mode issue: https://github.com/$REPO/issues/$MAINTENANCE_ISSUE for status updates")
        echo "$MSG" >> "$GITHUB_STEP_SUMMARY"
        while IFS= read -r line; do
          echo "::error::$line"
        done <<< "$MSG"
        exit 1
--- a/third_party/sglang/.github/actions/check-stage-health/action.yml
+++ b/third_party/sglang/.github/actions/check-stage-health/action.yml
@@ -0,0 +1,50 @@
 name: Check Stage Health
 description: Fail fast if any job in the current workflow run has already failed. Auto-skips for scheduled runs.
 inputs:
  github-token:
    description: 'GitHub token for API calls'
    required: false
    default: ${{ github.token }}
 runs:
  using: composite
  steps:
    - name: Check stage health
      uses: actions/github-script@v7
      env:
        SKIP_STAGE_HEALTH_CHECK: ${{ env.SKIP_STAGE_HEALTH_CHECK }}
      with:
        github-token: ${{ inputs.github-token }}
        script: |
          // Skip when explicitly requested via env var (e.g. release branch cut)
          if (process.env.SKIP_STAGE_HEALTH_CHECK === 'true') {
            core.info('Skipping health check (SKIP_STAGE_HEALTH_CHECK=true)');
            return;
          }
          // Skip for scheduled runs — they should collect all failures, not fast-fail
          if (context.eventName === 'schedule') {
            core.info('Skipping health check for scheduled run');
            return;
          }
          const jobs = await github.paginate(github.rest.actions.listJobsForWorkflowRun, {
            owner: context.repo.owner,
            repo: context.repo.repo,
            run_id: context.runId,
            per_page: 100,
          });
          // Find jobs that failed from a real error, not from fast-fail cascade
          const rootCauseFailures = jobs.filter(j => {
            if (j.status !== 'completed' || j.conclusion !== 'failure') return false;
            // If the failing step is the health check, it's a cascade — skip it
            const failedStep = (j.steps || []).find(s => s.conclusion === 'failure');
            if (failedStep && (failedStep.name.includes('check-stage-health') || failedStep.name.includes('Check stage health'))) {
              return false;
            }
            return true;
          });
          if (rootCauseFailures.length > 0) {
            core.setFailed(`Fast-fail: skipping — root cause job(s): ${rootCauseFailures.map(j => j.name).join(', ')}`);
          }
--- a/third_party/sglang/.github/actions/upload-cuda-coredumps/action.yml
+++ b/third_party/sglang/.github/actions/upload-cuda-coredumps/action.yml
@@ -0,0 +1,27 @@
 name: Upload CUDA Coredumps
 description: Upload CUDA coredump files as artifacts and clean up the directory.
 inputs:
  artifact-suffix:
    description: Suffix appended to the artifact name (e.g. matrix partition id)
    required: false
    default: ""
  retention-days:
    description: Number of days to retain the artifact
    required: false
    default: "7"
 runs:
  using: composite
  steps:
    - name: Upload CUDA coredumps
      uses: actions/upload-artifact@v4
      with:
        name: cuda-coredumps-${{ github.job }}${{ inputs.artifact-suffix && format('-{0}', inputs.artifact-suffix) }}
        path: ${{ env.SGLANG_CUDA_COREDUMP_DIR || '/tmp/sglang_cuda_coredumps' }}/
        retention-days: ${{ inputs.retention-days }}
        if-no-files-found: ignore
    - name: Cleanup CUDA coredumps
      shell: bash
      run: rm -rf "${{ env.SGLANG_CUDA_COREDUMP_DIR || '/tmp/sglang_cuda_coredumps' }}"
--- a/third_party/sglang/.github/actions/wait-for-jobs/action.yml
+++ b/third_party/sglang/.github/actions/wait-for-jobs/action.yml
@@ -0,0 +1,177 @@
 name: Wait for Jobs
 description: Poll and wait for specified jobs in the current workflow run to complete
 inputs:
  stage-name:
    description: 'Human-readable stage name for log messages (e.g. "stage-a")'
    required: true
  jobs:
    description: |
      JSON array of job specs to wait for. Each element is either:
        - a string: exact job name (e.g. "stage-a-test-1-gpu-small")
        - an object { "prefix": "...", "expected_count": N }: for matrix jobs
    required: true
  max-wait-minutes:
    description: 'Maximum time to wait before timing out'
    required: false
    default: '240'
  poll-interval-seconds:
    description: 'Seconds between polling attempts'
    required: false
    default: '60'
  github-token:
    description: 'GitHub token for API calls'
    required: false
    default: ${{ github.token }}
 outputs:
  result:
    description: 'Overall result: success, failure, or timeout'
    value: ${{ steps.wait.outputs.result }}
 runs:
  using: composite
  steps:
    - name: Wait for jobs to complete
      id: wait
      uses: actions/github-script@v7
      env:
        INPUT_STAGE_NAME: ${{ inputs.stage-name }}
        INPUT_JOBS: ${{ inputs.jobs }}
        INPUT_MAX_WAIT_MINUTES: ${{ inputs.max-wait-minutes }}
        INPUT_POLL_INTERVAL_SECONDS: ${{ inputs.poll-interval-seconds }}
      with:
        github-token: ${{ inputs.github-token }}
        script: |
          const stageName = process.env.INPUT_STAGE_NAME;
          const jobSpecs = JSON.parse(process.env.INPUT_JOBS);
          const maxWaitMinutes = parseInt(process.env.INPUT_MAX_WAIT_MINUTES);
          const pollIntervalSeconds = parseInt(process.env.INPUT_POLL_INTERVAL_SECONDS);
          const maxAttempts = (maxWaitMinutes * 60) / pollIntervalSeconds;
          // Normalize job specs into a uniform format
          const normalizedSpecs = jobSpecs.map(spec => {
            if (typeof spec === 'string') {
              return { prefix: spec, expected_count: 1, exact: true };
            }
            return { ...spec, exact: false };
          });
          const totalExpectedJobs = normalizedSpecs.reduce((sum, s) => sum + s.expected_count, 0);
          const matchesSpec = (jobName, spec) => {
            if (spec.exact) {
              return jobName === spec.prefix;
            }
            return jobName === spec.prefix || jobName.startsWith(spec.prefix + ' (');
          };
          // Use ETag conditional requests to avoid consuming rate limit when nothing changed.
          // GitHub returns 304 Not Modified for unchanged data, which is FREE (no rate limit cost).
          let lastEtag = '';
          let lastJobs = null;
          let apiCalls = 0;
          let cachedCalls = 0;
          async function fetchJobs() {
            const url = `GET /repos/{owner}/{repo}/actions/runs/{run_id}/jobs`;
            const params = {
              owner: context.repo.owner,
              repo: context.repo.repo,
              run_id: context.runId,
              per_page: 100,
              headers: {},
            };
            if (lastEtag) {
              params.headers['if-none-match'] = lastEtag;
            }
            try {
              const response = await github.request(url, params);
              apiCalls++;
              const rateRemaining = response.headers['x-ratelimit-remaining'] || '?';
              const rateLimit = response.headers['x-ratelimit-limit'] || '?';
              console.log(`[rate-limit] ${rateRemaining}/${rateLimit} remaining (ETag: ${lastEtag ? 'sent' : 'none'}) | this session: ${apiCalls} paid, ${cachedCalls} free`);
              lastEtag = response.headers.etag || '';
              const jobs = response.data.jobs;
              // Handle pagination if >100 jobs
              // ETag only covers page 1, so invalidate it to avoid stale cache
              // when later pages change but page 1 doesn't.
              if (response.data.total_count > 100) {
                lastEtag = '';
                for (let page = 2; page <= Math.ceil(response.data.total_count / 100); page++) {
                  const { data: pageData } = await github.request(url, {
                    ...params,
                    page,
                    headers: {},
                  });
                  jobs.push(...pageData.jobs);
                }
              }
              lastJobs = jobs;
              return { jobs, cached: false };
            } catch (err) {
              if (err.status === 304 && lastJobs) {
                cachedCalls++;
                console.log(`[rate-limit] 304 Not Modified | this session: ${apiCalls} paid, ${cachedCalls} free`);
                return { jobs: lastJobs, cached: true };
              }
              throw err;
            }
          }
          for (let attempt = 0; attempt < maxAttempts; attempt++) {
            const { jobs, cached } = await fetchJobs();
            let allCompleted = true;
            let failedJobs = [];
            let completedCount = 0;
            let totalCount = 0;
            for (const spec of normalizedSpecs) {
              const matchingJobs = jobs.filter(job => matchesSpec(job.name, spec));
              for (const job of matchingJobs) {
                totalCount++;
                if (!cached) {
                  console.log(`${job.name}: status=${job.status}, conclusion=${job.conclusion}`);
                }
                if (job.status === 'completed') {
                  completedCount++;
                  if (job.conclusion !== 'success' && job.conclusion !== 'skipped') {
                    failedJobs.push(job.name);
                  }
                } else {
                  allCompleted = false;
                }
              }
              if (matchingJobs.length < spec.expected_count) {
                console.log(`${spec.prefix}: found ${matchingJobs.length}/${spec.expected_count} jobs (waiting for more)`);
                allCompleted = false;
              }
            }
            console.log(`[${stageName}] Progress: ${completedCount}/${totalCount} jobs completed (expected ${totalExpectedJobs})${cached ? ' (cached, no rate limit cost)' : ''}`);
            // Fail fast if any jobs failed
            if (failedJobs.length > 0) {
              core.setOutput('result', 'failure');
              core.setFailed(`${stageName} jobs failed: ${failedJobs.join(', ')}`);
              return;
            }
            if (allCompleted && totalCount >= totalExpectedJobs) {
              core.setOutput('result', 'success');
              return;
            }
            console.log(`Waiting ${pollIntervalSeconds}s... (attempt ${attempt + 1}/${maxAttempts})`);
            await new Promise(resolve => setTimeout(resolve, pollIntervalSeconds * 1000));
          }
          core.setFailed(`Timeout waiting for ${stageName} jobs`);
          core.setOutput('result', 'timeout');
--- a/third_party/sglang/.github/audit_permission.py
+++ b/third_party/sglang/.github/audit_permission.py
@@ -0,0 +1,411 @@
 """
 Audit GitHub repository collaborators with elevated access.
 This script will:
 1. Fetch all collaborators with write permission to this repo.
 2. Show their github username, Nickname and the role (e.g., admin, maintain,
   custom org role, write, triage).
 3. Show their last activity related to this repo (last commit, last issue,
   last pull request). Put the data in YYYY-MM-DD format. Add a column "last activity date" to the CSV, before the above three breakdown columns.
 4. Show activity on other repos: repos touched via public events in the last 90 days (Push, PR, Issues, etc.). Sort the repos by the number of activities.
 5. Write results to a CSV sorted by the roles (admin, maintain, custom org role, write, triage) and the last activity date (most recent first).
 Usage:
    export GH_TOKEN="your_github_token"
    python3 audit_permission.py [--output path] [--repo owner/name]
 Requires: requests, and a token with permission to list collaborators (push+
 access to the repo).
 """
 from __future__ import annotations
 import argparse
 import csv
 import os
 import sys
 import time
 from collections import Counter
 from datetime import datetime, timedelta, timezone
 from typing import Any
 try:
    import requests
 except ImportError:
    requests = None  # type: ignore
 DEFAULT_OWNER = "sgl-project"
 DEFAULT_NAME = "sglang"
 HEADERS: dict[str, str] = {}
 def _request(
    method: str,
    url: str,
    *,
    params: dict[str, Any] | None = None,
    max_retries: int = 3,
 ) -> requests.Response:
    if requests is None:
        raise RuntimeError("Install the requests package: pip install requests")
    for attempt in range(max_retries):
        r = requests.request(method, url, headers=HEADERS, params=params, timeout=60)
        if r.status_code == 403 and "rate limit" in (r.text or "").lower():
            reset = r.headers.get("X-RateLimit-Reset")
            wait = 60
            if reset:
                try:
                    wait = max(1, int(reset) - int(time.time()) + 2)
                except ValueError:
                    pass
            print(f"Rate limited; sleeping {wait}s...", file=sys.stderr)
            time.sleep(min(wait, 3600))
            continue
        return r
    return r
 def paginate_list(url: str, params: dict[str, Any] | None = None) -> list[Any]:
    out: list[Any] = []
    next_url: str | None = url
    next_params = params
    while next_url:
        r = _request("GET", next_url, params=next_params)
        next_params = None
        if r.status_code != 200:
            print(
                f"Error {r.status_code} GET {next_url}: {r.text[:500]}",
                file=sys.stderr,
            )
            break
        data = r.json()
        if isinstance(data, list):
            out.extend(data)
        else:
            break
        next_url = None
        link = r.headers.get("Link", "")
        for part in link.split(", "):
            if 'rel="next"' in part:
                start = part.find("<") + 1
                end = part.find(">")
                if start > 0 and end > start:
                    next_url = part[start:end]
                break
    return out
 def collaborator_role(collab: dict[str, Any]) -> str:
    role_name = collab.get("role_name")
    if isinstance(role_name, str) and role_name.strip():
        return role_name.strip()
    perms = collab.get("permissions") or {}
    if perms.get("admin"):
        return "admin"
    if perms.get("maintain"):
        return "maintain"
    if perms.get("push"):
        return "write"
    if perms.get("triage"):
        return "triage"
    return "read"
 def has_write_plus(collab: dict[str, Any]) -> bool:
    perms = collab.get("permissions") or {}
    return bool(
        perms.get("admin")
        or perms.get("maintain")
        or perms.get("push")
        or perms.get("triage")
    )
 def role_sort_tier(collab: dict[str, Any]) -> int:
    """Sort order: admin (0), maintain (1), custom org role (2), write (3), triage (4)."""
    rn = collab.get("role_name")
    if isinstance(rn, str) and rn.strip():
        k = rn.strip().lower()
        if k == "admin":
            return 0
        if k == "maintain":
            return 1
        if k == "write":
            return 3
        if k == "triage":
            return 4
        if k == "read":
            return 5
        return 2
    perms = collab.get("permissions") or {}
    if perms.get("admin"):
        return 0
    if perms.get("maintain"):
        return 1
    if perms.get("push"):
        return 3
    if perms.get("triage"):
        return 4
    return 5
 def fetch_display_name(login: str) -> str:
    url = f"https://api.github.com/users/{login}"
    r = _request("GET", url)
    if r.status_code != 200:
        return ""
    data = r.json()
    if not isinstance(data, dict):
        return ""
    n = data.get("name")
    return n.strip() if isinstance(n, str) else ""
 def parse_github_ts(s: str) -> datetime | None:
    if not s:
        return None
    s = s.replace("Z", "+00:00")
    try:
        return datetime.fromisoformat(s)
    except ValueError:
        return None
 def iso_timestamp_to_ymd(iso: str | None) -> str:
    if not iso:
        return ""
    p = parse_github_ts(iso)
    if not p:
        return ""
    return p.date().isoformat()
 def max_date_ymd(*iso_dates: str | None) -> str:
    best: datetime | None = None
    for d in iso_dates:
        p = parse_github_ts(d or "")
        if p and (best is None or p > best):
            best = p
    return best.date().isoformat() if best else ""
 def parse_ymd(s: str) -> datetime | None:
    if not s:
        return None
    try:
        return datetime.strptime(s, "%Y-%m-%d").replace(tzinfo=timezone.utc)
    except ValueError:
        return None
 def last_commit_date(owner: str, repo: str, login: str) -> str | None:
    url = f"https://api.github.com/repos/{owner}/{repo}/commits"
    r = _request("GET", url, params={"author": login, "per_page": 1})
    if r.status_code != 200:
        return None
    data = r.json()
    if not isinstance(data, list) or not data:
        return None
    commit = data[0].get("commit") or {}
    c = commit.get("committer") or commit.get("author") or {}
    d = c.get("date")
    return d if isinstance(d, str) else None
 def search_repo_item(
    owner: str, repo: str, login: str, kind: str
 ) -> dict[str, Any] | None:
    q = f"repo:{owner}/{repo} is:{kind} author:{login}"
    url = "https://api.github.com/search/issues"
    r = _request(
        "GET",
        url,
        params={"q": q, "sort": "updated", "order": "desc", "per_page": 1},
    )
    if r.status_code != 200:
        return None
    payload = r.json()
    items = payload.get("items")
    if not items:
        return None
    return items[0] if isinstance(items[0], dict) else None
 def last_issue_pr_dates(
    owner: str, repo: str, login: str
 ) -> tuple[str | None, str | None]:
    issue = search_repo_item(owner, repo, login, "issue")
    pr = search_repo_item(owner, repo, login, "pr")
    issue_dt = None
    pr_dt = None
    if issue:
        issue_dt = issue.get("updated_at") or issue.get("created_at")
        if not isinstance(issue_dt, str):
            issue_dt = None
    if pr:
        pr_dt = pr.get("updated_at") or pr.get("created_at")
        if not isinstance(pr_dt, str):
            pr_dt = None
    return issue_dt, pr_dt
 def other_repos_activity_column(
    login: str, owner: str, repo: str, days: int = 90
 ) -> str:
    """Repos other than this one touched in the window, sorted by event count (desc)."""
    cutoff = datetime.now(timezone.utc) - timedelta(days=days)
    full = f"{owner}/{repo}"
    counts: Counter[str] = Counter()
    url: str | None = f"https://api.github.com/users/{login}/events/public"
    params: dict[str, Any] = {"per_page": 100}
    while url:
        r = _request("GET", url, params=params)
        params = {}
        if r.status_code != 200:
            break
        events = r.json()
        if not isinstance(events, list):
            break
        oldest_in_page: datetime | None = None
        for ev in events:
            if not isinstance(ev, dict):
                continue
            created = parse_github_ts(ev.get("created_at") or "")
            if created:
                if oldest_in_page is None or created < oldest_in_page:
                    oldest_in_page = created
            if created and created < cutoff:
                continue
            rinfo = ev.get("repo")
            name = None
            if isinstance(rinfo, dict):
                name = rinfo.get("name")
            if isinstance(name, str) and name and name != full:
                counts[name] += 1
        next_url = None
        link = r.headers.get("Link", "")
        for part in link.split(", "):
            if 'rel="next"' in part:
                s, e = part.find("<") + 1, part.find(">")
                if s > 0 and e > s:
                    next_url = part[s:e]
                break
        if oldest_in_page and oldest_in_page < cutoff:
            break
        url = next_url
        if not events:
            break
    ordered = sorted(counts.items(), key=lambda x: (-x[1], x[0]))
    return ";".join(f"{n}:{c}" for n, c in ordered)
 def main() -> None:
    parser = argparse.ArgumentParser(description="Audit repo collaborator permissions.")
    parser.add_argument(
        "--repo",
        default=f"{DEFAULT_OWNER}/{DEFAULT_NAME}",
        help=f"owner/name (default: {DEFAULT_OWNER}/{DEFAULT_NAME})",
    )
    parser.add_argument(
        "--output",
        "-o",
        default=os.path.join(os.path.dirname(__file__), "permission_audit.csv"),
        help="Output CSV path",
    )
    parser.add_argument(
        "--events-days",
        type=int,
        default=90,
        help="Window for other-repo activity via public events",
    )
    args = parser.parse_args()
    if "/" not in args.repo:
        print("Error: --repo must be owner/name", file=sys.stderr)
        sys.exit(1)
    owner, name = args.repo.split("/", 1)
    gh_token = os.getenv("GH_TOKEN")
    if not gh_token:
        print("Error: GH_TOKEN environment variable is not set.", file=sys.stderr)
        sys.exit(1)
    global HEADERS
    HEADERS = {
        "Authorization": f"Bearer {gh_token}",
        "Accept": "application/vnd.github+json",
        "X-GitHub-Api-Version": "2022-11-28",
    }
    collab_url = f"https://api.github.com/repos/{owner}/{name}/collaborators"
    print(f"Fetching collaborators for {owner}/{name}...", file=sys.stderr)
    collaborators = paginate_list(
        collab_url, params={"per_page": 100, "affiliation": "all"}
    )
    rows: list[dict[str, Any]] = []
    elevated = [c for c in collaborators if isinstance(c, dict) and has_write_plus(c)]
    print(
        f"Found {len(elevated)} collaborators with admin/maintain/write/triage.",
        file=sys.stderr,
    )
    for i, col in enumerate(elevated, start=1):
        login = col.get("login")
        if not isinstance(login, str):
            continue
        print(f"  [{i}/{len(elevated)}] {login}", file=sys.stderr)
        role = collaborator_role(col)
        nickname = fetch_display_name(login)
        cd = last_commit_date(owner, name, login)
        issue_dt, pr_dt = last_issue_pr_dates(owner, name, login)
        last_act_ymd = max_date_ymd(cd, issue_dt, pr_dt)
        others = other_repos_activity_column(login, owner, name, days=args.events_days)
        rows.append(
            {
                "_role_tier": role_sort_tier(col),
                "github_username": login,
                "nickname": nickname,
                "role": role,
                "last_activity_date": last_act_ymd,
                "last_commit_date": iso_timestamp_to_ymd(cd),
                "last_issue_date": iso_timestamp_to_ymd(issue_dt),
                "last_pr_date": iso_timestamp_to_ymd(pr_dt),
                "other_repos_90d": others,
            }
        )
    def sort_key(r: dict[str, Any]) -> tuple[int, float]:
        tier = r["_role_tier"]
        act = parse_ymd(r.get("last_activity_date") or "")
        ts = act.timestamp() if act else 0.0
        return (tier, -ts)
    rows.sort(key=sort_key)
    fieldnames = [
        "github_username",
        "nickname",
        "role",
        "last_activity_date",
        "last_commit_date",
        "last_issue_date",
        "last_pr_date",
        "other_repos_90d",
    ]
    for r in rows:
        del r["_role_tier"]
    with open(args.output, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=fieldnames)
        w.writeheader()
        w.writerows(rows)
    print(f"Wrote {len(rows)} rows to {args.output}", file=sys.stderr)
 if __name__ == "__main__":
    main()
--- a/third_party/sglang/.github/labeler.yml
+++ b/third_party/sglang/.github/labeler.yml
@@ -0,0 +1,122 @@
 # Configuration for the GitHub Labeler action
 # Automatically adds labels to PRs based on the files changed
 # Router specific (Rust code in sgl-model-gateway)
 model-gateway:
  - changed-files:
    - any-glob-to-any-file: 'sgl-model-gateway/**/*'
 # Kernel specific
 sgl-kernel:
  - changed-files:
    - any-glob-to-any-file: 'sgl-kernel/**/*'
 # JIT kernel specific
 jit-kernel:
  - changed-files:
    - any-glob-to-any-file: 'python/sglang/jit_kernel/**/*'
 # Documentation
 documentation:
  - changed-files:
    - any-glob-to-any-file:
      - '**/*.md'
      - 'docs/**/*'
      - 'README*'
 # Dependencies
 dependencies:
  - changed-files:
    - any-glob-to-any-file:
      - '**/requirements*.txt'
      - '**/Cargo.toml'
      - '**/Cargo.lock'
      - '**/pyproject*.toml'
      - '**/setup.py'
      - '**/poetry.lock'
      - '**/package.json'
      - '**/package-lock.json'
 # Multi-modal
 Multi-modal:
  - changed-files:
    - any-glob-to-any-file:
      - '**/*multimodal*'
      - '**/*vision*'
      - '**/*vlm*'
 # Diffusion
 diffusion:
  - changed-files:
    - any-glob-to-any-file: 'python/sglang/multimodal_gen/**/*'
 # LoRA
 lora:
  - changed-files:
    - any-glob-to-any-file:
      - '**/*lora*'
 # Quantization
 quant:
  - changed-files:
    - any-glob-to-any-file:
      - '**/*quant*'
      - '**/*quantization*'
 # Speculative decoding
 speculative-decoding:
  - changed-files:
    - any-glob-to-any-file:
      - '**/*speculative*'
 # AMD specific
 amd:
  - changed-files:
    - any-glob-to-any-file:
      - '**/*amd*'
      - '**/*rocm*'
 # NPU specific
 npu:
  - changed-files:
    - any-glob-to-any-file:
      - '**/*npu*'
      - '**/*ascend*'
 # Blackwell
 blackwell:
  - changed-files:
    - any-glob-to-any-file:
      - '**/*nvfp4*'
      - 'sgl-kernel/csrc/attention/cutlass_sm100_mla/**/*'
      - 'python/sglang/srt/layers/attention/trtllm_mla_backend.py'
      - 'python/sglang/srt/layers/attention/trtllm_mha_backend.py'
 # DeepSeek specific
 deepseek:
  - changed-files:
    - any-glob-to-any-file:
      - '**/*deepseek*'
 # HiCache
 hicache:
  - changed-files:
    - any-glob-to-any-file:
      - '**/*hicache*'
 # Deterministic
 deterministic:
  - changed-files:
    - any-glob-to-any-file: 'python/sglang/srt/batch_invariant_ops/**/*'
 # Piecewise CUDA Graph
 piecewise-cuda-graph:
  - changed-files:
    - any-glob-to-any-file: 'python/sglang/srt/compilation/**/*'
 # Moore Threads specific
 mthreads:
  - changed-files:
    - any-glob-to-any-file:
      - '**/*mthreads*'
      - '**/*musa*'
--- a/third_party/sglang/.github/linters/lychee-ci.toml
+++ b/third_party/sglang/.github/linters/lychee-ci.toml
@@ -0,0 +1,42 @@
 no_progress = true
 verbose = "warn"
 timeout = 20
 max_concurrency = 8
 retry_wait_time = 2
 max_retries = 2
 # CI should validate external links over the network.
 offline = false
 scheme = ["http", "https"]
 exclude_path = [
  # Exclude generated Sphinx build artifacts.
  # - "(\\./)?" allows both "docs/..." and "./docs/..."
  # - "[/\\\\]" supports both slash styles in CI environments
  "^(\\./)?docs[/\\\\]_build[/\\\\]",
 ]
 exclude = [
  # Local-only endpoints referenced in docs/examples.
  # These are expected to be unreachable in GitHub-hosted CI.
  "^https?://localhost(:[0-9]+)?(/|$)",
  "^http://127\\.0\\.0\\.1(:[0-9]+)?(/|$)",
  # Vendor pages that frequently block/deny CI user-agents (transient 403/anti-bot).
  "^https://www\\.intel\\.com/content/www/us/en/ark/products/series/240391/intel-arc-b-series-graphics\\.html$",
  "^https://www\\.intel\\.com/content/www/us/en/ark/products/series/242616/intel-arc-pro-b-series-graphics\\.html$",
  "^https://www\\.intel\\.com/content/www/us/en/products/sku/241598/intel-arc-b580-graphics/specifications\\.html$",
  # Non-routable bind address used in examples, never externally reachable.
  "^http://0\\.0\\.0\\.0(/|$)",
  # Large doc portals with anti-bot/rate-limit behavior in CI.
  # We keep API docs references in content but do not fail CI on access policy.
  "^https://platform\\.openai\\.com/docs/",
  "^https://gamma\\.app/docs/Optimizing-RL-with-SGLang-y0kqgj877k34779$",
  "^https://aflah02\\.substack\\.com/p/multi-node-llm-inference-with-sglang/?$",
  # Known noisy image URLs used in notebook-rendered examples.
  "^https://github\\.com/sgl-project/sglang/blob/main/examples/assets/example_image\\.png\\?raw=true$",
  "^https://raw\\.githubusercontent\\.com/sgl-project/sglang/main/examples/assets/example_image\\.png/?$",
  "^https://raw\\.githubusercontent\\.com/sgl-project/sglang/main/assets/logo\\.png/?$",
 ]
--- a/third_party/sglang/.github/linters/lychee.toml
+++ b/third_party/sglang/.github/linters/lychee.toml
@@ -0,0 +1,18 @@
 # .github/linters/lychee.toml
 no_progress = true
 verbose = "warn"
 timeout = 20
 max_concurrency = 8
 offline = true
 # Ignore generated docs output; check source docs only.
 exclude_path = [
  "^(\\./)?docs[/\\\\]_build[/\\\\]",
 ]
 exclude = [
  "^https?://localhost(:[0-9]+)?(/|$)",
  "^http://127\\.0\\.0\\.1(:[0-9]+)?(/|$)",
  "^http://0\\.0\\.0\\.0(/|$)",
 ]
--- a/third_party/sglang/.github/pull_request_template.md
+++ b/third_party/sglang/.github/pull_request_template.md
@@ -0,0 +1,33 @@
 <!-- Thank you for your contribution! Please follow these guidelines to enhance your pull request. If anything is unclear, submit your PR and reach out to maintainers for assistance. Join our Slack community at https://slack.sglang.io to discuss further. -->
 ## Motivation
 <!-- Describe the purpose and goals of this pull request. -->
 ## Modifications
 <!-- Detail the changes made in this pull request. -->
 ## Accuracy Tests
 <!-- If this pull request affects model outputs (e.g., changes to the kernel or model forward code), provide accuracy test results. -->
 ## Speed Tests and Profiling
 <!-- If this pull request impacts inference speed, provide benchmarking and profiling results. -->
 ## Checklist
 - [ ] Format your code according to the [Format code with pre-commit](https://docs.sglang.io/developer_guide/contribution_guide.html#format-code-with-pre-commit).
 - [ ] Add unit tests according to the [Run and add unit tests](https://docs.sglang.io/developer_guide/contribution_guide.html#run-and-add-unit-tests).
 - [ ] Update documentation according to [Write documentations](https://docs.sglang.io/developer_guide/contribution_guide.html#write-documentations).
 - [ ] Provide accuracy and speed benchmark results according to [Test the accuracy](https://docs.sglang.io/developer_guide/contribution_guide.html#test-the-accuracy) and [Benchmark the speed](https://docs.sglang.io/developer_guide/contribution_guide.html#benchmark-the-speed).
 - [ ] Follow the SGLang code style [guidance](https://docs.sglang.io/developer_guide/contribution_guide.html#code-style-guidance).
 ## Review and Merge Process
 1. Ping Merge Oncalls to start the process. See the [PR Merge Process](https://github.com/sgl-project/sglang/blob/main/.github/MAINTAINER.md#pull-request-merge-process).
 2. Get approvals from [CODEOWNERS](https://github.com/sgl-project/sglang/blob/main/.github/CODEOWNERS) and other reviewers.
 3. Trigger CI tests with [comments](https://docs.sglang.io/developer_guide/contribution_guide.html#how-to-trigger-ci-tests) or contact authorized users to do so.
   - Common commands include `/tag-and-rerun-ci`, `/tag-run-ci-label`, `/rerun-failed-ci`
 4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.
--- a/third_party/sglang/.github/update_ci_permission.py
+++ b/third_party/sglang/.github/update_ci_permission.py
@@ -0,0 +1,244 @@
 """
 Update the CI permissions configuration file.
 This script updates the `CI_PERMISSIONS.json` file, which defines the CI permissions granted to each user.
 The format of `CI_PERMISSIONS.json` is as follows:
 {
    "username1": {
        "can_tag_run_ci_label": true,
        "can_rerun_failed_ci": true,
        "cooldown_interval_minutes": 0,
        "reason": "top contributor"
    },
    "username2": {
        "can_tag_run_ci_label": true,
        "can_rerun_failed_ci": true,
        "cooldown_interval_minutes": 60,
        "reason": "custom override"
    }
 }
 Permissions are assigned according to the following rules:
 1. Add the top 50 contributors from the last 120 days with full permissions, no cooldown, and the reason "top contributor".
 2. Load all users from the existing `CI_PERMISSIONS.json` file and update their entries as follows:
   - If a user is already covered by rule 1, skip that user.
   - If the old reason of a user is "top contributor" but they are not in the current top contributors list, change their configuration to:
       {
           "can_tag_run_ci_label": true,
           "can_rerun_failed_ci": true,
           "cooldown_interval_minutes": 60,
           "reason": "custom override"
       }
    - For all other cases, preserve the original configuration unchanged.
 3. All other users receive no permissions and a 120-minute cooldown (they are omitted from the file).
 Usage:
    export GH_TOKEN="your_github_token"
    python3 update_ci_permission.py
    # Sort-only mode (no network calls, no GH_TOKEN required)
    python3 update_ci_permission.py --sort-only
 """
 import argparse
 import json
 import os
 from collections import Counter
 from datetime import datetime, timedelta, timezone
 try:
    import requests
 except ImportError:
    requests = None  # Only needed for non-sort-only runs
 # Configuration
 REPO_OWNER = "sgl-project"
 REPO_NAME = "sglang"
 FILE_NAME = os.path.join(os.path.dirname(__file__), "CI_PERMISSIONS.json")
 HEADERS = {}
 def github_api_get(endpoint, params=None):
    """Helper to make paginated GitHub API requests."""
    if requests is None:
        raise RuntimeError(
            "The requests package is required. Install it or use --sort-only."
        )
    if not HEADERS:
        raise RuntimeError(
            "GitHub headers not initialized. Set GH_TOKEN or use --sort-only."
        )
    results = []
    url = f"https://api.github.com/repos/{REPO_OWNER}/{REPO_NAME}/{endpoint}"
    while url:
        response = requests.get(url, headers=HEADERS, params=params)
        if response.status_code != 200:
            print(f"Error fetching {url}: {response.status_code} {response.text}")
            # If we fail to fetch, strictly return what we have or empty to avoid crashing logic
            break
        data = response.json()
        if isinstance(data, list):
            results.extend(data)
        else:
            return data  # Non-list response (not paginated usually)
        # Handle pagination
        url = None
        if "link" in response.headers:
            links = response.headers["link"].split(", ")
            for link in links:
                if 'rel="next"' in link:
                    url = link[link.find("<") + 1 : link.find(">")]
                    params = None  # Params are included in the next link
                    break
    return results
 def get_write_access_users():
    """Fetches users with push (write) or admin access."""
    print("Fetching collaborators with write access...")
    # Note: This endpoint usually requires admin rights on the token.
    collaborators = github_api_get("collaborators", params={"per_page": 100})
    writers = set()
    for col in collaborators:
        perms = col.get("permissions", {})
        # Check for admin, maintain, or push rights
        if perms.get("admin") or perms.get("maintain") or perms.get("push"):
            writers.add(col["login"])
    print(f"Found {len(writers)} users with write access.")
    return writers
 def get_top_contributors(days, limit):
    """Fetches top contributors based on commit count in the last N days."""
    print(f"Fetching commits from the last {days} days...")
    since_date = (datetime.now(timezone.utc) - timedelta(days=days)).isoformat()
    # Fetch commits
    commits = github_api_get("commits", params={"since": since_date, "per_page": 100})
    author_counts = Counter()
    for commit in commits:
        # commit['author'] contains the GitHub user object (can be None if not linked)
        if commit.get("author") and "login" in commit["author"]:
            author_counts[commit["author"]["login"]] += 1
    top_users = [user for user, _ in author_counts.most_common(limit)]
    print(f"Found {len(top_users)} top contributors in the last {days} days.")
    return set(top_users)
 def load_existing_permissions():
    if os.path.exists(FILE_NAME):
        try:
            with open(FILE_NAME, "r") as f:
                return json.load(f)
        except json.JSONDecodeError:
            print(f"Warning: {FILE_NAME} is invalid JSON. Starting fresh.")
    return {}
 def sort_permissions_file():
    """Sort the existing CI permissions file alphabetically and exit."""
    if not os.path.exists(FILE_NAME):
        print(f"{FILE_NAME} not found. Nothing to sort.")
        return
    old_permissions = load_existing_permissions()
    sorted_permissions = dict(sorted(old_permissions.items()))
    with open(FILE_NAME, "w") as f:
        json.dump(sorted_permissions, f, indent=4)
        f.write("\n")
    print(f"Sorted {FILE_NAME}. Total users: {len(sorted_permissions)}")
 def main():
    parser = argparse.ArgumentParser(description="Update or sort CI permissions.")
    parser.add_argument(
        "--sort-only",
        action="store_true",
        help="Only sort CI_PERMISSIONS.json alphabetically without fetching data.",
    )
    args = parser.parse_args()
    if args.sort_only:
        sort_permissions_file()
        return
    gh_token = os.getenv("GH_TOKEN")
    if not gh_token:
        raise ValueError("Error: GH_TOKEN environment variable is not set.")
    global HEADERS
    HEADERS = {
        "Authorization": f"Bearer {gh_token}",
        "Accept": "application/vnd.github+json",
        "X-GitHub-Api-Version": "2022-11-28",
    }
    # Gather Data
    try:
        write_access_users = get_write_access_users()
    except Exception as e:
        print(f"Warning: Could not fetch collaborators (check token scope). Error: {e}")
        write_access_users = set()
    top_contributors = get_top_contributors(days=120, limit=50)
    old_permissions = load_existing_permissions()
    new_permissions = {}
    # Rule 1: Add Top 50 Contributors
    for user in top_contributors:
        new_permissions[user] = {
            "can_tag_run_ci_label": True,
            "can_rerun_failed_ci": True,
            "can_rerun_stage": True,
            "cooldown_interval_minutes": 0,
            "reason": "top contributor",
        }
    # Rule 2: Process Existing Users (Merge Logic)
    for user, config in old_permissions.items():
        if user in new_permissions:
            # Already handled by Rule 1 or 2
            continue
        old_reason = config.get("reason", "")
        # If they fell off the top contributor list
        if old_reason in ["top contributor"]:
            new_permissions[user] = {
                "can_tag_run_ci_label": True,
                "can_rerun_failed_ci": True,
                "can_rerun_stage": True,
                "cooldown_interval_minutes": 60,
                "reason": "custom override",
            }
        else:
            # Preserve custom overrides
            new_permissions[user] = config
    # Save and Sort
    # Sorting keys for cleaner diffs
    sorted_permissions = dict(sorted(new_permissions.items()))
    with open(FILE_NAME, "w") as f:
        json.dump(sorted_permissions, f, indent=4)
        f.write("\n")  # Add trailing newline
    print(f"Successfully updated {FILE_NAME}. Total users: {len(sorted_permissions)}")
 if __name__ == "__main__":
    main()
--- a/third_party/sglang/.github/workflows/amd-aiter-scout.yml
+++ b/third_party/sglang/.github/workflows/amd-aiter-scout.yml
@@ -0,0 +1,161 @@
 name: AMD AITER Scout
 on:
  schedule:
    - cron: '0 20 * * 1'   # Monday 20:00 UTC
    - cron: '0 20 * * 4'   # Thursday 20:00 UTC
  workflow_dispatch:
    inputs:
      aiter_ref:
        description: 'AITER git ref (branch, tag, or SHA). Default: main (latest commit)'
        required: false
        type: string
        default: 'main'
      job_filter:
        description: 'Comma-separated workflows to run: nightly-amd, nightly-amd-rocm720, pr-test-amd, pr-test-amd-rocm720. Default: all'
        required: false
        type: string
        default: 'all'
      continue_on_error:
        description: 'Continue running other workflows even if one fails'
        required: false
        type: boolean
        default: true
 concurrency:
  group: amd-aiter-scout-${{ github.run_id }}
  cancel-in-progress: true
 jobs:
  resolve-aiter:
    runs-on: ubuntu-latest
    outputs:
      aiter_sha: ${{ steps.resolve.outputs.sha }}
      run_nightly_amd: ${{ steps.parse.outputs.run_nightly_amd }}
      run_nightly_amd_rocm720: ${{ steps.parse.outputs.run_nightly_amd_rocm720 }}
      run_pr_test_amd: ${{ steps.parse.outputs.run_pr_test_amd }}
      run_pr_test_amd_rocm720: ${{ steps.parse.outputs.run_pr_test_amd_rocm720 }}
    steps:
      - name: Resolve AITER commit
        id: resolve
        run: |
          REF="${{ inputs.aiter_ref || 'main' }}"
          echo "Resolving AITER ref: ${REF}"
          SHA=$(git ls-remote https://github.com/ROCm/aiter.git "refs/heads/${REF}" | head -1 | cut -f1)
          if [ -z "$SHA" ]; then
            SHA=$(git ls-remote https://github.com/ROCm/aiter.git "refs/tags/${REF}" | head -1 | cut -f1)
          fi
          if [ -z "$SHA" ]; then
            SHA=$(git ls-remote https://github.com/ROCm/aiter.git "${REF}" | head -1 | cut -f1)
          fi
          if [ -z "$SHA" ]; then
            SHA="${REF}"
          fi
          echo "sha=${SHA}" >> $GITHUB_OUTPUT
          echo "### AITER Ref Resolution" >> $GITHUB_STEP_SUMMARY
          echo "- **Requested ref:** \`${REF}\`" >> $GITHUB_STEP_SUMMARY
          echo "- **Resolved SHA:** \`${SHA}\`" >> $GITHUB_STEP_SUMMARY
          echo "- **AITER commit:** https://github.com/ROCm/aiter/commit/${SHA}" >> $GITHUB_STEP_SUMMARY
      - name: Parse job filter
        id: parse
        run: |
          FILTER="${{ inputs.job_filter || 'all' }}"
          echo "Job filter: ${FILTER}"
          if [[ "$FILTER" == "all" ]]; then
            echo "run_nightly_amd=true" >> $GITHUB_OUTPUT
            echo "run_nightly_amd_rocm720=true" >> $GITHUB_OUTPUT
            echo "run_pr_test_amd=true" >> $GITHUB_OUTPUT
            echo "run_pr_test_amd_rocm720=true" >> $GITHUB_OUTPUT
          else
            # Wrap with commas for exact substring matching (avoids "nightly-amd" matching "nightly-amd-rocm720")
            PADDED=",${FILTER// /},"
            echo "run_nightly_amd=$(echo "$PADDED" | grep -q ',nightly-amd,' && echo true || echo false)" >> $GITHUB_OUTPUT
            echo "run_nightly_amd_rocm720=$(echo "$PADDED" | grep -q ',nightly-amd-rocm720,' && echo true || echo false)" >> $GITHUB_OUTPUT
            echo "run_pr_test_amd=$(echo "$PADDED" | grep -q ',pr-test-amd,' && echo true || echo false)" >> $GITHUB_OUTPUT
            echo "run_pr_test_amd_rocm720=$(echo "$PADDED" | grep -q ',pr-test-amd-rocm720,' && echo true || echo false)" >> $GITHUB_OUTPUT
          fi
          echo "### Job Filter" >> $GITHUB_STEP_SUMMARY
          echo "- **Filter:** \`${FILTER}\`" >> $GITHUB_STEP_SUMMARY
  call-nightly-amd:
    if: needs.resolve-aiter.outputs.run_nightly_amd == 'true'
    needs: resolve-aiter
    uses: ./.github/workflows/nightly-test-amd.yml
    secrets: inherit
    with:
      ref: ${{ github.sha }}
      aiter_ref: ${{ needs.resolve-aiter.outputs.aiter_sha }}
      job_filter: 'all'
      continue_on_error: ${{ inputs.continue_on_error == '' && true || inputs.continue_on_error }}
  call-nightly-amd-rocm720:
    if: needs.resolve-aiter.outputs.run_nightly_amd_rocm720 == 'true'
    needs: resolve-aiter
    uses: ./.github/workflows/nightly-test-amd-rocm720.yml
    secrets: inherit
    with:
      ref: ${{ github.sha }}
      aiter_ref: ${{ needs.resolve-aiter.outputs.aiter_sha }}
      job_filter: 'all'
      continue_on_error: ${{ inputs.continue_on_error == '' && true || inputs.continue_on_error }}
  call-pr-test-amd:
    if: needs.resolve-aiter.outputs.run_pr_test_amd == 'true'
    needs: resolve-aiter
    uses: ./.github/workflows/pr-test-amd.yml
    secrets: inherit
    with:
      run_all_tests: true
      aiter_ref: ${{ needs.resolve-aiter.outputs.aiter_sha }}
      continue_on_error: ${{ inputs.continue_on_error == '' && true || inputs.continue_on_error }}
  call-pr-test-amd-rocm720:
    if: needs.resolve-aiter.outputs.run_pr_test_amd_rocm720 == 'true'
    needs: resolve-aiter
    uses: ./.github/workflows/pr-test-amd-rocm720.yml
    secrets: inherit
    with:
      run_all_tests: true
      aiter_ref: ${{ needs.resolve-aiter.outputs.aiter_sha }}
      continue_on_error: ${{ inputs.continue_on_error == '' && true || inputs.continue_on_error }}
  check-all-jobs:
    if: always()
    needs:
      - resolve-aiter
      - call-nightly-amd
      - call-nightly-amd-rocm720
      - call-pr-test-amd
      - call-pr-test-amd-rocm720
    runs-on: ubuntu-latest
    steps:
      - name: Summary
        run: |
          echo "## AMD AITER Scout Results" >> $GITHUB_STEP_SUMMARY
          echo "" >> $GITHUB_STEP_SUMMARY
          echo "- **AITER SHA:** \`${{ needs.resolve-aiter.outputs.aiter_sha }}\`" >> $GITHUB_STEP_SUMMARY
          echo "- **AITER commit:** https://github.com/ROCm/aiter/commit/${{ needs.resolve-aiter.outputs.aiter_sha }}" >> $GITHUB_STEP_SUMMARY
          echo "" >> $GITHUB_STEP_SUMMARY
          echo "| Workflow | Result |" >> $GITHUB_STEP_SUMMARY
          echo "|----------|--------|" >> $GITHUB_STEP_SUMMARY
          echo "| Nightly AMD (AITER Latest) | \`${{ needs.call-nightly-amd.result }}\` |" >> $GITHUB_STEP_SUMMARY
          echo "| Nightly AMD ROCm 7.2 | \`${{ needs.call-nightly-amd-rocm720.result }}\` |" >> $GITHUB_STEP_SUMMARY
          echo "| PR Test AMD (AITER Latest) | \`${{ needs.call-pr-test-amd.result }}\` |" >> $GITHUB_STEP_SUMMARY
          echo "| PR Test AMD ROCm 7.2 | \`${{ needs.call-pr-test-amd-rocm720.result }}\` |" >> $GITHUB_STEP_SUMMARY
      - name: Check if any job failed
        run: |
          if [[ "${{ contains(needs.*.result, 'failure') }}" == "true" ]]; then
            echo "One or more workflows failed"
            exit 1
          fi
          if [[ "${{ contains(needs.*.result, 'cancelled') }}" == "true" ]]; then
            echo "One or more workflows were cancelled"
            exit 1
          fi
          echo "All workflows passed"
--- a/third_party/sglang/.github/workflows/amd-ci-job-monitor.yml
+++ b/third_party/sglang/.github/workflows/amd-ci-job-monitor.yml
@@ -0,0 +1,338 @@
 name: AMD CI Job Monitor
 on:
  schedule:
    - cron: '0 0 * * *'  # Daily at midnight UTC
  pull_request:
    paths:
      - '.github/workflows/amd-ci-job-monitor.yml'
      - 'scripts/ci/utils/query_job_status.py'
  workflow_dispatch:
    inputs:
      hours:
        description: 'Time window in hours'
        required: false
        default: '24'
        type: string
      job_filter:
        description: 'Job name filter (leave empty for all AMD jobs)'
        required: false
        type: string
 jobs:
  fetch-actions-data:
    name: Fetch Actions Snapshot
    runs-on: ubuntu-latest
    env:
      GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'
      - name: Install dependencies
        run: pip install tabulate
      - name: Select workflows for snapshot
        id: select-workflows
        run: |
          if [[ -n "${{ inputs.job_filter }}" ]]; then
            echo "workflows=pr-test-amd.yml" >> "$GITHUB_OUTPUT"
          else
            echo "workflows=pr-test-amd.yml,nightly-test-amd.yml,pr-test-amd-rocm720.yml,nightly-test-amd-rocm720.yml" >> "$GITHUB_OUTPUT"
          fi
      - name: Fetch Actions data snapshot
        timeout-minutes: 30
        run: |
          python scripts/ci/utils/query_job_status.py \
            --repo ${{ github.repository }} \
            --workflow "${{ steps.select-workflows.outputs.workflows }}" \
            --hours ${{ inputs.hours || '24' }} \
            --dump-data-file actions-job-snapshot.json
      - name: Upload Actions data snapshot
        uses: actions/upload-artifact@v4
        with:
          name: actions-job-snapshot
          path: actions-job-snapshot.json
          if-no-files-found: error
  # Single job filter mode
  custom-report:
    name: Custom Job Report
    if: ${{ inputs.job_filter }}
    needs: fetch-actions-data
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'
      - name: Install dependencies
        run: pip install tabulate
      - name: Download Actions data snapshot
        uses: actions/download-artifact@v4
        with:
          name: actions-job-snapshot
          path: ci-data
      - name: Generate Custom Job Report
        timeout-minutes: 30
        run: |
          python scripts/ci/utils/query_job_status.py \
            --repo ${{ github.repository }} \
            --job "${{ inputs.job_filter }}" \
            --workflow "pr-test-amd.yml" \
            --hours ${{ inputs.hours || '24' }} \
            --input-data-file ci-data/actions-job-snapshot.json \
            --summary
  # Parse workflow files to get job names dynamically
  parse-workflows:
    name: Parse Workflow Jobs
    if: ${{ !inputs.job_filter }}
    runs-on: ubuntu-latest
    outputs:
      pr_jobs: ${{ steps.parse.outputs.pr_jobs }}
      nightly_jobs: ${{ steps.parse.outputs.nightly_jobs }}
      pr_rocm720_jobs: ${{ steps.parse.outputs.pr_rocm720_jobs }}
      nightly_rocm720_jobs: ${{ steps.parse.outputs.nightly_rocm720_jobs }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Parse workflow files
        id: parse
        run: |
          # Parse pr-test-amd.yml and extract job names (exclude utility jobs)
          # Excluded: call-gate, check-changes, pr-test-amd-finish, cancel, check-all-jobs
          pr_jobs=$(yq -r '.jobs | keys | .[]' .github/workflows/pr-test-amd.yml | \
            grep -v -E '^(call-gate|check-changes|pr-test-amd-finish|cancel|check-all-jobs)$' | \
            jq -R -s -c 'split("\n") | map(select(length > 0))')
          echo "pr_jobs=$pr_jobs" >> $GITHUB_OUTPUT
          echo "PR jobs: $pr_jobs"
          # Parse nightly-test-amd.yml and extract job names (exclude utility jobs)
          # Excluded: check-all-jobs
          nightly_jobs=$(yq -r '.jobs | keys | .[]' .github/workflows/nightly-test-amd.yml | \
            grep -v -E '^(check-all-jobs)$' | \
            jq -R -s -c 'split("\n") | map(select(length > 0))')
          echo "nightly_jobs=$nightly_jobs" >> $GITHUB_OUTPUT
          echo "Nightly jobs: $nightly_jobs"
          # Parse pr-test-amd-rocm720.yml (exclude utility jobs)
          # Excluded: call-gate, check-changes, pr-test-amd-finish, cancel, check-all-jobs
          pr_rocm720_jobs=$(yq -r '.jobs | keys | .[]' .github/workflows/pr-test-amd-rocm720.yml | \
            grep -v -E '^(call-gate|check-changes|pr-test-amd-finish|cancel|check-all-jobs)$' | \
            jq -R -s -c 'split("\n") | map(select(length > 0))')
          echo "pr_rocm720_jobs=$pr_rocm720_jobs" >> $GITHUB_OUTPUT
          echo "PR ROCm 7.2 jobs: $pr_rocm720_jobs"
          # Parse nightly-test-amd-rocm720.yml (exclude utility jobs)
          # Excluded: check-all-jobs
          nightly_rocm720_jobs=$(yq -r '.jobs | keys | .[]' .github/workflows/nightly-test-amd-rocm720.yml | \
            grep -v -E '^(check-all-jobs)$' | \
            jq -R -s -c 'split("\n") | map(select(length > 0))')
          echo "nightly_rocm720_jobs=$nightly_rocm720_jobs" >> $GITHUB_OUTPUT
          echo "Nightly ROCm 7.2 jobs: $nightly_rocm720_jobs"
  # PR CI reports using dynamic matrix
  pr-ci-reports:
    name: PR - ${{ matrix.job_name }}
    needs: [parse-workflows, fetch-actions-data]
    if: ${{ !inputs.job_filter }}
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        job_name: ${{ fromJson(needs.parse-workflows.outputs.pr_jobs) }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'
      - name: Install dependencies
        run: pip install tabulate
      - name: Download Actions data snapshot
        uses: actions/download-artifact@v4
        with:
          name: actions-job-snapshot
          path: ci-data
      - name: Generate Report
        timeout-minutes: 15
        run: |
          python scripts/ci/utils/query_job_status.py \
            --repo ${{ github.repository }} \
            --job "${{ matrix.job_name }}" \
            --workflow "pr-test-amd.yml" \
            --hours ${{ inputs.hours || '24' }} \
            --input-data-file ci-data/actions-job-snapshot.json \
            --summary
  # Nightly AMD test reports using dynamic matrix
  nightly-reports:
    name: Nightly - ${{ matrix.job_name }}
    needs: [parse-workflows, fetch-actions-data]
    if: ${{ !inputs.job_filter }}
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        job_name: ${{ fromJson(needs.parse-workflows.outputs.nightly_jobs) }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'
      - name: Install dependencies
        run: pip install tabulate
      - name: Download Actions data snapshot
        uses: actions/download-artifact@v4
        with:
          name: actions-job-snapshot
          path: ci-data
      - name: Generate Nightly Report
        timeout-minutes: 15
        run: |
          python scripts/ci/utils/query_job_status.py \
            --repo ${{ github.repository }} \
            --job "${{ matrix.job_name }}" \
            --workflow "nightly-test-amd.yml" \
            --hours ${{ inputs.hours || '24' }} \
            --input-data-file ci-data/actions-job-snapshot.json \
            --summary
  # PR ROCm 7.2 CI reports using dynamic matrix
  pr-rocm720-ci-reports:
    name: PR ROCm720 - ${{ matrix.job_name }}
    needs: [parse-workflows, fetch-actions-data]
    if: ${{ !inputs.job_filter }}
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        job_name: ${{ fromJson(needs.parse-workflows.outputs.pr_rocm720_jobs) }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'
      - name: Install dependencies
        run: pip install tabulate
      - name: Download Actions data snapshot
        uses: actions/download-artifact@v4
        with:
          name: actions-job-snapshot
          path: ci-data
      - name: Generate PR ROCm 7.2 Report
        timeout-minutes: 15
        run: |
          python scripts/ci/utils/query_job_status.py \
            --repo ${{ github.repository }} \
            --job "${{ matrix.job_name }}" \
            --workflow "pr-test-amd-rocm720.yml" \
            --hours ${{ inputs.hours || '24' }} \
            --input-data-file ci-data/actions-job-snapshot.json \
            --summary
  # Nightly ROCm 7.2 reports using dynamic matrix
  nightly-rocm720-reports:
    name: Nightly ROCm720 - ${{ matrix.job_name }}
    needs: [parse-workflows, fetch-actions-data]
    if: ${{ !inputs.job_filter }}
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        job_name: ${{ fromJson(needs.parse-workflows.outputs.nightly_rocm720_jobs) }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'
      - name: Install dependencies
        run: pip install tabulate
      - name: Download Actions data snapshot
        uses: actions/download-artifact@v4
        with:
          name: actions-job-snapshot
          path: ci-data
      - name: Generate Nightly ROCm 7.2 Report
        timeout-minutes: 15
        run: |
          python scripts/ci/utils/query_job_status.py \
            --repo ${{ github.repository }} \
            --job "${{ matrix.job_name }}" \
            --workflow "nightly-test-amd-rocm720.yml" \
            --hours ${{ inputs.hours || '24' }} \
            --input-data-file ci-data/actions-job-snapshot.json \
            --summary
  # Runner fleet report - cross-workflow runner analytics in a single pass
  runner-fleet-report:
    name: Runner Fleet Report
    if: ${{ !inputs.job_filter }}
    needs: fetch-actions-data
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'
      - name: Install dependencies
        run: pip install tabulate
      - name: Download Actions data snapshot
        uses: actions/download-artifact@v4
        with:
          name: actions-job-snapshot
          path: ci-data
      - name: Generate Runner Fleet Report
        timeout-minutes: 30
        run: |
          python scripts/ci/utils/query_job_status.py \
            --repo ${{ github.repository }} \
            --runner-report \
            --workflow "pr-test-amd.yml,nightly-test-amd.yml,pr-test-amd-rocm720.yml,nightly-test-amd-rocm720.yml" \
            --hours ${{ inputs.hours || '24' }} \
            --input-data-file ci-data/actions-job-snapshot.json \
            --summary
--- a/third_party/sglang/.github/workflows/auto-tune.yml
+++ b/third_party/sglang/.github/workflows/auto-tune.yml
@@ -0,0 +1,10 @@
 name: Auto tune
 on:
  workflow_dispatch:
 jobs:
  auto-tune-lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
--- a/third_party/sglang/.github/workflows/bot-bump-flashinfer-version.yml
+++ b/third_party/sglang/.github/workflows/bot-bump-flashinfer-version.yml
@@ -0,0 +1,50 @@
 name: Bot Bump Flashinfer Version
 on:
  workflow_dispatch:
    inputs:
      new_version:
        description: 'New flashinfer version (e.g., 0.6.4)'
        required: true
        type: string
 permissions:
  contents: write
  pull-requests: write
 jobs:
  bump-flashinfer-version:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          token: ${{ secrets.GITHUB_TOKEN }}
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'
      - name: Install Python dependencies
        run: |
            pip install tomli
      - name: Configure Git and branch
        run: |
          git config user.name "sglang-bot"
          git config user.email "sglang-bot@users.noreply.github.com"
          RANDOM_SUFFIX=$(echo $RANDOM | md5sum | head -c 4)
          BRANCH_NAME="bot/bump-flashinfer-version-${{ github.event.inputs.new_version }}-${RANDOM_SUFFIX}"
          git checkout -b "$BRANCH_NAME"
          echo "BRANCH_NAME=$BRANCH_NAME" >> $GITHUB_ENV
      - name: Run flashinfer version bump script
        run: |
          python scripts/release/bump_flashinfer_version.py "${{ github.event.inputs.new_version }}"
      - name: Commit and create PR
        env:
          GH_TOKEN: ${{ secrets.GH_PAT_FOR_PULL_REQUEST }}
        run: |
          bash scripts/release/commit_and_pr.sh "flashinfer" "${{ github.event.inputs.new_version }}" "$BRANCH_NAME"
--- a/third_party/sglang/.github/workflows/bot-bump-kernel-version-to-sglang.yml
+++ b/third_party/sglang/.github/workflows/bot-bump-kernel-version-to-sglang.yml
@@ -0,0 +1,60 @@
 name: Bot Bump Kernel Version to SGLang
 on:
  workflow_dispatch:
 permissions:
  contents: write
  pull-requests: write
 jobs:
  bump-kernel-version-to-sglang:
    runs-on: ubuntu-latest
    outputs:
      branch_name: ${{ steps.set_output.outputs.branch_name }}
      needs_sync: ${{ steps.check_sync.outputs.needs_sync }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          token: ${{ secrets.GITHUB_TOKEN }}
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'
      - name: Install Python dependencies
        run: |
          pip install tomli
      - name: Check if sync is needed
        id: check_sync
        run: |
          python scripts/release/check_kernel_version_to_sglang.py
      - name: Configure Git and branch
        if: steps.check_sync.outputs.needs_sync == 'true'
        id: set_output
        run: |
          git config user.name "sglang-bot"
          git config user.email "sglang-bot@users.noreply.github.com"
          RANDOM_SUFFIX=$(echo $RANDOM | md5sum | head -c 4)
          KERNEL_VERSION="${{ steps.check_sync.outputs.kernel_version }}"
          BRANCH_NAME="bot/bump-kernel-version-to-sglang-${KERNEL_VERSION}-${RANDOM_SUFFIX}"
          git checkout -b "$BRANCH_NAME"
          echo "BRANCH_NAME=$BRANCH_NAME" >> $GITHUB_ENV
          echo "KERNEL_VERSION=$KERNEL_VERSION" >> $GITHUB_ENV
          echo "branch_name=$BRANCH_NAME" >> $GITHUB_OUTPUT
      - name: Run kernel version bump script
        if: steps.check_sync.outputs.needs_sync == 'true'
        run: |
          python scripts/release/bump_kernel_version_to_sglang.py
      - name: Commit and create PR
        if: steps.check_sync.outputs.needs_sync == 'true'
        env:
          GH_TOKEN: ${{ secrets.GH_PAT_FOR_PULL_REQUEST }}
        run: |
          bash scripts/release/commit_and_pr_kernel_to_sglang.sh "$KERNEL_VERSION" "$BRANCH_NAME"
--- a/third_party/sglang/.github/workflows/bot-bump-kernel-version.yml
+++ b/third_party/sglang/.github/workflows/bot-bump-kernel-version.yml
@@ -0,0 +1,50 @@
 name: Bot Bump Kernel Version
 on:
  workflow_dispatch:
    inputs:
      new_version:
        description: 'New sgl-kernel version (e.g., 0.3.12)'
        required: true
        type: string
 permissions:
  contents: write
  pull-requests: write
 jobs:
  bump-kernel-version:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          token: ${{ secrets.GITHUB_TOKEN }}
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'
      - name: Install Python dependencies
        run: |
          pip install tomli
      - name: Configure Git and branch
        run: |
          git config user.name "sglang-bot"
          git config user.email "sglang-bot@users.noreply.github.com"
          RANDOM_SUFFIX=$(echo $RANDOM | md5sum | head -c 4)
          BRANCH_NAME="bot/bump-kernel-version-${{ github.event.inputs.new_version }}-${RANDOM_SUFFIX}"
          git checkout -b "$BRANCH_NAME"
          echo "BRANCH_NAME=$BRANCH_NAME" >> $GITHUB_ENV
      - name: Run kernel version bump script
        run: |
          python scripts/release/bump_kernel_version.py "${{ github.event.inputs.new_version }}"
      - name: Commit and create PR
        env:
          GH_TOKEN: ${{ secrets.GH_PAT_FOR_PULL_REQUEST }}
        run: |
          bash scripts/release/commit_and_pr.sh "sgl-kernel" "${{ github.event.inputs.new_version }}" "$BRANCH_NAME"
--- a/third_party/sglang/.github/workflows/bot-bump-sglang-version.yml
+++ b/third_party/sglang/.github/workflows/bot-bump-sglang-version.yml
@@ -0,0 +1,89 @@
 name: Bot Bump SGLang Version
 on:
  workflow_dispatch:
    inputs:
      new_version:
        description: 'New SGLang version (e.g., 0.5.3 or 0.5.3rc0)'
        required: true
        type: string
 permissions:
  contents: write
  pull-requests: write
 jobs:
  bump-sglang-version:
    runs-on: ubuntu-latest
    outputs:
      branch_name: ${{ steps.set_output.outputs.branch_name }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          token: ${{ secrets.GITHUB_TOKEN }}
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'
      - name: Install Python dependencies
        run: |
          pip install tomli
      - name: Configure Git and branch
        id: set_output
        run: |
          git config user.name "sglang-bot"
          git config user.email "sglang-bot@users.noreply.github.com"
          RANDOM_SUFFIX=$(echo $RANDOM | md5sum | head -c 4)
          BRANCH_NAME="bot/bump-sglang-version-${{ github.event.inputs.new_version }}-${RANDOM_SUFFIX}"
          git checkout -b "$BRANCH_NAME"
          echo "BRANCH_NAME=$BRANCH_NAME" >> $GITHUB_ENV
          echo "branch_name=$BRANCH_NAME" >> $GITHUB_OUTPUT
      - name: Run SGLang version bump script
        run: |
          python scripts/release/bump_sglang_version.py "${{ github.event.inputs.new_version }}"
      - name: Commit and create PR
        env:
          GH_TOKEN: ${{ secrets.GH_PAT_FOR_PULL_REQUEST }}
        run: |
          bash scripts/release/commit_and_pr.sh "SGLang" "${{ github.event.inputs.new_version }}" "$BRANCH_NAME"
  run-nightly-tests-nvidia:
    needs: bump-sglang-version
    uses: ./.github/workflows/nightly-test-nvidia.yml
    with:
      ref: ${{ needs.bump-sglang-version.outputs.branch_name }}
    secrets: inherit
  run-nightly-tests-amd:
    needs: bump-sglang-version
    uses: ./.github/workflows/nightly-test-amd.yml
    with:
      ref: ${{ needs.bump-sglang-version.outputs.branch_name }}
    secrets: inherit
  run-nightly-tests-npu:
    needs: bump-sglang-version
    uses: ./.github/workflows/nightly-test-npu.yml
    with:
      ref: ${{ needs.bump-sglang-version.outputs.branch_name }}
    secrets: inherit
  run-pr-tests-xeon:
    needs: bump-sglang-version
    uses: ./.github/workflows/pr-test-xeon.yml
    with:
      ref: ${{ needs.bump-sglang-version.outputs.branch_name }}
    secrets: inherit
  run-pr-tests-xpu:
    needs: bump-sglang-version
    uses: ./.github/workflows/pr-test-xpu.yml
    with:
      ref: ${{ needs.bump-sglang-version.outputs.branch_name }}
    secrets: inherit
--- a/third_party/sglang/.github/workflows/bot-cherry-pick.yml
+++ b/third_party/sglang/.github/workflows/bot-cherry-pick.yml
@@ -0,0 +1,182 @@
 name: Bot Cherry Pick to Release Branch
 on:
  workflow_dispatch:
    inputs:
      commit_sha:
        description: 'Commit SHA to cherry-pick (full or short hash)'
        required: true
        type: string
      target_branch:
        description: 'Target release branch (e.g., release/v0.5.7)'
        required: true
        type: string
      create_pr:
        description: 'Create a PR instead of pushing directly'
        required: false
        type: boolean
        default: true
 permissions:
  contents: write
  pull-requests: write
 concurrency:
  group: cherry-pick-${{ github.event.inputs.target_branch }}
  cancel-in-progress: false
 jobs:
  cherry-pick:
    if: github.repository == 'sgl-project/sglang'
    runs-on: ubuntu-latest
    environment: 'prod'
    steps:
      - name: Validate inputs
        env:
          TARGET_BRANCH: ${{ github.event.inputs.target_branch }}
        run: |
          if [[ ! "$TARGET_BRANCH" =~ ^release/v[0-9]+\.[0-9]+(\.[0-9]+)?$ ]]; then
            echo "::error::Target branch must match pattern 'release/vX.Y' or 'release/vX.Y.Z' (e.g., release/v0.5.7)"
            exit 1
          fi
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          fetch-depth: 0
          token: ${{ secrets.GH_PAT_FOR_PULL_REQUEST }}
      - name: Configure Git
        run: |
          git config user.name "sglang-bot"
          git config user.email "sglang-bot@users.noreply.github.com"
      - name: Validate target branch exists
        env:
          TARGET_BRANCH: ${{ github.event.inputs.target_branch }}
        run: |
          git fetch origin
          if ! git ls-remote --exit-code --heads origin "$TARGET_BRANCH" > /dev/null 2>&1; then
            echo "::error::Target branch '$TARGET_BRANCH' does not exist on remote"
            exit 1
          fi
      - name: Get commit info
        id: commit_info
        env:
          COMMIT_SHA_INPUT: ${{ github.event.inputs.commit_sha }}
        run: |
          # Verify commit exists
          if ! git cat-file -t "$COMMIT_SHA_INPUT" > /dev/null 2>&1; then
            echo "::error::Commit SHA '$COMMIT_SHA_INPUT' does not exist"
            exit 1
          fi
          # Get full SHA if short hash provided
          FULL_SHA=$(git rev-parse "$COMMIT_SHA_INPUT")
          COMMIT_TITLE=$(git log -1 --format="%s" "$FULL_SHA")
          SHORT_SHA=$(git rev-parse --short "$FULL_SHA")
          echo "full_sha=$FULL_SHA" >> $GITHUB_OUTPUT
          echo "short_sha=$SHORT_SHA" >> $GITHUB_OUTPUT
          # Use delimiter for multiline-safe output
          {
            echo "commit_title<<EOF"
            echo "$COMMIT_TITLE"
            echo "EOF"
          } >> $GITHUB_OUTPUT
          echo "Cherry-picking commit: $SHORT_SHA - $COMMIT_TITLE"
      - name: Cherry-pick commit
        id: cherry_pick
        env:
          TARGET_BRANCH: ${{ github.event.inputs.target_branch }}
          FULL_SHA: ${{ steps.commit_info.outputs.full_sha }}
          SHORT_SHA: ${{ steps.commit_info.outputs.short_sha }}
          CREATE_PR: ${{ github.event.inputs.create_pr }}
        run: |
          if [[ "$CREATE_PR" == "true" ]]; then
            # Create a new branch for the PR
            RANDOM_SUFFIX=$(head -c 4 /dev/urandom | xxd -p)
            NEW_BRANCH="cherry-pick/${SHORT_SHA}-to-${TARGET_BRANCH#release/}-${RANDOM_SUFFIX}"
            git checkout -b "$NEW_BRANCH" "origin/$TARGET_BRANCH"
            echo "new_branch=$NEW_BRANCH" >> $GITHUB_OUTPUT
          else
            # Checkout target branch directly
            git checkout "$TARGET_BRANCH"
          fi
          # Attempt cherry-pick
          if git cherry-pick "$FULL_SHA"; then
            echo "cherry_pick_success=true" >> $GITHUB_OUTPUT
          else
            echo "::error::Cherry-pick failed due to conflicts. Please resolve manually."
            git cherry-pick --abort || true
            echo "cherry_pick_success=false" >> $GITHUB_OUTPUT
            exit 1
          fi
      - name: Push changes
        if: steps.cherry_pick.outputs.cherry_pick_success == 'true'
        env:
          CREATE_PR: ${{ github.event.inputs.create_pr }}
          TARGET_BRANCH: ${{ github.event.inputs.target_branch }}
          NEW_BRANCH: ${{ steps.cherry_pick.outputs.new_branch }}
        run: |
          if [[ "$CREATE_PR" == "true" ]]; then
            git push origin "$NEW_BRANCH"
          else
            git push origin "$TARGET_BRANCH"
          fi
      - name: Create Pull Request
        if: steps.cherry_pick.outputs.cherry_pick_success == 'true' && github.event.inputs.create_pr == 'true'
        env:
          GH_TOKEN: ${{ secrets.GH_PAT_FOR_PULL_REQUEST }}
          TARGET_BRANCH: ${{ github.event.inputs.target_branch }}
          SHORT_SHA: ${{ steps.commit_info.outputs.short_sha }}
          COMMIT_TITLE: ${{ steps.commit_info.outputs.commit_title }}
          FULL_SHA: ${{ steps.commit_info.outputs.full_sha }}
          NEW_BRANCH: ${{ steps.cherry_pick.outputs.new_branch }}
        run: |
          PR_TITLE="[Cherry-pick] ${COMMIT_TITLE} to ${TARGET_BRANCH}"
          gh pr create \
            --title "$PR_TITLE" \
            --base "$TARGET_BRANCH" \
            --head "$NEW_BRANCH" \
            --label "cherry-pick" \
            --body-file - <<EOF
          Cherry-pick of commit ${FULL_SHA} to \`${TARGET_BRANCH}\`
          **Original commit:** ${FULL_SHA}
          **Original title:** ${COMMIT_TITLE}
          ---
          *This PR was automatically created by the cherry-pick workflow.*
          EOF
      - name: Summary
        if: always()
        env:
          FULL_SHA: ${{ steps.commit_info.outputs.full_sha }}
          COMMIT_TITLE: ${{ steps.commit_info.outputs.commit_title }}
          TARGET_BRANCH: ${{ github.event.inputs.target_branch }}
          CHERRY_PICK_SUCCESS: ${{ steps.cherry_pick.outputs.cherry_pick_success }}
          CREATE_PR: ${{ github.event.inputs.create_pr }}
          NEW_BRANCH: ${{ steps.cherry_pick.outputs.new_branch }}
          ACTOR: ${{ github.actor }}
        run: |
          echo "## Cherry-Pick Summary" >> $GITHUB_STEP_SUMMARY
          echo "" >> $GITHUB_STEP_SUMMARY
          echo "- **Triggered by:** @${ACTOR}" >> $GITHUB_STEP_SUMMARY
          echo "- **Commit:** ${FULL_SHA}" >> $GITHUB_STEP_SUMMARY
          echo "- **Title:** ${COMMIT_TITLE}" >> $GITHUB_STEP_SUMMARY
          echo "- **Target Branch:** ${TARGET_BRANCH}" >> $GITHUB_STEP_SUMMARY
          if [[ "$CHERRY_PICK_SUCCESS" == "true" ]]; then
            echo "- **Status:** ✅ Success" >> $GITHUB_STEP_SUMMARY
          else
            echo "- **Status:** ❌ Failed" >> $GITHUB_STEP_SUMMARY
          fi
          if [[ "$CREATE_PR" == "true" && "$CHERRY_PICK_SUCCESS" == "true" ]]; then
            echo "- **PR Branch:** ${NEW_BRANCH}" >> $GITHUB_STEP_SUMMARY
          fi
--- a/third_party/sglang/.github/workflows/cancel-pr-workflow-on-merge.yml
+++ b/third_party/sglang/.github/workflows/cancel-pr-workflow-on-merge.yml
@@ -0,0 +1,22 @@
 name: Cancel PR Workflows on Merge
 on:
  pull_request_target:
    types:
      - closed
 permissions:
  actions: write
 jobs:
  cancel:
    if: github.event.pull_request.merged == true
    runs-on: ubuntu-latest
    steps:
      - name: Cancel Previous Runs
        uses: styfle/cancel-workflow-action@0.12.1
        with:
          workflow_id: all
          access_token: ${{ secrets.GITHUB_TOKEN }}
          ignore_sha: true
          pr_number: ${{ github.event.pull_request.number }}
--- a/third_party/sglang/.github/workflows/cancel-unfinished-pr-tests.yml
+++ b/third_party/sglang/.github/workflows/cancel-unfinished-pr-tests.yml
@@ -0,0 +1,155 @@
 name: Cancel Unfinished PR Runs
 on:
  workflow_dispatch:
    inputs:
      workflows:
        description: 'Space-separated list of workflow filenames to cancel'
        required: true
        type: string
        default: 'pr-test.yml'
      include_high_priority:
        description: 'Also cancel runs from high-priority PRs'
        required: false
        type: boolean
        default: false
 permissions:
  actions: write   # Needed to cancel runs
  contents: read   # Needed to read repo info
  pull-requests: read  # needed for gh pr view (labels)
 jobs:
  cancel-unfinished-pr-runs:
    runs-on: ubuntu-latest
    steps:
      - name: Install GitHub CLI
        run: sudo apt-get install -y gh jq
      - name: Cancel unfinished PR-associated runs (skip high-priority PRs)
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          REPO: ${{ github.repository }}
          WORKFLOWS: ${{ github.event.inputs.workflows || 'pr-test.yml' }}
          INCLUDE_HIGH_PRIORITY: ${{ github.event.inputs.include_high_priority || 'false' }}
        shell: bash
        run: |
          set -euo pipefail
          # Read the space-separated string from the input into a bash array
          read -r -a WORKFLOW_FILES <<< "${WORKFLOWS}"
          echo "Targeting ${#WORKFLOW_FILES[@]} workflow(s): ${WORKFLOWS}"
          echo ""
          for workflow_file in "${WORKFLOW_FILES[@]}"; do
            echo "========================================="
            echo "Workflow: $workflow_file"
            echo "========================================="
            # Get all unfinished runs
            all_runs=$(gh run list \
              --repo "$REPO" \
              --workflow "$workflow_file" \
              --json databaseId,status,event,url,createdAt \
              --limit 1000 \
            | jq -c '.[] | select(.status=="queued" or .status=="waiting" or .status=="in_progress")')
            if [ -z "$all_runs" ]; then
              echo "✅ No unfinished runs found"
              echo ""
              continue
            fi
            # Count runs by event type
            total_runs=$(echo "$all_runs" | wc -l)
            pr_runs=$(echo "$all_runs" | jq -s '[.[] | select(.event=="pull_request")] | length')
            other_runs=$(echo "$all_runs" | jq -s '[.[] | select(.event!="pull_request")] | length')
            echo "📊 Summary: $total_runs unfinished runs ($pr_runs PR-related, $other_runs other)"
            echo ""
            # Process non-PR runs first
            if [ "$other_runs" -gt 0 ]; then
              echo "--- Non-PR Runs ---"
              echo "$all_runs" | jq -c 'select(.event!="pull_request")' | while read -r run; do
                run_url=$(echo "$run" | jq -r '.url')
                run_event=$(echo "$run" | jq -r '.event')
                run_status=$(echo "$run" | jq -r '.status')
                echo "  • $run_event ($run_status): $run_url"
              done
              echo ""
            fi
            # Process PR runs
            if [ "$pr_runs" -gt 0 ]; then
              echo "--- PR Runs (checking for cancellation) ---"
              echo "$all_runs" | jq -c 'select(.event=="pull_request")' | while read -r run; do
                run_id=$(echo "$run" | jq -r '.databaseId')
                run_url=$(echo "$run" | jq -r '.url')
                run_status=$(echo "$run" | jq -r '.status')
                echo ""
                echo "Run ($run_status): $run_url"
                # Fetch full run details to get head repository and branch info
                run_details=$(gh api -H "Accept: application/vnd.github+json" \
                  "repos/$REPO/actions/runs/$run_id" 2>/dev/null || true)
                if [ -z "$run_details" ]; then
                  echo "  ⚠️  Could not fetch run details, skipping"
                  continue
                fi
                # Get head owner and branch (works for both fork and non-fork PRs)
                head_owner=$(echo "$run_details" | jq -r '.head_repository.owner.login // empty')
                head_branch=$(echo "$run_details" | jq -r '.head_branch // empty')
                if [ -z "$head_owner" ] || [ -z "$head_branch" ]; then
                  echo "  ⚠️  Missing head info, skipping"
                  continue
                fi
                echo "  Branch: ${head_owner}:${head_branch}"
                # Find PR by searching with head=owner:branch
                pr_number=$(gh api -H "Accept: application/vnd.github+json" \
                  "repos/$REPO/pulls?state=open&head=${head_owner}:${head_branch}" \
                  --jq '.[0].number // empty' 2>/dev/null || true)
                if [ -z "$pr_number" ]; then
                  echo "  ⚠️  No open PR found, skipping"
                  continue
                fi
                pr_url="https://github.com/$REPO/pull/$pr_number"
                echo "  PR: $pr_url"
                # Check for high priority label
                labels=$(gh pr view "$pr_number" --repo "$REPO" --json labels \
                  | jq -r '.labels[].name' 2>/dev/null || true)
                if echo "$labels" | grep -Fxq "bypass-maintenance"; then
                  echo "  🛑 Skipping (bypass-maintenance label, never cancelled)"
                  continue
                fi
                if echo "$labels" | grep -Fxq "high priority"; then
                  if [ "$INCLUDE_HIGH_PRIORITY" != "true" ]; then
                    echo "  🛑 Skipping (high priority label)"
                    continue
                  fi
                  echo "  ⚠️  High priority PR, but include_high_priority is enabled"
                fi
                echo "  🚫 Cancelling..."
                gh run cancel "$run_id" --repo "$REPO" || echo "  ⚠️  Cancellation failed"
              done
            fi
            echo ""
          done
          echo "========================================="
          echo "✅ Processing complete"
          echo "========================================="
--- a/third_party/sglang/.github/workflows/ci-coverage-overview.yml
+++ b/third_party/sglang/.github/workflows/ci-coverage-overview.yml
@@ -0,0 +1,154 @@
 name: CI Coverage Overview
 on:
  schedule:
    - cron: '0 6 * * *'  # Daily at 6 AM UTC
  pull_request:
    paths:
      - '.github/workflows/ci-coverage-overview.yml'
      - 'scripts/ci/utils/ci_coverage_report.py'
      - 'test/registered/**'
  workflow_dispatch:
    inputs:
      output_format:
        description: 'Output format'
        required: false
        default: 'markdown'
        type: choice
        options:
          - markdown
          - json
 jobs:
  summary:
    name: Summary
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'
      - name: Generate Summary Report
        run: |
          python scripts/ci/utils/ci_coverage_report.py --section summary
  by-folder:
    name: Tests by Folder
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'
      - name: Generate Tests by Folder Report
        run: |
          python scripts/ci/utils/ci_coverage_report.py --section by-folder
  by-suite:
    name: Tests by Suite
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'
      - name: Generate Tests by Suite Report
        run: |
          python scripts/ci/utils/ci_coverage_report.py --section by-suite
  unit-test-coverage:
    name: Unit Test Code Coverage
    if: github.event_name != 'pull_request'
    runs-on: 1-gpu-h100
    timeout-minutes: 30
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Install dependencies
        timeout-minutes: 10
        run: |
          pip install -e "python/[test]"
      - name: Run unit tests with coverage
        timeout-minutes: 10
        run: |
          pytest test/registered/unit/ \
            --cov --cov-config=.coveragerc \
            --cov-report=term-missing:skip-covered \
            --continue-on-collection-errors \
            -v | tee coverage_output.txt
      - name: Write coverage to summary
        if: always()
        run: |
          echo "## Unit Test Code Coverage" >> $GITHUB_STEP_SUMMARY
          echo "" >> $GITHUB_STEP_SUMMARY
          echo "**Commit:** \`${GITHUB_SHA::8}\` | **Branch:** \`${GITHUB_REF_NAME}\`" >> $GITHUB_STEP_SUMMARY
          echo "" >> $GITHUB_STEP_SUMMARY
          # Test result line (e.g., "== 42 passed, 1 failed in 23.5s ==")
          echo '```' >> $GITHUB_STEP_SUMMARY
          grep -E '^=+.*passed' coverage_output.txt >> $GITHUB_STEP_SUMMARY || true
          echo "" >> $GITHUB_STEP_SUMMARY
          # Coverage total
          grep -E '^TOTAL ' coverage_output.txt >> $GITHUB_STEP_SUMMARY || true
          echo '```' >> $GITHUB_STEP_SUMMARY
          # Partially covered core modules (1-49%) — most actionable for contributors
          # Only show modules with testable logic; skip configs, models, layers, etc.
          LOW_COV=$(awk '/^python\/.*%/ {
            for (i=1; i<=NF; i++) {
              if ($i ~ /^[0-9]+%$/) {
                pct = $i + 0
                if (pct >= 1 && pct < 50) printf "%-80s %5s  %s\n", $1, $(i-2), $i
                break
              }
            }
          }' coverage_output.txt \
            | grep -E '/(mem_cache|managers|sampling|parser|observability|function_call|entrypoints|speculative|multimodal|utils)/' \
            | head -40 || true)
          if [ -n "$LOW_COV" ]; then
            echo "" >> $GITHUB_STEP_SUMMARY
            echo "<details><summary>Core modules with coverage below 50% — good candidates for more unit tests</summary>" >> $GITHUB_STEP_SUMMARY
            echo "" >> $GITHUB_STEP_SUMMARY
            echo '```' >> $GITHUB_STEP_SUMMARY
            echo "$LOW_COV" >> $GITHUB_STEP_SUMMARY
            echo '```' >> $GITHUB_STEP_SUMMARY
            echo "</details>" >> $GITHUB_STEP_SUMMARY
          fi
  json-export:
    name: JSON Export
    runs-on: ubuntu-latest
    if: inputs.output_format == 'json'
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'
      - name: Generate JSON Report
        run: |
          python scripts/ci/utils/ci_coverage_report.py --output-format json > ci_coverage.json
      - name: Upload JSON artifact
        uses: actions/upload-artifact@v4
        with:
          name: ci-coverage-report
          path: ci_coverage.json
--- a/third_party/sglang/.github/workflows/ci-failure-monitor.yml
+++ b/third_party/sglang/.github/workflows/ci-failure-monitor.yml
@@ -0,0 +1,72 @@
 name: CI Failure Monitor
 on:
  schedule:
    - cron: '0 */12 * * *' # Every 12 hour
  workflow_dispatch:
 concurrency:
  group: ci-failure-monitor-${{ github.ref }}
  cancel-in-progress: true
 permissions:
  contents: read
  actions: read
 jobs:
  failure-analysis:
    if: github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.14'
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install requests slack_sdk
      - name: Run Failure Analysis
        env:
          GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }}
          GH_PAT_FOR_RUNNER_ADMIN: ${{ secrets.GH_PAT_FOR_RUNNER_ADMIN }}
          PYTHONUNBUFFERED: 1
          PYTHONIOENCODING: utf-8
        run: |
          cd scripts/ci_monitor
          python ci_failures_analysis.py \
            --token $GITHUB_TOKEN \
            --limit 100 \
            --output ci_failure_analysis_$(date +%Y%m%d_%H%M%S).json
      - name: Upload Analysis Results
        uses: actions/upload-artifact@v4
        with:
          name: ci-failure-analysis-${{ github.run_number }}
          path: |
            scripts/ci_monitor/ci_failure_analysis_*.json
          retention-days: 7
      - name: Send Slack Notification
        if: always()
        env:
          SGLANG_DIFFUSION_SLACK_TOKEN: ${{ secrets.SGLANG_DIFFUSION_SLACK_TOKEN }}
        run: |
          cd scripts/ci_monitor
          LATEST_REPORT=$(ls -t ci_failure_analysis_*.json | head -1)
          if [ ! -f "$LATEST_REPORT" ]; then
            echo "No report found, so skipping Slack notification"
            exit 0
          fi
          if [ -n "$SGLANG_DIFFUSION_SLACK_TOKEN" ]; then
            python3 post_ci_failures_to_slack.py --report-file "$LATEST_REPORT"
          else
            echo "SGLANG_DIFFUSION_SLACK_TOKEN not configured, skipping notification"
          fi
--- a/third_party/sglang/.github/workflows/close-inactive-issues.yml
+++ b/third_party/sglang/.github/workflows/close-inactive-issues.yml
@@ -0,0 +1,96 @@
 name: Close Inactive Issues
 on:
  schedule:
    - cron: '0 0 * * *'
  workflow_dispatch:
 permissions:
  issues: write
  contents: read
 jobs:
  close-inactive-issues:
    if: github.repository == 'sgl-project/sglang'
    runs-on: ubuntu-latest
    steps:
      - name: Check and close inactive issues
        uses: actions/github-script@v6
        with:
          github-token: ${{secrets.GITHUB_TOKEN}}
          script: |
            const sixtyDaysAgo = new Date(Date.now() - 60 * 24 * 60 * 60 * 1000);
            const [owner, repo] = process.env.GITHUB_REPOSITORY.split('/');
            console.log(`Owner: ${owner}, Repo: ${repo}`);
            async function fetchIssues(page = 1) {
              console.log(`Fetching issues for ${owner}/${repo}, page ${page}`);
              return await github.rest.issues.listForRepo({
                owner,
                repo,
                state: 'open',
                sort: 'updated',
                direction: 'asc',
                per_page: 100,
                page: page
              });
            }
            async function processIssues() {
              console.log('Starting to process issues');
              console.log(`Repository: ${owner}/${repo}`);
              let page = 1;
              let hasMoreIssues = true;
              while (hasMoreIssues) {
                try {
                  const issues = await fetchIssues(page);
                  console.log(`Fetched ${issues.data.length} issues on page ${page}`);
                  if (issues.data.length === 0) {
                    hasMoreIssues = false;
                    break;
                  }
                  for (const issue of issues.data) {
                    // Skip if the issue has 'good first issue' label
                    if (issue.labels.some(label => label.name === 'good first issue')) {
                      console.log(`Skipping issue #${issue.number} as it's marked as 'good first issue'`);
                      continue;
                    }
                    if (new Date(issue.updated_at) < sixtyDaysAgo) {
                      try {
                        await github.rest.issues.update({
                          owner,
                          repo,
                          issue_number: issue.number,
                          state: 'closed',
                          labels: [...issue.labels.map(l => l.name), 'inactive']
                        });
                        await github.rest.issues.createComment({
                          owner,
                          repo,
                          issue_number: issue.number,
                          body: 'This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.'
                        });
                        console.log(`Closed issue #${issue.number} due to inactivity.`);
                      } catch (error) {
                        console.error(`Failed to close issue #${issue.number}: ${error.message}`);
                      }
                    } else {
                      console.log(`Issue #${issue.number} is still active. Stopping processing.`);
                      hasMoreIssues = false;
                      break;
                    }
                  }
                  page += 1;
                } catch (error) {
                  console.error(`Error fetching issues on page ${page}: ${error.message}`);
                  hasMoreIssues = false;
                }
              }
              console.log('Finished processing issues');
            }
            await processIssues();
--- a/third_party/sglang/.github/workflows/diffusion-ci-gt-gen.yml
+++ b/third_party/sglang/.github/workflows/diffusion-ci-gt-gen.yml
@@ -0,0 +1,115 @@
 name: Diffusion CI Ground Truth Generation
 on:
  workflow_dispatch:
    inputs:
      ref:
        description: 'Git ref to checkout'
        required: false
        default: ''
        type: string
      case_ids:
        description: 'Specific case IDs to run (space-separated, optional)'
        required: false
        default: ''
        type: string
 concurrency:
  group: diffusion-ci-gt-gen-${{ github.ref }}
  cancel-in-progress: true
 permissions:
  contents: write
  actions: read
 jobs:
  multimodal-diffusion-gen-1gpu:
    if: github.repository == 'sgl-project/sglang'
    runs-on: 1-gpu-h100
    strategy:
      matrix:
        part: [0, 1]
    timeout-minutes: 150
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - name: Install dependencies
        run: bash scripts/ci/cuda/ci_install_dependency.sh diffusion
      - name: Generate outputs
        run: |
          cd python
          python -m sglang.multimodal_gen.test.scripts.gen_diffusion_ci_outputs \
            --suite 1-gpu \
            --partition-id ${{ matrix.part }} \
            --total-partitions 2 \
            --out-dir ./diffusion-ci-outputs \
            ${{ inputs.case_ids != '' && format('--case-ids {0}', inputs.case_ids) || '' }}
      - name: Upload artifact
        uses: actions/upload-artifact@v4
        with:
          name: diffusion-gen-1gpu-part${{ matrix.part }}
          path: python/diffusion-ci-outputs
          retention-days: 7
  multimodal-diffusion-gen-2gpu:
    if: github.repository == 'sgl-project/sglang'
    runs-on: 2-gpu-h100
    strategy:
      matrix:
        part: [0, 1]
    timeout-minutes: 150
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - name: Install dependencies
        run: bash scripts/ci/cuda/ci_install_dependency.sh diffusion
      - name: Generate outputs
        run: |
          cd python
          python -m sglang.multimodal_gen.test.scripts.gen_diffusion_ci_outputs \
            --suite 2-gpu \
            --partition-id ${{ matrix.part }} \
            --total-partitions 2 \
            --out-dir ./diffusion-ci-outputs \
            ${{ inputs.case_ids != '' && format('--case-ids {0}', inputs.case_ids) || '' }}
      - name: Upload artifact
        uses: actions/upload-artifact@v4
        with:
          name: diffusion-gen-2gpu-part${{ matrix.part }}
          path: python/diffusion-ci-outputs
          retention-days: 7
  diffusion-ci-push:
    needs: [multimodal-diffusion-gen-1gpu, multimodal-diffusion-gen-2gpu]
    if: github.repository == 'sgl-project/sglang'
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Download artifacts
        uses: actions/download-artifact@v4
        with:
          pattern: diffusion-gen-*
          path: combined
          merge-multiple: true
      - name: Collect image files
        run: |
          mkdir -p gt_images
          find combined \( -name "*.png" -o -name "*.jpg" -o -name "*.jpeg" -o -name "*.webp" \) -type f -exec cp -f {} gt_images/ \;
      - name: Publish GT images to sglang-bot/sglang-ci-data
        env:
          GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }}
        run: python scripts/ci/utils/diffusion/publish_diffusion_gt.py --source-dir gt_images
--- a/third_party/sglang/.github/workflows/execute-notebook.yml
+++ b/third_party/sglang/.github/workflows/execute-notebook.yml
@@ -0,0 +1,74 @@
 name: Execute Notebooks
 on:
  pull_request:
    branches: [ main ]
    types: [opened, synchronize, reopened, labeled]
    paths:
      - "python/sglang/**"
      - "docs/**"
      - "!python/sglang/**/*.md"
      - "!docs/**/*.md"
  workflow_dispatch:
 concurrency:
  group: execute-notebook-${{ github.ref }}
  cancel-in-progress: true
 env:
  SGLANG_IS_IN_CI: true
 jobs:
  call-gate:
    # Align with PR Test: fail fast if PR doesn't have run-ci label.
    # This makes /tag-and-rerun-ci work by rerunning this failed workflow.
    uses: ./.github/workflows/pr-gate.yml
    secrets: inherit
  run-all-notebooks:
    needs: [call-gate]
    runs-on: 1-gpu-h100
    if: github.event_name != 'pull_request' || needs.call-gate.result == 'success'
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Install dependencies
        run: |
          bash scripts/ci/cuda/ci_install_dependency.sh
          pip install -r docs/requirements.txt
          apt-get update && apt-get install -y pandoc parallel retry
          ln -sf "$(which python3)" /usr/bin/python
      - name: Setup Jupyter Kernel
        run: |
          python -m ipykernel install --user --name python3 --display-name "Python 3"
      - name: Execute notebooks
        timeout-minutes: 40
        run: |
          cd docs
          make clean
          make compile
  notebook-finish:
    needs: [
      call-gate,
      run-all-notebooks
    ]
    runs-on: ubuntu-latest
    if: always() && needs.run-all-notebooks.result != 'skipped'
    steps:
      - name: Check all dependent job statuses
        run: |
          results=(${{ join(needs.*.result, ' ') }})
          for result in "${results[@]}"; do
            if [ "$result" = "failure" ] || [ "$result" = "cancelled" ]; then
              echo "Job failed with result: $result"
              exit 1
            fi
          done
          echo "All jobs completed successfully"
          exit 0
--- a/third_party/sglang/.github/workflows/full-test-npu.yml
+++ b/third_party/sglang/.github/workflows/full-test-npu.yml
@@ -0,0 +1,355 @@
 name: Full Test (NPU)
 on:
 #  pull_request:
 #    branches:
 #      - main
 #    paths:
 #      - ".github/workflows/full-test-npu.yml"
  workflow_dispatch:
    inputs:
      ref:
        description: 'Git ref (branch, tag, or SHA) to test. If not provided, uses the default branch.'
        required: false
        type: string
        default: ''
      job_filter:
        description: 'Select which job to run (leave empty or "all" to run all jobs)'
        required: false
        type: string
        default: 'all'
      image_a3:
        description: 'The a3 running docker image of the test task.'
        required: false
        type: string
        default: 'swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-a3-ubuntu22.04-py3.11'
      skip_install_flag:
        description: 'Indicates whether to skip the installation of sglang, defaulting to false.'
        required: false
        type: string
        default: 'false'
 concurrency:
  group: full-test-npu-${{ inputs.ref || github.ref }}
  cancel-in-progress: ${{ github.event_name != 'workflow_call' }}
 jobs:
  set-image-config:
    runs-on: ubuntu-latest
    outputs:
      ref: ${{ steps.set-vars.outputs.ref }}
      job_filter: ${{ steps.set-vars.outputs.job_filter }}
      image_a3: ${{ steps.set-vars.outputs.image_a3 }}
      skip_install_flag: ${{ steps.set-vars.outputs.skip_install_flag }}
    steps:
      # When triggered by PR, no inputs parameters are used. The latest community code is tested by default.
      - name: Set image config
        id: set-vars
        run: |
          if [ -z "${{ inputs.ref }}" ]; then
            echo "ref=" >> $GITHUB_OUTPUT
          else
            echo "ref=${{ inputs.ref }}" >> $GITHUB_OUTPUT
          fi
          if [ -z "${{ inputs.job_filter }}" ]; then
            echo "job_filter=all" >> $GITHUB_OUTPUT
          else
            echo "job_filter=${{ inputs.job_filter }}" >> $GITHUB_OUTPUT
          fi
          if [ -z "${{ inputs.image_a3 }}" ]; then
            echo "image_a3=swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-a3-ubuntu22.04-py3.11" >> $GITHUB_OUTPUT
          else
            echo "image_a3=${{ inputs.image_a3 }}" >> $GITHUB_OUTPUT
          fi
          if [ -z "${{ inputs.skip_install_flag }}" ]; then
            echo "skip_install_flag=false" >> $GITHUB_OUTPUT
          else
            echo "skip_install_flag=${{ inputs.skip_install_flag }}" >> $GITHUB_OUTPUT
          fi
  nighly-test-npu:
    needs: [set-image-config]
    name: nightly-test-npu
    if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }}
    uses: ./.github/workflows/nightly-test-npu.yml
    with:
      ref:  ${{ needs.set-image-config.outputs.ref }}
      job_filter: ${{ needs.set-image-config.outputs.job_filter }}
      image_a3: ${{ needs.set-image-config.outputs.image_a3 }}
      skip_install_flag: ${{ needs.set-image-config.outputs.skip_install_flag }}
    secrets: inherit
  full-1-npu-a3:
    needs: [set-image-config]
    if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }}
    runs-on: linux-aarch64-a3-2
    container:
      image: ${{ needs.set-image-config.outputs.image_a3 }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ needs.set-image-config.outputs.ref || github.ref }}
      - name: Install dependencies
        env:
          TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
          PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
          GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
        run: |
          # speed up by using infra cache services
          CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
          sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
          pip config set global.index-url http://${CACHING_URL}/pypi/simple
          pip config set global.trusted-host "${CACHING_URL}"
          if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then
            bash scripts/ci/npu/npu_ci_install_dependency.sh a3
          fi
          # copy required file from our daily cache
          cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
          # copy gsm8k dataset
          cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
      - name: Print Log Information
        run: |
          bash scripts/ci/npu/npu_log_print.sh
      - name: Run test
        timeout-minutes: 240
        env:
          SGLANG_USE_MODELSCOPE: true
          SGLANG_IS_IN_CI: true
          HF_ENDPOINT: https://hf-mirror.com
          TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
          PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
          STREAMS_PER_DEVICE: 32
        run: |
          pip install sglang_router
          hf download lmms-lab/MMMU --repo-type dataset
          pip install sentence_transformers torchaudio==2.8.0
          pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap
          pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1
          pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.1.25 numpy==1.26.4 dotenv
          git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git
          cd ./lmms-eval
          nohup pip install . > lmmslog.txt 2>&1 &
          sleep 120
          export PYTHONPATH=$PYTHONPATH:$(pwd)
          cd ../
          cd test
          python3 run_suite.py --hw npu --suite full-1-npu-a3 --nightly --continue-on-error --timeout-per-file 3600
  full-2-npu-a3:
    needs: [set-image-config]
    if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }}
    runs-on: linux-aarch64-a3-2
    container:
      image: ${{ needs.set-image-config.outputs.image_a3 }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ needs.set-image-config.outputs.ref || github.ref }}
      - name: Install dependencies
        env:
          TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
          PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
          GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
        run: |
          # speed up by using infra cache services
          CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
          sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
          pip config set global.index-url http://${CACHING_URL}/pypi/simple
          pip config set global.trusted-host "${CACHING_URL}"
          if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then
            bash scripts/ci/npu/npu_ci_install_dependency.sh a3
          fi
          # copy required file from our daily cache
          cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
          # copy gsm8k dataset
          cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
      - name: Print Log Information
        run: |
          bash scripts/ci/npu/npu_log_print.sh
      - name: Run test
        timeout-minutes: 240
        env:
          SGLANG_USE_MODELSCOPE: true
          SGLANG_IS_IN_CI: true
          HF_ENDPOINT: https://hf-mirror.com
          TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
          PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
          STREAMS_PER_DEVICE: 32
        run: |
          pip install sglang_router
          hf download lmms-lab/MMMU --repo-type dataset
          pip install sentence_transformers torchaudio==2.8.0
          pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap
          pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1
          pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.1.25 numpy==1.26.4 dotenv
          git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git
          cd ./lmms-eval
          nohup pip install . > lmmslog.txt 2>&1 &
          sleep 120
          export PYTHONPATH=$PYTHONPATH:$(pwd)
          cd ../
          cd test
          python3 run_suite.py --hw npu --suite full-2-npu-a3 --nightly --continue-on-error --timeout-per-file 3600
  full-4-npu-a3:
    needs: [set-image-config]
    if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }}
    runs-on: linux-aarch64-a3-4
    container:
      image: ${{ needs.set-image-config.outputs.image_a3 }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ needs.set-image-config.outputs.ref || github.ref }}
      - name: Install dependencies
        env:
          TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
          PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
          GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
        run: |
          # speed up by using infra cache services
          CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
          sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
          pip config set global.index-url http://${CACHING_URL}/pypi/simple
          pip config set global.trusted-host "${CACHING_URL}"
          if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then
            bash scripts/ci/npu/npu_ci_install_dependency.sh a3
          fi
          # copy required file from our daily cache
          cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
          # copy gsm8k dataset
          cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
      - name: Print Log Information
        run: |
          bash scripts/ci/npu/npu_log_print.sh
      - name: Run test
        timeout-minutes: 240
        env:
          SGLANG_USE_MODELSCOPE: true
          SGLANG_IS_IN_CI: true
          HF_ENDPOINT: https://hf-mirror.com
          TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
          PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
          STREAMS_PER_DEVICE: 32
        run: |
          pip install sglang_router
          hf download lmms-lab/MMMU --repo-type dataset
          pip install sentence_transformers torchaudio==2.8.0
          pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap
          pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1
          pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.1.25 numpy==1.26.4 dotenv
          git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git
          cd ./lmms-eval
          nohup pip install . > lmmslog.txt 2>&1 &
          sleep 120
          export PYTHONPATH=$PYTHONPATH:$(pwd)
          cd ../
          cd test
          python3 run_suite.py --hw npu --suite full-4-npu-a3 --nightly --continue-on-error --timeout-per-file 3600
  full-16-npu-a3:
    needs: [set-image-config]
    if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }}
    runs-on: linux-aarch64-a3-16
    container:
      image: ${{ needs.set-image-config.outputs.image_a3 }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ needs.set-image-config.outputs.ref || github.ref }}
      - name: Install dependencies
        env:
          TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
          PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
          GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
        run: |
          # speed up by using infra cache services
          CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
          sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
          pip config set global.index-url http://${CACHING_URL}/pypi/simple
          pip config set global.trusted-host "${CACHING_URL}"
          if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then
            bash scripts/ci/npu/npu_ci_install_dependency.sh a3
          fi
          # copy required file from our daily cache
          cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
          # copy gsm8k dataset
          cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
      - name: Print Log Information
        run: |
          bash scripts/ci/npu/npu_log_print.sh
      - name: Run test
        timeout-minutes: 240
        env:
          SGLANG_USE_MODELSCOPE: true
          SGLANG_IS_IN_CI: true
          HF_ENDPOINT: https://hf-mirror.com
          TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
          PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
          STREAMS_PER_DEVICE: 32
        run: |
          pip install sglang_router
          hf download lmms-lab/MMMU --repo-type dataset
          pip install sentence_transformers torchaudio==2.8.0
          pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap
          pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1
          pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.1.25 numpy==1.26.4 dotenv
          git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git
          cd ./lmms-eval
          nohup pip install . > lmmslog.txt 2>&1 &
          sleep 120
          export PYTHONPATH=$PYTHONPATH:$(pwd)
          cd ../
          cd test
          python3 run_suite.py --hw npu --suite full-16-npu-a3 --nightly --continue-on-error --timeout-per-file 3600
  check-all-jobs:
    if: github.repository == 'sgl-project/sglang' && always()
    needs:
      - nighly-test-npu
      - full-1-npu-a3
      - full-2-npu-a3
      - full-4-npu-a3
      - full-16-npu-a3
    runs-on: ubuntu-latest
    container:
      image: docker.m.daocloud.io/ubuntu:22.04
    steps:
      - name: Check if any job failed
        run: |
          if [[ "${{ contains(needs.*.result, 'failure') }}" == "true" ]]; then
            echo "One or more nightly test jobs failed"
            exit 1
          fi
          if [[ "${{ contains(needs.*.result, 'cancelled') }}" == "true" ]]; then
            echo "One or more nightly test jobs were cancelled"
            exit 1
          fi
          echo "All nightly test jobs passed"
--- a/third_party/sglang/.github/workflows/labeler.yml
+++ b/third_party/sglang/.github/workflows/labeler.yml
@@ -0,0 +1,20 @@
 name: Auto Label PRs
 on:
  pull_request_target:
    types: [opened, synchronize, reopened]
 permissions:
  contents: read
  pull-requests: write
 jobs:
  label:
    runs-on: ubuntu-latest
    steps:
      - name: Auto-label by file changes
        uses: actions/labeler@v5
        with:
          repo-token: "${{ secrets.GITHUB_TOKEN }}"
          configuration-path: .github/labeler.yml
          sync-labels: false
--- a/third_party/sglang/.github/workflows/lint.yml
+++ b/third_party/sglang/.github/workflows/lint.yml
@@ -0,0 +1,39 @@
 name: Lint
 on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
 jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.12"
      - name: Install pre-commit hook
        run: |
          python -m pip install pre-commit
          pre-commit install
      - name: Run pre-commit checks
        run: SKIP=no-commit-to-branch pre-commit run --all-files --show-diff-on-failure
      - name: Run lychee docs checks (offline references)
        uses: lycheeverse/lychee-action@8646ba30535128ac92d33dfc9133794bfdd9b411 # v2
        with:
          args: --config .github/linters/lychee.toml README.md "docs/**/*.md" "docs/**/*.rst" "docs/**/*.ipynb"
      - name: Run sgl-kernel clang-format checks
        uses: DoozyX/clang-format-lint-action@v0.20
        with:
          source: sgl-kernel
          extensions: h,c,cpp,hpp,cu,cuh,cc
          clangFormatVersion: 20
          style: file
--- a/third_party/sglang/.github/workflows/list-active-pr-runs.yml
+++ b/third_party/sglang/.github/workflows/list-active-pr-runs.yml
@@ -0,0 +1,317 @@
 name: List Active Runs
 on:
  workflow_dispatch:
    inputs:
      workflows:
        description: 'Space-separated list of workflow filenames to check'
        required: false
        type: string
        default: 'pr-test.yml'
 permissions:
  actions: read
  contents: read
  pull-requests: read
 jobs:
  list-active-runs:
    runs-on: ubuntu-latest
    steps:
      - name: Install GitHub CLI
        run: sudo apt-get install -y gh jq
      - name: List active runs grouped by PR
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          REPO: ${{ github.repository }}
          WORKFLOWS: ${{ github.event.inputs.workflows || 'pr-test.yml' }}
        shell: bash
        run: |
          set -euo pipefail
          echo "========================================="
          echo "🔍 Active Workflow Runs Report"
          echo "========================================="
          echo ""
          # Get all workflows or specific ones
          read -r -a workflow_files <<< "${WORKFLOWS}"
          echo "📋 Checking specified workflows: ${WORKFLOWS}"
          echo ""
          # Create a temporary file to store PR data
          pr_data_file=$(mktemp)
          # Process each workflow
          for workflow_file in ${workflow_files[@]}; do
            echo "Scanning workflow: $workflow_file"
            # Get all active runs (queued, waiting, in_progress)
            active_runs=$(gh run list \
              --repo "$REPO" \
              --workflow "$workflow_file" \
              --json databaseId,status,event,headBranch,createdAt,updatedAt,headSha,number,attempt \
              --limit 500 \
              | jq -c '.[] | select(.status=="queued" or .status=="waiting" or .status=="in_progress")')
            if [ -z "$active_runs" ]; then
              continue
            fi
            # Process each run
            echo "$active_runs" | while read -r run; do
              run_id=$(echo "$run" | jq -r '.databaseId')
              run_status=$(echo "$run" | jq -r '.status')
              run_event=$(echo "$run" | jq -r '.event')
              created_at=$(echo "$run" | jq -r '.createdAt')
              head_sha=$(echo "$run" | jq -r '.headSha')
              run_number=$(echo "$run" | jq -r '.number')
              run_attempt=$(echo "$run" | jq -r '.attempt // 1')
              # Get detailed run information including jobs
              run_details=$(gh api "repos/$REPO/actions/runs/$run_id" 2>/dev/null || true)
              if [ -z "$run_details" ]; then
                continue
              fi
              head_owner=$(echo "$run_details" | jq -r '.head_repository.owner.login // empty')
              head_branch=$(echo "$run_details" | jq -r '.head_branch // empty')
              if [ -z "$head_owner" ] || [ -z "$head_branch" ]; then
                continue
              fi
              # Find PR number (may be empty for non-PR runs)
              pr_number=$(gh api "repos/$REPO/pulls?state=open&head=${head_owner}:${head_branch}" \
                --jq '.[0].number // empty' 2>/dev/null || true)
              if [ -z "$pr_number" ]; then
                pr_number="NO_PR"
              fi
              # Get jobs for this run (with pagination to avoid missing jobs)
              jobs=$(gh api "repos/$REPO/actions/runs/$run_id/jobs" --paginate --jq '.jobs[]' | jq -s '.')
              running_jobs=$(echo "$jobs" | jq '[.[] | select(.status=="in_progress")] | length')
              queued_jobs=$(echo "$jobs" | jq '[.[] | select(.status=="queued" or .status=="waiting")] | length')
              # Get runner info for running jobs
              runners=$(echo "$jobs" | jq -r '.[] | select(.status=="in_progress") | .runner_name // "N/A"' | paste -sd "," -)
              # Calculate queue time
              current_time=$(date -u +%s)
              created_time=$(date -u -d "$created_at" +%s 2>/dev/null || echo "$current_time")
              queue_time=$((current_time - created_time))
              queue_minutes=$((queue_time / 60))
              # Store data in temporary file (unified format with event and branch)
              echo "$pr_number|$workflow_file|$run_id|$run_status|$running_jobs|$queued_jobs|$runners|$queue_minutes|$created_at|$head_sha|$run_attempt|$run_event|$head_branch" >> "$pr_data_file"
            done
          done
          echo ""
          echo "========================================="
          echo "📊 Active Runs Summary"
          echo "========================================="
          echo ""
          if [ ! -s "$pr_data_file" ]; then
            echo "✅ No active runs found"
            rm -f "$pr_data_file"
            exit 0
          fi
          # Get unique PR numbers (exclude NO_PR entries)
          pr_numbers=$(cut -d'|' -f1 < "$pr_data_file" | grep -v '^NO_PR$' | sort -u || true)
          # Separate high priority and normal PRs
          high_priority_prs=()
          normal_prs=()
          for pr_num in $pr_numbers; do
            labels=$(gh pr view "$pr_num" --repo "$REPO" --json labels \
              | jq -r '.labels[].name' 2>/dev/null || true)
            if echo "$labels" | grep -Fxq "high priority"; then
              high_priority_prs+=($pr_num)
            else
              normal_prs+=($pr_num)
            fi
          done
          # Combine: high priority first, then normal
          sorted_pr_numbers=("${high_priority_prs[@]}" "${normal_prs[@]}")
          pr_count=0
          total_running=0
          total_queued=0
          for pr_num in "${sorted_pr_numbers[@]}"; do
            pr_count=$((pr_count + 1))
            # Get PR details
            pr_info=$(gh pr view "$pr_num" --repo "$REPO" --json title,author,labels,url 2>/dev/null || true)
            if [ -z "$pr_info" ]; then
              continue
            fi
            pr_title=$(echo "$pr_info" | jq -r '.title')
            pr_author=$(echo "$pr_info" | jq -r '.author.login')
            pr_url=$(echo "$pr_info" | jq -r '.url')
            pr_labels=$(echo "$pr_info" | jq -r '.labels[].name' | paste -sd ", " -)
            if [ -z "$pr_labels" ]; then
              pr_labels="(no labels)"
            fi
            # Add priority indicator
            priority_indicator=""
            if echo "$pr_labels" | grep -q "high priority"; then
              priority_indicator="🔴 [HIGH PRIORITY] "
            fi
            echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
            echo "🔗 ${priority_indicator}PR #$pr_num: $pr_title"
            echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
            echo "👤 Author: $pr_author"
            echo "🏷️  Labels: $pr_labels"
            echo "🔗 URL: $pr_url"
            echo ""
            # Get all runs for this PR
            pr_runs=$(grep "^$pr_num|" "$pr_data_file")
            pr_running_total=0
            pr_queued_total=0
            echo "$pr_runs" | while read -r line; do
              workflow=$(echo "$line" | cut -d'|' -f2)
              run_id=$(echo "$line" | cut -d'|' -f3)
              status=$(echo "$line" | cut -d'|' -f4)
              running=$(echo "$line" | cut -d'|' -f5)
              queued=$(echo "$line" | cut -d'|' -f6)
              runners=$(echo "$line" | cut -d'|' -f7)
              queue_min=$(echo "$line" | cut -d'|' -f8)
              created=$(echo "$line" | cut -d'|' -f9)
              attempt=$(echo "$line" | cut -d'|' -f11)
              pr_running_total=$((pr_running_total + running))
              pr_queued_total=$((pr_queued_total + queued))
              run_url="https://github.com/$REPO/actions/runs/$run_id"
              # Calculate retry count for this specific run
              retry_count=$((attempt - 1))
              # Show retry indicator
              retry_indicator=""
              if [ "$retry_count" -gt 0 ]; then
                retry_indicator=" 🔄 Retry #$retry_count"
              fi
              echo "  📦 Workflow: $workflow (Run #$run_id)$retry_indicator"
              echo "     Status: $status"
              echo "     🟢 Running jobs: $running"
              echo "     🟡 Queued jobs: $queued"
              if [ "$running" -gt 0 ] && [ "$runners" != "" ]; then
                echo "     🖥️  Runners: $runners"
              fi
              if [ "$queue_min" -gt 0 ]; then
                echo "     ⏱️  Queue time: ${queue_min} minutes"
              fi
              echo "     🔗 Run URL: $run_url"
              echo ""
            done
            # Summary for this PR
            pr_running_total=$(grep "^$pr_num|" "$pr_data_file" | cut -d'|' -f5 | awk '{sum+=$1} END {print sum+0}')
            pr_queued_total=$(grep "^$pr_num|" "$pr_data_file" | cut -d'|' -f6 | awk '{sum+=$1} END {print sum+0}')
            total_running=$((total_running + pr_running_total))
            total_queued=$((total_queued + pr_queued_total))
            echo "  📊 PR Total: $pr_running_total running, $pr_queued_total queued"
            echo ""
          done
          # --- Non-PR Runs Section ---
          non_pr_runs=$(grep '^NO_PR|' "$pr_data_file" 2>/dev/null || true)
          non_pr_running=0
          non_pr_queued=0
          if [ -n "$non_pr_runs" ]; then
            echo "========================================="
            echo "📦 Non-PR Runs (manual / scheduled / other)"
            echo "========================================="
            echo ""
            echo "$non_pr_runs" | while read -r line; do
              workflow=$(echo "$line" | cut -d'|' -f2)
              run_id=$(echo "$line" | cut -d'|' -f3)
              status=$(echo "$line" | cut -d'|' -f4)
              running=$(echo "$line" | cut -d'|' -f5)
              queued=$(echo "$line" | cut -d'|' -f6)
              runners=$(echo "$line" | cut -d'|' -f7)
              queue_min=$(echo "$line" | cut -d'|' -f8)
              created=$(echo "$line" | cut -d'|' -f9)
              attempt=$(echo "$line" | cut -d'|' -f11)
              event=$(echo "$line" | cut -d'|' -f12)
              branch=$(echo "$line" | cut -d'|' -f13)
              run_url="https://github.com/$REPO/actions/runs/$run_id"
              retry_count=$((attempt - 1))
              retry_indicator=""
              if [ "$retry_count" -gt 0 ]; then
                retry_indicator=" 🔄 Retry #$retry_count"
              fi
              echo "  📦 Workflow: $workflow (Run #$run_id)$retry_indicator"
              echo "     Event: $event"
              echo "     Branch: $branch"
              echo "     Status: $status"
              echo "     🟢 Running jobs: $running"
              echo "     🟡 Queued jobs: $queued"
              if [ "$running" -gt 0 ] && [ "$runners" != "" ]; then
                echo "     🖥️  Runners: $runners"
              fi
              if [ "$queue_min" -gt 0 ]; then
                echo "     ⏱️  Queue time: ${queue_min} minutes"
              fi
              echo "     🔗 Run URL: $run_url"
              echo ""
            done
            non_pr_running=$(echo "$non_pr_runs" | cut -d'|' -f5 | awk '{sum+=$1} END {print sum+0}')
            non_pr_queued=$(echo "$non_pr_runs" | cut -d'|' -f6 | awk '{sum+=$1} END {print sum+0}')
            non_pr_count=$(echo "$non_pr_runs" | wc -l | tr -d ' ')
            total_running=$((total_running + non_pr_running))
            total_queued=$((total_queued + non_pr_queued))
            echo "  📊 Non-PR Total: $non_pr_running running, $non_pr_queued queued"
            echo ""
          fi
          # Overall summary
          echo "========================================="
          echo "📈 Overall Summary"
          echo "========================================="
          echo "Total PRs with active runs: $pr_count"
          echo "Total non-PR active runs: ${non_pr_count:-0}"
          echo "Total running jobs: $total_running"
          echo "Total queued jobs: $total_queued"
          echo "========================================="
          # Cleanup
          rm -f "$pr_data_file"
--- a/third_party/sglang/.github/workflows/nightly-link-check.yml
+++ b/third_party/sglang/.github/workflows/nightly-link-check.yml
@@ -0,0 +1,32 @@
 name: Nightly Link Check
 on:
  schedule:
    - cron: "0 2 * * *"
  workflow_dispatch:
 concurrency:
  group: nightly-link-check-${{ github.ref }}
  cancel-in-progress: true
 jobs:
  lychee-online:
    if: github.repository == 'sgl-project/sglang'
    runs-on: ubuntu-latest
    timeout-minutes: 20
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Run lychee online link checks
        uses: lycheeverse/lychee-action@8646ba30535128ac92d33dfc9133794bfdd9b411 # v2
        with:
          fail: true
          args: >-
            --config .github/linters/lychee-ci.toml
            README.md
            docs/**/*.md
            docs/**/*.rst
            docs/**/*.ipynb
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
--- a/third_party/sglang/.github/workflows/nightly-release-gateway.yml
+++ b/third_party/sglang/.github/workflows/nightly-release-gateway.yml
@@ -0,0 +1,196 @@
 # Nightly release workflow for SGLang Model Gateway
 name: Nightly Release SGLang Model Gateway to PyPI
 on:
  schedule:
    # Run at 2 AM UTC every day
    - cron: '0 2 * * *'
  workflow_dispatch:  # Allow manual trigger
 jobs:
  build:
    name: build on ${{ matrix.platform || matrix.os }} (${{ matrix.target }} - ${{ matrix.manylinux || 'auto' }})
    runs-on: ${{ matrix.os }}-latest
    strategy:
      fail-fast: false
      matrix:
        os: [ubuntu, macos, windows]
        target: [x86_64, aarch64]
        manylinux: [auto]
        include:
          - os: ubuntu
            platform: linux
          - os: windows
            ls: dir
            target: x86_64
            python-architecture: x64
            interpreter: 3.9 3.10 3.11 3.12 3.13
          - os: macos
            target: aarch64
            interpreter: 3.9 3.10 3.11 3.12 3.13
          - os: ubuntu
            platform: linux
            target: aarch64
          # musllinux
          - os: ubuntu
            platform: linux
            target: x86_64
            manylinux: musllinux_1_1
          - os: ubuntu
            platform: linux
            target: aarch64
            manylinux: musllinux_1_1
        exclude:
          - os: windows
            target: aarch64
    steps:
      - uses: actions/checkout@v4
        with:
          path: sglang-repo
      - name: Move sgl-model-gateway folder to root and delete sglang-repo
        run: |
          mv sglang-repo/sgl-model-gateway/* .
          rm -rf sglang-repo
          ls -alt
        shell: bash
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.13"
          architecture: ${{ matrix.python-architecture || 'x64' }}
      - name: Modify version for nightly release
        run: |
          # Get current version from pyproject.toml
          CURRENT_VERSION=$(python -c "import tomllib; print(tomllib.load(open('bindings/python/pyproject.toml', 'rb'))['project']['version'])" 2>/dev/null || python -c "import tomli; print(tomli.load(open('bindings/python/pyproject.toml', 'rb'))['project']['version'])")
          # Create nightly version with date: e.g., 0.2.1.dev20250128
          NIGHTLY_VERSION="${CURRENT_VERSION}.dev$(date +%Y%m%d)"
          echo "Nightly version: $NIGHTLY_VERSION"
          # Update pyproject.toml with nightly version (temporary, not committed)
          sed -i.bak "s/version = \"${CURRENT_VERSION}\"/version = \"${NIGHTLY_VERSION}\"/" bindings/python/pyproject.toml
          # Verify the change
          cat bindings/python/pyproject.toml | grep "^version"
        shell: bash
      - name: Install twine and tomli
        run: pip install -U twine tomli
      - name: Install protoc (macOS)
        if: matrix.os == 'macos'
        run: brew install protobuf
      - name: Install protoc (Windows)
        if: matrix.os == 'windows'
        run: choco install protoc -y
      - name: Build wheels
        uses: PyO3/maturin-action@v1
        with:
          working-directory: bindings/python
          target: ${{ matrix.target }}
          manylinux: ${{ matrix.manylinux || 'auto' }}
          args: --release --out dist --features vendored-openssl --interpreter ${{ matrix.interpreter || '3.9 3.10 3.11 3.12 3.13 3.14' }}
          rust-toolchain: stable
          docker-options: -e CI -e CC_aarch64_unknown_linux_gnu=aarch64-linux-gnu-gcc -e CXX_aarch64_unknown_linux_gnu=aarch64-linux-gnu-g++
          before-script-linux: |
            # Install build dependencies (perl/make for vendored OpenSSL, protoc for gRPC)
            if command -v yum &> /dev/null; then
              yum update -y && yum install -y wget unzip gcc gcc-c++ perl-core make
              # Install cross-compilation toolchain for aarch64 if needed
              if [ "${{ matrix.target }}" = "aarch64" ]; then
                yum install -y gcc-aarch64-linux-gnu gcc-c++-aarch64-linux-gnu || true
              fi
            elif command -v apt-get &> /dev/null; then
              apt-get update && apt-get install -y wget unzip gcc g++ perl make
              # Install cross-compilation toolchain for aarch64 if needed
              if [ "${{ matrix.target }}" = "aarch64" ]; then
                apt-get install -y gcc-aarch64-linux-gnu g++-aarch64-linux-gnu || true
              fi
            fi
            (cd /tmp && \
             wget https://github.com/protocolbuffers/protobuf/releases/download/v32.0/protoc-32.0-linux-x86_64.zip && \
             unzip protoc-32.0-linux-x86_64.zip -d /usr/local && \
             rm protoc-32.0-linux-x86_64.zip)
            protoc --version
      - name: List built packages
        run: ${{ matrix.ls || 'ls -lh' }} bindings/python/dist/
      - name: Check packages
        run: twine check --strict bindings/python/dist/*
      - uses: actions/upload-artifact@v4
        with:
          name: packages-${{ matrix.os }}-${{ matrix.target }}-${{ matrix.manylinux || 'auto' }}
          path: bindings/python/dist/
  build-sdist:
    name: Build SDist
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          path: sglang-repo
      - name: Move sgl-model-gateway folder to root and delete sglang-repo
        run: |
          mv sglang-repo/sgl-model-gateway/* .
          rm -rf sglang-repo
          ls -alt
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.13"
      - name: Modify version for nightly release
        run: |
          # Get current version from pyproject.toml
          CURRENT_VERSION=$(python -c "import tomllib; print(tomllib.load(open('bindings/python/pyproject.toml', 'rb'))['project']['version'])" 2>/dev/null || python -c "import tomli; print(tomli.load(open('bindings/python/pyproject.toml', 'rb'))['project']['version'])")
          # Create nightly version with date: e.g., 0.2.1.dev20250128
          NIGHTLY_VERSION="${CURRENT_VERSION}.dev$(date +%Y%m%d)"
          echo "Nightly version: $NIGHTLY_VERSION"
          # Update pyproject.toml with nightly version (temporary, not committed)
          sed -i "s/version = \"${CURRENT_VERSION}\"/version = \"${NIGHTLY_VERSION}\"/" bindings/python/pyproject.toml
          # Verify the change
          cat bindings/python/pyproject.toml | grep "^version"
      - name: Build SDist
        uses: PyO3/maturin-action@v1
        with:
          working-directory: bindings/python
          command: sdist
          args: --out dist
          rust-toolchain: stable
      - uses: actions/upload-artifact@v4
        with:
          name: sdist
          path: bindings/python/dist/*.tar.gz
  upload:
    name: Upload to TestPyPI
    if: github.repository == 'sgl-project/sglang'  # Ensure this job only runs for the sgl-project/sglang repository
    needs: [build, build-sdist]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/download-artifact@v4
        with:
          path: dist
          merge-multiple: true
      - name: Upload to TestPyPI
        env:
          TWINE_USERNAME: __token__
          TWINE_PASSWORD: ${{ secrets.TEST_PYPI_TOKEN_ROUTER }}
        run: |
          pip install twine
          twine upload --repository testpypi dist/* --verbose
--- a/third_party/sglang/.github/workflows/nightly-test-amd-rocm720.yml
+++ b/third_party/sglang/.github/workflows/nightly-test-amd-rocm720.yml
--- a/third_party/sglang/.github/workflows/nightly-test-amd.yml
+++ b/third_party/sglang/.github/workflows/nightly-test-amd.yml
--- a/third_party/sglang/.github/workflows/nightly-test-intel.yml
+++ b/third_party/sglang/.github/workflows/nightly-test-intel.yml
@@ -0,0 +1,33 @@
 name: Nightly Test (Intel)
 on:
  schedule:
    - cron: '0 0 * * *'
  push:
    branches:
      - main
    paths:
      - "python/sglang/version.py"
  workflow_dispatch:
  workflow_call:
    inputs:
      ref:
        description: "Branch, tag or SHA to checkout"
        required: false
        type: string
        default: ""
 concurrency:
  group: nightly-test-intel-${{ inputs.ref || github.ref }}
  cancel-in-progress: ${{ github.event_name != 'workflow_call' }}
 jobs:
  # Placeholder for Intel GPU tests
  # Add Intel-specific nightly test workflows here when available
  placeholder:
    if: github.repository == 'sgl-project/sglang'
    runs-on: ubuntu-latest
    steps:
      - name: Placeholder
        run: echo "Intel nightly tests will be added here"
--- a/third_party/sglang/.github/workflows/nightly-test-npu.yml
+++ b/third_party/sglang/.github/workflows/nightly-test-npu.yml
@@ -0,0 +1,428 @@
 name: Nightly Test (NPU)
 on:
  schedule:
    - cron: '0 18 * * *'  # Execute at 2:00 a.m. Beijing Time every day
  pull_request:
    branches:
      - main
    paths:
      - ".github/workflows/nightly-test-npu.yml"
  workflow_dispatch:
  workflow_call:
    inputs:
      ref:
        description: 'Git ref (branch, tag, or SHA) to test. If not provided, uses the default branch.'
        required: false
        type: string
        default: ''
      job_filter:
        description: 'Select which job to run (leave empty or "all" to run all jobs)'
        required: false
        type: string
        default: 'all'
      image_a3:
        description: 'The a3 running docker image of the test task.'
        required: false
        type: string
        default: 'swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-a3-ubuntu22.04-py3.11'
      skip_install_flag:
        description: 'Indicates whether to skip the installation of sglang, defaulting to false.'
        required: false
        type: string
        default: 'false'
 concurrency:
  group: nightly-test-npu-${{ inputs.ref || github.ref }}
  cancel-in-progress: ${{ github.event_name != 'workflow_call' }}
 jobs:
  set-image-config:
    runs-on: ubuntu-latest
    outputs:
      ref: ${{ steps.set-vars.outputs.ref }}
      job_filter: ${{ steps.set-vars.outputs.job_filter }}
      image_a3: ${{ steps.set-vars.outputs.image_a3 }}
      skip_install_flag: ${{ steps.set-vars.outputs.skip_install_flag }}
    steps:
      # When triggered by PR, no inputs parameters are used. The latest community code is tested by default.
      - name: Set image config
        id: set-vars
        run: |
          if [ -z "${{ inputs.ref }}" ]; then
            echo "ref=" >> $GITHUB_OUTPUT
          else
            echo "ref=${{ inputs.ref }}" >> $GITHUB_OUTPUT
          fi
          if [ -z "${{ inputs.job_filter }}" ]; then
            echo "job_filter=all" >> $GITHUB_OUTPUT
          else
            echo "job_filter=${{ inputs.job_filter }}" >> $GITHUB_OUTPUT
          fi
          if [ -z "${{ inputs.image_a3 }}" ]; then
            echo "image_a3=swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-a3-ubuntu22.04-py3.11" >> $GITHUB_OUTPUT
          else
            echo "image_a3=${{ inputs.image_a3 }}" >> $GITHUB_OUTPUT
          fi
          if [ -z "${{ inputs.skip_install_flag }}" ]; then
            echo "skip_install_flag=false" >> $GITHUB_OUTPUT
          else
            echo "skip_install_flag=${{ inputs.skip_install_flag }}" >> $GITHUB_OUTPUT
          fi
  nightly-1-npu-a3:
    needs: [set-image-config]
    if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }}
    runs-on: linux-aarch64-a3-2
    strategy:
      fail-fast: false
      matrix:
        part: [0, 1]
    container:
      image: ${{ needs.set-image-config.outputs.image_a3 }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ needs.set-image-config.outputs.ref || github.ref }}
      - name: Install dependencies
        env:
          TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
          PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
          GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
        run: |
          # speed up by using infra cache services
          CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
          sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
          pip config set global.index-url http://${CACHING_URL}/pypi/simple
          pip config set global.trusted-host "${CACHING_URL}"
          if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then
            bash scripts/ci/npu/npu_ci_install_dependency.sh a3
          fi
          # copy required file from our daily cache
          cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
          # copy gsm8k dataset
          cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
      - name: Print Log Information
        run: |
          bash scripts/ci/npu/npu_log_print.sh
      - name: Run test
        timeout-minutes: 240
        env:
          SGLANG_USE_MODELSCOPE: true
          SGLANG_IS_IN_CI: true
          HF_ENDPOINT: https://hf-mirror.com
          TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
          PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
          STREAMS_PER_DEVICE: 32
        run: |
          pip install sglang_router
          hf download lmms-lab/MMMU --repo-type dataset
          pip install sentence_transformers torchaudio==2.8.0
          pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap
          pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1
          pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.1.32 numpy==1.26.4 dotenv
          git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git
          cd ./lmms-eval
          nohup pip install . > lmmslog.txt 2>&1 &
          sleep 120
          export PYTHONPATH=$PYTHONPATH:$(pwd)
          cd ../
          cd test
          python3 run_suite.py --hw npu --suite nightly-1-npu-a3 --nightly --continue-on-error --timeout-per-file 3600 --auto-partition-id ${{ matrix.part }} --auto-partition-size 2
  nightly-2-npu-a3:
    needs: [set-image-config]
    if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }}
    runs-on: linux-aarch64-a3-2
    strategy:
      fail-fast: false
      matrix:
        part: [0]
    container:
      image: ${{ needs.set-image-config.outputs.image_a3 }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ needs.set-image-config.outputs.ref || github.ref }}
      - name: Install dependencies
        env:
          TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
          PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
          GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
        run: |
          # speed up by using infra cache services
          CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
          sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
          pip config set global.index-url http://${CACHING_URL}/pypi/simple
          pip config set global.trusted-host "${CACHING_URL}"
          if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then
            bash scripts/ci/npu/npu_ci_install_dependency.sh a3
          fi
          # copy required file from our daily cache
          cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
          # copy gsm8k dataset
          cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
      - name: Print Log Information
        run: |
          bash scripts/ci/npu/npu_log_print.sh
      - name: Run test
        timeout-minutes: 240
        env:
          SGLANG_USE_MODELSCOPE: true
          SGLANG_IS_IN_CI: true
          HF_ENDPOINT: https://hf-mirror.com
          TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
          PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
          STREAMS_PER_DEVICE: 32
        run: |
          pip install sglang_router
          hf download lmms-lab/MMMU --repo-type dataset
          pip install sentence_transformers torchaudio==2.8.0
          pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap
          pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1
          pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.1.32 numpy==1.26.4 dotenv
          git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git
          cd ./lmms-eval
          nohup pip install . > lmmslog.txt 2>&1 &
          sleep 120
          export PYTHONPATH=$PYTHONPATH:$(pwd)
          cd ../
          cd test
          python3 run_suite.py --hw npu --suite nightly-2-npu-a3 --nightly --continue-on-error --timeout-per-file 3600 --auto-partition-id ${{ matrix.part }} --auto-partition-size 1
  nightly-4-npu-a3:
    needs: [set-image-config]
    if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }}
    runs-on: linux-aarch64-a3-4
    strategy:
      fail-fast: false
      matrix:
        part: [0]
    container:
      image: ${{ needs.set-image-config.outputs.image_a3 }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ needs.set-image-config.outputs.ref|| github.ref }}
      - name: Install dependencies
        env:
          TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
          PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
          GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
        run: |
          # speed up by using infra cache services
          CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
          sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
          pip config set global.index-url http://${CACHING_URL}/pypi/simple
          pip config set global.trusted-host "${CACHING_URL}"
          if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then
            bash scripts/ci/npu/npu_ci_install_dependency.sh a3
          fi
          # copy required file from our daily cache
          cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
          # copy gsm8k dataset
          cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
      - name: Print Log Information
        run: |
          bash scripts/ci/npu/npu_log_print.sh
      - name: Run test
        timeout-minutes: 240
        env:
          SGLANG_USE_MODELSCOPE: true
          SGLANG_IS_IN_CI: true
          HF_ENDPOINT: https://hf-mirror.com
          TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
          PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
          STREAMS_PER_DEVICE: 32
        run: |
          pip install sglang_router
          hf download lmms-lab/MMMU --repo-type dataset
          pip install sentence_transformers torchaudio==2.8.0
          pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap
          pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1
          pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.1.32 numpy==1.26.4 dotenv
          git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git
          cd ./lmms-eval
          nohup pip install . > lmmslog.txt 2>&1 &
          sleep 120
          export PYTHONPATH=$PYTHONPATH:$(pwd)
          cd ../
          cd test
          python3 run_suite.py --hw npu --suite nightly-4-npu-a3 --nightly --continue-on-error --timeout-per-file 3600 --auto-partition-id ${{ matrix.part }} --auto-partition-size 1
  nightly-8-npu-a3:
    needs: [set-image-config]
    if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }}
    runs-on: linux-aarch64-a3-8
    strategy:
      fail-fast: false
      matrix:
        part: [0]
    container:
      image: ${{ needs.set-image-config.outputs.image_a3 }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ needs.set-image-config.outputs.ref || github.ref }}
      - name: Install dependencies
        env:
          TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
          PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
          GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
        run: |
          # speed up by using infra cache services
          CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
          sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
          pip config set global.index-url http://${CACHING_URL}/pypi/simple
          pip config set global.trusted-host "${CACHING_URL}"
          if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then
            bash scripts/ci/npu/npu_ci_install_dependency.sh a3
          fi
          # copy required file from our daily cache
          cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
          # copy gsm8k dataset
          cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
      - name: Print Log Information
        run: |
          bash scripts/ci/npu/npu_log_print.sh
      - name: Run test
        timeout-minutes: 240
        env:
          SGLANG_USE_MODELSCOPE: true
          SGLANG_IS_IN_CI: true
          HF_ENDPOINT: https://hf-mirror.com
          TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
          PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
          STREAMS_PER_DEVICE: 32
        run: |
          pip install sglang_router
          hf download lmms-lab/MMMU --repo-type dataset
          pip install sentence_transformers torchaudio==2.8.0
          pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap
          pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1
          pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.1.32 numpy==1.26.4 dotenv
          git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git
          cd ./lmms-eval
          nohup pip install . > lmmslog.txt 2>&1 &
          sleep 120
          export PYTHONPATH=$PYTHONPATH:$(pwd)
          cd ../
          cd test
          python3 run_suite.py --hw npu --suite nightly-8-npu-a3 --nightly --continue-on-error --timeout-per-file 3600 --auto-partition-id ${{ matrix.part }} --auto-partition-size 1
  nightly-16-npu-a3:
    needs: [set-image-config]
    if: ${{ (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') }}
    runs-on: linux-aarch64-a3-16
    strategy:
      fail-fast: false
      matrix:
        part: [0, 1]
    container:
      image: ${{ needs.set-image-config.outputs.image_a3 }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ needs.set-image-config.outputs.ref || github.ref }}
      - name: Install dependencies
        env:
          TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
          PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
          GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
        run: |
          # speed up by using infra cache services
          CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
          sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
          pip config set global.index-url http://${CACHING_URL}/pypi/simple
          pip config set global.trusted-host "${CACHING_URL}"
          if [ ${{ needs.set-image-config.outputs.skip_install_flag }} != "true" ];then
            bash scripts/ci/npu/npu_ci_install_dependency.sh a3
          fi
          # copy required file from our daily cache
          cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
          # copy gsm8k dataset
          cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
      - name: Print Log Information
        run: |
          bash scripts/ci/npu/npu_log_print.sh
      - name: Run test
        timeout-minutes: 240
        env:
          SGLANG_USE_MODELSCOPE: true
          SGLANG_IS_IN_CI: true
          HF_ENDPOINT: https://hf-mirror.com
          TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
          PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
          STREAMS_PER_DEVICE: 32
        run: |
          pip install sglang_router
          hf download lmms-lab/MMMU --repo-type dataset
          pip install sentence_transformers torchaudio==2.8.0
          pip install protobuf==6.31.1 zss pre-commit wandb>=0.16.0 tenacity==8.3.0 loguru openpyxl latex2sympy2 zstandard transformers-stream-generator tqdm-multiprocess pycocoevalcap
          pip install yt-dlp sentencepiece==0.1.99 nltk av ftfy sqlitedict==2.1.0 sacrebleu>=1.5.0 pytablewriter black==24.1.0 isort==5.13.2 peft>=0.2.0 accelerate>=0.29.1
          pip install jsonlines httpx==0.25.0 evaluate>=0.4.0 datasets==2.16.1 numexpr xgrammar==0.1.32 numpy==1.26.4 dotenv
          git clone --branch v0.3.3 --depth 1 https://github.com/EvolvingLMMs-Lab/lmms-eval.git
          cd ./lmms-eval
          nohup pip install . > lmmslog.txt 2>&1 &
          sleep 120
          export PYTHONPATH=$PYTHONPATH:$(pwd)
          cd ../
          cd test
          python3 run_suite.py --hw npu --suite nightly-16-npu-a3 --nightly --continue-on-error --timeout-per-file 3600 --auto-partition-id ${{ matrix.part }} --auto-partition-size 2
  check-all-jobs:
    if: github.repository == 'sgl-project/sglang' && always()
    needs:
      - nightly-1-npu-a3
      - nightly-2-npu-a3
      - nightly-4-npu-a3
      - nightly-8-npu-a3
      - nightly-16-npu-a3
    runs-on: ubuntu-latest
    container:
      image: docker.m.daocloud.io/ubuntu:22.04
    steps:
      - name: Check if any job failed
        run: |
          if [[ "${{ contains(needs.*.result, 'failure') }}" == "true" ]]; then
            echo "One or more nightly test jobs failed"
            exit 1
          fi
          if [[ "${{ contains(needs.*.result, 'cancelled') }}" == "true" ]]; then
            echo "One or more nightly test jobs were cancelled"
            exit 1
          fi
          echo "All nightly test jobs passed"
--- a/third_party/sglang/.github/workflows/nightly-test-nvidia.yml
+++ b/third_party/sglang/.github/workflows/nightly-test-nvidia.yml
@@ -0,0 +1,796 @@
 name: Nightly Test (Nvidia)
 on:
  schedule:
    - cron: '0 0 * * *'
  workflow_dispatch:
    inputs:
      job_filter:
        description: 'Select which job to run (leave empty or "all" to run all jobs)'
        required: false
        type: choice
        default: 'all'
        options:
          - 'all'
          - 'nightly-test-general-1-gpu-h100'
          - 'nightly-test-general-4-gpu-h100'
          - 'nightly-test-general-8-gpu-h200'
          - 'nightly-test-general-8-gpu-h20'
          - 'nightly-test-general-8-gpu-b200'
          - 'nightly-test-text-accuracy-2-gpu-h100'
          - 'nightly-test-text-perf-2-gpu-h100'
          - 'nightly-test-vlm-accuracy-2-gpu-h100'
          - 'nightly-test-vlm-perf-2-gpu-h100'
          - 'nightly-test-multimodal-server-1-gpu'
          - 'nightly-test-multimodal-server-2-gpu'
          - 'nightly-test-perf-4-gpu-b200'
          - 'nightly-test-perf-8-gpu-b200'
          - 'nightly-test-specialized-8-gpu-b200'
          - 'nightly-test-kernel-1-gpu-h100'
          - 'nightly-test-diffusion-comparison'
          - 'nightly-test-kernel-8-gpu-h200'
  workflow_call:
    inputs:
      ref:
        description: 'Git ref (branch, tag, or SHA) to test. If not provided, uses the default branch.'
        required: false
        type: string
        default: ''
      job_filter:
        description: 'Select which job to run (leave empty or "all" to run all jobs)'
        required: false
        type: string
        default: 'all'
 concurrency:
  group: nightly-test-nvidia-${{ inputs.ref || github.ref }}
  cancel-in-progress: ${{ github.event_name != 'workflow_call' }}
 env:
  SGLANG_IS_IN_CI: true
  SGLANG_CUDA_COREDUMP: "1"
  HF_HUB_DOWNLOAD_TIMEOUT: 300
  HF_HUB_ETAG_TIMEOUT: 300
 jobs:
  # General tests - 1 GPU
  nightly-test-general-1-gpu-h100:
    if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-general-1-gpu-h100')
    runs-on: 1-gpu-h100
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - uses: ./.github/actions/check-maintenance
      - name: Install dependencies
        run: |
          bash scripts/ci/cuda/ci_install_dependency.sh
      - name: Run test
        timeout-minutes: 60
        run: |
          cd test
          python3 run_suite.py --hw cuda --suite nightly-1-gpu --nightly --continue-on-error
      - uses: ./.github/actions/upload-cuda-coredumps
        if: always()
  # JIT kernel full unit tests (expanded parameter ranges via SGLANG_JIT_KERNEL_RUN_FULL_TESTS)
  nightly-test-kernel-1-gpu-h100:
    if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-kernel-1-gpu-h100')
    runs-on: 1-gpu-h100
    timeout-minutes: 240
    env:
      # Full jit_kernel test grids (see sglang.jit_kernel.utils.should_run_full_tests)
      SGLANG_JIT_KERNEL_RUN_FULL_TESTS: "1"
      # Match pr-test-jit-kernel workflow for consistent JIT warmup behavior
      SGLANG_JIT_DEEPGEMM_FAST_WARMUP: true
      # Allow maintenance bypass on default branch (same semantics as PR JIT workflow)
      SGLANG_PR_TEST_BYPASS_MAINTENANCE_ON_MAIN: ${{ github.ref == 'refs/heads/main' && 'true' || 'false' }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - uses: ./.github/actions/check-maintenance
      - name: Install dependencies
        timeout-minutes: 20
        run: |
          bash scripts/ci/cuda/ci_install_dependency.sh
      - name: Run jit kernel nightly suite
        timeout-minutes: 60
        run: |
          cd test
          python3 run_suite.py --hw cuda --suite nightly-kernel-1-gpu --nightly --continue-on-error
      - uses: ./.github/actions/upload-cuda-coredumps
        if: always()
  nightly-test-kernel-8-gpu-h200:
    if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-kernel-8-gpu-h200')
    runs-on: 8-gpu-h200
    timeout-minutes: 240
    env:
      SGLANG_JIT_KERNEL_RUN_FULL_TESTS: "1"
      SGLANG_JIT_DEEPGEMM_FAST_WARMUP: true
      SGLANG_PR_TEST_BYPASS_MAINTENANCE_ON_MAIN: ${{ github.ref == 'refs/heads/main' && 'true' || 'false' }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - uses: ./.github/actions/check-maintenance
      - name: Install dependencies
        timeout-minutes: 20
        run: |
          bash scripts/ci/cuda/ci_install_dependency.sh
      - name: Run multi-GPU jit kernel nightly suite
        timeout-minutes: 90
        run: |
          cd test
          python3 run_suite.py --hw cuda --suite nightly-kernel-8-gpu-h200 --nightly --continue-on-error
      - uses: ./.github/actions/upload-cuda-coredumps
        if: always()
  # General tests - 4 GPU H100
  nightly-test-general-4-gpu-h100:
    if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-general-4-gpu-h100')
    runs-on: 4-gpu-h100
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - uses: ./.github/actions/check-maintenance
      - name: Install dependencies
        run: |
          bash scripts/ci/cuda/ci_install_dependency.sh
      - name: Run test
        timeout-minutes: 30
        run: |
          cd test
          python3 run_suite.py --hw cuda --suite nightly-4-gpu --nightly --continue-on-error
      - uses: ./.github/actions/upload-cuda-coredumps
        if: always()
  # General tests - 8 GPU H200
  nightly-test-general-8-gpu-h200:
    if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-general-8-gpu-h200')
    runs-on: 8-gpu-h200
    strategy:
      fail-fast: false
      matrix:
        partition: [0, 1, 2, 3]
    env:
      RUNNER_LABELS: 8-gpu-h200
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - uses: ./.github/actions/check-maintenance
      - name: Install dependencies
        run: |
          bash scripts/ci/cuda/ci_install_dependency.sh
      - name: Run common 8-GPU model tests
        if: always()
        timeout-minutes: 300
        env:
          TRACE_BASE_URL: https://raw.githubusercontent.com/sglang-bot/sglang-ci-data/main/traces/${{ github.run_id }}
          PERFETTO_RELAY_URL: ${{ vars.PERFETTO_RELAY_URL }}
          GPU_CONFIG: "8-gpu-h200"
          IS_H200: "1"
        run: |
          cd test
          python3 run_suite.py --hw cuda --suite nightly-8-gpu-common --nightly --timeout-per-file=18000 --continue-on-error --auto-partition-id=${{ matrix.partition }} --auto-partition-size=4
      - name: Publish traces to storage repo
        if: always()
        continue-on-error: true
        env:
          GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }}
          GITHUB_RUN_ID: ${{ github.run_id }}
          GITHUB_RUN_NUMBER: ${{ github.run_number }}
        run: |
          TRACE_ARGS=""
          for dir in test/performance_profiles_*/; do
            [ -d "$dir" ] && TRACE_ARGS="$TRACE_ARGS --traces-dir $dir"
          done
          if [ -n "$TRACE_ARGS" ]; then
            python3 scripts/ci/utils/publish_traces.py $TRACE_ARGS
            find test/performance_profiles_*/ -name '*.json.gz' -delete
          else
            echo "No trace directories found, skipping publish"
          fi
      - name: Run test
        timeout-minutes: 30
        env:
          GPU_CONFIG: "8-gpu-h200"
        run: |
          cd test
          python3 run_suite.py --hw cuda --suite nightly-8-gpu-h200 --nightly --continue-on-error
      - name: Collect performance metrics
        if: always()
        run: |
          python3 scripts/ci/utils/save_metrics.py \
            --gpu-config 8-gpu-h200 \
            --partition ${{ matrix.partition }} \
            --run-id ${{ github.run_id }} \
            --output test/metrics-8gpu-h200-partition-${{ matrix.partition }}.json \
            --search-dir test/performance_profiles_8_gpu \
            --search-dir test
      - name: Upload partition metrics
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: metrics-8gpu-h200-partition-${{ matrix.partition }}
          path: test/metrics-8gpu-h200-partition-${{ matrix.partition }}.json
          retention-days: 5
          if-no-files-found: ignore
      - uses: ./.github/actions/upload-cuda-coredumps
        if: always()
        with:
          artifact-suffix: ${{ matrix.partition }}
  # General tests - 8 GPU H20
  nightly-test-general-8-gpu-h20:
    if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-general-8-gpu-h20')
    runs-on: 8-gpu-h20
    env:
      SGLANG_CI_RDMA_ALL_DEVICES: "mlx5_1,mlx5_2,mlx5_3,mlx5_4"
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - uses: ./.github/actions/check-maintenance
      - name: Install dependencies
        run: |
          bash scripts/ci/cuda/ci_install_dependency.sh
      - name: Run test
        timeout-minutes: 30
        env:
          GPU_CONFIG: "8-gpu-h20"
        run: |
          cd test
          python3 run_suite.py --hw cuda --suite nightly-8-gpu-h20 --nightly --continue-on-error
      - uses: ./.github/actions/upload-cuda-coredumps
        if: always()
  # General tests - 8 GPU B200
  nightly-test-general-8-gpu-b200:
    if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-general-8-gpu-b200')
    runs-on: 8-gpu-b200
    strategy:
      fail-fast: false
      matrix:
        partition: [0, 1, 2, 3]
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - uses: ./.github/actions/check-maintenance
      - name: Install dependencies
        run: |
          bash scripts/ci/cuda/ci_install_dependency.sh
      - name: Run common 8-GPU model tests
        if: always()
        timeout-minutes: 300
        env:
          TRACE_BASE_URL: https://raw.githubusercontent.com/sglang-bot/sglang-ci-data/main/traces/${{ github.run_id }}
          PERFETTO_RELAY_URL: ${{ vars.PERFETTO_RELAY_URL }}
          GPU_CONFIG: "8-gpu-b200"
        run: |
          cd test
          python3 run_suite.py --hw cuda --suite nightly-8-gpu-common --nightly --timeout-per-file=12000 --continue-on-error --auto-partition-id=${{ matrix.partition }} --auto-partition-size=4
      - name: Publish traces to storage repo
        if: always()
        continue-on-error: true
        env:
          GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }}
          GITHUB_RUN_ID: ${{ github.run_id }}
          GITHUB_RUN_NUMBER: ${{ github.run_number }}
        run: |
          TRACE_ARGS=""
          for dir in test/performance_profiles_*/; do
            [ -d "$dir" ] && TRACE_ARGS="$TRACE_ARGS --traces-dir $dir"
          done
          if [ -n "$TRACE_ARGS" ]; then
            python3 scripts/ci/utils/publish_traces.py $TRACE_ARGS
            find test/performance_profiles_*/ -name '*.json.gz' -delete
          else
            echo "No trace directories found, skipping publish"
          fi
      - name: Collect performance metrics
        if: always()
        run: |
          python3 scripts/ci/utils/save_metrics.py \
            --gpu-config 8-gpu-b200 \
            --partition ${{ matrix.partition }} \
            --run-id ${{ github.run_id }} \
            --output test/metrics-8gpu-b200-partition-${{ matrix.partition }}.json \
            --search-dir test/performance_profiles_8_gpu \
            --search-dir test
      - name: Upload partition metrics
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: metrics-8gpu-b200-partition-${{ matrix.partition }}
          path: test/metrics-8gpu-b200-partition-${{ matrix.partition }}.json
          retention-days: 5
          if-no-files-found: ignore
      - uses: ./.github/actions/upload-cuda-coredumps
        if: always()
        with:
          artifact-suffix: ${{ matrix.partition }}
  # Text model accuracy tests
  nightly-test-text-accuracy-2-gpu-h100:
    if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-text-accuracy-2-gpu-h100')
    runs-on: 2-gpu-h100
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - uses: ./.github/actions/check-maintenance
      - name: Install dependencies
        run: |
          bash scripts/ci/cuda/ci_install_dependency.sh
      - name: Run eval test for text models
        timeout-minutes: 120
        run: |
          cd test
          python3 run_suite.py --hw cuda --suite nightly-eval-text-2-gpu --nightly --continue-on-error --timeout-per-file 4500
      - uses: ./.github/actions/upload-cuda-coredumps
        if: always()
  # Text model performance tests
  nightly-test-text-perf-2-gpu-h100:
    if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-text-perf-2-gpu-h100')
    runs-on: 2-gpu-h100
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - uses: ./.github/actions/check-maintenance
      - name: Install dependencies
        run: |
          bash scripts/ci/cuda/ci_install_dependency.sh
      - name: Run performance test for text models
        timeout-minutes: 180
        env:
          TRACE_BASE_URL: https://raw.githubusercontent.com/sglang-bot/sglang-ci-data/main/traces/${{ github.run_id }}
          PERFETTO_RELAY_URL: ${{ vars.PERFETTO_RELAY_URL }}
          GPU_CONFIG: "2-gpu-h100"
        run: |
          cd test
          rm -rf performance_profiles_text_models/
          python3 run_suite.py --hw cuda --suite nightly-perf-text-2-gpu --nightly --continue-on-error --timeout-per-file 3600
      - name: Publish traces to storage repo
        env:
          GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }}
          GITHUB_RUN_ID: ${{ github.run_id }}
          GITHUB_RUN_NUMBER: ${{ github.run_number }}
        run: |
          python3 scripts/ci/utils/publish_traces.py --traces-dir test/performance_profiles_text_models
      - uses: ./.github/actions/upload-cuda-coredumps
        if: always()
  # VLM accuracy tests
  nightly-test-vlm-accuracy-2-gpu-h100:
    if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-vlm-accuracy-2-gpu-h100')
    runs-on: 2-gpu-h100
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - uses: ./.github/actions/check-maintenance
      - name: Install dependencies
        run: |
          bash scripts/ci/cuda/ci_install_dependency.sh
      - name: Run eval test for VLM models (fixed MMMU-100)
        timeout-minutes: 240
        run: |
          cd test
          python3 run_suite.py --hw cuda --suite nightly-eval-vlm-2-gpu --nightly --continue-on-error --timeout-per-file 9000
      - uses: ./.github/actions/upload-cuda-coredumps
        if: always()
  # VLM performance tests
  nightly-test-vlm-perf-2-gpu-h100:
    if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-vlm-perf-2-gpu-h100')
    runs-on: 2-gpu-h100
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - uses: ./.github/actions/check-maintenance
      - name: Install dependencies
        run: |
          bash scripts/ci/cuda/ci_install_dependency.sh
      - name: Run perf test for VLM models (MMMU)
        timeout-minutes: 240
        env:
          TRACE_BASE_URL: https://raw.githubusercontent.com/sglang-bot/sglang-ci-data/main/traces/${{ github.run_id }}
          PERFETTO_RELAY_URL: ${{ vars.PERFETTO_RELAY_URL }}
          GPU_CONFIG: "2-gpu-h100"
        run: |
          cd test
          rm -rf performance_profiles_vlms/
          python3 run_suite.py --hw cuda --suite nightly-perf-vlm-2-gpu --nightly --continue-on-error --timeout-per-file 3600
      - name: Publish traces to storage repo
        env:
          GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }}
          GITHUB_RUN_ID: ${{ github.run_id }}
          GITHUB_RUN_NUMBER: ${{ github.run_number }}
        run: |
          python3 scripts/ci/utils/publish_traces.py --traces-dir test/performance_profiles_vlms
      - uses: ./.github/actions/upload-cuda-coredumps
        if: always()
  # diffusion performance tests
  nightly-test-multimodal-server-1-gpu:
    if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-multimodal-server-1-gpu')
    runs-on: 1-gpu-h100
    strategy:
      fail-fast: false
      max-parallel: 5
      matrix:
        part: [0, 1]
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - uses: ./.github/actions/check-maintenance
      - name: Install dependencies
        run: |
          bash scripts/ci/cuda/ci_install_dependency.sh diffusion
          pip install slack_sdk
      - name: Run diffusion server tests
        env:
          SGLANG_DIFFUSION_SLACK_TOKEN: ${{ secrets.SGLANG_DIFFUSION_SLACK_TOKEN }}
          GITHUB_RUN_ID: ${{ github.run_id }}
          GPU_CONFIG: "1-gpu-h100"
        timeout-minutes: 90
        run: |
          cd python
          python3 sglang/multimodal_gen/test/run_suite.py \
            --suite 1-gpu \
            --partition-id ${{ matrix.part }} \
            --total-partitions 2
      - name: Collect diffusion performance metrics
        if: always()
        run: |
          python3 scripts/ci/utils/diffusion/save_diffusion_metrics.py \
            --gpu-config 1-gpu-h100 \
            --run-id ${{ github.run_id }} \
            --output python/diffusion-metrics-1gpu-partition-${{ matrix.part }}.json \
            --results-json python/diffusion-results.json
      - name: Upload diffusion metrics
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: diffusion-metrics-1gpu-partition-${{ matrix.part }}
          path: python/diffusion-metrics-1gpu-partition-${{ matrix.part }}.json
          retention-days: 90
          if-no-files-found: ignore
      - uses: ./.github/actions/upload-cuda-coredumps
        if: always()
        with:
          artifact-suffix: ${{ matrix.part }}
  nightly-test-multimodal-server-2-gpu:
    if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-multimodal-server-2-gpu')
    runs-on: 2-gpu-h100
    strategy:
      fail-fast: false
      max-parallel: 5
      matrix:
        part: [0, 1]
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - uses: ./.github/actions/check-maintenance
      - name: Install dependencies
        run: |
          bash scripts/ci/cuda/ci_install_dependency.sh diffusion
          pip install slack_sdk
      - name: Run diffusion server tests
        env:
          SGLANG_DIFFUSION_SLACK_TOKEN: ${{ secrets.SGLANG_DIFFUSION_SLACK_TOKEN }}
          GITHUB_RUN_ID: ${{ github.run_id }}
          GPU_CONFIG: "2-gpu-h100"
        timeout-minutes: 90
        run: |
          cd python
          python3 sglang/multimodal_gen/test/run_suite.py \
            --suite 2-gpu \
            --partition-id ${{ matrix.part }} \
            --total-partitions 2
      - name: Collect diffusion performance metrics
        if: always()
        run: |
          python3 scripts/ci/utils/diffusion/save_diffusion_metrics.py \
            --gpu-config 2-gpu-h100 \
            --run-id ${{ github.run_id }} \
            --output python/diffusion-metrics-2gpu-partition-${{ matrix.part }}.json \
            --results-json python/diffusion-results.json
      - name: Upload diffusion metrics
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: diffusion-metrics-2gpu-partition-${{ matrix.part }}
          path: python/diffusion-metrics-2gpu-partition-${{ matrix.part }}.json
          retention-days: 90
          if-no-files-found: ignore
      - uses: ./.github/actions/upload-cuda-coredumps
        if: always()
        with:
          artifact-suffix: ${{ matrix.part }}
  # B200 Performance tests - 4 GPU
  nightly-test-perf-4-gpu-b200:
    if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-perf-4-gpu-b200')
    runs-on: 4-gpu-b200
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - uses: ./.github/actions/check-maintenance
      - name: Install dependencies
        run: |
          bash scripts/ci/cuda/ci_install_dependency.sh
      - name: Run test
        timeout-minutes: 300
        run: |
          cd test
          python3 run_suite.py --hw cuda --suite nightly-4-gpu-b200 --nightly --continue-on-error --timeout-per-file 12000
      - uses: ./.github/actions/upload-cuda-coredumps
        if: always()
  # Specialized B200 tests - 8 GPU, for specific backends and configs
  nightly-test-specialized-8-gpu-b200:
    if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-perf-8-gpu-b200' || inputs.job_filter == 'nightly-test-specialized-8-gpu-b200')
    runs-on: 8-gpu-b200
    env:
      RUNNER_LABELS: 8-gpu-b200
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - uses: ./.github/actions/check-maintenance
      - name: Install dependencies
        run: |
          bash scripts/ci/cuda/ci_install_dependency.sh
      - name: Run test
        timeout-minutes: 120
        env:
          GPU_CONFIG: "8-gpu-b200"
        run: |
          cd test
          python3 run_suite.py --hw cuda --suite nightly-8-gpu-b200 --nightly --continue-on-error --timeout-per-file 2400
      - uses: ./.github/actions/upload-cuda-coredumps
        if: always()
  # Diffusion cross-framework comparison
  nightly-test-diffusion-comparison:
    if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-diffusion-comparison')
    runs-on: 4-gpu-h100
    timeout-minutes: 240
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - name: Install dependencies
        run: |
          bash scripts/ci/cuda/ci_install_dependency.sh diffusion
      - name: Run cross-framework comparison
        env:
          GITHUB_SHA: ${{ github.sha }}
          GITHUB_RUN_ID: ${{ github.run_id }}
          PYTHONUNBUFFERED: "1"
        timeout-minutes: 210
        run: |
          python3 -u scripts/ci/utils/diffusion/run_comparison.py \
            --output comparison-results.json
      - name: Generate dashboard
        if: always()
        env:
          GH_PAT_FOR_NIGHTLY_CI_DATA: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }}
          GH_TOKEN: ${{ github.token }}
        run: |
          python3 scripts/ci/utils/diffusion/generate_diffusion_dashboard.py \
            --results comparison-results.json \
            --output dashboard.md \
            --charts-dir comparison-charts \
            --fetch-history \
            --step-summary
      - name: Publish to sglang-ci-data
        if: always()
        env:
          GH_PAT_FOR_NIGHTLY_CI_DATA: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }}
        run: |
          python3 scripts/ci/utils/diffusion/publish_comparison_results.py \
            --results comparison-results.json \
            --dashboard dashboard.md \
            --charts-dir comparison-charts
      - name: Upload comparison artifacts
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: diffusion-comparison-${{ github.run_id }}
          path: |
            comparison-results.json
            dashboard.md
            comparison-charts/
            comparison-logs/
          retention-days: 90
          if-no-files-found: ignore
      - uses: ./.github/actions/upload-cuda-coredumps
        if: always()
  # Consolidate performance metrics from all jobs
  consolidate-metrics:
    if: github.repository == 'sgl-project/sglang' && always()
    needs:
      - nightly-test-general-8-gpu-h200
      - nightly-test-general-8-gpu-b200
      - nightly-test-multimodal-server-1-gpu
      - nightly-test-multimodal-server-2-gpu
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - name: Download all partition metrics
        uses: actions/download-artifact@v4
        with:
          pattern: "*metrics-*"
          path: metrics/
          merge-multiple: true
      - name: List downloaded metrics
        run: |
          echo "Downloaded metrics files:"
          find metrics/ -name "*.json" -type f 2>/dev/null || echo "No metrics files found"
      - name: Merge metrics
        run: |
          python3 scripts/ci/utils/merge_metrics.py \
            --input-dir metrics/ \
            --output consolidated-metrics-${{ github.run_id }}.json \
            --run-id ${{ github.run_id }} \
            --commit-sha ${{ github.sha }} \
            --branch ${{ github.ref_name }}
      - name: Upload consolidated metrics
        uses: actions/upload-artifact@v4
        with:
          name: consolidated-metrics-${{ github.run_id }}
          path: consolidated-metrics-${{ github.run_id }}.json
          retention-days: 90
          if-no-files-found: warn
  # Final check job
  check-all-jobs:
    if: github.repository == 'sgl-project/sglang' && always()
    needs:
      - nightly-test-general-1-gpu-h100
      - nightly-test-general-4-gpu-h100
      - nightly-test-general-8-gpu-h200
      - nightly-test-general-8-gpu-h20
      - nightly-test-general-8-gpu-b200
      - nightly-test-text-accuracy-2-gpu-h100
      - nightly-test-text-perf-2-gpu-h100
      - nightly-test-vlm-accuracy-2-gpu-h100
      - nightly-test-vlm-perf-2-gpu-h100
      - nightly-test-multimodal-server-1-gpu
      - nightly-test-multimodal-server-2-gpu
      - nightly-test-perf-4-gpu-b200
      - nightly-test-specialized-8-gpu-b200
      - nightly-test-diffusion-comparison
      - consolidate-metrics
    runs-on: ubuntu-latest
    steps:
      - name: Check if any job failed
        run: |
          if [[ "${{ contains(needs.*.result, 'failure') }}" == "true" ]]; then
            echo "One or more nightly test jobs failed"
            exit 1
          fi
          if [[ "${{ contains(needs.*.result, 'cancelled') }}" == "true" ]]; then
            echo "One or more nightly test jobs were cancelled"
            exit 1
          fi
          echo "All nightly test jobs passed"
--- a/third_party/sglang/.github/workflows/open-pr-copy-from-oss.yml
+++ b/third_party/sglang/.github/workflows/open-pr-copy-from-oss.yml
@@ -0,0 +1,28 @@
 name: Open A PR to Copy Code From OSS
 on:
  workflow_dispatch:
  # schedule:
  #   - cron: '0 10 * * *'
 permissions:
  contents: write
 jobs:
  copy:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
        with:
          ref: 'main'
      - name: Install GitHub CLI (if not present)
        run: |
          bash scripts/code_sync/install_github_cli.sh
      - name: Copy from OSS code
        env:
          GH_TOKEN: ${{ secrets.GH_PAT_FOR_OPEN_PR_TO_PRIVATE }}
        run: |
          python3 scripts/code_sync/copy_from_oss.py
--- a/third_party/sglang/.github/workflows/open-pr-copy-to-oss.yml
+++ b/third_party/sglang/.github/workflows/open-pr-copy-to-oss.yml
@@ -0,0 +1,31 @@
 name: Open A PR to Copy Diff To OSS
 on:
  workflow_dispatch:
    inputs:
      commit_sha:
        description: 'The commit SHA to copy. Defaults to LAST to copy the latest commit.'
        required: false
        default: 'LAST'
 permissions:
  contents: write
 jobs:
  copy:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - name: Install GitHub CLI (if not present)
        run: |
          bash scripts/code_sync/install_github_cli.sh
      - name: Copy to OSS code
        env:
          GH_TOKEN: ${{ secrets.GH_PAT_FOR_OPEN_PR_TO_OSS }}
        run: |
          python3 scripts/code_sync/copy_to_oss.py --commit ${{ github.event.inputs.commit_sha }}
--- a/third_party/sglang/.github/workflows/patch-docker-dev.yml
+++ b/third_party/sglang/.github/workflows/patch-docker-dev.yml
@@ -0,0 +1,115 @@
 name: Patch Docker Image
 on:
  workflow_dispatch:
    inputs:
      pr_numbers:
        description: "Comma-separated PR numbers to apply (e.g. 18962,19010)"
        required: false
        default: ""
      image_tag:
        description: "Base image tag to patch (e.g. dev-x86, dev-x86-cu13)"
        required: true
 concurrency:
  group: patch-docker-${{ inputs.image_tag }}
  cancel-in-progress: true
 jobs:
  patch:
    if: github.repository == 'sgl-project/sglang'
    runs-on: x64-docker-build-node
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - name: Login to Docker Hub
        uses: docker/login-action@v2
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      - name: Pull base image and extract commit
        run: |
          IMAGE="lmsysorg/sglang:${{ inputs.image_tag }}"
          docker pull "${IMAGE}"
          if BASE_SHA=$(docker run --rm "${IMAGE}" git -C /sgl-workspace/sglang rev-parse HEAD 2>/dev/null); then
            echo "Image built from commit: ${BASE_SHA}"
          else
            BASE_SHA=""
            echo "::warning::Image has no .git directory — cannot extract base commit"
          fi
          echo "BASE_SHA=${BASE_SHA}" >> "$GITHUB_ENV"
      - name: Generate patches
        run: |
          git config --global --add safe.directory "$GITHUB_WORKSPACE"
          git fetch origin main
          mkdir -p /tmp/patch-ctx
          if [ -n "${{ inputs.pr_numbers }}" ]; then
            IFS=',' read -ra PRS <<< "${{ inputs.pr_numbers }}"
            for pr in "${PRS[@]}"; do
              pr=$(echo "${pr}" | xargs)
              echo "Fetching PR #${pr}"
              git fetch origin "pull/${pr}/head:pr-${pr}"
              MERGE_BASE=$(git merge-base origin/main "pr-${pr}")
              echo "  PR #${pr}: merge-base=${MERGE_BASE}"
              git diff "${MERGE_BASE}..pr-${pr}" > "/tmp/patch-ctx/${pr}.patch"
              echo "  PR #${pr}: $(wc -l < /tmp/patch-ctx/${pr}.patch) lines"
            done
          elif [ -n "${BASE_SHA}" ]; then
            echo "Generating diff: image ${BASE_SHA} → latest main"
            git fetch origin "${BASE_SHA}"
            git diff "${BASE_SHA}..origin/main" > /tmp/patch-ctx/main.patch
            echo "  main: $(wc -l < /tmp/patch-ctx/main.patch) lines"
          else
            echo "::error::No PR numbers specified and image has no .git — cannot generate diff against main"
            exit 1
          fi
          TOTAL=$(cat /tmp/patch-ctx/*.patch | wc -l)
          if [ "${TOTAL}" -eq 0 ]; then
            echo "::warning::All patches are empty — image is already up to date"
            echo "SKIP_BUILD=true" >> "$GITHUB_ENV"
          fi
      - name: Build patched image
        if: env.SKIP_BUILD != 'true'
        run: |
          IMAGE="lmsysorg/sglang:${{ inputs.image_tag }}"
          cat <<'DOCKERFILE' > /tmp/patch-ctx/Dockerfile
          ARG BASE_IMAGE
          FROM ${BASE_IMAGE}
          COPY *.patch /tmp/patches/
          RUN cd /sgl-workspace/sglang \
              && for p in /tmp/patches/*.patch; do \
                   if [ ! -s "${p}" ]; then \
                     echo "Skipping ${p} (empty)"; \
                   else \
                     echo "Applying ${p}..." \
                     && patch -p1 --fuzz=2 --no-backup-if-mismatch -f < "${p}" \
                     || { echo "ERROR: Failed to apply ${p}"; exit 1; }; \
                   fi; \
                 done \
              && rm -rf /tmp/patches
          DOCKERFILE
          docker build \
            --no-cache \
            --build-arg BASE_IMAGE="${IMAGE}" \
            -t "${IMAGE}" \
            /tmp/patch-ctx/
      - name: Push patched image
        if: env.SKIP_BUILD != 'true'
        run: |
          IMAGE="lmsysorg/sglang:${{ inputs.image_tag }}"
          docker push "${IMAGE}"
          echo "### Patched \`${IMAGE}\`" >> "$GITHUB_STEP_SUMMARY"
          echo "- **Base commit:** \`${BASE_SHA:-unknown (no .git)}\`" >> "$GITHUB_STEP_SUMMARY"
          echo "- **Source:** ${{ inputs.pr_numbers && format('PRs: {0}', inputs.pr_numbers) || 'latest main' }}" >> "$GITHUB_STEP_SUMMARY"
--- a/third_party/sglang/.github/workflows/pr-benchmark-rust.yml
+++ b/third_party/sglang/.github/workflows/pr-benchmark-rust.yml
@@ -0,0 +1,198 @@
 name: PR Benchmark (SMG Components)
 on:
  push:
    branches: [ main ]
    paths:
      - "sgl-model-gateway/**"
  pull_request:
    branches: [ main ]
    paths:
      - "sgl-model-gateway/**"
  workflow_dispatch:
 concurrency:
  group: pr-benchmark-rust-${{ github.ref }}
  cancel-in-progress: true
 env:
  RUSTC_WRAPPER: sccache
  SCCACHE_GHA_ENABLED: "true"
 permissions:
  contents: read
  pull-requests: write
  issues: write
 jobs:
  benchmark-compile-check:
    name: Benchmark Compilation Check
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Install dependencies
        run: |
          bash scripts/ci/cuda/ci_install_gateway_dependencies.sh
      - name: Configure sccache
        uses: mozilla-actions/sccache-action@v0.0.9
        with:
          version: "v0.12.0"
          disable_annotations: true
      - name: Rust cache
        uses: Swatinem/rust-cache@v2
        with:
          workspaces: sgl-model-gateway
          shared-key: "rust-cache"
          save-if: true
          cache-all-crates: true
          cache-on-failure: true
      - name: Check benchmarks compile
        run: |
          source "$HOME/.cargo/env"
          cd sgl-model-gateway/
          cargo check --benches
      - name: Show sccache stats
        if: always()
        run: sccache --show-stats
  benchmark:
    name: Benchmark - ${{ matrix.name }}
    if: |
      github.repository == 'sgl-project/sglang' &&
      (github.event_name == 'push' ||
       github.event_name == 'workflow_dispatch' ||
       (contains(github.event.pull_request.labels.*.name, 'router-benchmark') &&
        contains(github.event.pull_request.labels.*.name, 'run-ci')))
    strategy:
      fail-fast: false
      matrix:
        include:
          - name: Request Processing
            bench_name: request_processing
            bench_args: "benchmark_summary --exact"
            runner: ubuntu-latest
            sccache_version: "v0.12.0"
            artifact_name: request-processing-results
            artifact_path: criterion/benchmark_summary/
          - name: Manual Policy
            bench_name: manual_policy_benchmark
            bench_args: ""
            runner: ubuntu-latest
            sccache_version: "v0.12.0"
            artifact_name: manual-policy-results
            artifact_path: criterion/manual_policy*/
    runs-on: ${{ matrix.runner }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          fetch-depth: 100
      - name: Install dependencies
        run: |
          bash scripts/ci/cuda/ci_install_gateway_dependencies.sh
      - name: Configure sccache
        uses: mozilla-actions/sccache-action@v0.0.9
        with:
          version: ${{ matrix.sccache_version }}
          disable_annotations: true
      - name: Rust cache
        uses: Swatinem/rust-cache@v2
        with:
          workspaces: sgl-model-gateway
          shared-key: "rust-cache"
          cache-all-crates: true
          cache-on-failure: true
          save-if: true
      - name: Run benchmark
        timeout-minutes: 30
        run: |
          source "$HOME/.cargo/env"
          cd sgl-model-gateway/
          if command -v sccache &> /dev/null; then
            echo "Testing sccache availability..."
            export RUSTC_WRAPPER=sccache
            export SCCACHE_GHA_ENABLED="true"
            if sccache --start-server 2>/dev/null && sccache --show-stats 2>/dev/null; then
              echo "sccache is working, using it for compilation"
            else
              echo "sccache failed to start, falling back to regular cargo"
              unset RUSTC_WRAPPER
              unset SCCACHE_GHA_ENABLED
            fi
          else
            echo "sccache not available, using regular cargo"
          fi
          cargo bench --bench ${{ matrix.bench_name }} -- ${{ matrix.bench_args }} 2>&1 | tee benchmark_output.txt
      - name: Upload benchmark results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: ${{ matrix.artifact_name }}-${{ github.sha }}
          path: |
            sgl-model-gateway/target/${{ matrix.artifact_path }}
            sgl-model-gateway/benchmark_output.txt
          retention-days: 30
      - name: Show sccache stats
        if: always()
        run: sccache --show-stats
  benchmark-summary:
    name: Benchmark Summary
    needs: [benchmark]
    if: always() && (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request')
    runs-on: ubuntu-latest
    steps:
      - name: Download all benchmark results
        uses: actions/download-artifact@v4
        with:
          pattern: '*-results-${{ github.sha }}'
          path: benchmark-results
      - name: Generate summary
        run: |
          generate_section() {
            local title="$1" dir_name="$2" lines="${3:-100}"
            local dir="benchmark-results/${dir_name}-${{ github.sha }}"
            echo "### $title" >> summary.md
            if [ -d "$dir" ]; then
              echo "✅ **Completed**" >> summary.md
              if [ -f "$dir/benchmark_output.txt" ]; then
                echo -e "\n<details>\n<summary>View Results</summary>\n\n\`\`\`" >> summary.md
                tail -"$lines" "$dir/benchmark_output.txt" >> summary.md
                echo -e "\`\`\`\n</details>" >> summary.md
              fi
            else
              echo "❌ Failed or skipped" >> summary.md
            fi
            echo "" >> summary.md
          }
          echo "## 🚀 Benchmark Results Summary" > summary.md
          echo "" >> summary.md
          generate_section "Request Processing" "request-processing-results" 60
          generate_section "Manual Policy (Sticky Sessions)" "manual-policy-results" 100
          echo -e "---\n_Generated at $(date -u '+%Y-%m-%d %H:%M:%S UTC')_" >> summary.md
          cat summary.md
          cat summary.md >> $GITHUB_STEP_SUMMARY
      - name: Upload summary
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-summary-${{ github.sha }}
          path: summary.md
          retention-days: 30
--- a/third_party/sglang/.github/workflows/pr-gate.yml
+++ b/third_party/sglang/.github/workflows/pr-gate.yml
@@ -0,0 +1,254 @@
 on:
  workflow_call:
    inputs:
      require-run-ci:
        description: "Whether the PR must have the run-ci label"
        type: boolean
        default: true
      cool-down-minutes:
        description: "Cooldown period in minutes for low-permission users; 0 disables rate limiting"
        type: number
        default: 120
 jobs:
  pr-gate:
    # 1. for commits on main: no gating needed
    # 2. for workflow_dispatch: this can only be triggered by users with write access
    runs-on: ubuntu-latest
    steps:
      - name: Fetch latest PR info
        if: github.event_name == 'pull_request'
        id: pr
        uses: actions/github-script@v7
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}
          script: |
            const pr = await github.rest.pulls.get({
              owner: context.repo.owner,
              repo: context.repo.repo,
              pull_number: context.issue.number
            });
            core.setOutput("labels", JSON.stringify(pr.data.labels.map(l => l.name)));
            core.setOutput("draft", pr.data.draft);
            core.setOutput("user", pr.data.user.login);
      - name: Log PR info
        if: github.event_name == 'pull_request'
        run: |
          echo "===== PR Info ====="
          echo "PR Event: ${{ github.event_name }}"
          echo "PR Labels: ${{ steps.pr.outputs.labels }}"
          echo "PR Draft: ${{ steps.pr.outputs.draft }}"
          echo "PR User: ${{ steps.pr.outputs.user }}"
          echo "Require run-ci: ${{ inputs.require-run-ci }}"
          echo "Cool down minutes: ${{ inputs.cool-down-minutes }}"
          echo "==================="
      - name: Block draft PR
        if: github.event_name == 'pull_request' && fromJson(steps.pr.outputs.draft)
        run: |
          echo "PR is draft. Blocking CI."
          exit 1
      - name: Require run-ci label (optional)
        if:  github.event_name == 'pull_request' && inputs.require-run-ci == true
        run: |
          labels='${{ steps.pr.outputs.labels }}'
          if [[ "${{ contains(fromJson(steps.pr.outputs.labels), 'run-ci') }}" == "false" ]]; then
            echo "Missing required label 'run-ci'. See https://docs.sglang.io/developer_guide/contribution_guide.html#how-to-trigger-ci-tests for more details."
            exit 1
          fi
      - name: Enforce rate limit for low-permission actors (optional)
        if: github.event_name == 'pull_request' && inputs.cool-down-minutes > 0
        uses: actions/github-script@v7
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}
          script: |
            const DEFAULT_MINUTES = Number("${{ inputs.cool-down-minutes }}");
            const owner = context.repo.owner;
            const repo = context.repo.repo;
            const eventName = context.eventName;
            const curRun = await github.rest.actions.getWorkflowRun({
              owner, repo, run_id: context.runId
            });
            let triggeringActor = curRun.data.triggering_actor?.login || context.actor;
            if (triggeringActor === "github-actions[bot]") {
              triggeringActor = `${{ steps.pr.outputs.user }}`;
              core.info(
                `triggering_actor is github-actions[bot]; substituting PR author '${triggeringActor}'.`
              );
            }
            async function hasHighPermission(username) {
              try {
                const { data } = await github.rest.repos.getCollaboratorPermissionLevel({ owner, repo, username });
                const perm = data.permission || 'none';
                return perm === 'write' || perm === 'maintain' || perm === 'admin';
              } catch (e) {
                if (e.status === 404 || e.status === 403) return false;
                throw e;
              }
            }
            if (await hasHighPermission(triggeringActor)) {
              core.info(`Triggering user '${triggeringActor}' has high permission. No rate limit applied.`);
              return;
            }
            let effectiveCooldownMinutes = DEFAULT_MINUTES;
            let perUserCooldownMinutes = null;
            try {
              const contentResp = await github.rest.repos.getContent({
                owner,
                repo,
                path: ".github/CI_PERMISSIONS.json",
                ref: "main",
              });
              if (!Array.isArray(contentResp.data) && contentResp.data && "content" in contentResp.data) {
                const raw = Buffer.from(
                  contentResp.data.content,
                  contentResp.data.encoding || "base64"
                ).toString();
                const ciPermissions = JSON.parse(raw);
                const userPerm = ciPermissions[triggeringActor];
                if (userPerm && typeof userPerm.cooldown_interval_minutes === "number") {
                  perUserCooldownMinutes = userPerm.cooldown_interval_minutes;
                  core.info(
                    `Per-user cooldown for '${triggeringActor}' from CI_PERMISSIONS.json: ${perUserCooldownMinutes} minutes.`
                  );
                } else {
                  core.info(`No per-user cooldown found for '${triggeringActor}' in CI_PERMISSIONS.json.`);
                }
              } else {
                core.info("CI_PERMISSIONS.json content response is not a file; skipping per-user cooldown.");
              }
            } catch (e) {
              core.info(`CI_PERMISSIONS.json not found or unreadable: ${e.message}. Using default rate limit only.`);
            }
            if (perUserCooldownMinutes !== null) {
              effectiveCooldownMinutes = Math.min(effectiveCooldownMinutes, perUserCooldownMinutes);
            }
            if (effectiveCooldownMinutes <= 0) {
              core.info(
                `Effective cooldown for '${triggeringActor}' is 0 minutes; no rate limit enforced for this user.`
              );
              return;
            }
            const cutoff = new Date(Date.now() - effectiveCooldownMinutes * 60 * 1000);
            core.info(
              `Checking for workflow runs since ${cutoff.toISOString()} (last ${effectiveCooldownMinutes} minutes) for event '${eventName}'.`
            );
            const { data } = await github.rest.actions.listWorkflowRuns({
              owner,
              repo,
              workflow_id: 'pr-test.yml',
              event: eventName,
              per_page: 100,
            });
            const runs = data.workflow_runs || [];
            // Rate Limiting Logic:
            // We only count workflow runs that actually consumed CI resources (i.e., passed the gate).
            // A run "passes the gate" if any jobs beyond the gate jobs (check-changes, pr-gate, call-gate)
            // actually executed (not skipped/cancelled). This prevents scenarios where:
            // - User has PR A with missing 'run-ci' label (fails at gate)
            // - User opens PR B with 'run-ci' label
            // - PR B should be able to run even though PR A triggered a run recently
            // Helper function to check if a run passed the gate (i.e., actually consumed CI resources)
            async function didRunPassGate(run) {
              try {
                // Note: Fetching up to 100 jobs (API maximum). If a workflow has >100 jobs,
                // we may miss some, but this is unlikely in practice.
                const { data: jobsData } = await github.rest.actions.listJobsForWorkflowRun({
                  owner, repo, run_id: run.id, per_page: 100
                });
                const jobs = jobsData.jobs || [];
                // If no jobs exist yet, the run hasn't started consuming resources
                if (jobs.length === 0) {
                  core.info(`Run ${run.id} has no jobs yet; not counting against rate limit.`);
                  return false;
                }
                // Gate jobs that don't consume significant CI resources
                const gateJobs = ['check-changes', 'pr-gate', 'call-gate', 'pr-test-finish'];
                const jobsBeyondGate = jobs.filter(j => !gateJobs.some(g => j.name === g || j.name.startsWith(g + ' ')));
                // A job "ran" if it reached a terminal conclusion state that indicates actual execution
                const ranStates = ['success', 'failure', 'timed_out', 'action_required'];
                const hasJobsThatRan = jobsBeyondGate.some(j => j.conclusion && ranStates.includes(j.conclusion));
                return hasJobsThatRan;
              } catch (e) {
                core.warning(`Could not check jobs for run ${run.id}: ${e.message}`);
                // If it's a rate limit error, count it conservatively to prevent abuse
                if (e.status === 429) {
                  core.warning(`Hit rate limit checking run ${run.id}; counting it to be safe.`);
                  return true;
                }
                // For cancelled/skipped runs, they likely didn't consume resources
                if (run.conclusion === 'cancelled' || run.conclusion === 'skipped') {
                  return false;
                }
                // Default to counting it to prevent abuse
                return true;
              }
            }
            // Limit the number of runs we'll check in detail to avoid API rate limits
            const MAX_RUNS_TO_CHECK = 5;
            let runsChecked = 0;
            let runsSkippedAtGate = 0;
            let recentFound = null;
            for (const run of runs) {
              if (String(run.id) === String(context.runId)) continue;
              if (new Date(run.created_at) < cutoff) continue;
              const isUserRun = (run.actor?.login === triggeringActor) || (run.triggering_actor?.login === triggeringActor);
              if (!isUserRun) continue;
              runsChecked++;
              core.info(`Checking run ${run.id} (created: ${run.created_at}, conclusion: ${run.conclusion})`);
              // Safety limit: if we've checked too many runs, assume the next one passed to be conservative
              if (runsChecked > MAX_RUNS_TO_CHECK) {
                core.warning(`Checked ${MAX_RUNS_TO_CHECK} runs; assuming this one passed gate to avoid API limits.`);
                recentFound = run;
                break;
              }
              // Only count runs that actually passed the gate and consumed CI resources
              if (await didRunPassGate(run)) {
                recentFound = run;
                core.info(`Found recent run ${run.id} that passed gate.`);
                break;
              } else {
                runsSkippedAtGate++;
                core.info(`Run ${run.id} failed at gate; not counting against rate limit.`);
              }
            }
            core.info(`Rate limit check summary: checked ${runsChecked} runs, ${runsSkippedAtGate} failed at gate.`);
            if (recentFound) {
              core.setFailed(
                `User '${triggeringActor}' already triggered '${context.workflow}' via '${eventName}' at ${recentFound.created_at}. ` +
                `Please wait ${effectiveCooldownMinutes} minutes before triggering again.`
              );
            } else {
              core.info(
                `No recent runs detected for '${triggeringActor}' within the last ${effectiveCooldownMinutes} minutes; proceeding.`
              );
            }
--- a/third_party/sglang/.github/workflows/pr-test-amd-rocm720.yml
+++ b/third_party/sglang/.github/workflows/pr-test-amd-rocm720.yml
--- a/third_party/sglang/.github/workflows/pr-test-amd.yml
+++ b/third_party/sglang/.github/workflows/pr-test-amd.yml
--- a/third_party/sglang/.github/workflows/pr-test-jit-kernel.yml
+++ b/third_party/sglang/.github/workflows/pr-test-jit-kernel.yml
@@ -0,0 +1,117 @@
 name: PR Test - JIT Kernel
 on:
  workflow_call:
    inputs:
      jit_kernel:
        required: true
        type: string
      pr_head_sha:
        required: false
        type: string
        default: ''
      git_ref:
        required: false
        type: string
        default: ''
      target_stage:
        required: false
        type: string
        default: ''
      test_parallel_dispatch:
        required: false
        type: string
        default: 'false'
      skip_stage_health_check:
        required: false
        type: boolean
        default: false
 # Workflow-level env is NOT inherited from the caller in reusable workflows (verified by CI test).
 # The github context (including github.event_name) IS inherited from the caller.
 env:
  SGLANG_IS_IN_CI: true
  SGLANG_CUDA_COREDUMP: "1"
  SGLANG_JIT_DEEPGEMM_FAST_WARMUP: true
  SGLANG_PR_TEST_BYPASS_MAINTENANCE_ON_MAIN: ${{ github.ref == 'refs/heads/main' && 'true' || 'false' }}
  SKIP_STAGE_HEALTH_CHECK: ${{ inputs.skip_stage_health_check == true && 'true' || 'false' }}
 jobs:
  jit-kernel-unit-test:
    if: |
      github.event_name != 'schedule' &&
      inputs.test_parallel_dispatch != 'true' &&
      !inputs.target_stage
    runs-on: 1-gpu-h100
    timeout-minutes: 240
    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }}
      - uses: ./.github/actions/check-stage-health
      - uses: ./.github/actions/check-maintenance
      - name: Install dependencies
        timeout-minutes: 20
        run: |
          bash scripts/ci/cuda/ci_install_dependency.sh diffusion
      - name: Run test
        timeout-minutes: 30
        run: |
          cd test/
          python3 run_suite.py --hw cuda --suite stage-b-kernel-unit-1-gpu-large
  jit-kernel-multigpu-unit-test:
    if: |
      github.event_name != 'schedule' &&
      inputs.test_parallel_dispatch != 'true' &&
      !inputs.target_stage
    runs-on: 8-gpu-h200
    timeout-minutes: 240
    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }}
      - uses: ./.github/actions/check-maintenance
      - name: Install dependencies
        timeout-minutes: 20
        run: |
          bash scripts/ci/cuda/ci_install_dependency.sh diffusion
      - name: Run multi-GPU test
        timeout-minutes: 45
        run: |
          cd test/
          python3 run_suite.py --hw cuda --suite stage-b-kernel-unit-8-gpu-h200
  jit-kernel-benchmark-test:
    if: |
      github.event_name != 'schedule' &&
      inputs.test_parallel_dispatch != 'true' &&
      !inputs.target_stage
    runs-on: 1-gpu-h100
    timeout-minutes: 240
    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }}
      - uses: ./.github/actions/check-stage-health
      - uses: ./.github/actions/check-maintenance
      - name: Install dependencies
        timeout-minutes: 20
        run: |
          bash scripts/ci/cuda/ci_install_dependency.sh diffusion
      - name: Run benchmark tests
        timeout-minutes: 45
        run: |
          cd test/
          python3 run_suite.py --hw cuda --suite stage-b-kernel-benchmark-1-gpu-large
--- a/third_party/sglang/.github/workflows/pr-test-multimodal-gen.yml
+++ b/third_party/sglang/.github/workflows/pr-test-multimodal-gen.yml
@@ -0,0 +1,245 @@
 name: PR Test - Multimodal Gen
 on:
  workflow_call:
    inputs:
      multimodal_gen:
        required: true
        type: string
      sgl_kernel:
        required: true
        type: string
      b200_runner:
        required: true
        type: string
      continue_on_error:
        required: false
        type: string
        default: 'false'
      pr_head_sha:
        required: false
        type: string
        default: ''
      git_ref:
        required: false
        type: string
        default: ''
      target_stage:
        required: false
        type: string
        default: ''
      test_parallel_dispatch:
        required: false
        type: string
        default: 'false'
      caller_needs_failure:
        required: false
        type: string
        default: 'false'
      skip_stage_health_check:
        required: false
        type: string
        default: 'false'
 # Workflow-level env is NOT inherited from the caller in reusable workflows.
 # The github context (including github.event_name) IS inherited from the caller.
 env:
  SGLANG_IS_IN_CI: true
  SGLANG_CUDA_COREDUMP: "1"
  SGLANG_PR_TEST_BYPASS_MAINTENANCE_ON_MAIN: ${{ github.ref == 'refs/heads/main' && 'true' || 'false' }}
  SKIP_STAGE_HEALTH_CHECK: ${{ inputs.skip_stage_health_check == 'true' }}
 jobs:
  multimodal-gen-test-1-gpu:
    if: |
      (inputs.target_stage == 'multimodal-gen-test-1-gpu') ||
      (
        !inputs.target_stage &&
        ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == 'true') || (inputs.caller_needs_failure != 'true' && !cancelled())) &&
        inputs.multimodal_gen == 'true'
      )
    runs-on: 1-gpu-h100
    timeout-minutes: 240
    strategy:
      fail-fast: false
      matrix:
        part: [0, 1]
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }}
      - uses: ./.github/actions/check-stage-health
      - uses: ./.github/actions/check-maintenance
      - name: Download artifacts
        if: inputs.sgl_kernel == 'true'
        uses: actions/download-artifact@v4
        with:
          path: sgl-kernel/dist/
          merge-multiple: true
          pattern: wheel-python3.10-cuda12.9
      - name: Install dependencies
        timeout-minutes: 20
        run: |
          CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh diffusion
      - name: Run diffusion server tests
        timeout-minutes: 240
        env:
          RUNAI_STREAMER_MEMORY_LIMIT: 0
          CONTINUE_ON_ERROR_FLAG: ${{ inputs.continue_on_error == 'true' && '--continue-on-error' || '' }}
        run: |
          cd python
          python3 sglang/multimodal_gen/test/run_suite.py \
            --suite 1-gpu \
            --partition-id ${{ matrix.part }} \
            --total-partitions 2 \
            $CONTINUE_ON_ERROR_FLAG
      - uses: ./.github/actions/upload-cuda-coredumps
        if: always()
        with:
          artifact-suffix: ${{ matrix.part }}
  multimodal-gen-test-2-gpu:
    if: |
      (inputs.target_stage == 'multimodal-gen-test-2-gpu') ||
      (
        !inputs.target_stage &&
        ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == 'true') || (inputs.caller_needs_failure != 'true' && !cancelled())) &&
        inputs.multimodal_gen == 'true'
      )
    runs-on: 2-gpu-h100
    timeout-minutes: 240
    strategy:
      fail-fast: false
      matrix:
        part: [0, 1]
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }}
      - uses: ./.github/actions/check-stage-health
      - uses: ./.github/actions/check-maintenance
      - name: Download artifacts
        if: inputs.sgl_kernel == 'true'
        uses: actions/download-artifact@v4
        with:
          path: sgl-kernel/dist/
          merge-multiple: true
          pattern: wheel-python3.10-cuda12.9
      - name: Install dependencies
        timeout-minutes: 20
        run: |
          CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh diffusion
      - name: Run diffusion server tests
        timeout-minutes: 240
        env:
          RUNAI_STREAMER_MEMORY_LIMIT: 0
          CONTINUE_ON_ERROR_FLAG: ${{ inputs.continue_on_error == 'true' && '--continue-on-error' || '' }}
        run: |
          cd python
          python3 sglang/multimodal_gen/test/run_suite.py \
            --suite 2-gpu \
            --partition-id ${{ matrix.part }} \
            --total-partitions 2 \
            $CONTINUE_ON_ERROR_FLAG
      - uses: ./.github/actions/upload-cuda-coredumps
        if: always()
        with:
          artifact-suffix: ${{ matrix.part }}
  multimodal-gen-test-1-b200:
    if: |
      (inputs.target_stage == 'multimodal-gen-test-1-b200') ||
      (
        !inputs.target_stage &&
        ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == 'true') || (inputs.caller_needs_failure != 'true' && !cancelled())) &&
        inputs.multimodal_gen == 'true'
      )
    runs-on: ${{ inputs.b200_runner }}
    timeout-minutes: 240
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }}
      - uses: ./.github/actions/check-maintenance
      - name: Download artifacts
        if: inputs.sgl_kernel == 'true'
        uses: actions/download-artifact@v4
        with:
          path: sgl-kernel/dist/
          merge-multiple: true
          pattern: wheel-python3.10-cuda12.9
      - name: Install dependencies
        timeout-minutes: 20
        run: |
          CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh diffusion
      - name: Run diffusion server tests
        timeout-minutes: 240
        env:
          RUNAI_STREAMER_MEMORY_LIMIT: 0
          CONTINUE_ON_ERROR_FLAG: ${{ inputs.continue_on_error == 'true' && '--continue-on-error' || '' }}
        run: |
          cd python
          python3 sglang/multimodal_gen/test/run_suite.py \
            --suite 1-gpu-b200 \
            $CONTINUE_ON_ERROR_FLAG
      - uses: ./.github/actions/upload-cuda-coredumps
        if: always()
  multimodal-gen-unit-test:
    if: |
      (inputs.target_stage == 'multimodal-gen-unit-test') ||
      (
        !inputs.target_stage &&
        ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == 'true') || (inputs.caller_needs_failure != 'true' && !cancelled())) &&
        inputs.multimodal_gen == 'true'
      )
    runs-on: 1-gpu-h100
    timeout-minutes: 120
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }}
      - uses: ./.github/actions/check-stage-health
      - uses: ./.github/actions/check-maintenance
      - name: Download artifacts
        if: inputs.sgl_kernel == 'true'
        uses: actions/download-artifact@v4
        with:
          path: sgl-kernel/dist/
          merge-multiple: true
          pattern: wheel-python3.10-cuda12.9
      - name: Install dependencies
        timeout-minutes: 20
        run: |
          CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh diffusion
      - name: Run diffusion unit tests
        timeout-minutes: 60
        run: |
          cd python
          python3 sglang/multimodal_gen/test/run_suite.py --suite unit
--- a/third_party/sglang/.github/workflows/pr-test-npu.yml
+++ b/third_party/sglang/.github/workflows/pr-test-npu.yml
@@ -0,0 +1,453 @@
 name: PR Test (NPU)
 on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]
  workflow_dispatch:
  workflow_call:
    inputs:
      ref:
        description: 'Git ref (branch, tag, or SHA) to test. If not provided, uses the default branch.'
        required: false
        type: string
        default: ''
      run_all_tests:
        description: "Run all tests (for releasing or testing purpose)"
        required: false
        type: boolean
        default: false
 concurrency:
  group: pr-test-npu-${{ inputs.ref || github.ref }}
  cancel-in-progress: ${{ github.event_name != 'workflow_call' }}
 jobs:
  # ==================== Check Changes ==================== #
  check-changes:
    runs-on: ubuntu-latest
    outputs:
      changes_exist: ${{ steps.filter.outputs.main_package == 'true' || steps.filter.outputs.multimodal_gen == 'true' || steps.run-mode.outputs.run_all_tests == 'true'}}
      main_package: ${{ steps.filter.outputs.main_package == 'true' || steps.run-mode.outputs.run_all_tests == 'true' }}
      multimodal_gen: ${{ steps.filter.outputs.multimodal_gen == 'true' || steps.run-mode.outputs.run_all_tests == 'true' }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - name: Determine run mode
        id: run-mode
        run: |
          # Run all tests for workflow_call (when ref input is provided)
          # Note: github.event_name is inherited from caller, so we detect workflow_call by checking inputs.ref
          if [[ "${{ inputs.run_all_tests }}" == "true" ]]; then
            echo "run_all_tests=true" >> $GITHUB_OUTPUT
            echo "Run mode: ALL TESTS (run_all_tests=${{ inputs.run_all_tests }})"
          else
            echo "run_all_tests=false" >> $GITHUB_OUTPUT
            echo "Run mode: FILTERED (triggered by ${{ github.event_name }})"
          fi
      - name: Detect file changes
        id: filter
        uses: dorny/paths-filter@v3
        if: steps.run-mode.outputs.run_all_tests != 'true'
        with:
          filters: |
            main_package:
              - "python/sglang/!(multimodal_gen)/**/!(*.md)"
              - "python/pyproject_npu.toml"
              - "scripts/ci/npu/npu_ci_install_dependency.sh"
              - "test/srt/ascend/**"
              - ".github/workflows/pr-test-npu.yml"
            multimodal_gen:
              - "python/sglang/multimodal_gen/**/*.!(md|ipynb)"
              - "python/sglang/srt/**"
              - "python/pyproject_npu.toml"
              - "scripts/ci/npu/npu_ci_install_dependency.sh"
              - ".github/workflows/pr-test-npu.yml"
  # ==================== PR Gate ==================== #
  pr-gate:
    needs: check-changes
    if: needs.check-changes.outputs.changes_exist == 'true'
    uses: ./.github/workflows/pr-gate.yml
    secrets: inherit
  stage-b-test-1-npu-a2:
    needs: [check-changes, pr-gate]
    if: needs.check-changes.outputs.main_package == 'true'
    runs-on: linux-aarch64-a2-1
    strategy:
      fail-fast: false
      matrix:
        part: [ 0, 1 ]
    container:
      image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-910b-ubuntu22.04-py3.11
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - name: Mark repository safe
        run: |
          git config --system --add safe.directory ${GITHUB_WORKSPACE}
      - name: Install dependencies
        env:
          TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
          PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
          GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
        run: |
          # speed up by using infra cache services
          CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
          sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
          pip config set global.index-url http://${CACHING_URL}/pypi/simple
          pip config set global.trusted-host "${CACHING_URL}"
          bash scripts/ci/npu/npu_ci_install_dependency.sh 910b
          # copy required file from our daily cache
          cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
          # copy gsm8k dataset
          cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
      - name: Run test
        timeout-minutes: 60
        env:
          SGLANG_USE_MODELSCOPE: true
          SGLANG_IS_IN_CI: true
          HF_ENDPOINT: https://hf-mirror.com
          TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
          PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
          STREAMS_PER_DEVICE: 32
        run: |
          cd test
          python3 run_suite.py --hw npu --suite stage-b-test-1-npu-a2 --auto-partition-id ${{ matrix.part }} --auto-partition-size 2
  stage-b-test-2-npu-a2:
    needs: [check-changes, pr-gate]
    if: needs.check-changes.outputs.main_package == 'true'
    runs-on: linux-aarch64-a2-2
    strategy:
      fail-fast: true
      matrix:
        part: [0, 1]
    container:
      image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-910b-ubuntu22.04-py3.11
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - name: Mark repository safe
        run: |
          git config --system --add safe.directory ${GITHUB_WORKSPACE}
      - name: Install dependencies
        env:
          TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
          PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
          GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
        run: |
          # speed up by using infra cache services
          CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
          sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
          pip config set global.index-url http://${CACHING_URL}/pypi/simple
          pip config set global.trusted-host "${CACHING_URL}"
          bash scripts/ci/npu/npu_ci_install_dependency.sh 910b
          # copy required file from our daily cache
          cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
          # copy gsm8k dataset
          cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
      - name: Run test
        timeout-minutes: 60
        env:
          SGLANG_USE_MODELSCOPE: true
          SGLANG_IS_IN_CI: true
          HF_ENDPOINT: https://hf-mirror.com
          TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
          PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
          STREAMS_PER_DEVICE: 32
        run: |
          cd test
          python3 run_suite.py --hw npu --suite stage-b-test-2-npu-a2 --auto-partition-id ${{ matrix.part }} --auto-partition-size 2
  stage-b-test-4-npu-a3:
    needs: [check-changes, pr-gate]
    if: needs.check-changes.outputs.main_package == 'true'
    runs-on: linux-aarch64-a3-4
    container:
      image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-a3-ubuntu22.04-py3.11
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - name: Mark repository safe
        run: |
          git config --system --add safe.directory ${GITHUB_WORKSPACE}
      - name: Install dependencies
        env:
          TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
          PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
          GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
        run: |
          # speed up by using infra cache services
          CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
          sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
          pip config set global.index-url http://${CACHING_URL}/pypi/simple
          pip config set global.trusted-host "${CACHING_URL}"
          bash scripts/ci/npu/npu_ci_install_dependency.sh a3
          # copy required file from our daily cache
          cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
          # copy gsm8k dataset
          cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
      - name: Run test
        timeout-minutes: 60
        env:
          SGLANG_USE_MODELSCOPE: true
          SGLANG_IS_IN_CI: true
          HF_ENDPOINT: https://hf-mirror.com
          TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
          PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
          STREAMS_PER_DEVICE: 32
        run: |
          cd test
          python3 run_suite.py --hw npu --suite stage-b-test-4-npu-a3 --timeout-per-file 3600
  stage-b-test-16-npu-a3:
    needs: [check-changes, pr-gate]
    if: needs.check-changes.outputs.main_package == 'true'
    runs-on: linux-aarch64-a3-16
    container:
      image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-a3-ubuntu22.04-py3.11
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - name: Mark repository safe
        run: |
          git config --system --add safe.directory ${GITHUB_WORKSPACE}
      - name: Install dependencies
        env:
          TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
          PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
          GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
        run: |
          # speed up by using infra cache services
          CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
          sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
          pip config set global.index-url http://${CACHING_URL}/pypi/simple
          pip config set global.trusted-host "${CACHING_URL}"
          bash scripts/ci/npu/npu_ci_install_dependency.sh a3
          # copy required file from our daily cache
          cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
          # copy gsm8k dataset
          cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
      - name: Run test
        timeout-minutes: 60
        env:
          SGLANG_USE_MODELSCOPE: true
          SGLANG_IS_IN_CI: true
          HF_ENDPOINT: https://hf-mirror.com
          TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
          PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
          STREAMS_PER_DEVICE: 32
        run: |
          cd test
          python3 run_suite.py --hw npu --suite stage-b-test-16-npu-a3 --timeout-per-file 3600
  multimodal-gen-test-1-npu-a3:
    needs: [check-changes, pr-gate]
    if: needs.check-changes.outputs.multimodal_gen == 'true'
    runs-on: linux-aarch64-a3-2
    container:
      image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.3.rc2-a3-ubuntu22.04-py3.11
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Mark repository safe
        run: |
          git config --system --add safe.directory ${GITHUB_WORKSPACE}
      - name: Install dependencies
        env:
          TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
          PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
          GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
        run: |
          # speed up by using infra cache services
          CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
          sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
          pip config set global.index-url http://${CACHING_URL}/pypi/simple
          pip config set global.trusted-host "${CACHING_URL}"
          bash scripts/ci/npu/npu_ci_install_dependency.sh a3 diffusion
          # copy required file from our daily cache
          cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
          # copy gsm8k dataset
          cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
      - name: Run test
        timeout-minutes: 60
        env:
          SGLANG_USE_MODELSCOPE: true
          SGLANG_IS_IN_CI: true
          HF_ENDPOINT: https://hf-mirror.com
          TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
          PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
          STREAMS_PER_DEVICE: 32
        run: |
          export PATH="/usr/local/Ascend/8.3.RC1/compiler/bishengir/bin:${PATH}"
          cd python
          python3 sglang/multimodal_gen/test/run_suite.py --suite 1-npu
  multimodal-gen-test-2-npu-a3:
    needs: [check-changes, pr-gate]
    if: needs.check-changes.outputs.multimodal_gen == 'true'
    runs-on: linux-aarch64-a3-16
    container:
      image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.3.rc2-a3-ubuntu22.04-py3.11
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Mark repository safe
        run: |
          git config --system --add safe.directory ${GITHUB_WORKSPACE}
      - name: Install dependencies
        env:
          TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
          PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
          GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
        run: |
          # speed up by using infra cache services
          CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
          sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
          pip config set global.index-url http://${CACHING_URL}/pypi/simple
          pip config set global.trusted-host "${CACHING_URL}"
          bash scripts/ci/npu/npu_ci_install_dependency.sh a3 diffusion
          # copy required file from our daily cache
          cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
          # copy gsm8k dataset
          cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
      - name: Run test
        timeout-minutes: 60
        env:
          SGLANG_USE_MODELSCOPE: true
          SGLANG_IS_IN_CI: true
          HF_ENDPOINT: https://hf-mirror.com
          TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
          PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
          STREAMS_PER_DEVICE: 32
        run: |
          export PATH="/usr/local/Ascend/8.3.RC1/compiler/bishengir/bin:${PATH}"
          cd python
          python3 sglang/multimodal_gen/test/run_suite.py --suite 2-npu
  multimodal-gen-test-8-npu-a3:
    needs: [check-changes, pr-gate]
    if: needs.check-changes.outputs.multimodal_gen == 'true'
    runs-on: linux-aarch64-a3-8
    container:
      image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.5.0-a3-ubuntu22.04-py3.11
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Mark repository safe
        run: |
          git config --system --add safe.directory ${GITHUB_WORKSPACE}
      - name: Install dependencies
        env:
          TORCH_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/whl/cpu"
          PYPI_CACHE_URL: "http://cache-service.nginx-pypi-cache.svc.cluster.local/pypi/simple"
          GITHUB_PROXY_URL: "https://gh-proxy.test.osinfra.cn/"
        run: |
          # speed up by using infra cache services
          CACHING_URL="cache-service.nginx-pypi-cache.svc.cluster.local"
          sed -Ei "s@(ports|archive).ubuntu.com@${CACHING_URL}:8081@g" /etc/apt/sources.list
          pip config set global.index-url http://${CACHING_URL}/pypi/simple
          pip config set global.trusted-host "${CACHING_URL}"
          bash scripts/ci/npu/npu_ci_install_dependency.sh a3 diffusion
          # copy required file from our daily cache
          cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
          # copy gsm8k dataset
          cp ~/.cache/modelscope/hub/datasets/tmp/test.jsonl /tmp
      - name: Run test
        timeout-minutes: 60
        env:
          SGLANG_USE_MODELSCOPE: true
          SGLANG_IS_IN_CI: true
          HF_ENDPOINT: https://hf-mirror.com
          TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
          PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
          STREAMS_PER_DEVICE: 32
        run: |
          cd python
          python3 sglang/multimodal_gen/test/run_suite.py --suite 8-npu
  pr-test-npu-finish:
    needs:
      [
        check-changes,
        stage-b-test-1-npu-a2,
        stage-b-test-2-npu-a2,
        stage-b-test-4-npu-a3,
        stage-b-test-16-npu-a3,
        multimodal-gen-test-1-npu-a3,
        multimodal-gen-test-2-npu-a3,
        multimodal-gen-test-8-npu-a3,
      ]
    if: always()
    runs-on: ubuntu-latest
    steps:
      - name: Check all dependent job statuses
        run: |
          # Convert the 'needs' context to a JSON string
          json_needs='${{ toJson(needs) }}'
          # Get a list of all job names from the JSON keys
          job_names=$(echo "$json_needs" | jq -r 'keys_unsorted[]')
          for job in $job_names; do
            # For each job, extract its result
            result=$(echo "$json_needs" | jq -r --arg j "$job" '.[$j].result')
            # Print the job name and its result
            echo "$job: $result"
            # Check for failure or cancellation and exit if found
            if [[ "$result" == "failure" || "$result" == "cancelled" ]]; then
              echo "The above jobs failed."
              exit 1
            fi
          done
          # If the loop completes, all jobs were successful
          echo "All jobs completed successfully"
          exit 0
--- a/third_party/sglang/.github/workflows/pr-test-rust.yml
+++ b/third_party/sglang/.github/workflows/pr-test-rust.yml
@@ -0,0 +1,359 @@
 name: PR Test (SMG)
 on:
  push:
    branches: [ main ]
    paths:
      - "sgl-model-gateway/**"
  pull_request:
    branches: [ main ]
    types: [opened, synchronize, reopened, labeled]
    paths:
      - "sgl-model-gateway/**"
  workflow_dispatch:
 concurrency:
  group: gateway-tests-${{ github.ref }}
  cancel-in-progress: true
 env:
  RUSTC_WRAPPER: sccache
  SCCACHE_GHA_ENABLED: "true"
  SGLANG_IS_IN_CI: true
 jobs:
  build-wheel:
    if: |
      github.event_name != 'pull_request' ||
      (github.event.action != 'labeled' && contains(github.event.pull_request.labels.*.name, 'run-ci')) ||
      (github.event.action == 'labeled' && github.event.label.name == 'run-ci')
    runs-on: 4-gpu-a10
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Install rust dependencies
        run: |
          bash scripts/ci/cuda/ci_install_gateway_dependencies.sh
      - name: Configure sccache
        uses: mozilla-actions/sccache-action@v0.0.9
        with:
          version: "v0.12.0"
          disable_annotations: true
      - name: Rust cache
        uses: Swatinem/rust-cache@v2
        with:
          workspaces: sgl-model-gateway
          shared-key: "rust-cache"
          cache-all-crates: true
          cache-on-failure: true
          save-if: true
      - name: Build python binding
        run: |
          source "$HOME/.cargo/env"
          export RUSTC_WRAPPER=sccache
          cd sgl-model-gateway/bindings/python
          python3 -m pip install --upgrade pip maturin
          maturin build --profile ci --features vendored-openssl --out dist
      - name: List built wheel
        run: ls -lh sgl-model-gateway/bindings/python/dist/
      - name: Upload wheel artifact
        uses: actions/upload-artifact@v4
        with:
          name: smg-wheel
          path: sgl-model-gateway/bindings/python/dist/*.whl
          retention-days: 1
      - name: Test wheel install
        run: |
          pip install sgl-model-gateway/bindings/python/dist/*.whl
          python3 -c "import sglang_router; print('Python package: OK')"
          python3 -c "from sglang_router.sglang_router_rs import Router; print('Rust extension: OK')"
          python3 -m sglang_router.launch_router --help > /dev/null && echo "Entry point: OK"
  python-unit-tests:
    needs: build-wheel
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          path: sglang-repo
      - name: Move sgl-model-gateway folder to root
        run: |
          mv sglang-repo/sgl-model-gateway/* .
          rm -rf sglang-repo
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.13"
      - name: Download wheel artifact
        uses: actions/download-artifact@v4
        with:
          name: smg-wheel
          path: dist/
      - name: Install wheel
        run: pip install dist/*.whl
      - name: Run Python unit tests
        run: |
          cd bindings/python
          python3 -m pip install pytest pytest-cov pytest-xdist
          pytest -q tests --cov=sglang_router --cov-config=.coveragerc --cov-report=term-missing --cov-fail-under=80
  unit-tests:
    if: |
      github.event_name != 'pull_request' ||
      (github.event.action != 'labeled' && contains(github.event.pull_request.labels.*.name, 'run-ci')) ||
      (github.event.action == 'labeled' && github.event.label.name == 'run-ci')
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Install dependencies
        run: |
          bash scripts/ci/cuda/ci_install_gateway_dependencies.sh
      - name: Configure sccache
        uses: mozilla-actions/sccache-action@v0.0.9
        with:
          version: "v0.12.0"
          disable_annotations: true
      - name: Rust cache
        uses: Swatinem/rust-cache@v2
        with:
          workspaces: sgl-model-gateway
          shared-key: "rust-cache"
          cache-all-crates: true
          cache-on-failure: true
          save-if: true
      - name: Run lint
        run: |
          source "$HOME/.cargo/env"
          cd sgl-model-gateway/
          rustup component add clippy
          cargo clippy --all-targets --all-features -- -D warnings
      - name: Run fmt
        run: |
          source "$HOME/.cargo/env"
          cd sgl-model-gateway/
          rustup component add --toolchain nightly-x86_64-unknown-linux-gnu rustfmt
          rustup toolchain install nightly --profile minimal
          cargo +nightly fmt -- --check
      - name: Generate vision golden fixtures
        run: |
          pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
          pip install transformers pillow numpy scipy
          pip install transformers pillow numpy
          cd sgl-model-gateway/
          python scripts/generate_vision_golden.py
      - name: Run Rust tests
        timeout-minutes: 20
        run: |
          source "$HOME/.cargo/env"
          cd sgl-model-gateway/
          cargo test
      - name: Show sccache stats
        if: always()
        run: sccache --show-stats
  gateway-e2e:
    name: ${{ matrix.name }}
    needs: build-wheel
    if: |
      github.event_name != 'pull_request' ||
      (github.event.action != 'labeled' && contains(github.event.pull_request.labels.*.name, 'run-ci')) ||
      (github.event.action == 'labeled' && github.event.label.name == 'run-ci')
    strategy:
      fail-fast: false
      matrix:
        include:
          - name: benchmarks
            timeout: 32
            test_dirs: "e2e_test/benchmarks"
            extra_deps: "genai-bench==0.0.3"
            env_vars: ""
            reruns: ""
            upload_benchmarks: true
            parallel_opts: ""  # No parallel for benchmarks (performance measurement)
          - name: responses
            timeout: 45
            test_dirs: "e2e_test/responses"
            extra_deps: ""
            env_vars: "SHOW_WORKER_LOGS=0 SHOW_ROUTER_LOGS=1"
            reruns: "--reruns 2 --reruns-delay 5"
            setup_oracle: true
            setup_brave: true
            parallel_opts: ""  # Cloud backend tests not compatible with parallel execution
          - name: e2e
            timeout: 45
            test_dirs: "e2e_test/router e2e_test/embeddings"
            extra_deps: "pytest-parallel py"  # py is required for pytest-parallel with newer pytest
            env_vars: "SHOW_WORKER_LOGS=0 SHOW_ROUTER_LOGS=1"
            reruns: "--reruns 2 --reruns-delay 5"
            parallel_opts: "--workers 1 --tests-per-worker 4"  # Thread-based parallelism
          - name: chat-completions
            timeout: 45
            test_dirs: "e2e_test/chat_completions"
            extra_deps: ""
            env_vars: "SHOW_WORKER_LOGS=0 SHOW_ROUTER_LOGS=1"
            reruns: "--reruns 2 --reruns-delay 5"
            parallel_opts: ""
    runs-on: 4-gpu-a10
    timeout-minutes: ${{ matrix.timeout }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Install SGLang dependencies
        run: |
          sudo --preserve-env=PATH bash scripts/ci/cuda/ci_install_dependency.sh
      - name: Setup Oracle Instant Client
        if: matrix.setup_oracle
        run: |
          sudo apt-get install -y unzip
          INSTANT_CLIENT_DIR="/home/ubuntu/instant-client"
          INSTANT_CLIENT_ZIP="instantclient-basic-linux.x64-23.9.0.25.07.zip"
          if [ ! -d "$INSTANT_CLIENT_DIR/instantclient_23_9" ]; then
            echo "Downloading Oracle Instant Client..."
            mkdir -p "$INSTANT_CLIENT_DIR"
            cd "$INSTANT_CLIENT_DIR"
            wget https://download.oracle.com/otn_software/linux/instantclient/2390000/$INSTANT_CLIENT_ZIP
            unzip $INSTANT_CLIENT_ZIP
            rm $INSTANT_CLIENT_ZIP
          else
            echo "Oracle Instant Client already exists, skipping download"
          fi
          echo "LD_LIBRARY_PATH=/home/ubuntu/instant-client/instantclient_23_9:\$LD_LIBRARY_PATH" >> $GITHUB_ENV
      - name: Start Oracle Database
        if: matrix.setup_oracle
        run: |
          docker run -d -p 1521:1521 -e ORACLE_PASSWORD=oracle --name oracle-db gvenzl/oracle-xe:21-slim
          echo "Starting Oracle DB..."
          # Export Oracle connection environment variables
          echo "ATP_USER=system" >> $GITHUB_ENV
          echo "ATP_PASSWORD=oracle" >> $GITHUB_ENV
          echo "ATP_DSN=localhost:1521/XEPDB1" >> $GITHUB_ENV
      - name: Start Brave MCP Server
        if: matrix.setup_brave
        run: |
          docker run -d --rm \
            -p 8001:8080 \
            -e BRAVE_API_KEY \
            --name brave-search-server \
            shoofio/brave-search-mcp-sse:1.0.10
          echo "Starting Brave MCP Server..."
          sleep 2
          curl -f --max-time 1 http://localhost:8001/sse > /dev/null 2>&1 && echo "Brave MCP Server is healthy!" || echo "Brave MCP Server responded"
      - name: Download wheel artifact
        uses: actions/download-artifact@v4
        with:
          name: smg-wheel
          path: wheel/
      - name: Install wheel
        run: |
          pip uninstall -y sglang-router || true
          pip install wheel/*.whl
      - name: Install e2e test dependencies
        run: |
          python3 -m pip install pytest pytest-rerunfailures httpx openai grpcio grpcio-health-checking numpy
          if [ -n "${{ matrix.extra_deps }}" ]; then
            python3 -m pip --no-cache-dir install --upgrade ${{ matrix.extra_deps }}
          fi
      - name: Run E2E tests
        run: |
          python3 python/sglang/cli/killall.py
          cd sgl-model-gateway
          ${{ matrix.env_vars }} ROUTER_LOCAL_MODEL_PATH="/home/ubuntu/models" pytest ${{ matrix.reruns }} ${{ matrix.parallel_opts }} ${{ matrix.test_dirs }} -s -vv -o log_cli=true --log-cli-level=INFO
      - name: Upload benchmark results
        if: matrix.upload_benchmarks && success()
        uses: actions/upload-artifact@v4
        with:
          name: genai-bench-results-all-policies
          path: sgl-model-gateway/benchmark_**/
      - name: Cleanup Brave MCP Server
        if: always() && matrix.setup_brave
        run: |
          docker stop brave-search-server || true
          docker rm brave-search-server || true
      - name: Cleanup Oracle Database
        if: always() && matrix.setup_oracle
        run: |
          docker stop oracle-db || true
          docker rm oracle-db || true
  docker-build-test:
    if: |
      github.event_name != 'pull_request' ||
      (github.event.action != 'labeled' && contains(github.event.pull_request.labels.*.name, 'run-ci')) ||
      (github.event.action == 'labeled' && github.event.label.name == 'run-ci')
    runs-on: ubuntu-24.04
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      - name: Build Docker image (no push)
        uses: docker/build-push-action@v5
        with:
          context: .
          file: docker/gateway.Dockerfile
          push: false
          tags: sgl-model-gateway:test
          cache-from: type=gha
          cache-to: type=gha,mode=max
  finish:
    needs: [build-wheel, python-unit-tests, unit-tests, gateway-e2e, docker-build-test]
    runs-on: ubuntu-latest
    steps:
      - name: Finish
        run: echo "This is an empty step to ensure that all jobs are completed."
  summarize-benchmarks:
    needs: gateway-e2e
    runs-on: ubuntu-latest
    if: success()
    steps:
    - name: Checkout code
      uses: actions/checkout@v4
    - name: Download benchmark results
      uses: actions/download-artifact@v4
      with:
        name: genai-bench-results-all-policies
    - name: Create benchmark summary
      run: python3 sgl-model-gateway/e2e_test/benchmarks/summarize.py .
--- a/third_party/sglang/.github/workflows/pr-test-sgl-kernel.yml
+++ b/third_party/sglang/.github/workflows/pr-test-sgl-kernel.yml
@@ -0,0 +1,214 @@
 name: PR Test - SGL Kernel
 on:
  workflow_call:
    inputs:
      sgl_kernel:
        required: true
        type: string
      b200_runner:
        required: true
        type: string
      pr_head_sha:
        required: false
        type: string
        default: ''
      git_ref:
        required: false
        type: string
        default: ''
      skip_stage_health_check:
        required: false
        type: boolean
        default: false
 # Workflow-level env is NOT inherited from the caller in reusable workflows.
 # The github context (including github.event_name) IS inherited from the caller.
 env:
  SGLANG_IS_IN_CI: true
  SGLANG_CUDA_COREDUMP: "1"
  SGLANG_PR_TEST_BYPASS_MAINTENANCE_ON_MAIN: ${{ github.ref == 'refs/heads/main' && 'true' || 'false' }}
  SKIP_STAGE_HEALTH_CHECK: ${{ inputs.skip_stage_health_check == true && 'true' || 'false' }}
 jobs:
  sgl-kernel-unit-test:
    runs-on: 1-gpu-h100
    timeout-minutes: 240
    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }}
      - uses: ./.github/actions/check-stage-health
      - uses: ./.github/actions/check-maintenance
      - name: Cleanup
        run: |
          ls -alh sgl-kernel/dist || true
          rm -rf sgl-kernel/dist/* || true
      - name: Download artifacts
        uses: actions/download-artifact@v4
        with:
          path: sgl-kernel/dist/
          merge-multiple: true
          pattern: wheel-python3.10-cuda12.9
      - name: Install dependencies
        timeout-minutes: 20
        run: |
          CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh diffusion
      - name: Run test
        timeout-minutes: 30
        run: |
          cd sgl-kernel
          pytest tests/
  sgl-kernel-mla-test:
    runs-on: 1-gpu-h100
    timeout-minutes: 240
    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }}
      - uses: ./.github/actions/check-stage-health
      - uses: ./.github/actions/check-maintenance
      - name: Cleanup
        run: |
          ls -alh sgl-kernel/dist || true
          rm -rf sgl-kernel/dist/* || true
      - name: Download artifacts
        uses: actions/download-artifact@v4
        with:
          path: sgl-kernel/dist/
          merge-multiple: true
          pattern: wheel-python3.10-cuda12.9
      - name: Install dependencies
        timeout-minutes: 20
        run: |
          CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh
      - name: Run test
        timeout-minutes: 30
        run: |
          cd test/registered/mla
          python3 test_mla_deepseek_v3.py
  sgl-kernel-benchmark-test:
    runs-on: 1-gpu-h100
    timeout-minutes: 240
    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }}
      - uses: ./.github/actions/check-stage-health
      - uses: ./.github/actions/check-maintenance
      - name: Cleanup
        run: |
          ls -alh sgl-kernel/dist || true
          rm -rf sgl-kernel/dist/* || true
      - name: Download artifacts
        uses: actions/download-artifact@v4
        with:
          path: sgl-kernel/dist/
          merge-multiple: true
          pattern: wheel-python3.10-cuda12.9
      - name: Install dependencies
        timeout-minutes: 20
        run: |
          CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh
      - name: Run benchmark tests
        timeout-minutes: 45
        run: |
          cd sgl-kernel/benchmark
          echo "Running sgl-kernel benchmark tests in CI mode..."
          echo "CI environment variable: $CI"
          echo "GITHUB_ACTIONS environment variable: $GITHUB_ACTIONS"
          for bench_file in bench_*.py; do
            echo "Testing $bench_file..."
            timeout 60 python3 "$bench_file" || echo "Warning: $bench_file timed out or failed, continuing..."
            echo "Completed $bench_file"
            echo "---"
          done
          echo "All benchmark tests completed!"
  sgl-kernel-b200-test:
    runs-on: ${{ inputs.b200_runner }}
    timeout-minutes: 240
    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ inputs.pr_head_sha || inputs.git_ref || github.sha }}
      - uses: ./.github/actions/check-stage-health
      - uses: ./.github/actions/check-maintenance
      - name: Cleanup
        run: |
          ls -alh sgl-kernel/dist || true
          rm -rf sgl-kernel/dist/* || true
      - name: Download artifacts
        uses: actions/download-artifact@v4
        with:
          path: sgl-kernel/dist/
          merge-multiple: true
          pattern: wheel-python3.10-cuda12.9
      - name: Install dependencies
        timeout-minutes: 20
        run: |
          CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh diffusion
      - name: Run sgl-kernel unit tests on B200
        timeout-minutes: 30
        run: |
          cd sgl-kernel
          pytest tests/
  # Adding a single CUDA13 smoke test to verify that the kernel builds and runs
  # TODO: Add back this test when it can pass on CI
  # cuda13-kernel-smoke-test:
  #   if: inputs.sgl_kernel == 'true'
  #   runs-on: x64-cu13-kernel-tests
  #   steps:
  #     - uses: actions/checkout@v4
  #     - name: Cleanup
  #       run: |
  #         ls -alh sgl-kernel/dist || true
  #         rm -rf sgl-kernel/dist/* || true
  #     - name: Download CUDA 13.0 artifacts
  #       uses: actions/download-artifact@v4
  #       with:
  #         path: sgl-kernel/dist/
  #         merge-multiple: true
  #         pattern: wheel-python3.10-cuda13.0
  #     - name: Install dependencies
  #       run: |
  #         CUSTOM_BUILD_SGL_KERNEL=${{inputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh
  #     - name: Run kernel unit tests
  #       timeout-minutes: 30
  #       run: |
  #         cd sgl-kernel
  #         pytest tests/
--- a/third_party/sglang/.github/workflows/pr-test-xeon.yml
+++ b/third_party/sglang/.github/workflows/pr-test-xeon.yml
@@ -0,0 +1,131 @@
 name: PR Test (Xeon)
 on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]
  workflow_dispatch:
  workflow_call:
    inputs:
      ref:
        description: 'Git ref (branch, tag, or SHA) to test. If not provided, uses the default branch.'
        required: false
        type: string
        default: ''
      run_all_tests:
        description: "Run all tests (for releasing or testing purpose)"
        required: false
        type: boolean
        default: false
 concurrency:
  group: pr-test-xeon-${{ inputs.ref || github.ref }}
  cancel-in-progress: false
 jobs:
  # ==================== Check Changes ==================== #
  check-changes:
    runs-on: ubuntu-latest
    outputs:
      main_package: ${{ steps.filter.outputs.main_package || steps.run-mode.outputs.run_all_tests}}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - name: Determine run mode
        id: run-mode
        run: |
          # Run all tests for workflow_call (when ref input is provided)
          # Note: github.event_name is inherited from caller, so we detect workflow_call by checking inputs.ref
          if [[ "${{ inputs.run_all_tests }}" == "true" ]]; then
            echo "run_all_tests=true" >> $GITHUB_OUTPUT
            echo "Run mode: ALL TESTS (run_all_tests=${{ inputs.run_all_tests }})"
          else
            echo "run_all_tests=false" >> $GITHUB_OUTPUT
            echo "Run mode: FILTERED (triggered by ${{ github.event_name }})"
          fi
      - name: Detect file changes
        id: filter
        uses: dorny/paths-filter@v3
        if: steps.run-mode.outputs.run_all_tests != 'true'
        with:
          filters: |
            main_package:
              - "python/sglang/!(multimodal_gen)/**/!(*.md)"
              - "python/pyproject_cpu.toml"
              - "test/**/!(*.md)"
              - "sgl-kernel/**/*.!(md|txt)"
              - ".github/workflows/pr-test-xeon.yml"
              - "docker/xeon.Dockerfile"
  # ==================== PR Gate ==================== #
  pr-gate:
    needs: check-changes
    if: needs.check-changes.outputs.main_package == 'true'
    uses: ./.github/workflows/pr-gate.yml
    secrets: inherit
  build-test:
    needs: [check-changes, pr-gate]
    if: needs.check-changes.outputs.main_package == 'true'
    runs-on: xeon-gnr
    env:
      HF_HOME: /home/sdp/.cache/huggingface
    strategy:
      matrix:
        build_type: ['all']
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - name: Build and Push
        run: |
          version=$(cat python/sglang/version.py | cut -d'"' -f2)
          tag=v${version}-xeon
          PR_REPO=${{ github.event.pull_request.head.repo.clone_url }}
          PR_HEAD_REF=${{ github.head_ref }}
          docker build \
            ${PR_REPO:+--build-arg SGLANG_REPO=$PR_REPO} \
            ${PR_HEAD_REF:+--build-arg VER_SGLANG=$PR_HEAD_REF} \
            . -f docker/xeon.Dockerfile  -t sglang_xeon --no-cache
      - name: Run container
        run: |
          docker run -dt \
            -v ${{ github.workspace }}:/sglang-checkout/ --ipc=host \
            -v ${HF_HOME}:/root/.cache/huggingface \
            --name ci_sglang_xeon \
            sglang_xeon
      - name: Check AMX support
        id: check_amx
        timeout-minutes: 5
        run: |
          docker exec -w /sglang-checkout/ ci_sglang_xeon \
            bash -c "source /opt/.venv/bin/activate && python3 -c 'import torch; import sgl_kernel; assert torch._C._cpu._is_amx_tile_supported(); assert hasattr(torch.ops.sgl_kernel, \"convert_weight_packed\"); '"
      - name: Run unit tests
        timeout-minutes: 36
        run: |
          docker exec -w /sglang-checkout/ ci_sglang_xeon \
            bash -c "source /opt/.venv/bin/activate && cd ./test/srt && python3 run_suite.py --suite per-commit-cpu --timeout-per-file 1500"
      - name: Change permission
        timeout-minutes: 2
        run: |
          docker exec -u root ci_sglang_xeon bash -c "
            rm -rf /tmp/ci-home  &&
            chown -R  $(id -u):$(id -g) /sglang-checkout/ 2>/dev/null || true
          "
      - name: Cleanup container
        if: always()
        run: |
          docker rm -f ci_sglang_xeon || true
--- a/third_party/sglang/.github/workflows/pr-test-xpu.yml
+++ b/third_party/sglang/.github/workflows/pr-test-xpu.yml
@@ -0,0 +1,143 @@
 name: PR Test (XPU)
 on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]
  workflow_dispatch:
  workflow_call:
    inputs:
      ref:
        description: 'Git ref (branch, tag, or SHA) to test. If not provided, uses the default branch.'
        required: false
        type: string
        default: ''
      run_all_tests:
        description: "Run all tests (for releasing or testing purpose)"
        required: false
        type: boolean
        default: false
 concurrency:
  group: pr-test-xpu-${{ inputs.ref || github.ref }}
  cancel-in-progress: ${{ github.event_name != 'workflow_call' }}
 jobs:
  # ==================== Check Changes ==================== #
  check-changes:
    runs-on: ubuntu-latest
    outputs:
      main_package: ${{ steps.filter.outputs.main_package || steps.run-mode.outputs.run_all_tests }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.ref || github.ref }}
      - name: Determine run mode
        id: run-mode
        run: |
          # Run all tests for workflow_call (when ref input is provided)
          # Note: github.event_name is inherited from caller, so we detect workflow_call by checking inputs.ref
          if [[ "${{ inputs.run_all_tests }}" == "true" ]]; then
            echo "run_all_tests=true" >> $GITHUB_OUTPUT
            echo "Run mode: ALL TESTS (run_all_tests=${{ inputs.run_all_tests }})"
          else
            echo "run_all_tests=false" >> $GITHUB_OUTPUT
            echo "Run mode: FILTERED (triggered by ${{ github.event_name }})"
          fi
      - name: Detect file changes
        id: filter
        uses: dorny/paths-filter@v3
        if: steps.run-mode.outputs.run_all_tests != 'true'
        with:
          filters: |
            main_package:
              - "python/sglang/!(multimodal_gen)/**/!(*.md)"
              - "python/pyproject_xpu.toml"
              - "test/**/!(*.md)"
              - "sgl-kernel/**/*.!(md|txt)"
              - ".github/workflows/pr-test-xpu.yml"
              - "docker/xpu.Dockerfile"
  # ==================== PR Gate ==================== #
  pr-gate:
    needs: check-changes
    if: needs.check-changes.outputs.main_package == 'true'
    uses: ./.github/workflows/pr-gate.yml
    secrets: inherit
  build-and-test:
    needs: [check-changes, pr-gate]
    if: needs.check-changes.outputs.main_package == 'true'
    runs-on: intel-bmg
    env:
      HF_HOME: /home/sdp/.cache/huggingface
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          fetch-depth: 0
          ref: ${{ inputs.ref || github.ref }}
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      - name: Build Docker image
        run: |
          PR_REPO=${{ github.event.pull_request.head.repo.clone_url }}
          PR_HEAD_REF=${{ github.head_ref }}
          docker build \
            ${PR_REPO:+--build-arg SG_LANG_REPO=$PR_REPO} \
            ${PR_HEAD_REF:+--build-arg SG_LANG_BRANCH=$PR_HEAD_REF} \
            --no-cache --progress=plain -f docker/xpu.Dockerfile -t xpu_sglang_main:bmg .
      - name: Run container
        id: start_container
        run: |
          container_id=$(docker run -dt \
            --group-add 992 \
            --group-add $(getent group video | cut -d: -f3) \
            -v ${HF_HOME}:/root/.cache/huggingface \
            --device /dev/dri \
            -e HF_TOKEN="$(cat ~/huggingface_token.txt)" \
            xpu_sglang_main:bmg)
          echo "Started container: $container_id"
          echo "container_id=$container_id" >> "$GITHUB_OUTPUT"
      - name: Install Dependency
        timeout-minutes: 20
        run: |
          cid="${{ steps.start_container.outputs.container_id }}"
          docker exec "$cid" /home/sdp/miniforge3/envs/py3.10/bin/python3 -m pip install --upgrade pip
          docker exec "$cid" /home/sdp/miniforge3/envs/py3.10/bin/python3 -m pip install pytest expecttest ray huggingface_hub
          docker exec "$cid" /home/sdp/miniforge3/envs/py3.10/bin/python3 -m pip uninstall -y flashinfer-python
          docker exec "$cid" /bin/bash -c '/home/sdp/miniforge3/envs/py3.10/bin/hf auth login --token ${HF_TOKEN} '
      - name: Run E2E Bfloat16 tests
        timeout-minutes: 20
        run: |
          cid="${{ steps.start_container.outputs.container_id }}"
          docker exec "$cid" bash -c "source /home/sdp/miniforge3/bin/activate && conda activate py3.10 && cd /home/sdp/sglang/test/srt && python3 run_suite.py --suite per-commit-xpu"
      - name: Cleanup container
        if: always()
        run: |
          cid="${{ steps.start_container.outputs.container_id }}"
          docker rm -f "$cid" || true
  finish:
    if: always()
    needs: [build-and-test, pr-gate]
    runs-on: ubuntu-latest
    steps:
      - name: Check job status
        run: |
          result="${{ needs.build-and-test.result }}"
          if [ "$result" != "success" ] && [ "$result" != "skipped" ]; then
            echo "Job failed with result: $result"
            exit 1
          fi
          echo "All jobs completed successfully (result: $result)"
          exit 0
--- a/third_party/sglang/.github/workflows/pr-test.yml
+++ b/third_party/sglang/.github/workflows/pr-test.yml
--- a/third_party/sglang/.github/workflows/release-branch-cut.yml
+++ b/third_party/sglang/.github/workflows/release-branch-cut.yml
@@ -0,0 +1,215 @@
 name: Release Branch Cut
 on:
  workflow_dispatch:
    inputs:
      branch_name:
        description: 'Branch name to create (e.g., release/v0.5.7)'
        required: true
        type: string
      commit_sha:
        description: 'Commit SHA from main to cut the release branch from (defaults to latest main)'
        required: false
        type: string
        default: ''
 permissions:
  actions: write
  contents: write
  issues: read
  pull-requests: read
 jobs:
  cut-release-branch:
    if: github.repository == 'sgl-project/sglang'
    runs-on: ubuntu-latest
    environment: 'prod'
    outputs:
      branch_name: ${{ steps.set_output.outputs.branch_name }}
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
        with:
          ref: main
          fetch-depth: 0
          token: ${{ secrets.GITHUB_TOKEN }}
      - name: Validate branch name
        run: |
          BRANCH_NAME="${{ github.event.inputs.branch_name }}"
          if [ -z "$BRANCH_NAME" ]; then
            echo "::error::Branch name is required"
            exit 1
          fi
          # Validate branch name format (should start with release/)
          if [[ ! "$BRANCH_NAME" =~ ^release/ ]]; then
            echo "::warning::Branch name '$BRANCH_NAME' does not follow convention 'release/vX.Y.Z'"
          fi
          echo "Branch name: $BRANCH_NAME"
      - name: Validate commit SHA
        id: validate
        run: |
          COMMIT_SHA="${{ github.event.inputs.commit_sha }}"
          # If no commit SHA provided, use latest main
          if [ -z "$COMMIT_SHA" ]; then
            COMMIT_SHA=$(git rev-parse HEAD)
            echo "No commit SHA provided, using latest main: $COMMIT_SHA"
          fi
          # Verify the commit exists and is on main
          if ! git cat-file -t "$COMMIT_SHA" > /dev/null 2>&1; then
            echo "::error::Commit SHA '$COMMIT_SHA' does not exist"
            exit 1
          fi
          # Check if commit is an ancestor of main (i.e., is on main branch)
          if ! git merge-base --is-ancestor "$COMMIT_SHA" main; then
            echo "::error::Commit SHA '$COMMIT_SHA' is not on the main branch"
            exit 1
          fi
          echo "COMMIT_SHA=$COMMIT_SHA" >> $GITHUB_OUTPUT
          echo "Validated commit SHA: $COMMIT_SHA"
      - name: Check if branch already exists
        run: |
          BRANCH_NAME="${{ github.event.inputs.branch_name }}"
          if git ls-remote --heads origin "$BRANCH_NAME" | grep -q "$BRANCH_NAME"; then
            echo "::error::Branch '$BRANCH_NAME' already exists"
            exit 1
          fi
          echo "Branch '$BRANCH_NAME' does not exist, proceeding with creation"
      - name: Create release branch
        id: set_output
        run: |
          COMMIT_SHA="${{ steps.validate.outputs.COMMIT_SHA }}"
          BRANCH_NAME="${{ github.event.inputs.branch_name }}"
          git config user.name "sglang-bot"
          git config user.email "sglang-bot@users.noreply.github.com"
          # Create branch from the specified commit
          git checkout -b "$BRANCH_NAME" "$COMMIT_SHA"
          echo "branch_name=$BRANCH_NAME" >> $GITHUB_OUTPUT
          echo "Successfully created branch '$BRANCH_NAME' from commit '$COMMIT_SHA'"
      - name: Update version references in documentation
        run: |
          BRANCH_NAME="${{ github.event.inputs.branch_name }}"
          # Extract version from branch name (e.g., release/v0.5.8 -> v0.5.8)
          VERSION=$(echo "$BRANCH_NAME" | sed 's/release\///')
          # Update git clone version references in docs
          sed -i "s/git clone -b v[0-9]\+\.[0-9]\+\.[0-9]\+\.\?post\?[0-9]*/git clone -b $VERSION/" docs/get_started/install.md
          sed -i "s/git clone -b v[0-9]\+\.[0-9]\+\.[0-9]\+\.\?post\?[0-9]*/git clone -b $VERSION/" docs/platforms/amd_gpu.md
          # Check if any changes were made
          if git diff --quiet; then
            echo "No version references needed updating"
          else
            git add docs/get_started/install.md docs/platforms/amd_gpu.md
            git commit -m "docs: update version references to $VERSION"
            echo "Updated version references to $VERSION"
          fi
      - name: Push release branch
        run: |
          BRANCH_NAME="${{ steps.set_output.outputs.branch_name }}"
          git push origin "$BRANCH_NAME"
          echo "Successfully pushed branch '$BRANCH_NAME'"
      - name: Summary
        run: |
          COMMIT_SHA="${{ steps.validate.outputs.COMMIT_SHA }}"
          BRANCH_NAME="${{ github.event.inputs.branch_name }}"
          echo "## Release Branch Cut Summary" >> $GITHUB_STEP_SUMMARY
          echo "" >> $GITHUB_STEP_SUMMARY
          echo "| Property | Value |" >> $GITHUB_STEP_SUMMARY
          echo "|----------|-------|" >> $GITHUB_STEP_SUMMARY
          echo "| Branch | \`$BRANCH_NAME\` |" >> $GITHUB_STEP_SUMMARY
          echo "| Commit | \`$COMMIT_SHA\` |" >> $GITHUB_STEP_SUMMARY
          echo "| Triggered by | @${{ github.actor }} |" >> $GITHUB_STEP_SUMMARY
          echo "" >> $GITHUB_STEP_SUMMARY
          echo "### Next Steps" >> $GITHUB_STEP_SUMMARY
          echo "1. Tests are automatically triggered on the release branch" >> $GITHUB_STEP_SUMMARY
          echo "2. Apply any hotfixes if needed" >> $GITHUB_STEP_SUMMARY
          echo "3. Create a tag to trigger release: \`gh workflow run release-tag.yml -f version=X.Y.Z -f ref=$BRANCH_NAME\`" >> $GITHUB_STEP_SUMMARY
  run-pr-tests-nvidia:
    needs: cut-release-branch
    uses: ./.github/workflows/pr-test.yml
    with:
      git_ref: ${{ needs.cut-release-branch.outputs.branch_name }}
      run_all_tests: true
      skip_stage_health_check: true
    secrets: inherit
  run-pr-tests-amd:
    needs: cut-release-branch
    uses: ./.github/workflows/pr-test-amd.yml
    with:
      ref: ${{ needs.cut-release-branch.outputs.branch_name }}
      run_all_tests: true
    secrets: inherit
  run-pr-test-npu:
    needs: cut-release-branch
    uses: ./.github/workflows/pr-test-npu.yml
    with:
      ref: ${{ needs.cut-release-branch.outputs.branch_name }}
      run_all_tests: true
    secrets: inherit
  run-pr-tests-xeon:
    needs: cut-release-branch
    uses: ./.github/workflows/pr-test-xeon.yml
    with:
      ref: ${{ needs.cut-release-branch.outputs.branch_name }}
      run_all_tests: true
    secrets: inherit
  run-pr-tests-xpu:
    needs: cut-release-branch
    uses: ./.github/workflows/pr-test-xpu.yml
    with:
      ref: ${{ needs.cut-release-branch.outputs.branch_name }}
      run_all_tests: true
    secrets: inherit
  run-nightly-tests-nvidia:
    needs: cut-release-branch
    uses: ./.github/workflows/nightly-test-nvidia.yml
    with:
      ref: ${{ needs.cut-release-branch.outputs.branch_name }}
    secrets: inherit
  run-nightly-tests-amd:
    needs: cut-release-branch
    uses: ./.github/workflows/nightly-test-amd.yml
    with:
      ref: ${{ needs.cut-release-branch.outputs.branch_name }}
    secrets: inherit
  run-nightly-tests-npu:
    needs: cut-release-branch
    uses: ./.github/workflows/nightly-test-npu.yml
    with:
      ref: ${{ needs.cut-release-branch.outputs.branch_name }}
    secrets: inherit
  run-nightly-tests-intel:
    needs: cut-release-branch
    uses: ./.github/workflows/nightly-test-intel.yml
    with:
      ref: ${{ needs.cut-release-branch.outputs.branch_name }}
    secrets: inherit
--- a/third_party/sglang/.github/workflows/release-docker-amd-nightly.yml
+++ b/third_party/sglang/.github/workflows/release-docker-amd-nightly.yml
@@ -0,0 +1,182 @@
 name: Release Docker Images Nightly (AMD)
 on:
  workflow_dispatch:
  schedule:
    - cron: '0 12 * * *'
 concurrency:
  # A PR number if a pull request and otherwise the commit hash. This cancels
  # queued and in-progress runs for the same PR (presubmit) or commit
  # (postsubmit). The workflow name is prepended to avoid conflicts between
  # different workflows.
  group: ${{ github.workflow }}-${{ github.event.number || github.sha }}
  cancel-in-progress: true
 jobs:
  publish:
    if: github.repository == 'sgl-project/sglang'
    runs-on: amd-docker-scale
    environment: 'prod'
    strategy:
      fail-fast: false
      matrix:
        gpu_arch: ['gfx942', 'gfx950']
        build_type: ['all']
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Required for git describe to find tags
      - name: "Set Date"
        run: |
          echo "DATE=$(date +%Y%m%d)" >> $GITHUB_ENV
      - name: Get version from latest tag
        id: version
        run: |
          # Get the latest version tag sorted by version number (e.g., v0.5.7 -> 0.5.7)
          VERSION=$(git tag -l 'v[0-9]*' --sort=-v:refname | head -1 | sed 's/^v//')
          if [ -z "$VERSION" ]; then
            echo "::error::Could not determine version from git tags"
            exit 1
          fi
          # Get short commit hash of current HEAD
          COMMIT_HASH=$(git rev-parse --short HEAD)
          # Compose pretend version for setuptools_scm: e.g., 0.5.8.dev20260129+g1a2b3c4
          PRETEND_VERSION="${VERSION}.dev${{ env.DATE }}+g${COMMIT_HASH}"
          echo "version=${VERSION}" >> $GITHUB_OUTPUT
          echo "pretend_version=${PRETEND_VERSION}" >> $GITHUB_OUTPUT
          echo "Detected version: ${VERSION}"
          echo "Pretend version for pip: ${PRETEND_VERSION}"
      - name: Login to Docker Hub (AMD)
        uses: docker/login-action@v2
        with:
          username: ${{ secrets.DOCKERHUB_AMD_USERNAME }}
          password: ${{ secrets.DOCKERHUB_AMD_TOKEN }}
      - name: Build and Push to rocm/sgl-dev
        run: |
          version=${{ steps.version.outputs.version }}
          pretend_version=${{ steps.version.outputs.pretend_version }}
          echo "Version: ${version}"
          echo "Pretend version: ${pretend_version}"
          if [ "${{ matrix.gpu_arch }}" = "gfx942" ]; then
            rocm_tag="rocm700-mi30x"
          elif [ "${{ matrix.gpu_arch }}" = "gfx950" ]; then
            rocm_tag="rocm700-mi35x"
          else
            echo "Unsupported gfx arch"
            exit 1
          fi
          tag=v${version}-${rocm_tag}
          echo "IMAGE_TAG=${tag}-${{ env.DATE }}" >> $GITHUB_ENV
          docker build . -f docker/rocm.Dockerfile --build-arg SGL_BRANCH=${{ github.ref_name }} --build-arg BUILD_TYPE=${{ matrix.build_type }} --build-arg GPU_ARCH=${{ matrix.gpu_arch }} --build-arg ENABLE_MORI=1 --build-arg NIC_BACKEND=ainic --build-arg SETUPTOOLS_SCM_PRETEND_VERSION=${pretend_version} -t rocm/sgl-dev:${tag}-${{ env.DATE }} --no-cache
          docker push rocm/sgl-dev:${tag}-${{ env.DATE }}
      - name: Login to Docker Hub (lmsys)
        uses: docker/login-action@v2
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      - name: Push to lmsysorg/sglang-rocm
        run: |
          docker tag rocm/sgl-dev:${{ env.IMAGE_TAG }} lmsysorg/sglang-rocm:${{ env.IMAGE_TAG }}
          docker push lmsysorg/sglang-rocm:${{ env.IMAGE_TAG }}
  # Temporarily disable docker cache seeding until performant storage is in place
  cache:
    if: false
    # if: always() && github.repository == 'sgl-project/sglang'
    runs-on: linux-mi300-gpu-1
    environment: 'prod'
    needs: publish
    strategy:
      fail-fast: false
      matrix:
        gpu_arch: ['gfx942']
        build_type: ['all']
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Required for git describe to find tags
      - name: "Set Date"
        run: |
          echo "DATE=$(date +%Y%m%d)" >> $GITHUB_ENV
      - name: Get version from latest tag
        id: version
        run: |
          # Get the latest version tag sorted by version number (e.g., v0.5.7 -> 0.5.7)
          VERSION=$(git tag -l 'v[0-9]*' --sort=-v:refname | head -1 | sed 's/^v//')
          if [ -z "$VERSION" ]; then
            echo "::error::Could not determine version from git tags"
            exit 1
          fi
          echo "version=${VERSION}" >> $GITHUB_OUTPUT
          echo "Detected version: ${VERSION}"
      - name: Login to Docker Hub
        uses: docker/login-action@v2
        with:
          username: ${{ secrets.DOCKERHUB_AMD_USERNAME }}
          password: ${{ secrets.DOCKERHUB_AMD_TOKEN }}
      - name: Pull and Save Docker Image to Cache
        run: |
          set -euxo pipefail
          version=${{ steps.version.outputs.version }}
          echo "Version: ${version}"
          if [ "${{ matrix.gpu_arch }}" = "gfx942" ]; then
            rocm_tag="rocm700-mi30x"
          else
            echo "Unsupported gfx arch"
            exit 1
          fi
          tag=v${version}-${rocm_tag}
          if [ "${{ matrix.build_type }}" = "all" ]; then
            tag_suffix=""
          else
            echo "Unsupported build type"
            exit 1
          fi
          image="rocm/sgl-dev:${tag}-${{ env.DATE }}${tag_suffix}"
          # Determine target cache file name based on ROCm variant
          if [[ "${rocm_tag}" == rocm700* ]]; then
            final_path="/home/runner/sgl-data/docker/image-700.tar"
          else
            echo "Unexpected ROCm tag: ${rocm_tag}"
            exit 1
          fi
          tmp_path="${final_path}.tmp"
          echo "Pulling image: ${image}"
          docker pull "${image}"
          echo "Saving to temp file: ${tmp_path}"
          docker save "${image}" -o "${tmp_path}"
          echo "Moving to final path: ${final_path}"
          mv -f "${tmp_path}" "${final_path}"
          echo "Cache populated successfully at ${final_path}"
--- a/third_party/sglang/.github/workflows/release-docker-amd-rocm720-nightly.yml
+++ b/third_party/sglang/.github/workflows/release-docker-amd-rocm720-nightly.yml
@@ -0,0 +1,94 @@
 name: Release Docker Images ROCm 7.2.0 Nightly Preview (AMD)
 on:
  workflow_dispatch:
  schedule:
    - cron: '0 12 * * *'
 concurrency:
  # A PR number if a pull request and otherwise the commit hash. This cancels
  # queued and in-progress runs for the same PR (presubmit) or commit
  # (postsubmit). The workflow name is prepended to avoid conflicts between
  # different workflows.
  group: ${{ github.workflow }}-${{ github.event.number || github.sha }}
  cancel-in-progress: True
 jobs:
  publish:
    if: github.repository == 'sgl-project/sglang'
    runs-on: amd-docker-scale
    environment: 'prod'
    strategy:
      fail-fast: false
      matrix:
        gpu_arch: ['gfx942-rocm720', 'gfx950-rocm720']
        build_type: ['all']
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Required for git describe to find tags
      - name: "Set Date"
        run: |
          echo "DATE=$(date +%Y%m%d)" >> $GITHUB_ENV
      - name: Get version from latest tag
        id: version
        run: |
          # Get the latest version tag sorted by version number (e.g., v0.5.7 -> 0.5.7)
          VERSION=$(git tag -l 'v[0-9]*' --sort=-v:refname | head -1 | sed 's/^v//')
          if [ -z "$VERSION" ]; then
            echo "::error::Could not determine version from git tags"
            exit 1
          fi
          # Get short commit hash of current HEAD
          COMMIT_HASH=$(git rev-parse --short HEAD)
          # Compose pretend version for setuptools_scm: e.g., 0.5.8.post1.dev20260211+g1a2b3c4
          PRETEND_VERSION="${VERSION}.dev${{ env.DATE }}+g${COMMIT_HASH}"
          echo "version=${VERSION}" >> $GITHUB_OUTPUT
          echo "pretend_version=${PRETEND_VERSION}" >> $GITHUB_OUTPUT
          echo "Detected version: ${VERSION}"
          echo "Pretend version for pip: ${PRETEND_VERSION}"
      - name: Login to Docker Hub
        uses: docker/login-action@v2
        with:
          username: ${{ secrets.DOCKERHUB_AMD_USERNAME }}
          password: ${{ secrets.DOCKERHUB_AMD_TOKEN }}
      - name: Build and Push to rocm/sgl-dev
        run: |
          version=${{ steps.version.outputs.version }}
          pretend_version=${{ steps.version.outputs.pretend_version }}
          echo "Version: ${version}"
          echo "Pretend version: ${pretend_version}"
          if [ "${{ matrix.gpu_arch }}" = "gfx942-rocm720" ]; then
            rocm_tag="rocm720-mi30x"
          elif [ "${{ matrix.gpu_arch }}" = "gfx950-rocm720" ]; then
            rocm_tag="rocm720-mi35x"
          else
            echo "Unsupported gfx arch"
            exit 1
          fi
          tag=v${version}-${rocm_tag}
          echo "IMAGE_TAG=${tag}-${{ env.DATE }}" >> $GITHUB_ENV
          docker build . -f docker/rocm.Dockerfile --build-arg SGL_BRANCH=${{ github.ref_name }} --build-arg BUILD_TYPE=${{ matrix.build_type }} --build-arg GPU_ARCH=${{ matrix.gpu_arch }} --build-arg ENABLE_MORI=1 --build-arg NIC_BACKEND=ainic --build-arg SETUPTOOLS_SCM_PRETEND_VERSION=${pretend_version} -t rocm/sgl-dev:${tag}-${{ env.DATE }} --no-cache
          docker push rocm/sgl-dev:${tag}-${{ env.DATE }}
      - name: Login to Docker Hub (lmsys)
        uses: docker/login-action@v2
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      - name: Push to lmsysorg/sglang-rocm
        run: |
          docker tag rocm/sgl-dev:${{ env.IMAGE_TAG }} lmsysorg/sglang-rocm:${{ env.IMAGE_TAG }}
          docker push lmsysorg/sglang-rocm:${{ env.IMAGE_TAG }}
--- a/third_party/sglang/.github/workflows/release-docker-amd.yml
+++ b/third_party/sglang/.github/workflows/release-docker-amd.yml
@@ -0,0 +1,88 @@
 name: Release Docker Images (AMD)
 on:
  push:
    tags:
      - 'v[0-9]+.*'
  workflow_dispatch:
    inputs:
      version:
        description: 'Version to build (without v prefix, e.g., 0.5.7)'
        required: true
 jobs:
  publish:
    if: github.repository == 'sgl-project/sglang'
    runs-on: amd-docker-scale
    environment: 'prod'
    strategy:
      matrix:
        rocm_version: ['rocm700', 'rocm720']
        gpu_arch: ['gfx942', 'gfx950']
        build_type: ['all']
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
      - name: Login to Docker Hub
        uses: docker/login-action@v2
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      - name: Get version from tag
        id: version
        run: |
          if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
            VERSION="${{ github.event.inputs.version }}"
          else
            # Extract version from tag (e.g., v0.5.7 -> 0.5.7)
            VERSION="${GITHUB_REF_NAME#v}"
          fi
          # Validate version format
          if [ -z "$VERSION" ]; then
            echo "::error::Version is empty"
            exit 1
          fi
          if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then
            echo "::error::Invalid version format: $VERSION (expected: X.Y.Z)"
            exit 1
          fi
          echo "version=${VERSION}" >> $GITHUB_OUTPUT
      - name: Build and Push
        run: |
          version=${{ steps.version.outputs.version }}
          echo "Version: ${version}"
          gpu_arch_suffix=""
          if [ "${{ matrix.rocm_version }}" = "rocm700" ]; then
            if [ "${{ matrix.gpu_arch }}" = "gfx942" ]; then
              rocm_tag="rocm700-mi30x"
            elif [ "${{ matrix.gpu_arch }}" = "gfx950" ]; then
              rocm_tag="rocm700-mi35x"
            else
              echo "Unsupported gfx arch"
              exit 1
            fi
          elif [ "${{ matrix.rocm_version }}" = "rocm720" ]; then
            gpu_arch_suffix="-${{ matrix.rocm_version }}"
            if [ "${{ matrix.gpu_arch }}" = "gfx942" ]; then
              rocm_tag="rocm720-mi30x"
            elif [ "${{ matrix.gpu_arch }}" = "gfx950" ]; then
              rocm_tag="rocm720-mi35x"
            else
              echo "Unsupported gfx arch"
              exit 1
            fi
          else
            echo "Unsupported rocm version"
            exit 1
          fi
          tag=v${version}-${rocm_tag}
          # rocm.Dockerfile expects SGL_BRANCH with 'v' prefix for git tag checkout
          docker build . -f docker/rocm.Dockerfile --build-arg BUILD_TYPE=${{ matrix.build_type }} --build-arg GPU_ARCH=${{ matrix.gpu_arch }}${gpu_arch_suffix} --build-arg SGL_BRANCH=v${version} --build-arg ENABLE_MORI=1 --build-arg NIC_BACKEND=ainic -t lmsysorg/sglang:${tag} --no-cache
          docker push lmsysorg/sglang:${tag}
--- a/third_party/sglang/.github/workflows/release-docker-cu13-framework.yml
+++ b/third_party/sglang/.github/workflows/release-docker-cu13-framework.yml
@@ -0,0 +1,190 @@
 name: Release CUDA 13 Framework Docker Images (Temporary)
 # Temporary workflow to build only versioned cu13 framework images
 # Can be deleted after use
 on:
  workflow_dispatch:
    inputs:
      version:
        description: "Version to build (without v prefix, e.g., 0.5.8)"
        required: true
 jobs:
  publish-x86:
    if: github.repository == 'sgl-project/sglang'
    runs-on: x64-docker-build-node
    steps:
      - name: Delete huge unnecessary tools folder
        run: rm -rf /opt/hostedtoolcache
      - name: Checkout repository
        uses: actions/checkout@v4
      - name: Free disk space
        uses: jlumbroso/free-disk-space@main
        with:
          tool-cache: false
          docker-images: false
          android: true
          dotnet: true
          haskell: true
          large-packages: true
          swap-storage: false
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      - name: Login to Docker Hub
        uses: docker/login-action@v2
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      - name: Validate version
        id: version
        run: |
          VERSION="${{ github.event.inputs.version }}"
          if [ -z "$VERSION" ]; then
            echo "::error::Version is empty"
            exit 1
          fi
          if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then
            echo "::error::Invalid version format: $VERSION (expected: X.Y.Z)"
            exit 1
          fi
          echo "version=${VERSION}" >> $GITHUB_OUTPUT
      - name: Build and Push AMD64 Framework (CUDA 13)
        run: |
          version=${{ steps.version.outputs.version }}
          docker buildx build \
            --target framework \
            --platform linux/amd64 \
            --output type=image,name=lmsysorg/sglang,push-by-digest=true,name-canonical=true,push=true \
            -f docker/Dockerfile \
            --build-arg CUDA_VERSION=13.0.1 \
            --build-arg BUILD_TYPE=all \
            --build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \
            --build-arg GRACE_BLACKWELL=0 \
            --build-arg SGL_VERSION=${version} \
            --metadata-file /tmp/metadata.json \
            --no-cache \
            .
          DIGEST=$(python3 -c "import json; print(json.load(open('/tmp/metadata.json'))['containerimage.digest'])")
          echo "Pushed digest: ${DIGEST}"
          echo "${DIGEST}" > /tmp/digest-cu130-amd64-framework.txt
      - name: Upload digest
        uses: actions/upload-artifact@v4
        with:
          name: digest-cu130-amd64
          path: /tmp/digest-cu130-amd64-framework.txt
          retention-days: 1
  publish-arm64:
    if: github.repository == 'sgl-project/sglang'
    runs-on: arm-docker-build-node
    steps:
      - name: Delete huge unnecessary tools folder
        run: rm -rf /opt/hostedtoolcache
      - name: Checkout repository
        uses: actions/checkout@v4
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      - name: Login to Docker Hub
        uses: docker/login-action@v2
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      - name: Validate version
        id: version
        run: |
          VERSION="${{ github.event.inputs.version }}"
          if [ -z "$VERSION" ]; then
            echo "::error::Version is empty"
            exit 1
          fi
          if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then
            echo "::error::Invalid version format: $VERSION (expected: X.Y.Z)"
            exit 1
          fi
          echo "version=${VERSION}" >> $GITHUB_OUTPUT
      - name: Build and Push ARM64 Framework (CUDA 13)
        run: |
          version=${{ steps.version.outputs.version }}
          docker buildx build \
            --target framework \
            --platform linux/arm64 \
            --output type=image,name=lmsysorg/sglang,push-by-digest=true,name-canonical=true,push=true \
            -f docker/Dockerfile \
            --build-arg CUDA_VERSION=13.0.1 \
            --build-arg BUILD_TYPE=all \
            --build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \
            --build-arg GRACE_BLACKWELL=1 \
            --build-arg SGL_VERSION=${version} \
            --metadata-file /tmp/metadata.json \
            --no-cache \
            .
          DIGEST=$(python3 -c "import json; print(json.load(open('/tmp/metadata.json'))['containerimage.digest'])")
          echo "Pushed digest: ${DIGEST}"
          echo "${DIGEST}" > /tmp/digest-cu130-arm64-framework.txt
      - name: Upload digest
        uses: actions/upload-artifact@v4
        with:
          name: digest-cu130-arm64
          path: /tmp/digest-cu130-arm64-framework.txt
          retention-days: 1
  create-manifest:
    runs-on: ubuntu-22.04
    needs: [publish-x86, publish-arm64]
    if: github.repository == 'sgl-project/sglang'
    steps:
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      - name: Login to Docker Hub
        uses: docker/login-action@v2
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      - name: Download amd64 digest
        uses: actions/download-artifact@v4
        with:
          name: digest-cu130-amd64
          path: /tmp/digests/amd64
      - name: Download arm64 digest
        uses: actions/download-artifact@v4
        with:
          name: digest-cu130-arm64
          path: /tmp/digests/arm64
      - name: Create multi-arch manifest
        run: |
          version=${{ github.event.inputs.version }}
          AMD64_DIGEST=$(cat /tmp/digests/amd64/digest-cu130-amd64-framework.txt)
          ARM64_DIGEST=$(cat /tmp/digests/arm64/digest-cu130-arm64-framework.txt)
          # Create versioned CUDA 13 framework manifest
          docker buildx imagetools create \
            -t lmsysorg/sglang:v${version}-cu130 \
            lmsysorg/sglang@${AMD64_DIGEST} \
            lmsysorg/sglang@${ARM64_DIGEST}
          # Create latest CUDA 13 framework manifest
          docker buildx imagetools create \
            -t lmsysorg/sglang:latest-cu130 \
            lmsysorg/sglang@${AMD64_DIGEST} \
            lmsysorg/sglang@${ARM64_DIGEST}
--- a/third_party/sglang/.github/workflows/release-docker-dev.yml
+++ b/third_party/sglang/.github/workflows/release-docker-dev.yml
@@ -0,0 +1,209 @@
 name: Build and Push Development Docker Images
 on:
  workflow_dispatch:
    inputs:
      pr_number:
        description: "PR number to build from (leave empty to use current branch)"
        required: false
        default: ""
      tag:
        description: "Custom tag suffix (overrides pr_number in tag). E.g. 'my-test' → dev-my-test, dev-cu13-my-test, etc."
        required: false
        default: ""
  schedule:
    - cron: "0 0 * * *"
 concurrency:
  group: release-docker-dev-${{ inputs.tag || inputs.pr_number || 'nightly' }}
  cancel-in-progress: true
 jobs:
  build-dev:
    if: ${{ github.repository == 'sgl-project/sglang' }}
    runs-on: ${{ matrix.runner }}
    strategy:
      matrix:
        include:
          - runner: x64-docker-build-node
            platform: linux/amd64
            build_type: all
            grace_blackwell: 0
            arch_tag: x86
            version: 12.9.1
          - runner: arm-docker-build-node
            platform: linux/arm64
            build_type: all
            grace_blackwell: 1
            arch_tag: arm64
            version: 12.9.1
          - runner: x64-docker-build-node
            platform: linux/amd64
            build_type: all
            grace_blackwell: 0
            arch_tag: x86-cu13
            version: 13.0.1
          - runner: arm-docker-build-node
            platform: linux/arm64
            build_type: all
            grace_blackwell: 1
            arch_tag: arm64-cu13
            version: 13.0.1
    steps:
      - name: Delete huge unnecessary tools folder
        run: rm -rf /opt/hostedtoolcache
      - name: Checkout repository
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || github.ref }}
      - name: Free disk space
        uses: jlumbroso/free-disk-space@main
        with:
          tool-cache: true
          docker-images: true
          android: true
          dotnet: true
          haskell: true
          large-packages: true
          swap-storage: true
      - name: Prune Docker to reclaim disk space
        run: |
          docker buildx prune --filter "until=72h" -f
          docker system prune -af --filter "until=72h"
          docker volume prune -af
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      - name: Login to Docker Hub
        uses: docker/login-action@v2
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      - name: Build and Push Dev Image
        run: |
          # Nightly (schedule) installs latest release; manual dispatch builds from checked-out source
          if [ "${{ github.event_name }}" = "schedule" ]; then
            SOURCE_ARG="--build-arg USE_LATEST_SGLANG=1"
          else
            SOURCE_ARG="--build-arg BRANCH_TYPE=local"
          fi
          docker buildx build \
            --platform ${{ matrix.platform }} \
            --output type=image,name=lmsysorg/sglang,push-by-digest=true,name-canonical=true,push=true \
            --target framework \
            -f docker/Dockerfile \
            --build-arg CUDA_VERSION=${{ matrix.version }} \
            --build-arg BUILD_TYPE=${{ matrix.build_type }} \
            --build-arg CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) \
            --build-arg GRACE_BLACKWELL=${{ matrix.grace_blackwell }} \
            ${SOURCE_ARG} \
            --build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \
            --metadata-file /tmp/metadata.json \
            --no-cache \
            .
          DIGEST=$(python3 -c "import json; print(json.load(open('/tmp/metadata.json'))['containerimage.digest'])")
          echo "Pushed digest: ${DIGEST}"
          echo "${DIGEST}" > /tmp/digest.txt
      - name: Upload digest
        uses: actions/upload-artifact@v4
        with:
          name: digest-${{ matrix.arch_tag }}
          path: /tmp/digest.txt
          retention-days: 1
  create-manifests:
    runs-on: ubuntu-22.04
    needs: [build-dev]
    if: ${{ github.repository == 'sgl-project/sglang' }}
    strategy:
      matrix:
        variant:
          - base: dev
            x86: x86
            arm64: arm64
          - base: dev-cu13
            x86: x86-cu13
            arm64: arm64-cu13
    steps:
      - uses: docker/setup-buildx-action@v3
      - uses: docker/login-action@v2
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      - name: Download x86 digest
        uses: actions/download-artifact@v4
        with:
          name: digest-${{ matrix.variant.x86 }}
          path: /tmp/digests/x86
      - name: Download arm64 digest
        uses: actions/download-artifact@v4
        with:
          name: digest-${{ matrix.variant.arm64 }}
          path: /tmp/digests/arm64
      - name: Create multi-arch manifest
        run: |
          X86_DIGEST=$(cat /tmp/digests/x86/digest.txt)
          ARM64_DIGEST=$(cat /tmp/digests/arm64/digest.txt)
          SUFFIX=""
          if [ -n "${{ inputs.tag }}" ]; then
            SUFFIX="-${{ inputs.tag }}"
          elif [ -n "${{ inputs.pr_number }}" ]; then
            SUFFIX="-pr-${{ inputs.pr_number }}"
          fi
          TAG="${{ matrix.variant.base }}${SUFFIX}"
          # For nightly (no suffix), also stamp a dated tag
          EXTRA_TAG=""
          if [ -z "${SUFFIX}" ]; then
            SHORT_SHA="${{ github.sha }}"
            EXTRA_TAG="-t lmsysorg/sglang:nightly-${TAG}-$(date +%Y%m%d)-${SHORT_SHA:0:8}"
          fi
          docker buildx imagetools create \
            -t lmsysorg/sglang:${TAG} \
            ${EXTRA_TAG} \
            lmsysorg/sglang@${X86_DIGEST} \
            lmsysorg/sglang@${ARM64_DIGEST}
          echo "✓ Published lmsysorg/sglang:${TAG}"
      - name: Cleanup Old Nightly Builds
        if: ${{ !inputs.tag && !inputs.pr_number }}
        run: |
          TOKEN=$(curl -s -H "Content-Type: application/json" \
            -X POST -d '{"username": "${{ secrets.DOCKERHUB_USERNAME }}", "password": "${{ secrets.DOCKERHUB_TOKEN }}"}' \
            https://hub.docker.com/v2/users/login/ | jq -r .token)
          TAGS_RESPONSE=$(curl -s -H "Authorization: JWT $TOKEN" \
            "https://hub.docker.com/v2/repositories/lmsysorg/sglang/tags/?page_size=100")
          TAGS=$(echo "$TAGS_RESPONSE" | jq -r \
            '.results[] | select(.name | test("^nightly-${{ matrix.variant.base }}-[0-9]")) | "\(.last_updated)|\(.name)"' \
            | sort -r | cut -d'|' -f2)
          TAG_COUNT=$(echo "$TAGS" | wc -l)
          if [ "$TAG_COUNT" -gt 14 ]; then
            echo "Found $TAG_COUNT nightly builds, keeping only the 14 most recent"
            TAGS_TO_DELETE=$(echo "$TAGS" | tail -n +15)
            for tag in $TAGS_TO_DELETE; do
              echo "Deleting tag: $tag"
              curl -X DELETE -H "Authorization: JWT $TOKEN" \
                "https://hub.docker.com/v2/repositories/lmsysorg/sglang/tags/$tag/"
            done
          else
            echo "Only $TAG_COUNT nightly builds found, no cleanup needed"
          fi
--- a/third_party/sglang/.github/workflows/release-docker-gateway.yml
+++ b/third_party/sglang/.github/workflows/release-docker-gateway.yml
@@ -0,0 +1,39 @@
 name: Release SGLang Model Gateway Docker Image
 on:
  push:
    branches:
      - main
    paths:
      - sgl-model-gateway/bindings/python/pyproject.toml
  workflow_dispatch:
 jobs:
  publish:
    if: github.repository == 'sgl-project/sglang'
    runs-on: ubuntu-24.04
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
      - name: Set up QEMU
        uses: docker/setup-qemu-action@v3
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      - name: Login to Docker Hub
        uses: docker/login-action@v3
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      - name: Build and Push
        run: |
          version=$(cat sgl-model-gateway/bindings/python/src/sglang_router/version.py | cut -d'"' -f2)
          tag=v${version}
          docker buildx build . -f docker/gateway.Dockerfile \
            --platform linux/amd64,linux/arm64 \
            -t lmsysorg/sgl-model-gateway:${tag} \
            -t lmsysorg/sgl-model-gateway:latest \
            --push
--- a/third_party/sglang/.github/workflows/release-docker-npu-nightly.yml
+++ b/third_party/sglang/.github/workflows/release-docker-npu-nightly.yml
@@ -0,0 +1,85 @@
 name: Release Docker Images Nightly (NPU)
 on:
  pull_request:
    branches:
      - 'main'
    paths:
      - '.github/workflows/release-docker-npu-nightly.yml'
      - 'docker/npu.Dockerfile'
  workflow_dispatch:
  schedule:
    - cron: "0 16 * * *" # Execute at 0:00 a.m. Beijing Time every day
 concurrency:
  group: ${{ github.workflow }}-${{ github.sha }}
  cancel-in-progress: true
 jobs:
  build:
    runs-on: ubuntu-22.04-arm
    strategy:
      matrix:
        cann_version: ["8.5.0"]
        device_type: ["910b", "a3"]
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
      - name: Free up disk space
        uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
        with:
          tool-cache: true
          docker-images: false
      - name: Setup Docker buildx
        uses: docker/setup-buildx-action@v3
      - name: Docker meta
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: |
            lmsysorg/sglang
          # push with schedule event
          # push with workflow_dispatch event
          tags: |
            type=ref,event=pr
            type=ref,event=branch
            type=schedule,pattern=main
          flavor: |
            latest=false
            suffix=-cann${{ matrix.cann_version }}-${{ matrix.device_type }},onlatest=true
      # Login against a Docker registry except on PR
      # https://github.com/docker/login-action
      - name: Log into docker hub
        uses: docker/login-action@v3
        if: ${{ github.repository == 'sgl-project/sglang' && github.event_name != 'pull_request' }}
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      # Enable Docker multi-architecture build environment
      # Emulate non-native architectures
      - name: Set up QEMU
        uses: docker/setup-qemu-action@v3
      # Required for building and pushing multi-arch Docker images
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      # Build and push Docker image with Buildx (don't push on PR)
      # https://github.com/docker/build-push-action
      - name: Build and push Docker image
        id: build-and-push
        uses: docker/build-push-action@v6
        with:
          context: docker
          file: docker/npu.Dockerfile
          platforms: linux/arm64,linux/amd64
          labels: ${{ steps.meta.outputs.labels }}
          tags: ${{ steps.meta.outputs.tags }}
          push: ${{ github.repository == 'sgl-project/sglang' && github.event_name != 'pull_request' }}
          provenance: false
          build-args: |
            SGLANG_KERNEL_NPU_TAG=2026.03.10.rc1
            CANN_VERSION=${{ matrix.cann_version }}
            DEVICE_TYPE=${{ matrix.device_type }}
--- a/third_party/sglang/.github/workflows/release-docker-npu.yml
+++ b/third_party/sglang/.github/workflows/release-docker-npu.yml
@@ -0,0 +1,93 @@
 name: Release Docker Images (NPU)
 on:
  push:
    tags:
      - 'v[0-9]+.*'
  workflow_dispatch:
    inputs:
      version:
        description: 'Version to build (without v prefix, e.g., 0.5.7)'
        required: true
 jobs:
  build:
    runs-on: ubuntu-22.04-arm
    strategy:
      matrix:
        cann_version: ["8.5.0"]
        device_type: ["910b", "a3"]
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
      - name: Free up disk space
        uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
        with:
          tool-cache: true
          docker-images: false
        # push with tag
      - name: Docker meta
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: |
            lmsysorg/sglang
          tags: |
            type=ref,event=pr
          flavor: |
            latest=false
      # Login against a Docker registry except on PR
      # https://github.com/docker/login-action
      - name: Login to Docker Hub
        uses: docker/login-action@v2
        if: ${{ github.repository == 'sgl-project/sglang' && github.event_name != 'pull_request' }}
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      - name: Get version from tag
        id: version
        run: |
          if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
            VERSION="${{ github.event.inputs.version }}"
          else
            # Extract version from tag (e.g., v0.5.7 -> 0.5.7)
            VERSION="${GITHUB_REF_NAME#v}"
          fi
          # Validate version format
          if [ -z "$VERSION" ]; then
            echo "::error::Version is empty"
            exit 1
          fi
          if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then
            echo "::error::Invalid version format: $VERSION (expected: X.Y.Z)"
            exit 1
          fi
          echo "version=v${VERSION}" >> $GITHUB_OUTPUT
          echo "TAG=lmsysorg/sglang:v${VERSION}-cann${{ matrix.cann_version }}-${{ matrix.device_type }}" >> $GITHUB_OUTPUT
      # Enable Docker multi-architecture build environment
      # Emulate non-native architectures
      - name: Set up QEMU
        uses: docker/setup-qemu-action@v3
      # Required for building and pushing multi-arch Docker images
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      - name: Build and push Docker image
        id: build-and-push
        uses: docker/build-push-action@v6
        with:
          context: docker
          file: docker/npu.Dockerfile
          platforms: linux/arm64,linux/amd64
          labels: ${{ steps.meta.outputs.labels }}
          tags: ${{ steps.meta.outputs.tags || steps.version.outputs.TAG }}
          push: ${{ github.repository == 'sgl-project/sglang' && github.event_name != 'pull_request' }}
          provenance: false
          build-args: |
            SGLANG_KERNEL_NPU_TAG=2026.03.10.rc1
            CANN_VERSION=${{ matrix.cann_version }}
            DEVICE_TYPE=${{ matrix.device_type }}
            SGLANG_TAG=${{ steps.version.outputs.version }}
--- a/third_party/sglang/.github/workflows/release-docker-runtime.yml
+++ b/third_party/sglang/.github/workflows/release-docker-runtime.yml
@@ -0,0 +1,309 @@
 name: Release Docker Runtime Images
 #
 # This workflow builds and publishes runtime Docker images (production-optimized, ~50% smaller):
 #   - lmsysorg/sglang:v{version}-runtime, lmsysorg/sglang:latest-runtime
 #   - lmsysorg/sglang:v{version}-cu130-runtime, lmsysorg/sglang:latest-cu130-runtime
 #
 on:
  push:
    tags:
      - "v[0-9]+.*"
  workflow_dispatch:
    inputs:
      version:
        description: "Version to build (without v prefix, e.g., 0.5.7)"
        required: true
 jobs:
  publish-x86:
    if: github.repository == 'sgl-project/sglang'
    environment: "prod"
    strategy:
      matrix:
        variant:
          - cuda_version: "12.9.1"
            build_type: "all"
            grace_blackwell: 0
    runs-on: x64-docker-build-node
    steps:
      - name: Delete huge unnecessary tools folder
        run: rm -rf /opt/hostedtoolcache
      - name: Checkout repository
        uses: actions/checkout@v4
      - name: Free disk space
        uses: jlumbroso/free-disk-space@main
        with:
          tool-cache: false
          docker-images: false
          android: true
          dotnet: true
          haskell: true
          large-packages: true
          swap-storage: false
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      - name: Login to Docker Hub
        uses: docker/login-action@v2
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      - name: Get version from tag
        id: version
        run: |
          if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
            VERSION="${{ github.event.inputs.version }}"
          else
            # Extract version from tag (e.g., v0.5.7 -> 0.5.7)
            VERSION="${GITHUB_REF_NAME#v}"
          fi
          # Validate version format
          if [ -z "$VERSION" ]; then
            echo "::error::Version is empty"
            exit 1
          fi
          if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then
            echo "::error::Invalid version format: $VERSION (expected: X.Y.Z)"
            exit 1
          fi
          echo "version=${VERSION}" >> $GITHUB_OUTPUT
      - name: Build and Push AMD64 Runtime
        run: |
          version=${{ steps.version.outputs.version }}
          docker buildx build \
            --target runtime \
            --platform linux/amd64 \
            --output type=image,name=lmsysorg/sglang,push-by-digest=true,name-canonical=true,push=true \
            -f docker/Dockerfile \
            --build-arg CUDA_VERSION=${{ matrix.variant.cuda_version }} \
            --build-arg BUILD_TYPE=${{ matrix.variant.build_type }} \
            --build-arg GRACE_BLACKWELL=${{ matrix.variant.grace_blackwell }} \
            --build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \
            --build-arg SGL_VERSION=${version} \
            --metadata-file /tmp/metadata-cu129-runtime.json \
            --no-cache \
            .
          DIGEST=$(python3 -c "import json; print(json.load(open('/tmp/metadata-cu129-runtime.json'))['containerimage.digest'])")
          echo "Pushed digest: ${DIGEST}"
          echo "${DIGEST}" > /tmp/digest-cu129-amd64-runtime.txt
      - name: Build and Push AMD64 Runtime (CUDA 13)
        run: |
          version=${{ steps.version.outputs.version }}
          docker buildx build \
            --target runtime \
            --platform linux/amd64 \
            --output type=image,name=lmsysorg/sglang,push-by-digest=true,name-canonical=true,push=true \
            -f docker/Dockerfile \
            --build-arg CUDA_VERSION=13.0.1 \
            --build-arg BUILD_TYPE=${{ matrix.variant.build_type }} \
            --build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \
            --build-arg GRACE_BLACKWELL=0 \
            --build-arg SGL_VERSION=${version} \
            --metadata-file /tmp/metadata-cu130-runtime.json \
            --no-cache \
            .
          DIGEST=$(python3 -c "import json; print(json.load(open('/tmp/metadata-cu130-runtime.json'))['containerimage.digest'])")
          echo "Pushed digest: ${DIGEST}"
          echo "${DIGEST}" > /tmp/digest-cu130-amd64-runtime.txt
      - name: Upload digests
        uses: actions/upload-artifact@v4
        with:
          name: digests-amd64
          path: /tmp/digest-*.txt
          retention-days: 1
  publish-arm64:
    if: github.repository == 'sgl-project/sglang'
    environment: "prod"
    strategy:
      matrix:
        variant:
          - cuda_version: "12.9.1"
            build_type: "all"
            grace_blackwell: 1
    runs-on: arm-docker-build-node
    steps:
      - name: Delete huge unnecessary tools folder
        run: rm -rf /opt/hostedtoolcache
      - name: Checkout repository
        uses: actions/checkout@v4
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      - name: Login to Docker Hub
        uses: docker/login-action@v2
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      - name: Get version from tag
        id: version
        run: |
          if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
            VERSION="${{ github.event.inputs.version }}"
          else
            # Extract version from tag (e.g., v0.5.7 -> 0.5.7)
            VERSION="${GITHUB_REF_NAME#v}"
          fi
          # Validate version format
          if [ -z "$VERSION" ]; then
            echo "::error::Version is empty"
            exit 1
          fi
          if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then
            echo "::error::Invalid version format: $VERSION (expected: X.Y.Z)"
            exit 1
          fi
          echo "version=${VERSION}" >> $GITHUB_OUTPUT
      - name: Build and Push ARM64 Runtime
        run: |
          version=${{ steps.version.outputs.version }}
          docker buildx build \
            --target runtime \
            --platform linux/arm64 \
            --output type=image,name=lmsysorg/sglang,push-by-digest=true,name-canonical=true,push=true \
            -f docker/Dockerfile \
            --build-arg CUDA_VERSION=${{ matrix.variant.cuda_version }} \
            --build-arg BUILD_TYPE=${{ matrix.variant.build_type }} \
            --build-arg GRACE_BLACKWELL=${{ matrix.variant.grace_blackwell }} \
            --build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \
            --build-arg SGL_VERSION=${version} \
            --metadata-file /tmp/metadata-cu129-runtime.json \
            --no-cache \
            .
          DIGEST=$(python3 -c "import json; print(json.load(open('/tmp/metadata-cu129-runtime.json'))['containerimage.digest'])")
          echo "Pushed digest: ${DIGEST}"
          echo "${DIGEST}" > /tmp/digest-cu129-arm64-runtime.txt
      - name: Build and Push ARM64 Runtime (CUDA 13)
        run: |
          version=${{ steps.version.outputs.version }}
          docker buildx build \
            --target runtime \
            --platform linux/arm64 \
            --output type=image,name=lmsysorg/sglang,push-by-digest=true,name-canonical=true,push=true \
            -f docker/Dockerfile \
            --build-arg CUDA_VERSION=13.0.1 \
            --build-arg BUILD_TYPE=${{ matrix.variant.build_type }} \
            --build-arg GRACE_BLACKWELL=1 \
            --build-arg SGL_VERSION=${version} \
            --metadata-file /tmp/metadata-cu130-runtime.json \
            --no-cache \
            .
          DIGEST=$(python3 -c "import json; print(json.load(open('/tmp/metadata-cu130-runtime.json'))['containerimage.digest'])")
          echo "Pushed digest: ${DIGEST}"
          echo "${DIGEST}" > /tmp/digest-cu130-arm64-runtime.txt
      - name: Upload digests
        uses: actions/upload-artifact@v4
        with:
          name: digests-arm64
          path: /tmp/digest-*.txt
          retention-days: 1
  create-manifests:
    runs-on: ubuntu-22.04
    needs: [publish-x86, publish-arm64]
    if: github.repository == 'sgl-project/sglang'
    environment: "prod"
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      - name: Login to Docker Hub
        uses: docker/login-action@v2
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      - name: Get version from tag
        id: version
        run: |
          if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
            VERSION="${{ github.event.inputs.version }}"
          else
            # Extract version from tag (e.g., v0.5.7 -> 0.5.7)
            VERSION="${GITHUB_REF_NAME#v}"
          fi
          # Validate version format
          if [ -z "$VERSION" ]; then
            echo "::error::Version is empty"
            exit 1
          fi
          if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then
            echo "::error::Invalid version format: $VERSION (expected: X.Y.Z)"
            exit 1
          fi
          echo "version=${VERSION}" >> $GITHUB_OUTPUT
      - name: Download amd64 digests
        uses: actions/download-artifact@v4
        with:
          name: digests-amd64
          path: /tmp/digests/amd64
      - name: Download arm64 digests
        uses: actions/download-artifact@v4
        with:
          name: digests-arm64
          path: /tmp/digests/arm64
      - name: Create multi-arch manifests
        run: |
          version=${{ steps.version.outputs.version }}
          CU129_AMD64_RT=$(cat /tmp/digests/amd64/digest-cu129-amd64-runtime.txt)
          CU130_AMD64_RT=$(cat /tmp/digests/amd64/digest-cu130-amd64-runtime.txt)
          CU129_ARM64_RT=$(cat /tmp/digests/arm64/digest-cu129-arm64-runtime.txt)
          CU130_ARM64_RT=$(cat /tmp/digests/arm64/digest-cu130-arm64-runtime.txt)
          # Create versioned runtime manifest
          docker buildx imagetools create \
            -t lmsysorg/sglang:v${version}-runtime \
            lmsysorg/sglang@${CU129_AMD64_RT} \
            lmsysorg/sglang@${CU129_ARM64_RT}
          # Create latest runtime manifest
          docker buildx imagetools create \
            -t lmsysorg/sglang:latest-runtime \
            lmsysorg/sglang@${CU129_AMD64_RT} \
            lmsysorg/sglang@${CU129_ARM64_RT}
          # Create versioned CUDA 13 runtime manifest
          docker buildx imagetools create \
            -t lmsysorg/sglang:v${version}-cu130-runtime \
            lmsysorg/sglang@${CU130_AMD64_RT} \
            lmsysorg/sglang@${CU130_ARM64_RT}
          # Create latest CUDA 13 runtime manifest
          docker buildx imagetools create \
            -t lmsysorg/sglang:latest-cu130-runtime \
            lmsysorg/sglang@${CU130_AMD64_RT} \
            lmsysorg/sglang@${CU130_ARM64_RT}
--- a/third_party/sglang/.github/workflows/release-docker-xeon.yml
+++ b/third_party/sglang/.github/workflows/release-docker-xeon.yml
@@ -0,0 +1,62 @@
 name: Release Docker Xeon Images
 on:
  push:
    tags:
      - 'v[0-9]+.*'
  workflow_dispatch:
    inputs:
      version:
        description: 'Version to build (without v prefix, e.g., 0.5.7)'
        required: true
 jobs:
  publish:
    if: github.repository == 'sgl-project/sglang'
    runs-on: ubuntu-24.04
    environment: 'prod'
    strategy:
      matrix:
        build_type: ['all']
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
      - name: Login to Docker Hub
        uses: docker/login-action@v2
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      - name: Get version from tag
        id: version
        run: |
          if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
            VERSION="${{ github.event.inputs.version }}"
          else
            # Extract version from tag (e.g., v0.5.7 -> 0.5.7)
            VERSION="${GITHUB_REF_NAME#v}"
          fi
          # Validate version format
          if [ -z "$VERSION" ]; then
            echo "::error::Version is empty"
            exit 1
          fi
          if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then
            echo "::error::Invalid version format: $VERSION (expected: X.Y.Z)"
            exit 1
          fi
          echo "version=${VERSION}" >> $GITHUB_OUTPUT
      - name: Build and Push
        run: |
          version=${{ steps.version.outputs.version }}
          tag=v${version}-xeon
          docker build . -f docker/xeon.Dockerfile \
            --build-arg VER_SGLANG=v${version} \
            -t lmsysorg/sglang:${tag} \
            --no-cache
          docker push lmsysorg/sglang:${tag}
--- a/third_party/sglang/.github/workflows/release-docker.yml
+++ b/third_party/sglang/.github/workflows/release-docker.yml
@@ -0,0 +1,294 @@
 name: Release Docker Images
 #
 # This workflow builds and publishes framework Docker images (full development environment):
 #   - lmsysorg/sglang:v{version}, lmsysorg/sglang:latest
 #   - lmsysorg/sglang:v{version}-cu130, lmsysorg/sglang:latest-cu130
 #
 on:
  push:
    tags:
      - "v[0-9]+.*"
  workflow_dispatch:
    inputs:
      version:
        description: "Version to build (without v prefix, e.g., 0.5.7)"
        required: true
 jobs:
  publish-x86:
    if: github.repository == 'sgl-project/sglang'
    environment: "prod"
    outputs:
      digest-cu129: ${{ steps.build-cu129.outputs.digest }}
      digest-cu130: ${{ steps.build-cu130.outputs.digest }}
    strategy:
      matrix:
        variant:
          - cuda_version: "12.9.1"
            build_type: "all"
            grace_blackwell: 0
    runs-on: x64-docker-build-node
    steps:
      - name: Delete huge unnecessary tools folder
        run: rm -rf /opt/hostedtoolcache
      - name: Checkout repository
        uses: actions/checkout@v4
      - name: Free disk space
        uses: jlumbroso/free-disk-space@main
        with:
          tool-cache: false
          docker-images: false
          android: true
          dotnet: true
          haskell: true
          large-packages: true
          swap-storage: false
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      - name: Login to Docker Hub
        uses: docker/login-action@v2
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      - name: Get version from tag
        id: version
        run: |
          if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
            VERSION="${{ github.event.inputs.version }}"
          else
            # Extract version from tag (e.g., v0.5.7 -> 0.5.7)
            VERSION="${GITHUB_REF_NAME#v}"
          fi
          # Validate version format
          if [ -z "$VERSION" ]; then
            echo "::error::Version is empty"
            exit 1
          fi
          if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then
            echo "::error::Invalid version format: $VERSION (expected: X.Y.Z)"
            exit 1
          fi
          echo "version=${VERSION}" >> $GITHUB_OUTPUT
      - name: Build AMD64 Framework
        id: build-cu129
        run: |
          version=${{ steps.version.outputs.version }}
          docker buildx build \
            --target framework \
            --platform linux/amd64 \
            --output type=image,name=lmsysorg/sglang,push-by-digest=true,name-canonical=true,push=true \
            -f docker/Dockerfile \
            --build-arg CUDA_VERSION=${{ matrix.variant.cuda_version }} \
            --build-arg BUILD_TYPE=${{ matrix.variant.build_type }} \
            --build-arg GRACE_BLACKWELL=${{ matrix.variant.grace_blackwell }} \
            --build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \
            --build-arg SGL_VERSION=${version} \
            --metadata-file /tmp/metadata-cu129-framework.json \
            --no-cache \
            .
          DIGEST=$(python3 -c "import json; print(json.load(open('/tmp/metadata-cu129-framework.json'))['containerimage.digest'])")
          echo "Pushed digest: ${DIGEST}"
          echo "digest=${DIGEST}" >> $GITHUB_OUTPUT
      - name: Build and Push AMD64 Framework (CUDA 13)
        id: build-cu130
        run: |
          version=${{ steps.version.outputs.version }}
          docker buildx build \
            --target framework \
            --platform linux/amd64 \
            --output type=image,name=lmsysorg/sglang,push-by-digest=true,name-canonical=true,push=true \
            -f docker/Dockerfile \
            --build-arg CUDA_VERSION=13.0.1 \
            --build-arg BUILD_TYPE=${{ matrix.variant.build_type }} \
            --build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \
            --build-arg GRACE_BLACKWELL=0 \
            --build-arg SGL_VERSION=${version} \
            --metadata-file /tmp/metadata-cu130-framework.json \
            --no-cache \
            .
          DIGEST=$(python3 -c "import json; print(json.load(open('/tmp/metadata-cu130-framework.json'))['containerimage.digest'])")
          echo "Pushed digest: ${DIGEST}"
          echo "digest=${DIGEST}" >> $GITHUB_OUTPUT
  publish-arm64:
    if: github.repository == 'sgl-project/sglang'
    environment: "prod"
    outputs:
      digest-cu129: ${{ steps.build-cu129.outputs.digest }}
      digest-cu130: ${{ steps.build-cu130.outputs.digest }}
    strategy:
      matrix:
        variant:
          - cuda_version: "12.9.1"
            build_type: "all"
            grace_blackwell: 1
    runs-on: arm-docker-build-node
    steps:
      - name: Delete huge unnecessary tools folder
        run: rm -rf /opt/hostedtoolcache
      - name: Checkout repository
        uses: actions/checkout@v4
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      - name: Login to Docker Hub
        uses: docker/login-action@v2
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      - name: Get version from tag
        id: version
        run: |
          if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
            VERSION="${{ github.event.inputs.version }}"
          else
            # Extract version from tag (e.g., v0.5.7 -> 0.5.7)
            VERSION="${GITHUB_REF_NAME#v}"
          fi
          # Validate version format
          if [ -z "$VERSION" ]; then
            echo "::error::Version is empty"
            exit 1
          fi
          if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then
            echo "::error::Invalid version format: $VERSION (expected: X.Y.Z)"
            exit 1
          fi
          echo "version=${VERSION}" >> $GITHUB_OUTPUT
      - name: Build ARM64 Framework
        id: build-cu129
        run: |
          version=${{ steps.version.outputs.version }}
          docker buildx build \
            --target framework \
            --platform linux/arm64 \
            --output type=image,name=lmsysorg/sglang,push-by-digest=true,name-canonical=true,push=true \
            -f docker/Dockerfile \
            --build-arg CUDA_VERSION=${{ matrix.variant.cuda_version }} \
            --build-arg BUILD_TYPE=${{ matrix.variant.build_type }} \
            --build-arg GRACE_BLACKWELL=${{ matrix.variant.grace_blackwell }} \
            --build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \
            --build-arg SGL_VERSION=${version} \
            --metadata-file /tmp/metadata-cu129-framework.json \
            --no-cache \
            .
          DIGEST=$(python3 -c "import json; print(json.load(open('/tmp/metadata-cu129-framework.json'))['containerimage.digest'])")
          echo "Pushed digest: ${DIGEST}"
          echo "digest=${DIGEST}" >> $GITHUB_OUTPUT
      - name: Build and Push ARM64 Framework (CUDA 13)
        id: build-cu130
        run: |
          version=${{ steps.version.outputs.version }}
          docker buildx build \
            --target framework \
            --platform linux/arm64 \
            --output type=image,name=lmsysorg/sglang,push-by-digest=true,name-canonical=true,push=true \
            -f docker/Dockerfile \
            --build-arg CUDA_VERSION=13.0.1 \
            --build-arg BUILD_TYPE=${{ matrix.variant.build_type }} \
            --build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \
            --build-arg GRACE_BLACKWELL=1 \
            --build-arg SGL_VERSION=${version} \
            --metadata-file /tmp/metadata-cu130-framework.json \
            --no-cache \
            .
          DIGEST=$(python3 -c "import json; print(json.load(open('/tmp/metadata-cu130-framework.json'))['containerimage.digest'])")
          echo "Pushed digest: ${DIGEST}"
          echo "digest=${DIGEST}" >> $GITHUB_OUTPUT
  create-manifests:
    runs-on: ubuntu-22.04
    needs: [publish-x86, publish-arm64]
    if: github.repository == 'sgl-project/sglang'
    environment: "prod"
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      - name: Login to Docker Hub
        uses: docker/login-action@v2
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      - name: Get version from tag
        id: version
        run: |
          if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
            VERSION="${{ github.event.inputs.version }}"
          else
            # Extract version from tag (e.g., v0.5.7 -> 0.5.7)
            VERSION="${GITHUB_REF_NAME#v}"
          fi
          # Validate version format
          if [ -z "$VERSION" ]; then
            echo "::error::Version is empty"
            exit 1
          fi
          if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then
            echo "::error::Invalid version format: $VERSION (expected: X.Y.Z)"
            exit 1
          fi
          echo "version=${VERSION}" >> $GITHUB_OUTPUT
      - name: Create multi-arch manifests
        run: |
          version=${{ steps.version.outputs.version }}
          CU129_AMD64_FW=${{ needs.publish-x86.outputs.digest-cu129 }}
          CU130_AMD64_FW=${{ needs.publish-x86.outputs.digest-cu130 }}
          CU129_ARM64_FW=${{ needs.publish-arm64.outputs.digest-cu129 }}
          CU130_ARM64_FW=${{ needs.publish-arm64.outputs.digest-cu130 }}
          # Create versioned framework manifest (default)
          docker buildx imagetools create \
            -t lmsysorg/sglang:v${version} \
            lmsysorg/sglang@${CU129_AMD64_FW} \
            lmsysorg/sglang@${CU129_ARM64_FW}
          # Create latest framework manifest (default)
          docker buildx imagetools create \
            -t lmsysorg/sglang:latest \
            lmsysorg/sglang@${CU129_AMD64_FW} \
            lmsysorg/sglang@${CU129_ARM64_FW}
          # Create versioned CUDA 13 framework manifest
          docker buildx imagetools create \
            -t lmsysorg/sglang:v${version}-cu130 \
            lmsysorg/sglang@${CU130_AMD64_FW} \
            lmsysorg/sglang@${CU130_ARM64_FW}
          # Create latest CUDA 13 framework manifest
          docker buildx imagetools create \
            -t lmsysorg/sglang:latest-cu130 \
            lmsysorg/sglang@${CU130_AMD64_FW} \
            lmsysorg/sglang@${CU130_ARM64_FW}
--- a/third_party/sglang/.github/workflows/release-docs.yml
+++ b/third_party/sglang/.github/workflows/release-docs.yml
@@ -0,0 +1,89 @@
 name: Release Documentation
 on:
  release:
    types: [published]
  push:
    branches:
      - main
    paths:
      - "docs/**"
      - "python/sglang/version.py"
      - "python/sglang/**"
  workflow_dispatch:
 concurrency:
  group: release-docs-${{ github.ref }}
  cancel-in-progress: true
 env:
  SGLANG_IS_IN_CI: true
 jobs:
  execute-and-deploy:
    runs-on: 1-gpu-h100
    if: github.repository == 'sgl-project/sglang'
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Fetch full git history for release index
        if: github.event_name == 'release'
        run: |
          git fetch --prune --unshallow || git fetch --prune --depth=0
      - name: Install dependencies
        run: |
          bash scripts/ci/cuda/ci_install_dependency.sh
          pip install -r docs/requirements.txt
          apt-get update && apt-get install -y pandoc parallel retry
          ln -sf "$(which python3)" /usr/bin/python
      - name: Setup Jupyter Kernel
        run: |
          python -m ipykernel install --user --name python3 --display-name "Python 3"
      - name: Execute notebooks
        timeout-minutes: 40
        run: |
          cd docs
          make clean
          make compile
      - name: Push HTML to sgl-project.github.io
        timeout-minutes: 30
        env:
          GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_DOCUMENTATION }}
        run: |
          cd docs
          make html
          make markdown
          python3 wrap_run_llm.py
          if [[ "${{ github.event_name }}" == "release" ]]; then
            python3 release_lookup/generate_index.py --output release_lookup/release_index.json
            # Copy release lookup tool for official docs on published releases.
            mkdir -p _build/html/release_lookup
            cp release_lookup/index.html _build/html/release_lookup/
            cp release_lookup/release_index.json _build/html/release_lookup/
          fi
          cd _build/html
          git clone https://$GITHUB_TOKEN@github.com/sgl-project/sgl-project.github.io.git ../sgl-project.github.io --depth 1
          if [[ "${{ github.event_name }}" == "release" ]]; then
            find ../sgl-project.github.io/ -mindepth 1 -not -path "../sgl-project.github.io/.git*" -not -name CNAME -not -name ".jekyll" -not -name ".nojekyll" -delete
          else
            find ../sgl-project.github.io/ -mindepth 1 -not -path "../sgl-project.github.io/.git*" -not -path "../sgl-project.github.io/release_lookup*" -not -name CNAME -not -name ".jekyll" -not -name ".nojekyll" -delete
          fi
          cp -r * ../sgl-project.github.io
          cp ../../README.md ../sgl-project.github.io/README.md
          cd ../sgl-project.github.io
          git config user.name "sglang-bot"
          git config user.email "sglangbot@gmail.com"
          git add .
          git commit -m "Update $(date +'%Y-%m-%d %H:%M:%S')"
          git push https://$GITHUB_TOKEN@github.com/sgl-project/sgl-project.github.io.git main
          cd ..
          rm -rf sgl-project.github.io
--- a/third_party/sglang/.github/workflows/release-pypi-gateway.yml
+++ b/third_party/sglang/.github/workflows/release-pypi-gateway.yml
@@ -0,0 +1,167 @@
 name: Release SGLang Model Gateway to PyPI
 on:
  push:
    branches:
      - main
    paths:
      - sgl-model-gateway/bindings/python/pyproject.toml
  workflow_dispatch:
 jobs:
  build:
    name: build on ${{ matrix.platform || matrix.os }} (${{ matrix.target }} - ${{ matrix.manylinux || 'auto' }})
    runs-on: ${{ matrix.os }}-latest
    strategy:
      fail-fast: false
      matrix:
        os: [ubuntu, macos, windows]
        target: [x86_64, aarch64]
        manylinux: [auto]
        include:
          - os: ubuntu
            platform: linux
          - os: windows
            ls: dir
            target: x86_64
            python-architecture: x64
            interpreter: 3.9 3.10 3.11 3.12 3.13
          - os: macos
            target: aarch64
            interpreter: 3.9 3.10 3.11 3.12 3.13
          - os: ubuntu
            platform: linux
            target: aarch64
          # musllinux
          - os: ubuntu
            platform: linux
            target: x86_64
            manylinux: musllinux_1_1
          - os: ubuntu
            platform: linux
            target: aarch64
            manylinux: musllinux_1_1
        exclude:
          - os: windows
            target: aarch64
    steps:
      - uses: actions/checkout@v4
        with:
          path: sglang-repo
      - name: Move sgl-model-gateway folder to root and delete sglang-repo
        run: |
          mv sglang-repo/sgl-model-gateway/* .
          rm -rf sglang-repo
          ls -alt
        shell: bash
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.13"
          architecture: ${{ matrix.python-architecture || 'x64' }}
      - name: Install twine
        run: pip install -U twine
      - name: Install protoc (macOS)
        if: matrix.os == 'macos'
        run: brew install protobuf
      - name: Install protoc (Windows)
        if: matrix.os == 'windows'
        run: choco install protoc -y
      - name: Build wheels
        uses: PyO3/maturin-action@v1
        with:
          working-directory: bindings/python
          target: ${{ matrix.target }}
          manylinux: ${{ matrix.manylinux || 'auto' }}
          args: --release --out dist --features vendored-openssl --interpreter ${{ matrix.interpreter || '3.9 3.10 3.11 3.12 3.13 3.14' }}
          rust-toolchain: stable
          docker-options: -e CI -e CC_aarch64_unknown_linux_gnu=aarch64-linux-gnu-gcc -e CXX_aarch64_unknown_linux_gnu=aarch64-linux-gnu-g++
          before-script-linux: |
            # Install build dependencies (perl/make for vendored OpenSSL, protoc for gRPC)
            if command -v yum &> /dev/null; then
              yum update -y && yum install -y wget unzip gcc gcc-c++ perl-core make
              # Install cross-compilation toolchain for aarch64 if needed
              if [ "${{ matrix.target }}" = "aarch64" ]; then
                yum install -y gcc-aarch64-linux-gnu gcc-c++-aarch64-linux-gnu || true
              fi
            elif command -v apt-get &> /dev/null; then
              apt-get update && apt-get install -y wget unzip gcc g++ perl make
              # Install cross-compilation toolchain for aarch64 if needed
              if [ "${{ matrix.target }}" = "aarch64" ]; then
                apt-get install -y gcc-aarch64-linux-gnu g++-aarch64-linux-gnu || true
              fi
            fi
            (cd /tmp && \
             wget https://github.com/protocolbuffers/protobuf/releases/download/v32.0/protoc-32.0-linux-x86_64.zip && \
             unzip protoc-32.0-linux-x86_64.zip -d /usr/local && \
             rm protoc-32.0-linux-x86_64.zip)
            protoc --version
      - name: List built packages
        run: ${{ matrix.ls || 'ls -lh' }} bindings/python/dist/
      - name: Check packages
        run: twine check --strict bindings/python/dist/*
      - uses: actions/upload-artifact@v4
        with:
          name: packages-${{ matrix.os }}-${{ matrix.target }}-${{ matrix.manylinux || 'auto' }}
          path: bindings/python/dist/
  build-sdist:
    name: Build SDist
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          path: sglang-repo
      - name: Move sgl-model-gateway folder to root and delete sglang-repo
        run: |
          mv sglang-repo/sgl-model-gateway/* .
          rm -rf sglang-repo
          ls -alt
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.13"
      - name: Build SDist
        uses: PyO3/maturin-action@v1
        with:
          working-directory: bindings/python
          command: sdist
          args: --out dist
          rust-toolchain: stable
      - uses: actions/upload-artifact@v4
        with:
          name: sdist
          path: bindings/python/dist/*.tar.gz
  upload:
    name: Upload to PyPI
    if: github.repository == 'sgl-project/sglang'  # Ensure this job only runs for the sgl-project/sglang repository
    needs: [build, build-sdist]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/download-artifact@v4
        with:
          path: dist
          merge-multiple: true
      - name: Upload to PyPI
        env:
          TWINE_USERNAME: __token__
          TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN_ROUTER }}
        run: |
          pip install twine
          twine upload dist/* --verbose
--- a/third_party/sglang/.github/workflows/release-pypi-nightly.yml
+++ b/third_party/sglang/.github/workflows/release-pypi-nightly.yml
@@ -0,0 +1,169 @@
 name: Release PyPI Nightly Wheels
 on:
  # Run daily at 2 AM UTC
  schedule:
    - cron: '0 2 * * *'
  # Triggered by nightly Docker workflow to use same commit
  repository_dispatch:
    types: [nightly-release]
  # Manual trigger for testing
  workflow_dispatch:
    inputs:
      commit_sha:
        description: 'Specific commit SHA to build (leave empty for latest)'
        required: false
        type: string
      cuda_version:
        description: 'CUDA version (e.g., 129 or 130)'
        required: false
        default: '129'
        type: string
 concurrency:
  group: release-pypi-nightly-${{ github.ref }}
  cancel-in-progress: true
 jobs:
  build-nightly-wheel:
    if: github.repository == 'sgl-project/sglang'
    runs-on: ubuntu-latest
    outputs:
      nightly_version: ${{ steps.build.outputs.nightly_version }}
      commit_hash: ${{ steps.build.outputs.commit_hash }}
      build_date: ${{ steps.build.outputs.build_date }}
    steps:
      - uses: actions/checkout@v4
        with:
          # Use commit from: 1) Docker workflow, 2) manual input, 3) latest main
          ref: ${{ github.event.client_payload.commit_sha || inputs.commit_sha || github.sha }}
          fetch-depth: 0  # Need full history for setuptools-scm
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.10"
      - name: Install build dependencies
        run: |
          pip install build wheel setuptools setuptools-scm
      - name: Build wheel
        id: build
        run: |
          cd python
          cp ../README.md ../LICENSE .
          # Parse git describe output to get latest tag
          # Use same command as pyproject.toml to ensure version consistency
          DESC=$(git tag --list --sort=-version:refname 'v*.*.*' | head -1 | xargs git describe --tags --long 2>/dev/null || echo 'v0.0.0-0-g0000000')
          TAG=$(echo "$DESC" | cut -d- -f1)
          HASH="g$(git rev-parse --short HEAD)"
          BUILD_DATE=$(date -u +%Y%m%d)
          # Increment patch version for nightlies (e.g., v0.5.9 -> 0.5.10)
          # Must always increment so nightly > latest tag per PEP 440 ordering:
          #   X.Y.Z.devN < X.Y.Z.rcN < X.Y.Z < X.Y.(Z+1).devN
          VERSION=${TAG#v}  # Remove 'v' prefix
          MAJOR=$(echo "$VERSION" | cut -d. -f1)
          MINOR=$(echo "$VERSION" | cut -d. -f2)
          PATCH_RAW=$(echo "$VERSION" | cut -d. -f3)
          # Strip pre-release suffixes (rc0, post1, etc.) to get numeric patch
          PATCH=$(echo "$PATCH_RAW" | sed 's/[^0-9].*//')
          NEXT_PATCH=$((PATCH + 1))
          NEXT_VERSION="${MAJOR}.${MINOR}.${NEXT_PATCH}"
          # Use date-based dev number for correct chronological sorting
          # e.g., 0.5.9.dev20260215+g4cf4f0859 > 0.5.9.dev20260214+g45a4697d4
          FORCE_VERSION="${NEXT_VERSION}.dev${BUILD_DATE}+${HASH}"
          echo "Forcing nightly version to: $FORCE_VERSION"
          export SETUPTOOLS_SCM_PRETEND_VERSION="$FORCE_VERSION"
          # Build wheel
          python3 -m build --wheel
          # Extract version from built wheel filename
          WHEEL_FILE=$(ls dist/*.whl)
          NIGHTLY_VERSION=$(echo "$WHEEL_FILE" | sed 's/.*sglang-\(.*\)-py3.*/\1/')
          # Get commit info
          COMMIT_HASH=$(git rev-parse --short HEAD)
          BUILD_DATE=$(date -u +%Y-%m-%d)
          echo "Built wheel: $WHEEL_FILE"
          echo "Nightly version: ${NIGHTLY_VERSION}"
          echo "Commit: ${COMMIT_HASH}"
          echo "Build date: ${BUILD_DATE}"
          echo "nightly_version=${NIGHTLY_VERSION}" >> $GITHUB_OUTPUT
          echo "commit_hash=${COMMIT_HASH}" >> $GITHUB_OUTPUT
          echo "build_date=${BUILD_DATE}" >> $GITHUB_OUTPUT
      - name: Upload wheel artifact
        uses: actions/upload-artifact@v4
        with:
          name: nightly-wheel
          path: python/dist/*.whl
          retention-days: 7
  release-nightly:
    needs: build-nightly-wheel
    runs-on: ubuntu-latest
    environment: 'prod'
    steps:
      - uses: actions/checkout@v4
      - name: Download wheel artifact
        uses: actions/download-artifact@v4
        with:
          name: nightly-wheel
          path: dist/
      - name: List downloaded wheels
        run: |
          echo "Downloaded wheel:"
          ls -lh dist/
      - name: Create GitHub Release for nightly wheel
        uses: softprops/action-gh-release@v2
        with:
          tag_name: nightly-${{ needs.build-nightly-wheel.outputs.build_date }}-${{ needs.build-nightly-wheel.outputs.commit_hash }}
          name: Nightly Build ${{ needs.build-nightly-wheel.outputs.build_date }} (${{ needs.build-nightly-wheel.outputs.commit_hash }})
          repository: sgl-project/whl
          token: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
          prerelease: true
          body: |
            Nightly build from commit ${{ github.sha }}
            Build date: ${{ needs.build-nightly-wheel.outputs.build_date }}
            Version: ${{ needs.build-nightly-wheel.outputs.nightly_version }}
          files: |
            dist/*.whl
      - name: Clone wheel index repository
        run: |
          git clone https://oauth2:${WHL_TOKEN}@github.com/sgl-project/whl.git sgl-whl
          cd sgl-whl
          git config --local user.name "sglang-bot"
          git config --local user.email "sglangbot@gmail.com"
        env:
          WHL_TOKEN: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.10"
      - name: Update wheel index
        run: |
          python3 scripts/update_nightly_whl_index.py \
            --commit-hash ${{ needs.build-nightly-wheel.outputs.commit_hash }} \
            --nightly-version ${{ needs.build-nightly-wheel.outputs.nightly_version }} \
            --cuda-version ${{ inputs.cuda_version || '129' }} \
            --build-date ${{ needs.build-nightly-wheel.outputs.build_date }}
      - name: Push wheel index
        run: |
          cd sgl-whl
          git add -A
          git diff --staged --quiet || git commit -m "Update nightly wheel index for commit ${{ needs.build-nightly-wheel.outputs.commit_hash }}"
          git push
--- a/third_party/sglang/.github/workflows/release-pypi-pr.yml
+++ b/third_party/sglang/.github/workflows/release-pypi-pr.yml
@@ -0,0 +1,183 @@
 name: Release PyPI PR Wheels
 on:
  workflow_dispatch:
    inputs:
      pr_number:
        description: 'PR number to build wheel for (works with both internal and fork PRs)'
        required: true
        type: string
 concurrency:
  group: build-pr-wheel-${{ github.event.inputs.pr_number }}
  cancel-in-progress: true
 jobs:
  build-pr-wheel:
    if: github.repository == 'sgl-project/sglang'
    runs-on: ubuntu-latest
    outputs:
      wheel_version: ${{ steps.gen_version.outputs.wheel_version }}
      commit_hash: ${{ steps.gen_version.outputs.commit_hash }}
      build_date: ${{ steps.gen_version.outputs.build_date }}
    steps:
      - uses: actions/checkout@v4
        with:
          ref: refs/pull/${{ inputs.pr_number }}/head
          fetch-depth: 0  # Need full history for version generation
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.10"
      - name: Generate PR wheel version
        id: gen_version
        run: |
          # Get base version from the latest v*.*.* git tag directly
          # Note: We cannot use setuptools_scm here because the [tool.setuptools_scm]
          # config (with custom git_describe_command) lives in python/pyproject.toml,
          # not at the repo root. Without that config, setuptools_scm falls back to
          # default git describe which finds gateway-* tags instead of v*.*.* release tags.
          LATEST_TAG=$(git tag --list --sort=-version:refname 'v*.*.*' | head -1)
          BASE_VERSION=${LATEST_TAG#v}
          echo "Latest release tag: ${LATEST_TAG}"
          # Get commit info
          COMMIT_HASH=$(git rev-parse --short HEAD)
          COMMIT_COUNT=$(git rev-list --count HEAD)
          # Get current date in YYYY-MM-DD format
          BUILD_DATE=$(date -u +%Y-%m-%d)
          # Always use pr-{number} format for suffix
          SUFFIX="pr-${{ inputs.pr_number }}"
          # Generate PR wheel version following PEP 440
          # Format: {base_version}.dev{commit_count}+pr-{number}.g{commit_hash}
          WHEEL_VERSION="${BASE_VERSION}.dev${COMMIT_COUNT}+${SUFFIX}.g${COMMIT_HASH}"
          echo "Base version: ${BASE_VERSION}"
          echo "PR wheel version: ${WHEEL_VERSION}"
          echo "Commit: ${COMMIT_HASH}"
          echo "Build date: ${BUILD_DATE}"
          echo "wheel_version=${WHEEL_VERSION}" >> $GITHUB_OUTPUT
          echo "commit_hash=${COMMIT_HASH}" >> $GITHUB_OUTPUT
          echo "base_version=${BASE_VERSION}" >> $GITHUB_OUTPUT
          echo "build_date=${BUILD_DATE}" >> $GITHUB_OUTPUT
      - name: Update pyproject.toml with PR wheel version
        run: |
          cd python
          WHEEL_VERSION="${{ steps.gen_version.outputs.wheel_version }}"
          # Update pyproject.toml to use static version instead of dynamic
          # Remove 'version' from dynamic list and add static version
          sed -i 's/dynamic = \["version"\]/dynamic = []/' pyproject.toml
          sed -i "/^name = \"sglang\"/a version = \"${WHEEL_VERSION}\"" pyproject.toml
          # Verify update
          echo "Updated version in pyproject.toml:"
          grep "^version" pyproject.toml
          grep "^dynamic" pyproject.toml
      - name: Install build dependencies
        run: |
          cd python
          pip install build wheel setuptools
      - name: Build wheel
        run: |
          cd python
          cp ../README.md ../LICENSE .
          python3 -m build --wheel
          # List built wheels
          echo "Built wheel:"
          ls -lh dist/
      - name: Upload wheel artifact
        uses: actions/upload-artifact@v4
        with:
          name: pr-wheel-${{ inputs.pr_number }}
          path: python/dist/*.whl
          retention-days: 30
  release-pr-wheel:
    needs: build-pr-wheel
    runs-on: ubuntu-latest
    environment: 'prod'
    steps:
      - uses: actions/checkout@v4
      - name: Download wheel artifact
        uses: actions/download-artifact@v4
        with:
          name: pr-wheel-${{ inputs.pr_number }}
          path: dist/
      - name: List downloaded wheels
        run: |
          echo "Downloaded wheel:"
          ls -lh dist/
      - name: Create GitHub Release for PR wheel
        uses: softprops/action-gh-release@v2
        with:
          tag_name: pr-${{ inputs.pr_number }}-${{ needs.build-pr-wheel.outputs.build_date }}-${{ needs.build-pr-wheel.outputs.commit_hash }}
          name: "PR #${{ inputs.pr_number }} Build (${{ needs.build-pr-wheel.outputs.commit_hash }})"
          repository: sgl-project/whl
          token: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
          prerelease: true
          body: |
            PR wheel build from PR #${{ inputs.pr_number }}
            Commit: ${{ github.sha }}
            Build date: ${{ needs.build-pr-wheel.outputs.build_date }}
            Version: ${{ needs.build-pr-wheel.outputs.wheel_version }}
            **Installation via index (pip):**
            ```bash
            pip install sglang==${{ needs.build-pr-wheel.outputs.wheel_version }} --index-url https://sgl-project.github.io/whl/pr/
            ```
            **Installation via index (uv):**
            ```bash
            uv pip install sglang==${{ needs.build-pr-wheel.outputs.wheel_version }} --index-url https://sgl-project.github.io/whl/pr/ --extra-index-url https://pypi.org/simple --index-strategy unsafe-best-match
            ```
            **Direct installation:**
            ```bash
            pip install https://github.com/sgl-project/whl/releases/download/pr-${{ inputs.pr_number }}-${{ needs.build-pr-wheel.outputs.build_date }}-${{ needs.build-pr-wheel.outputs.commit_hash }}/sglang-${{ needs.build-pr-wheel.outputs.wheel_version }}-py3-none-any.whl
            ```
          files: |
            dist/*.whl
      - name: Clone wheel index repository
        run: |
          git clone https://oauth2:${WHL_TOKEN}@github.com/sgl-project/whl.git sgl-whl
          cd sgl-whl
          git config --local user.name "sglang-bot"
          git config --local user.email "sglangbot@gmail.com"
        env:
          WHL_TOKEN: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.10"
      - name: Update wheel index
        run: |
          python3 scripts/update_pr_whl_index.py \
            --pr-number ${{ inputs.pr_number }} \
            --commit-hash ${{ needs.build-pr-wheel.outputs.commit_hash }} \
            --wheel-version ${{ needs.build-pr-wheel.outputs.wheel_version }} \
            --build-date ${{ needs.build-pr-wheel.outputs.build_date }}
      - name: Push wheel index
        run: |
          cd sgl-whl
          git add -A
          git diff --staged --quiet || git commit -m "Update PR wheel index for PR #${{ inputs.pr_number }} (commit ${{ needs.build-pr-wheel.outputs.commit_hash }})"
          git push
--- a/third_party/sglang/.github/workflows/release-pypi.yml
+++ b/third_party/sglang/.github/workflows/release-pypi.yml
@@ -0,0 +1,31 @@
 name: Release PyPI
 on:
  push:
    tags:
      - 'v[0-9]+.*'
  workflow_dispatch:
 jobs:
  publish:
    if: github.repository == 'sgl-project/sglang'
    runs-on: ubuntu-latest
    environment: "prod"
    steps:
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.10"
      - name: Checkout repository
        uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Required for setuptools-scm to determine version from tags
      - name: Upload to pypi
        run: |
          cd python
          cp ../README.md ../LICENSE .
          pip install build wheel setuptools setuptools-scm
          python3 -m build
          pip install twine
          python3 -m twine upload dist/* -u __token__ -p ${{ secrets.PYPI_TOKEN }}
--- a/third_party/sglang/.github/workflows/release-tag.yml
+++ b/third_party/sglang/.github/workflows/release-tag.yml
@@ -0,0 +1,68 @@
 name: Release Tag
 # Creates a git tag to trigger release workflows (PyPI, Docker)
 # Use this after testing on a release branch is complete
 on:
  workflow_dispatch:
    inputs:
      version:
        description: 'Version to tag (without v prefix, e.g., 0.5.7)'
        required: true
        type: string
      ref:
        description: 'Branch or commit to tag (e.g., release/v0.5.7, main, or commit SHA)'
        required: false
        default: 'main'
        type: string
 permissions:
  contents: write
 jobs:
  create-tag:
    if: github.repository == 'sgl-project/sglang'
    runs-on: ubuntu-latest
    environment: 'prod'
    steps:
      - name: Validate version format
        run: |
          VERSION="${{ github.event.inputs.version }}"
          if [ -z "$VERSION" ]; then
            echo "::error::Version is required"
            exit 1
          fi
          if ! echo "$VERSION" | grep -qE '^[0-9]+\.[0-9]+\.[0-9]+'; then
            echo "::error::Invalid version format: $VERSION (expected: X.Y.Z or X.Y.Z.postN)"
            exit 1
          fi
          echo "Version validated: v$VERSION"
      - name: Checkout repository
        uses: actions/checkout@v4
        with:
          ref: ${{ github.event.inputs.ref }}
          fetch-depth: 0
          token: ${{ secrets.GITHUB_TOKEN }}
      - name: Check if tag already exists
        run: |
          TAG="v${{ github.event.inputs.version }}"
          if git rev-parse "$TAG" >/dev/null 2>&1; then
            echo "::error::Tag $TAG already exists"
            exit 1
          fi
          echo "Tag $TAG does not exist, proceeding..."
      - name: Create and push tag
        run: |
          TAG="v${{ github.event.inputs.version }}"
          REF="${{ github.event.inputs.ref }}"
          git config user.name "sglang-bot"
          git config user.email "sglang-bot@users.noreply.github.com"
          echo "Creating tag $TAG on ref $REF (commit: $(git rev-parse HEAD))"
          git tag -a "$TAG" -m "Release $TAG"
          git push origin "$TAG"
          echo "::notice::Successfully created and pushed tag $TAG"
          echo "This will trigger the release workflows (PyPI, Docker)"
--- a/third_party/sglang/.github/workflows/release-whl-kernel.yml
+++ b/third_party/sglang/.github/workflows/release-whl-kernel.yml
@@ -0,0 +1,440 @@
 name: Release SGLang Kernels
 on:
  push:
    branches:
      - main
    paths:
      - sgl-kernel/python/sgl_kernel/version.py
  workflow_dispatch:
    inputs:
      target:
        type: choice
        description: 'Build target'
        required: false
        default: 'all'
        options:
          - 'all'
          - 'cu129'
          - 'cu130'
          - 'rocm700'
          - 'rocm720'
          - 'musa43'
      tag_name:
        description: "Version number, must be in the form of vX.Y.Z (e.g. v0.4.0)"
        type: string
        required: false
      pr_number:
        description: "PR number to build from (e.g. 12345)"
        type: string
        required: false
 concurrency:
  group: release-sglang-kernels-${{ github.ref }}
  cancel-in-progress: true
 jobs:
  build-cu129-matrix:
    if: |
      github.repository == 'sgl-project/sglang' &&
      (github.event_name == 'push' || github.event.inputs.target == 'all' || github.event.inputs.target == 'cu129')
    strategy:
      matrix:
        python-version: ["3.10"]
        cuda-version: ["12.9"]
        arch: [x86_64, aarch64]
        include:
          - arch: x86_64
            runner: x64-kernel-build-node
          - arch: aarch64
            runner: arm-kernel-build-node
    runs-on: ${{ matrix.runner }}
    steps:
      - uses: actions/checkout@v4
        with:
          submodules: "recursive"
          ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }}
      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}
      - name: Build wheels
        run: |
          cd sgl-kernel
          chmod +x ./build.sh
          ./build.sh "${{ matrix.python-version }}" "${{ matrix.cuda-version }}" ${{ matrix.arch == 'aarch64' && 'aarch64' || '' }}
        env:
          BUILD_JOBS: 64
          NVCC_THREADS: 8
      - name: Upload to PyPI
        working-directory: sgl-kernel
        run: |
          pip install twine
          python3 -m twine upload --skip-existing dist/* -u __token__ -p ${{ secrets.PYPI_TOKEN_SGLANG_KERNEL }}
      - name: Upload artifacts
        uses: actions/upload-artifact@v4
        with:
          name: wheel-python${{ matrix.python-version }}-cuda${{ matrix.cuda-version }}${{ matrix.arch == 'aarch64' && '-aarch64' || '' }}
          path: sgl-kernel/dist/*
  release-cu129:
    needs: build-cu129-matrix
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }}
      - name: Download artifacts
        uses: actions/download-artifact@v4
        with:
          path: sgl-kernel/dist/
          merge-multiple: true
          pattern: wheel-*
      - name: Set tag name
        id: set_tag_name
        run: |
          if [ -z "${{ inputs.tag_name }}" ]; then
            TAG_NAME="v$(cat sgl-kernel/python/sgl_kernel/version.py | cut -d'"' -f2)"
            echo "tag_name=$TAG_NAME" >> $GITHUB_OUTPUT
          else
            echo "tag_name=${{ inputs.tag_name }}" >> $GITHUB_OUTPUT
          fi
      - name: Release
        uses: softprops/action-gh-release@v2
        with:
          tag_name: ${{ steps.set_tag_name.outputs.tag_name }}
          repository: sgl-project/whl
          token: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
          files: |
            sgl-kernel/dist/*
      - name: Clone wheel index
        run: git clone https://oauth2:${WHL_TOKEN}@github.com/sgl-project/whl.git sgl-whl
        env:
          WHL_TOKEN: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
      - name: Update wheel index
        run: python3 scripts/update_kernel_whl_index.py --cuda 129
      - name: Push wheel index
        run: |
          cd sgl-whl
          git config --local user.name "sglang-bot"
          git config --local user.email "sglangbot@gmail.com"
          git add -A
          git commit -m "update whl index"
          git push
  # for now we do not release CUDA 13.0 wheels to pypi
  build-cu130-matrix:
    if: |
      github.repository == 'sgl-project/sglang' &&
      (github.event_name == 'push' || github.event.inputs.target == 'all' || github.event.inputs.target == 'cu130')
    strategy:
      matrix:
        python-version: ["3.10"]
        cuda-version: ["13.0"]
        arch: [x86_64, aarch64]
        include:
          - arch: x86_64
            runner: x64-kernel-build-node
          - arch: aarch64
            runner: arm-kernel-build-node
    runs-on: ${{ matrix.runner }}
    steps:
      - uses: actions/checkout@v4
        with:
          submodules: "recursive"
          ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }}
      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}
      - name: Build wheels
        run: |
          cd sgl-kernel
          chmod +x ./build.sh
          ./build.sh "${{ matrix.python-version }}" "${{ matrix.cuda-version }}" ${{ matrix.arch == 'aarch64' && 'aarch64' || '' }}
        env:
          BUILD_JOBS: 64
          NVCC_THREADS: 8
      - name: Upload artifacts
        uses: actions/upload-artifact@v4
        with:
          name: wheel-python${{ matrix.python-version }}-cuda${{ matrix.cuda-version }}${{ matrix.arch == 'aarch64' && '-aarch64' || '' }}
          path: sgl-kernel/dist/*
  release-cu130:
    needs: build-cu130-matrix
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }}
      - name: Download artifacts
        uses: actions/download-artifact@v4
        with:
          path: sgl-kernel/dist/
          merge-multiple: true
          pattern: wheel-*
      - name: Set tag name
        id: set_tag_name
        run: |
          if [ -z "${{ inputs.tag_name }}" ]; then
            TAG_NAME="v$(cat sgl-kernel/python/sgl_kernel/version.py | cut -d'"' -f2)"
            echo "tag_name=$TAG_NAME" >> $GITHUB_OUTPUT
          else
            echo "tag_name=${{ inputs.tag_name }}" >> $GITHUB_OUTPUT
          fi
      - name: Release
        uses: softprops/action-gh-release@v2
        with:
          tag_name: ${{ steps.set_tag_name.outputs.tag_name }}
          repository: sgl-project/whl
          token: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
          files: |
            sgl-kernel/dist/*
      - name: Clone wheel index
        run: git clone https://oauth2:${WHL_TOKEN}@github.com/sgl-project/whl.git sgl-whl
        env:
          WHL_TOKEN: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
      - name: Update wheel index
        run: python3 scripts/update_kernel_whl_index.py --cuda 130
      - name: Push wheel index
        run: |
          cd sgl-whl
          git config --local user.name "sglang-bot"
          git config --local user.email "sglangbot@gmail.com"
          git add -A
          git commit -m "update whl index"
          git push
  build-rocm-matrix:
    if: |
      github.repository == 'sgl-project/sglang' &&
      (github.event_name == 'push' || github.event.inputs.target == 'all' || github.event.inputs.target == 'rocm700' || github.event.inputs.target == 'rocm720')
    runs-on: amd-docker-scale
    strategy:
      matrix:
        python-version: ["3.10"]
        rocm-version: ["700", "720"]
    steps:
      - uses: actions/checkout@v4
        with:
          submodules: "recursive"
          ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }}
      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}
      - name: Build wheels
        run: |
          cp 3rdparty/amd/wheel/sgl-kernel/* sgl-kernel/
          cd sgl-kernel
          chmod +x ./build_rocm.sh
          ./build_rocm.sh "${{ matrix.rocm-version }}"
      - name: Upload artifacts
        uses: actions/upload-artifact@v4
        with:
          name: wheel-python${{ matrix.python-version }}-rocm${{ matrix.rocm-version }}
          path: sgl-kernel/dist/*
  release-rocm700:
    needs: build-rocm-matrix
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }}
      - name: Download artifacts
        uses: actions/download-artifact@v4
        with:
          path: sgl-kernel/dist/
          merge-multiple: true
          pattern: wheel-*-rocm700
      - name: Set tag name
        id: set_tag_name
        run: |
          if [ -z "${{ inputs.tag_name }}" ]; then
            TAG_NAME="v$(cat sgl-kernel/python/sgl_kernel/version.py | cut -d'"' -f2)"
            echo "tag_name=$TAG_NAME" >> $GITHUB_OUTPUT
          else
            echo "tag_name=${{ inputs.tag_name }}" >> $GITHUB_OUTPUT
          fi
      - name: Release
        uses: softprops/action-gh-release@v2
        with:
          tag_name: ${{ steps.set_tag_name.outputs.tag_name }}
          repository: sgl-project/whl
          token: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
          files: |
            sgl-kernel/dist/*
      - name: Clone wheel index
        run: git clone https://oauth2:${WHL_TOKEN}@github.com/sgl-project/whl.git sgl-whl
        env:
          WHL_TOKEN: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
      - name: Update wheel index
        run: python3 scripts/update_kernel_whl_index.py --rocm 700
      - name: Push wheel index
        run: |
          cd sgl-whl
          git config --local user.name "sglang-bot"
          git config --local user.email "sglangbot@gmail.com"
          git add -A
          git commit -m "update whl index"
          git push
  release-rocm720:
    needs: build-rocm-matrix
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }}
      - name: Download artifacts
        uses: actions/download-artifact@v4
        with:
          path: sgl-kernel/dist/
          merge-multiple: true
          pattern: wheel-*-rocm720
      - name: Set tag name
        id: set_tag_name
        run: |
          if [ -z "${{ inputs.tag_name }}" ]; then
            TAG_NAME="v$(cat sgl-kernel/python/sgl_kernel/version.py | cut -d'"' -f2)"
            echo "tag_name=$TAG_NAME" >> $GITHUB_OUTPUT
          else
            echo "tag_name=${{ inputs.tag_name }}" >> $GITHUB_OUTPUT
          fi
      - name: Release
        uses: softprops/action-gh-release@v2
        with:
          tag_name: ${{ steps.set_tag_name.outputs.tag_name }}
          repository: sgl-project/whl
          token: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
          files: |
            sgl-kernel/dist/*
      - name: Clone wheel index
        run: git clone https://oauth2:${WHL_TOKEN}@github.com/sgl-project/whl.git sgl-whl
        env:
          WHL_TOKEN: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
      - name: Update wheel index
        run: python3 scripts/update_kernel_whl_index.py --rocm 720
      - name: Push wheel index
        run: |
          cd sgl-whl
          git config --local user.name "sglang-bot"
          git config --local user.email "sglangbot@gmail.com"
          git add -A
          git commit -m "update whl index"
          git push
  build-musa43:
    if: |
      github.repository == 'sgl-project/sglang' &&
      (github.event_name == 'push' || github.event.inputs.target == 'all' || github.event.inputs.target == 'musa43')
    runs-on: kernel-build-node-musa
    strategy:
      matrix:
        python-version: ["3.10"]
        musa-version: ["43"]
    steps:
      - uses: actions/checkout@v4
        with:
          submodules: "recursive"
      - name: Build wheels
        run: |
          cd sgl-kernel
          mv pyproject_musa.toml pyproject.toml
          python setup_musa.py sdist bdist_wheel
      - name: Rename MUSA wheels
        run: |
          bash scripts/ci/musa/rename_wheels_musa.sh ${{ matrix.musa-version }} sgl-kernel/dist
      - name: Upload artifacts
        uses: actions/upload-artifact@v4
        with:
          name: wheel-python${{ matrix.python-version }}-musa${{ matrix.musa-version }}
          path: sgl-kernel/dist/*
  release-musa43:
    needs: build-musa43
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Download artifacts
        uses: actions/download-artifact@v4
        with:
          path: sgl-kernel/dist/
          merge-multiple: true
          pattern: wheel-*
      - name: Set tag name
        id: set_tag_name
        run: |
          if [ -z "${{ inputs.tag_name }}" ]; then
            TAG_NAME="v$(cat sgl-kernel/python/sgl_kernel/version.py | cut -d'"' -f2)"
            echo "tag_name=$TAG_NAME" >> $GITHUB_OUTPUT
          else
            echo "tag_name=${{ inputs.tag_name }}" >> $GITHUB_OUTPUT
          fi
      - name: Release
        uses: softprops/action-gh-release@v2
        with:
          tag_name: ${{ steps.set_tag_name.outputs.tag_name }}
          repository: sgl-project/whl
          token: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
          files: |
            sgl-kernel/dist/*
      - name: Clone wheel index
        run: git clone https://oauth2:${WHL_TOKEN}@github.com/sgl-project/whl.git sgl-whl
        env:
          WHL_TOKEN: ${{ secrets.GH_PAT_FOR_WHL_RELEASE }}
      - name: Update wheel index
        run: python3 scripts/update_kernel_whl_index.py --musa 43
      - name: Push wheel index
        run: |
          cd sgl-whl
          git config --local user.name "sglang-bot"
          git config --local user.email "sglangbot@gmail.com"
          git add -A
          git commit -m "update whl index"
          git push
--- a/third_party/sglang/.github/workflows/rerun-test.yml
+++ b/third_party/sglang/.github/workflows/rerun-test.yml
@@ -0,0 +1,136 @@
 name: Rerun Test
 run-name: ${{ inputs.pr_head_sha && format('[rerun-test] {0} {1}', inputs.test_command, inputs.pr_head_sha) || format('[rerun-test] {0}', inputs.test_command) }}
 on:
  workflow_dispatch:
    inputs:
      test_command:
        description: "Test command(s) to run, one per line (e.g. 'registered/core/test_srt_endpoint.py TestSRTEndpoint.test_simple_decode')"
        required: true
        type: string
      runner_label:
        description: "Runner label"
        required: true
        type: choice
        options:
          - 1-gpu-h100
          - 1-gpu-5090
          - 2-gpu-h100
          - 4-gpu-h100
          - 4-gpu-a10
          - 4-gpu-b200
          - 8-gpu-h200
          - 8-gpu-h20
          - 8-gpu-b200
          - ubuntu-latest
      pr_head_sha:
        description: "PR head SHA to checkout (for /rerun-test on fork PRs)"
        required: false
        type: string
        default: ""
      use_deepep:
        description: "Use ci_install_deepep.sh instead of ci_install_dependency.sh"
        required: false
        type: string
        default: "false"
      is_cpu:
        description: "Run as CPU-only test (uses ubuntu-latest with uv pip install)"
        required: false
        type: string
        default: "false"
 env:
  SGLANG_IS_IN_CI: true
  SGLANG_CUDA_COREDUMP: "1"
  SGLANG_JIT_DEEPGEMM_FAST_WARMUP: true
 permissions:
  actions: write
  contents: read
  issues: read
 jobs:
  rerun-test-cuda:
    if: inputs.is_cpu != 'true'
    runs-on: ${{ inputs.runner_label }}
    timeout-minutes: 120
    env:
      RUNNER_LABELS: ${{ inputs.runner_label }}
      SGLANG_CI_RDMA_ALL_DEVICES: ${{ inputs.runner_label == '8-gpu-h20' && 'mlx5_1,mlx5_2,mlx5_3,mlx5_4' || '' }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.pr_head_sha || github.sha }}
      - uses: ./.github/actions/check-maintenance
      - name: Install dependencies
        timeout-minutes: 20
        run: |
          if [[ "${{ inputs.runner_label }}" == "1-gpu-5090" ]]; then
            source /etc/profile.d/sglang-ci.sh
          fi
          if [[ "${{ inputs.use_deepep }}" == "true" ]]; then
            bash scripts/ci/cuda/ci_install_deepep.sh
          else
            bash scripts/ci/cuda/ci_install_dependency.sh
          fi
      - name: Run test
        timeout-minutes: 60
        run: |
          if [[ "${{ inputs.runner_label }}" == "1-gpu-5090" ]]; then
            source /etc/profile.d/sglang-ci.sh
          fi
          cd test/
          echo "${{ inputs.test_command }}" | while IFS= read -r cmd; do
            [ -z "$cmd" ] && continue
            echo ">>> Running: python3 $cmd"
            python3 $cmd || exit 1
          done
      - uses: ./.github/actions/upload-cuda-coredumps
        if: always()
  rerun-test-cpu:
    if: inputs.is_cpu == 'true'
    runs-on: ubuntu-latest
    timeout-minutes: 120
    steps:
      - name: Free disk space
        run: |
          sudo rm -rf /usr/share/dotnet /usr/local/lib/android /opt/ghc
          df -h
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.pr_head_sha || github.sha }}
      - uses: ./.github/actions/check-maintenance
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'
      - name: Install uv
        uses: astral-sh/setup-uv@v5
      - name: Install dependencies
        timeout-minutes: 20
        env:
          UV_SYSTEM_PYTHON: "1"
        run: |
          uv pip install -e "python[dev]" --index-strategy unsafe-best-match --prerelease allow
      - name: Run test
        timeout-minutes: 60
        run: |
          cd test/
          echo "${{ inputs.test_command }}" | while IFS= read -r cmd; do
            [ -z "$cmd" ] && continue
            echo ">>> Running: python3 $cmd"
            python3 $cmd || exit 1
          done
--- a/third_party/sglang/.github/workflows/retag-docker.yml
+++ b/third_party/sglang/.github/workflows/retag-docker.yml
@@ -0,0 +1,30 @@
 name: Retag Docker Image
 on:
  workflow_dispatch:
    inputs:
      source_tag:
        description: "Existing image tag (e.g., v0.4.7-cu129-amd64)"
        required: true
      target_tag:
        description: "New tag to apply (e.g., latest)"
        required: true
 jobs:
  retag:
    if: github.repository == 'sgl-project/sglang'
    runs-on: ubuntu-22.04
    environment: "prod"
    steps:
      - name: Login to Docker Hub
        uses: docker/login-action@v2
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      - name: Retag image
        run: |
          echo "Retagging lmsysorg/sglang:${{ inputs.source_tag }} -> lmsysorg/sglang:${{ inputs.target_tag }}"
          docker buildx imagetools create \
            -t lmsysorg/sglang:${{ inputs.target_tag }} \
            lmsysorg/sglang:${{ inputs.source_tag }}
--- a/third_party/sglang/.github/workflows/runner-utilization.yml
+++ b/third_party/sglang/.github/workflows/runner-utilization.yml
@@ -0,0 +1,43 @@
 name: Runner Utilization Report
 on:
  schedule:
    - cron: '0 8 * * *'  # Daily at 8 AM UTC
  pull_request:
    paths:
      - '.github/workflows/runner-utilization.yml'
      - 'scripts/ci/utils/runner_utilization_report.py'
  workflow_dispatch:
    inputs:
      hours:
        description: 'Time window in hours'
        required: false
        default: '24'
        type: string
      filter:
        description: 'Filter runner labels (e.g., 5090, h200)'
        required: false
        type: string
 jobs:
  report:
    name: Generate Report
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'
      - name: Generate Utilization Report
        timeout-minutes: 30
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          python scripts/ci/utils/runner_utilization_report.py \
            --repo ${{ github.repository }} \
            --hours ${{ inputs.hours || '24' }} \
            ${{ inputs.filter && format('--filter {0}', inputs.filter) || '' }}
--- a/third_party/sglang/.github/workflows/slash-command-handler.yml
+++ b/third_party/sglang/.github/workflows/slash-command-handler.yml
@@ -0,0 +1,99 @@
 name: Slash Command Handler
 on:
  issue_comment:
    types: [created, edited]
 permissions:
  contents: read
  pull-requests: write # Required to add labels and reactions
  actions: write       # Required to rerun workflows
  issues: write        # Required for comment reactions in some contexts
 jobs:
  slash_command:
    # Only run if it is a PR and the comment contains a recognized command
    # Use contains() since startsWith() can't handle leading whitespace/newlines
    if: >
      github.event.issue.pull_request &&
      (contains(github.event.comment.body, '/tag-run-ci-label') ||
       contains(github.event.comment.body, '/rerun-failed-ci') ||
       contains(github.event.comment.body, '/tag-and-rerun-ci') ||
       contains(github.event.comment.body, '/rerun-stage') ||
       contains(github.event.comment.body, '/rerun-test'))
    runs-on: ubuntu-latest
    steps:
      # SECURITY: This workflow runs on issue_comment trigger with elevated permissions
      # (pull-requests: write, actions: write). For non-fork PRs, we can safely checkout
      # the PR branch to allow testing changes to this handler. For fork PRs, we MUST
      # stay on main to prevent untrusted code execution with these elevated permissions.
      - name: Get PR details
        id: pr
        shell: bash
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          PR_DATA=$(gh pr view ${{ github.event.issue.number }} --repo ${{ github.repository }} --json headRefName,headRepositoryOwner) || {
            echo "::error::Failed to fetch PR data"
            exit 1
          }
          # Use 'empty' filter to handle null/missing values (e.g., deleted forks)
          HEAD_OWNER=$(echo "$PR_DATA" | jq -r '.headRepositoryOwner.login // empty')
          REPO_OWNER="${{ github.repository_owner }}"
          # Treat missing/null owner as fork for security (fail-safe)
          if [[ -z "$HEAD_OWNER" || "$HEAD_OWNER" != "$REPO_OWNER" ]]; then
            IS_FORK="true"
          else
            IS_FORK="false"
          fi
          echo "is_fork=$IS_FORK" >> $GITHUB_OUTPUT
          echo "ref=$(echo "$PR_DATA" | jq -r '.headRefName')" >> $GITHUB_OUTPUT
          echo "pr_ref=refs/pull/${{ github.event.issue.number }}/head" >> $GITHUB_OUTPUT
          echo "PR owner: $HEAD_OWNER, Repo owner: $REPO_OWNER, Is fork: $IS_FORK"
      - name: Check commenter permission for fork PRs
        id: perm
        if: steps.pr.outputs.is_fork == 'true'
        shell: bash
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          PERM=$(gh api repos/${{ github.repository }}/collaborators/${{ github.event.comment.user.login }}/permission --jq '.permission') || {
            PERM="none"
            echo "::warning::Failed to check commenter permission, defaulting to none"
          }
          if [[ "$PERM" == "admin" || "$PERM" == "maintain" || "$PERM" == "write" ]]; then
            echo "safe_to_checkout_pr=true" >> $GITHUB_OUTPUT
          else
            echo "safe_to_checkout_pr=false" >> $GITHUB_OUTPUT
          fi
          echo "Commenter ${{ github.event.comment.user.login }} permission: $PERM"
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          # For non-fork PRs: checkout PR branch by name
          # For fork PRs with trusted commenter: checkout via refs/pull/N/head
          # For fork PRs with untrusted commenter: stay on main for security
          ref: ${{ steps.pr.outputs.is_fork == 'false' && steps.pr.outputs.ref || (steps.perm.outputs.safe_to_checkout_pr == 'true' && steps.pr.outputs.pr_ref || '') }}
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'
      - name: Install dependencies
        run: |
          pip install PyGithub
      - name: Handle Slash Command
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          REPO_FULL_NAME: ${{ github.repository }}
          PR_NUMBER: ${{ github.event.issue.number }}
          COMMENT_ID: ${{ github.event.comment.id }}
          COMMENT_BODY: ${{ github.event.comment.body }}
          USER_LOGIN: ${{ github.event.comment.user.login }}
        run: |
          python scripts/ci/utils/slash_command_handler.py
--- a/third_party/sglang/.github/workflows/stress-test.yml
+++ b/third_party/sglang/.github/workflows/stress-test.yml
@@ -0,0 +1,44 @@
 name: Stress Test
 on:
  workflow_dispatch:
    inputs:
      num_prompts:
        description: 'Number of prompts per model'
        required: true
        default: '50000'
        type: string
      duration_minutes:
        description: 'Timeout per model in minutes'
        required: true
        default: '45'
        type: string
 jobs:
  stress-test:
    if: github.repository == 'sgl-project/sglang'
    runs-on: 8-gpu-h200
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Install dependencies
        run: |
          bash scripts/ci/cuda/ci_install_dependency.sh
      - name: Run stress tests
        timeout-minutes: 210
        env:
          NUM_PROMPTS: ${{ inputs.num_prompts }}
          DURATION_MINUTES: ${{ inputs.duration_minutes }}
        run: |
          cd test
          python3 run_suite.py --hw cuda --suite stress
      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: stress-test-results
          path: |
            stress_test_*.jsonl
--- a/third_party/sglang/.github/workflows/trivy-scan-dev.yml
+++ b/third_party/sglang/.github/workflows/trivy-scan-dev.yml
@@ -0,0 +1,85 @@
 name: Trivy Scan Dev Docker Images
 on:
  # Run daily after nightly dev builds (which run at midnight UTC)
  schedule:
    - cron: "0 6 * * *"
  workflow_dispatch:
    inputs:
      tag:
        description: "Image tag to scan (e.g., dev, dev-cu13, latest)"
        required: false
        default: ""
 jobs:
  scan:
    if: github.repository == 'sgl-project/sglang'
    runs-on: x64-docker-build-node
    timeout-minutes: 45
    permissions:
      contents: read
      security-events: write
    strategy:
      fail-fast: false
      matrix:
        tag: ${{ inputs.tag && fromJSON(format('["{0}"]', inputs.tag)) || fromJSON('["dev", "dev-cu13"]') }}
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
      - name: Run Trivy vulnerability scanner
        uses: aquasecurity/trivy-action@v0.35.0
        with:
          image-ref: 'docker.io/lmsysorg/sglang:${{ matrix.tag }}'
          scanners: 'vuln'
          format: 'sarif'
          output: 'trivy-results-${{ matrix.tag }}.sarif'
          severity: 'CRITICAL,HIGH'
          ignore-unfixed: true
          skip-dirs: 'usr/local/go,opt/nvidia'
      - name: Upload Trivy scan results to GitHub Security
        uses: github/codeql-action/upload-sarif@v4
        if: always() && hashFiles(format('trivy-results-{0}.sarif', matrix.tag)) != ''
        with:
          sarif_file: 'trivy-results-${{ matrix.tag }}.sarif'
          category: 'trivy-${{ matrix.tag }}'
      - name: Run Trivy (table output for logs)
        if: success()
        uses: aquasecurity/trivy-action@v0.35.0
        with:
          image-ref: 'docker.io/lmsysorg/sglang:${{ matrix.tag }}'
          scanners: 'vuln'
          format: 'table'
          severity: 'CRITICAL,HIGH'
          ignore-unfixed: true
          skip-dirs: 'usr/local/go,opt/nvidia'
      - name: Scan summary
        if: always()
        run: |
          IMAGE="docker.io/lmsysorg/sglang:${{ matrix.tag }}"
          SARIF="trivy-results-${{ matrix.tag }}.sarif"
          echo "## Trivy Scan: \`${{ matrix.tag }}\`" >> "$GITHUB_STEP_SUMMARY"
          if [ ! -f "${SARIF}" ]; then
            echo "**Status:** Scan failed — no SARIF output produced" >> "$GITHUB_STEP_SUMMARY"
            exit 0
          fi
          VULN_COUNT=$(python3 -c "
          import json
          data = json.load(open('${SARIF}'))
          print(sum(len(run.get('results', [])) for run in data.get('runs', [])))
          ")
          echo "- **Image**: \`${IMAGE}\`" >> "$GITHUB_STEP_SUMMARY"
          echo "- **Findings**: ${VULN_COUNT}" >> "$GITHUB_STEP_SUMMARY"
          if [ "${VULN_COUNT}" = "0" ]; then
            echo "- **Result**: No CRITICAL/HIGH unfixed vulnerabilities found" >> "$GITHUB_STEP_SUMMARY"
          else
            echo "- **Result**: Found ${VULN_COUNT} finding(s) — check the Security tab for details" >> "$GITHUB_STEP_SUMMARY"
          fi
--- a/third_party/sglang/.github/workflows/weekly-test-nvidia.yml
+++ b/third_party/sglang/.github/workflows/weekly-test-nvidia.yml
@@ -0,0 +1,49 @@
 name: Weekly Test (Nvidia)
 on:
  schedule:
    - cron: '0 0 * * 0'  # Run every Sunday at midnight UTC
  workflow_dispatch:
    inputs:
      job_filter:
        description: 'Select which job to run (leave empty or "all" to run all jobs)'
        required: false
        type: choice
        default: 'all'
        options:
          - 'all'
          - 'weekly-test-8-gpu-h200'
 concurrency:
  group: weekly-test-nvidia-${{ github.ref }}
  cancel-in-progress: true
 env:
  SGLANG_IS_IN_CI: true
  HF_HUB_DOWNLOAD_TIMEOUT: 300
  HF_HUB_ETAG_TIMEOUT: 300
 jobs:
  # Weekly tests - 8 GPU H200
  weekly-test-8-gpu-h200:
    if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'weekly-test-8-gpu-h200')
    runs-on: 8-gpu-h200
    timeout-minutes: 120
    env:
      RUNNER_LABELS: 8-gpu-h200
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Install dependencies
        run: |
          bash scripts/ci/cuda/ci_install_dependency.sh
      - name: Run weekly 8-GPU H200 tests
        timeout-minutes: 120
        env:
          GPU_CONFIG: "8-gpu-h200"
          IS_H200: "1"
        run: |
          cd test
          python3 run_suite.py --hw cuda --suite weekly-8-gpu-h200 --nightly --continue-on-error --timeout-per-file 7200
--- a/third_party/sglang/.gitignore
+++ b/third_party/sglang/.gitignore
@@ -0,0 +1,274 @@
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]
 *$py.class
 # C extensions
 *.so
 # Distribution / packaging
 .Python
 **/build/
 **/develop-eggs/
 **/dist/
 **/downloads/
 **/eggs/
 .eggs/
 **/lib/
 **/lib64/
 **/parts/
 **/sdist/
 **/var/
 **/wheels/
 **/share/python-wheels/
 *.egg-info/
 .installed.cfg
 *.egg
 MANIFEST
 # PyInstaller
 #  Usually these files are written by a python script from a template
 #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 *.manifest
 *.spec
 # Installer logs
 pip-log.txt
 pip-delete-this-directory.txt
 # Unit test / coverage reports
 htmlcov/
 .tox/
 .nox/
 .coverage
 .coverage.*
 .cache
 nosetests.xml
 coverage.xml
 *.cover
 *.py,cover
 .hypothesis/
 # Tokenizer cache for tests
 .tokenizer_cache/
 .pytest_cache/
 cover/
 # Translations
 *.mo
 *.pot
 # Django stuff:
 *.log
 local_settings.py
 db.sqlite3
 db.sqlite3-journal
 # Flask stuff:
 instance/
 .webassets-cache
 # Scrapy stuff:
 .scrapy
 # Sphinx documentation
 docs/_build/
 # PyBuilder
 .pybuilder/
 target/
 # Jupyter Notebook
 .ipynb_checkpoints
 # IPython
 profile_default/
 ipython_config.py
 # pyenv
 #   For a library or package, you might want to ignore these files since the code is
 #   intended to run in multiple environments; otherwise, check them in:
 # .python-version
 # pipenv
 #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 #   install all needed dependencies.
 #Pipfile.lock
 # poetry
 #   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
 #   This is especially recommended for binary packages to ensure reproducibility, and is more
 #   commonly ignored for libraries.
 #   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
 #poetry.lock
 # pdm
 #   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
 #pdm.lock
 #   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
 #   in version control.
 #   https://pdm.fming.dev/#use-with-ide
 .pdm.toml
 # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
 __pypackages__/
 # Celery stuff
 celerybeat-schedule
 celerybeat.pid
 # SageMath parsed files
 *.sage.py
 # Environments
 .env
 .venv
 env/
 venv/
 ENV/
 env.bak/
 venv.bak/
 # Spyder project settings
 .spyderproject
 .spyproject
 # Rope project settings
 .ropeproject
 # mkdocs documentation
 /site
 # mypy
 .mypy_cache/
 .dmypy.json
 dmypy.json
 # Pyre type checker
 .pyre/
 # pytype static type analyzer
 .pytype/
 # Cython debug symbols
 cython_debug/
 # PyCharm
 #  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
 #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
 #  and can be added to the global gitignore or merged into this file.  For a more nuclear
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
 .idea/
 # MacOS
 .DS_Store
 # Vim
 *.swp
 # Documentation
 docs/_build
 # SGL
 benchmark/mmlu/data
 benchmark/mmlu/data.tar
 benchmark/llava_bench/images
 benchmark/llava_bench/mme_pack
 *.jsonl
 tmp*.txt
 # Torch Compile logs
 tl_out/
 # Plots
 *.png
 *.pdf
 # personnal
 work_dirs/
 *.csv
 !logo.png
 # Prerequisites
 *.d
 # Compiled Object files
 *.slo
 *.lo
 *.o
 *.obj
 # Precompiled Headers
 *.gch
 *.pch
 # Compiled Dynamic libraries
 *.so
 *.dylib
 *.dll
 # Fortran module files
 *.mod
 *.smod
 # Compiled Static libraries
 *.lai
 *.la
 *.a
 *.lib
 # Executables
 *.exe
 *.out
 *.app
 *.iml
 # VSCode
 .vscode
 # Autoenv
 .env.leave
 # Rust lib
 Cargo.lock
 # Generated vision test fixtures (regenerate with: python scripts/generate_vision_golden.py)
 sgl-model-gateway/tests/fixtures/golden/
 # Other repos
 lmms-eval
 **/.serena/
 ctags/
 outputs/
 inputs/
 # Eval Cache
 .longbench_cache/
 # CUDA kernel develop, profile and debug
 .clangd
 *.nsys-rep
 *.ncu-rep
 *.nvcudmp
 # setuptools-scm generated version file
 python/sglang/_version.py
 # MUSA section
 # Generated source files by torchada
 sgl-kernel/csrc_musa/
 sgl-kernel/include_musa/
 sgl-kernel/csrc/**/*_musa/
 # MUSA core dump files
 *.mudmp
 # Others
 # diffusion 3D outputs
 *.glb
 *.ply
 *.npz
--- a/third_party/sglang/.isort.cfg
+++ b/third_party/sglang/.isort.cfg
@@ -0,0 +1,3 @@
 [settings]
 profile=black
 known_first_party=sglang
--- a/third_party/sglang/.pre-commit-config.yaml
+++ b/third_party/sglang/.pre-commit-config.yaml
@@ -0,0 +1,102 @@
 default_stages: [pre-commit, pre-push, manual]
 exclude: ^(python/sglang/multimodal_gen/csrc|python/sglang/jit_kernel/flash_attention/cute)
 repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v6.0.0
    hooks:
      - id: check-symlinks
      - id: destroyed-symlinks
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
        args: [--allow-multiple-documents]
      - id: check-toml
      - id: check-ast
      - id: check-added-large-files
      - id: check-merge-conflict
      - id: check-shebang-scripts-are-executable
      - id: detect-private-key
        exclude: ^sgl-model-gateway/tests/.*_test\.rs$
      - id: debug-statements
      - id: no-commit-to-branch
  - repo: https://github.com/PyCQA/isort
    rev: 7.0.0
    hooks:
      - id: isort
        exclude: '^python/sglang/srt/grpc/.*_pb2\.py$|^python/sglang/srt/grpc/.*_pb2_grpc\.py$|^python/sglang/srt/grpc/.*_pb2\.pyi$|^python/sglang/srt/grpc/.*_pb2_grpc\.pyi$'
  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.15.1
    hooks:
      - id: ruff
        args:
          - --select=F401,F821
          - --fix
        files: ^(benchmark/|docs/|examples/|python/sglang/|sgl-model-gateway/py_*|test/)
        exclude: |
          (?x)^(
          .*/__init__\.py$|
          .*\.ipynb$|
          python/sglang/srt/grpc/.*_pb2\.py$|
          python/sglang/srt/grpc/.*_pb2_grpc\.py$|
          python/sglang/srt/grpc/.*_pb2\.pyi$|
          python/sglang/srt/grpc/.*_pb2_grpc\.pyi$|
          )$
  - repo: https://github.com/psf/black
    rev: 26.1.0
    hooks:
      - id: black-jupyter
        exclude: '^python/sglang/srt/grpc/.*_pb2\.py$|^python/sglang/srt/grpc/.*_pb2_grpc\.py$|^python/sglang/srt/grpc/.*_pb2\.pyi$|^python/sglang/srt/grpc/.*_pb2_grpc\.pyi$'
  - repo: https://github.com/codespell-project/codespell
    rev: v2.4.1
    hooks:
      - id: codespell
        args: ['--config', '.codespellrc']
  - repo: https://github.com/pre-commit/mirrors-clang-format
    rev: v20.1.7
    hooks:
    - id: clang-format
      types_or: [c++, cuda]
      args: [--style=file, --verbose]
  - repo: https://github.com/kynan/nbstripout
    rev: 0.9.0
    hooks:
      - id: nbstripout
        args:
          - '--keep-output'
          - '--extra-keys=metadata.kernelspec metadata.language_info.version'
  - repo: local
    hooks:
      - id: check-chinese-characters
        name: check chinese characters in multimodal_gen
        entry: >-
          python3 -c 'import sys, re; p=re.compile(r"[\u4e00-\u9fff]"); ec=0; [ ([(print(f"{f}:{i+1}: {l.strip()}") or (ec:=1)) for i,l in enumerate(open(f, "r", encoding="utf-8", errors="ignore")) if p.search(l)]) for f in sys.argv[1:] ]; sys.exit(ec)'
        language: system
        files: ^python/sglang/multimodal_gen/.*
        exclude: ^(python/sglang/multimodal_gen/configs/sample|python/sglang/multimodal_gen/apps/ComfyUI_SGLDiffusion/workflows|python/sglang/multimodal_gen/runtime/pipelines_core/stages/model_specific_stages)(/|$)
        types_or: [python, markdown, json, text]
      - id: sort-ci-permissions
        name: sort CI_PERMISSIONS.json
        entry: python3 .github/update_ci_permission.py --sort-only
        language: system
        files: ^\.github/CI_PERMISSIONS\.json$
        pass_filenames: false
      - id: check-workflow-job-names
        name: check for duplicate workflow job names
        entry: python3 scripts/ci/check_workflow_job_names.py
        language: system
        files: ^\.github/workflows/.*\.yml$
        pass_filenames: false
  - repo: https://github.com/lycheeverse/lychee.git
    rev: lychee-v0.22.0
    hooks:
      - id: lychee
        name: check doc links (offline)
        args: ["--config", ".github/linters/lychee.toml"]
        stages: [manual]
        exclude: ^docs/_build/
        files: |
          (?x)^(
            README\.md|
            docs/.*\.(md|rst|ipynb)
          )$
--- a/third_party/sglang/3rdparty/amd/profiling/PROFILING.md
+++ b/third_party/sglang/3rdparty/amd/profiling/PROFILING.md
@@ -0,0 +1,425 @@
 ## Profiling SGLang Infer System with AMD GPUs
 This AppNote describes the SGLang profiling technical, code augment and running steps for systems with AMD Instinct GPUs, nevertheless the same procedure may work with Nvidia GPUs too.
 Examples and steps are provided in detail, to facilitate easy reproduce and use to localize performance problem towards optimizations.
 Two primary methods are covered:
 - [RPD](https://github.com/ROCm/rocmProfileData.git)
 - [PyTorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html)
 ### Profiling SGLang Infer System with RPD Profiler
 RPD profiler is a low-overhead cross-platform profiler. Therefore, the same RPD code augment not only works for profiling on ROCm/AMD GPUs, but also works for profiling on CUDA/Nvidia GPUs as well. To do RPD profiling on SGLang repository, please use scripts and patch files included in this directory and follow the steps below:
 1. Install RPD with rpd.patch applied during installation using install_rpd.sh, both files are in this directory.
 install_rpd.sh
 ```bash
 # download and install RPD
 apt update && apt install -y sqlite3 libsqlite3-dev libfmt-dev
 # install rpd module
 git clone https://github.com/ROCmSoftwarePlatform/rocmProfileData
 cd rocmProfileData
 git checkout 976899e9c6dbc6dd2bccf770818e4e44125590ac
 git apply rpd.patch
 make && make install
 cd rocpd_python && python setup.py install && cd ..
 cd rpd_tracer && make clean;make install && python setup.py install && cd ..
 ```
 rpd.patch
 ```bash
 diff --git a/rpd_tracer/Makefile b/rpd_tracer/Makefile
 index e9d9feb..b2e9e1a 100644
 --- a/rpd_tracer/Makefile
 +++ b/rpd_tracer/Makefile
@@ -16,7 +16,7 @@ ifneq (,$(HIP_PATH))
         $(info Building with roctracer)
         RPD_LIBS += -L/opt/rocm/lib -lroctracer64 -lroctx64 -lamdhip64 -lrocm_smi64
         RPD_INCLUDES += -I/opt/rocm/include -I/opt/rocm/include/roctracer -I/opt/rocm/include/hsa
 -        RPD_SRCS += RoctracerDataSource.cpp RocmSmiDataSource.cpp
 +        RPD_SRCS += RoctracerDataSource.cpp
         RPD_INCLUDES += -D__HIP_PLATFORM_AMD__
 endif
 ```
 2. Add loadTracer.sh file included in this directory to /sglang/python/sglang.
 loadTracer.sh
 ```bash
 #!/bin/bash
 ################################################################################
 # Copyright (c) 2021 - 2023 Advanced Micro Devices, Inc. All rights reserved.
 #
 # Permission is hereby granted, free of charge, to any person obtaining a copy
 # of this software and associated documentation files (the "Software"), to deal
 # in the Software without restriction, including without limitation the rights
 # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 # copies of the Software, and to permit persons to whom the Software is
 # furnished to do so, subject to the following conditions:
 #
 # The above copyright notice and this permission notice shall be included in
 # all copies or substantial portions of the Software.
 #
 # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
 # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 # THE SOFTWARE.
 ################################################################################
 OUTPUT_FILE="trace.rpd"
 if [ "$1" = "-o" ] ; then
  OUTPUT_FILE=$2
  shift
  shift
 fi
 if [ -e ${OUTPUT_FILE} ] ; then
  rm ${OUTPUT_FILE}
 fi
 python3 -m rocpd.schema --create ${OUTPUT_FILE}
 if [ $? != 0 ] ; then
  echo "Error: Could not create rpd file. Please run 'python setup.py install' from the rocpd_python dir"
  exit
 fi
 export RPDT_FILENAME=${OUTPUT_FILE}
 export RPDT_AUTOSTART=0
 LD_PRELOAD=librocm-smi_64:librpd_tracer.so "$@"
 ```
 3. Apply patch (provided in this directory) with "git apply rpd_profile_server_enable.patch" if the main profiling purpose is to get info on gpu kernels as well as limited cpu activity info.
 #### Common Notes 1
 Please note that although we are doing TP=8 in the example, we purposely only log RPD profiling on 2 ranks in the patch file (i.e.tp_rank=0/1) for profiling/visualization convenience, as even Perfetto streaming mode can only load maximal 8GB json file for visualization. With 2 ranks logged in RPD profiling, we could still check whether there are issues among ranks (e.g. load imbalance issue, nccl issue), and at the same time, we could log relatively longer time duration before the json file generated from RPD file hits 8GB size.
 rpd_profile_server_enable.patch
 ```bash
 diff --git a/python/sglang/srt/managers/scheduler.py b/python/sglang/srt/managers/scheduler.py
 index 62d1ff9..9021c01 100644
 --- a/python/sglang/srt/managers/scheduler.py
 +++ b/python/sglang/srt/managers/scheduler.py
@@ -71,6 +71,8 @@ from sglang.srt.utils import (
     suppress_other_loggers,
 )
 from sglang.utils import get_exception_traceback
 +from rpdTracerControl import rpdTracerControl
 +rpdTracerControl.skipCreate()
 logger = logging.getLogger(__name__)
@@ -245,6 +247,7 @@ class Scheduler:
                 ],
                 with_stack=True,
             )
 +            self.rpd = rpdTracerControl()
     @torch.inference_mode()
     def event_loop(self):
@@ -1027,15 +1030,24 @@ class Scheduler:
     def start_profile(self) -> None:
         if self.profiler is None:
             raise RuntimeError("Profiler is not enabled.")
 -        self.profiler.start()
 +        #self.profiler.start() #block pytorch profiler for rpd profiler enabling
 +        if self.tp_rank == 0 or self.tp_rank == 1:
 +            self.rpd.start()
 +            self.rpd.rangePush("", "rpd profile range", "")
 +            logger.info("rpd is enabled")
     def stop_profile(self) -> None:
         if self.profiler is None:
             raise RuntimeError("Profiler is not enabled.")
 -        self.profiler.stop()
 -        self.profiler.export_chrome_trace(
 -            self.torch_profiler_trace_dir + "/" + str(time.time()) + ".trace.json.gz"
 -        )
 +        #self.profiler.stop()
 +        #self.profiler.export_chrome_trace(
 +        #    self.torch_profiler_trace_dir + "/" + str(time.time()) + ".trace.json.gz"
 +        #)
 +        if self.tp_rank ==0 or self.tp_rank ==1:
 +            self.rpd.rangePop()
 +            self.rpd.stop()
 +            self.rpd.flush()
 +            logger.info("rpd is done")
         logger.info("Profiler is done")
 ```
 #### Advanced Debugging with RPD Profiler
 Sometimes, we want to use rpd profiler to capture more CPU and python activities in order to debug some challenging issues (e.g. root cause of load imbalance across gpu processes, root cause of bubbles, etc). Only in such cases, we need to apply patch "git apply rpd_profile_server_enable_wCPU_activities.patch", where 3 files are modified.
 rpd_profile_server_enable_wCPU_activities.patch
 ```bash
 diff --git a/python/sglang/srt/managers/scheduler.py b/python/sglang/srt/managers/scheduler.py
 index 62d1ff9..2edb427 100644
 --- a/python/sglang/srt/managers/scheduler.py
 +++ b/python/sglang/srt/managers/scheduler.py
@@ -71,6 +71,8 @@ from sglang.srt.utils import (
     suppress_other_loggers,
 )
 from sglang.utils import get_exception_traceback
 +from rpdTracerControl import rpdTracerControl
 +rpdTracerControl.skipCreate()
 logger = logging.getLogger(__name__)
@@ -245,6 +247,7 @@ class Scheduler:
                 ],
                 with_stack=True,
             )
 +            self.rpd = rpdTracerControl()
     @torch.inference_mode()
     def event_loop(self):
@@ -1027,15 +1030,26 @@ class Scheduler:
     def start_profile(self) -> None:
         if self.profiler is None:
             raise RuntimeError("Profiler is not enabled.")
 -        self.profiler.start()
 +        #self.profiler.start()
 +        logger.info("torch profiler is disabled")
 +        if self.tp_rank == 0 or self.tp_rank == 1:
 +            self.rpd.setPythonTrace(True)
 +            self.rpd.start()
 +            self.rpd.rangePush("", "scheduler", "")
 +        logger.info("rpd is enabled inside scheduler profiling")
     def stop_profile(self) -> None:
         if self.profiler is None:
             raise RuntimeError("Profiler is not enabled.")
 -        self.profiler.stop()
 -        self.profiler.export_chrome_trace(
 -            self.torch_profiler_trace_dir + "/" + str(time.time()) + ".trace.json.gz"
 -        )
 +        #self.profiler.stop()
 +        #self.profiler.export_chrome_trace(
 +        #    self.torch_profiler_trace_dir + "/" + str(time.time()) + ".trace.json.gz"
 +        #)
 +        if self.tp_rank ==0 or self.tp_rank ==1:
 +            self.rpd.rangePop()
 +            self.rpd.stop()
 +            self.rpd.flush()
 +            logger.info("rpd is done inside scheduler")
         logger.info("Profiler is done")
 diff --git a/python/sglang/srt/managers/tokenizer_manager.py b/python/sglang/srt/managers/tokenizer_manager.py
 index 2621ccd..181df85 100644
 --- a/python/sglang/srt/managers/tokenizer_manager.py
 +++ b/python/sglang/srt/managers/tokenizer_manager.py
@@ -58,6 +58,10 @@ from sglang.srt.sampling.sampling_params import SamplingParams
 from sglang.srt.server_args import PortArgs, ServerArgs
 from sglang.srt.utils import is_generation_model, is_multimodal_model
 +from rpdTracerControl import rpdTracerControl
 +rpdTracerControl.skipCreate()
 +
 +
 asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
 logger = logging.getLogger(__name__)
@@ -514,10 +518,20 @@ class TokenizerManager:
         self.send_to_scheduler.send_pyobj(req)
     def start_profile(self):
 +        rpd = rpdTracerControl()
 +        rpd.setPythonTrace(True)
 +        rpd.start()
 +        rpd.rangePush("", "tokenizer_manager", "")
 +        logger.info("tokenizer_manager rpd profiling started!")
         req = ProfileReq.START_PROFILE
         self.send_to_scheduler.send_pyobj(req)
     def stop_profile(self):
 +        rpd = rpdTracerControl()
 +        rpd.rangePop()
 +        rpd.stop()
 +        rpd.flush()
 +        logger.info("rpd profiling is done inside tokenizer_manager!")
         req = ProfileReq.STOP_PROFILE
         self.send_to_scheduler.send_pyobj(req)
 diff --git a/python/sglang/srt/server.py b/python/sglang/srt/server.py
 index 7111c93..2bd722c 100644
 --- a/python/sglang/srt/server.py
 +++ b/python/sglang/srt/server.py
@@ -30,6 +30,8 @@ import threading
 import time
 from http import HTTPStatus
 from typing import Dict, List, Optional, Union
 +from rpdTracerControl import rpdTracerControl
 +rpdTracerControl.skipCreate()
 # Fix a bug of Python threading
 setattr(threading, "_register_atexit", lambda *args, **kwargs: None)
@@ -152,6 +154,11 @@ async def flush_cache():
 @app.post("/start_profile")
 async def start_profile():
     """Start profiling."""
 +    rpd = rpdTracerControl()
 +    rpd.setPythonTrace(True)
 +    rpd.start()
 +    rpd.rangePush("", "server rpd profile range", "")
 +    logger.info("rpd profiling started in server.py!")
     tokenizer_manager.start_profile()
     return Response(
         content="Start profiling.\n",
@@ -164,6 +171,11 @@ async def start_profile():
 async def stop_profile():
     """Stop profiling."""
     tokenizer_manager.stop_profile()
 +    rpd = rpdTracerControl()
 +    rpd.rangePop()
 +    rpd.stop()
 +    rpd.flush()
 +    logger.info("rpd profiling is done in server.py!")
     return Response(
         content="Stop profiling. This will take some time.\n",
         status_code=200,
 ```
 4. As an example for grok1 profiling, we create a dummy_grok1 directory with config.json (see content below) inside this directory and copy this directory to the right path for "--model-path" if you want to use the example server.sh file provided.
 ```bash
 cat ../dummy_grok1/config.json
 {
  "architectures": [
    "Grok1ModelForCausalLM"
  ],
  "embedding_multiplier_scale": 78.38367176906169,
  "output_multiplier_scale": 0.5773502691896257,
  "vocab_size": 131072,
  "hidden_size": 6144,
  "intermediate_size": 32768,
  "max_position_embeddings": 8192,
  "num_experts_per_tok": 2,
  "num_local_experts": 8,
  "num_attention_heads": 48,
  "num_hidden_layers": 64,
  "num_key_value_heads": 8,
  "head_dim": 128,
  "rms_norm_eps": 1e-05,
  "rope_theta": 10000.0,
  "model_type": "mixtral",
  "torch_dtype": "bfloat16"
 }
 ```
 5. Launch server with rpd enabled script ./server.sh in one terminal inside the docker container.
 #### Common Notes 2
 - Remember to change model-path to the correct path
 - loadTracer.sh is needed to conduct profiling
 - SGLANG_TORCH_PROFILER_DIR is used for default torch profiler
 - Do not use loadTracer.sh if you are using the torch profiler, simply use python3 -m sglang.launch_server.
 server.sh
 ```bash
 #!/bin/bash
 # export SGLANG_TORCH_PROFILER_DIR=/data/sglang/
 export SGLANG_TORCH_PROFILER_DIR=/sgl-workspace/sglang/profile/
 # Get the current timestamp
 TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
 # Define the log file with a timestamp
 LOGFILE="sglang_server_log_$TIMESTAMP.json"
 # Run the Python command and save the output to the log file
 loadTracer.sh python3 -m sglang.launch_server \
    --model-path /sgl-workspace/sglang/dummy_grok1 \
    --tokenizer-path Xenova/grok-1-tokenizer \
    --load-format dummy \
    --quantization fp8 \
    --tp 8 \
    --port 30000 \
    --disable-radix-cache 2>&1 | tee "$LOGFILE"
 ```
 6. Open another terminal for the same docker container, and run the rpd enabled ./client.sh after you see "The server is fired up and is ready to roll!" message from server side terminal.
 #### Common Notes 3
 - Use curl http://localhost:30000/start_profile & curl http://localhost:30000/stop_profile to control the start and end of profiling. Check sglang/python/sglang/srt/managers/scheduler.py for more details.
 - Please don't use RPD profiler together with PyTorch profiler to avoid interference.
 - The rocmProfileData/tools/rpd2tracing.py file is used to generate json file from RPD file.
 client.sh
 ```bash
 #!/bin/bash
 # Start profiling via API
 curl http://localhost:30000/start_profile -H "Content-Type: application/json"
 # Benchmark serving using sglang with random dataset and tokenizer
 # Define the log file with a timestamp
 TIMESTAMP=$(date +%Y%m%d_%H%M%S)
 LOGFILE="sglang_client_log_$TIMESTAMP.json"
 # Run the benchmark with specified parameters and save logs
 python3 -m sglang.bench_serving \
    --backend sglang \
    --tokenizer Xenova/grok-1-tokenizer \
    --dataset-name random \
    --random-input 1024\
    --random-output 1024 \
    --num-prompts 120 \
    --request-rate 8 \
    --output-file online.jsonl 2>&1 | tee "$LOGFILE"
 # Stop profiling via API
 curl http://localhost:30000/stop_profile -H "Content-Type: application/json"
 # Convert tracing file to csv & json
 sqlite3 trace.rpd ".mode csv" ".header on" ".output trace.csv" "select * from top;" ".output stdout"
 python3 ./rocmProfileData/tools/rpd2tracing.py trace.rpd trace.json
 ```
 7. Follow [Perfetto docs](https://perfetto.dev/docs/visualization/large-traces) to visualize large json files. Try to adjust parameters so that the trace.json file size is less than 9GB.
 ### Profiling SGLang Infer System with PyTorch Profiler
 Please use the steps as follows:
 1. Apply the patch torch_profiler.patch. Note that you can modify "if self.tp_rank == 0" in the patch to allow more ranks be recorded in profiling.
 torch_profiler.patch
 ```bash
 diff --git a/python/sglang/srt/managers/scheduler.py b/python/sglang/srt/managers/scheduler.py
 index 62d1ff9..6ecd78c 100644
 --- a/python/sglang/srt/managers/scheduler.py
 +++ b/python/sglang/srt/managers/scheduler.py
@@ -240,7 +240,6 @@ class Scheduler:
             )
             self.profiler = torch.profiler.profile(
                 activities=[
 -                    torch.profiler.ProfilerActivity.CPU,
                     torch.profiler.ProfilerActivity.CUDA,
                 ],
                 with_stack=True,
@@ -1033,9 +1032,11 @@ class Scheduler:
         if self.profiler is None:
             raise RuntimeError("Profiler is not enabled.")
         self.profiler.stop()
 -        self.profiler.export_chrome_trace(
 -            self.torch_profiler_trace_dir + "/" + str(time.time()) + ".trace.json.gz"
 -        )
 +        if self.tp_rank == 0:
 +            with open(f"stats_repro_{int(time.time())}.txt", "w") as f:
 +                print(self.profiler.key_averages(group_by_input_shape=True).table(sort_by="cuda_time_total", row_limit=-1), file=f)
 +                print("Profiling stats done.")
 +
         logger.info("Profiler is done")
 ```
 2. Create the model path directory and copy it to the right path for "--model-path" if you want to use the server.sh file provided.
 3. Modify the included server.sh by removing "loadTracer.sh" before python command and launch script ./server.sh in one terminal inside the docker container.
 4. Similar to step 6 in RPD profiling section, but remove the last 2 lines in client.sh, which converted rpd file into csv and json files. Run modified client.sh for PyTorch profiling.
 -------
 - [Torch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html)
--- a/Show More
+++ b/Show More