xserv

Files

Gahow Wang d96ee0766c server: sampling-param validation, finish_reason normalization, backpressure

Three related hardening changes for the API surface:

- validate_request rejects NaN/negative temperature, out-of-range top_p,
  and absurd top_k before those values reach the CUDA sampling paths.
  Prevents NaN logits from downstream sampling and matches typical
  OpenAI-compatible server behavior (400 instead of 500).
- normalize_finish_reason maps engine strings to the OpenAI-standard
  subset. Currently only "error" (from tp/pp engine client-stall) needs
  normalization — it collapses to null so SDK clients see a clean stream
  close instead of an unknown finish_reason value. Applied to both
  streaming (SSE) and non-streaming JSON responses.
- Replace the unbounded std::sync::mpsc engine channel with a bounded
  sync_channel(256) and switch submit_to_engine to try_send. A saturated
  engine now returns 503 "engine is busy" instead of letting requests
  pile up in RAM. Also add axum DefaultBodyLimit(4 MiB) so a malicious
  or misbehaving client cannot exhaust memory with an arbitrary JSON POST.

2026-07-01 15:13:24 +08:00

xserv-cuda

style: format Rust workspace

2026-06-18 18:11:58 +08:00

xserv-distributed

style: format Rust workspace

2026-06-18 18:11:58 +08:00

xserv-kernels

cuda: deterministic BF16 gemv + paged attention reductions

2026-07-01 14:16:28 +08:00

xserv-model

sampling: NaN-safe sample() top-k/top-p path