Three related hardening changes for the API surface:
- validate_request rejects NaN/negative temperature, out-of-range top_p,
and absurd top_k before those values reach the CUDA sampling paths.
Prevents NaN logits from downstream sampling and matches typical
OpenAI-compatible server behavior (400 instead of 500).
- normalize_finish_reason maps engine strings to the OpenAI-standard
subset. Currently only "error" (from tp/pp engine client-stall) needs
normalization — it collapses to null so SDK clients see a clean stream
close instead of an unknown finish_reason value. Applied to both
streaming (SSE) and non-streaming JSON responses.
- Replace the unbounded std::sync::mpsc engine channel with a bounded
sync_channel(256) and switch submit_to_engine to try_send. A saturated
engine now returns 503 "engine is busy" instead of letting requests
pile up in RAM. Also add axum DefaultBodyLimit(4 MiB) so a malicious
or misbehaving client cannot exhaust memory with an arbitrary JSON POST.