xserv

Files

Gahow Wang 0314b4f3ac server: non-blocking stream send — stop one slow client stalling the batch

All three engines emitted tokens with blocking_send on the single
decode/coordinator OS thread. A streaming client that drains slower than
generation fills its 64-slot channel, and blocking_send then blocks the whole
thread: under continuous batching one slow consumer stalls every other running
sequence (and in the serial TP/PP path it blocks admission of the next request
too). The whole point of continuous batching is defeated.

Fix: switch to try_send. engine.rs sets a client_stalled flag on Full/Closed,
reaped by is_finished() next iteration; tp_engine/pp_engine emit_text returns
bool and the decode loop breaks with finish_reason "error". When the
sequence/request is dropped its sender drops too, closing the channel so the
client receive loop ends rather than hanging. A slow client now only loses its
own sequence, never the batch.

Verified on dash5: gpt-oss FP8 TP=1 streaming via tp_engine still streams
correctly (SSE chunks, coherent content, no hang); bench-gpt-oss TP=2 5.9ms
TPOT unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-07-01 12:37:32 +08:00

src

server: non-blocking stream send — stop one slow client stalling the batch

2026-07-01 12:37:32 +08:00

Cargo.toml

server: Jinja chat template rendering via minijinja

2026-05-31 13:23:18 +08:00