Layer-wise split: each stage loads only its contiguous layer range
[s*L, (s+1)*L); stage 0 keeps embed_tokens, the last stage keeps
norm/lm_head (others get a 1x1 placeholder). Heads are NOT split
(PP is orthogonal to TP). Adds embed/head and forward_layers_prefill/
forward_layers_decode that take and return the [tokens, hidden] hidden
state; per-stage PagedKVCache is indexed by local layer id.
sampling: derive Clone on SamplingParams (carried in the PP command enum).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>