xtrain

Files

Gahow Wang 2c9b58cb3b post-train: M2b — batched KV-cache decode (G-way, token-identical)

The rollout long-pole fix deferred from M2a: decode the G samples of one prompt
in lockstep (one forward per step over the group → G× fewer kernel launches).

- rope_pos(x, positions[]): RoPE with a per-row absolute position (new forward-
  only kernel) — G rows share one decode position. Gate: == full rope for
  [0..n], == rope_at(P) per row for uniform P (bit-identical).
- generate_cached_batch: BatchKVCache [T, G·num_kv, hd] + batched decode_step.
  decode_attention is already batch-agnostic (bh = G·nh); repeat_kv(nh, batch=G)
  broadcasts per group. No finished-mask / ragged prompts yet (perf-only / next).
- Gate (tests/decode_batch.rs): all G greedy rows token-identical to the single-
  sequence decode (8 query / 2 kv heads → exercises repeat_kv batching).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-30 17:18:54 +08:00

adamw_parity_dump.rs

model: add per-head QK-norm (Qwen3-compat) for xserv export

2026-06-15 17:33:19 +08:00

adamw_parity.py

model: add per-head QK-norm (Qwen3-compat) for xserv export

2026-06-15 17:33:19 +08:00

checkpoint_roundtrip.rs

test: AdamW PyTorch parity + checkpoint round-trip + real training

2026-06-15 16:30:06 +08:00

decode_batch.rs

post-train: M2b — batched KV-cache decode (G-way, token-identical)