Phase 6 — Model Loading (xserv-model): - safetensors parser with single/sharded file support - ModelConfig with dual naming (GPT-2 n_embd/n_head + modern HF naming) - Weight loading flow: safetensors → mmap → CPU Tensor → GPU Phase 7 — BPE Tokenizer (xserv-tokenizer): - Full BPE encode/decode from tokenizer.json - GPT-2 byte-to-unicode mapping (printable ASCII identity + shifted bytes) - Pre-tokenization regex, special token handling - Chat template support structure Phase 8 — GPT-2 Complete Inference: - GPT-2 model definition: wte, wpe, 12 transformer blocks, ln_f - Forward pass: embedding → (LayerNorm → MHA → residual → LayerNorm → MLP → residual) × 12 → LN → logits - QKV split with correct [batch, heads, seq, dim] layout (fixed reshape bug) - Greedy sampling from last-position logits - Interactive CLI: xserv-cli <model-dir> [--max-tokens N] Verified: GPT-2 124M generates coherent English text on RTX 5090. "The future of AI is uncertain. The future of AI is uncertain..." "Once upon a time, the world was a place of great beauty..." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
23 lines
399 B
TOML
23 lines
399 B
TOML
[workspace]
|
|
resolver = "2"
|
|
members = [
|
|
"crates/xserv-cuda",
|
|
"crates/xserv-tensor",
|
|
"crates/xserv-kernels",
|
|
"crates/xserv-model",
|
|
"crates/xserv-tokenizer",
|
|
]
|
|
|
|
[workspace.package]
|
|
version = "0.1.0"
|
|
edition = "2024"
|
|
license = "MIT"
|
|
|
|
[workspace.dependencies]
|
|
half = "2"
|
|
smallvec = "1"
|
|
serde = { version = "1", features = ["derive"] }
|
|
serde_json = "1"
|
|
safetensors = "0.5"
|
|
regex = "1"
|