xserv

Files

Gahow Wang 377a04b81f tokenizer: read pre-tokenizer regex from tokenizer.json

Parse the model's `pre_tokenizer` section to extract its Split regex
instead of hardcoding the GPT-2 pattern.  The gpt-oss-20b model uses
a GPT-4-style regex that produces different word boundaries, causing a
1-token prompt mismatch vs HuggingFace (136 → 135 tokens, now aligned).

Unsupported lookahead `(?!\S)` is stripped — it only affects trailing
whitespace edge cases.  Falls back to the old GPT-2/Qwen heuristic if
the model regex fails to compile or is absent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-31 13:22:35 +08:00

src

tokenizer: read pre-tokenizer regex from tokenizer.json

2026-05-31 13:22:35 +08:00

Cargo.toml

phase 6+7+8: model loading, BPE tokenizer, GPT-2 inference (Milestone ①)

2026-05-21 22:04:00 +08:00