docs: M2a — KV-cache decode engine results (token-identical + length-dependent speedup)

Implementation log (docs/18) + Phase-3 row (evolution.md): the two decode primitives and their gates, the engine design (host-cache baseline), the token-identical centerpiece gate, and the measured throughput baseline showing the cache win is sequence-length-dependent (~1.0x@32, ~1.9x@128, naive OOM@256). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 12:01:10 +08:00
parent eff26a0898
commit b39e6e7110
2 changed files with 52 additions and 0 deletions
--- a/docs/evolution.md
+++ b/docs/evolution.md
@@ -97,6 +97,8 @@ Phase 1/2 把**预训练全栈**学完后，Phase 3 转向**后训练 infra**（

 **M1（SFT task baseline，已落地）**：可验证算术任务 + 数据生成器 + 评分器一套，host-side 9/9 单测过（masking、SFT-target 自洽 2000 样、parser 边界、种子确定性）。dash5 单卡从 v12 基座 SFT（loss 4.68→~0.34，best val 0.386）。**100 留出题 eval：格式 `\boxed{}` 习得率 base 0% → SFT 100%；算术正确率 8%。**——SFT 只买**格式**（0%→100% 干净落地），算术正确性是 base 模型本身弱项（如 `46*80` 框成 3380），正是 M3/M4 的可验证 reward 要去补的残差。一条诚实账：M1 用的是**朴素无 KV-cache 采样器**（每 token 全量 forward），100 题已经很慢——这正是 M2 解码引擎前置的动机。

+**M2a（KV-cache 增量解码引擎，单序列，已落地）**：两个 forward-only 原语 + 裸 Tensor 逐 token block forward，各自隔离闸门。`rope_at`（绝对位置 RoPE，新 kernel，不动训练 `rope` → 训练路径零风险）逐位等于全序列 rope 的对应行；`decode_attention`（单 query × cached-K/V，由现成 strided-gemm + 普通 softmax 组合，**零新 kernel**）等于全 causal attention 末行（max|Δ| 6e-8）。引擎 `generate_greedy_cached` 镜像 `block_forward` 在 Tensor 层（无 autograd tape，推理不需梯度），靠**公开 `params()` 稳定顺序**拿权重（零 model 可见性改动）。**核心闸门 = token-identical**：与朴素全重算贪心逐 token 一致（小 GQA 单测 + v12 1.05B 上 cached eval 与 naive **逐字节相同**：format 100/100, correct 8/100）。**吞吐 baseline（v12, batch1, F32，profile-first 实测）= cache 收益随序列长度而定**：max_new 32 ≈ 持平（108 vs 111，短序列 launch 开销 bound）、128 **~1.9×**（69 vs 133）、256 naive **OOM** vs cached 129 tok/s。cached 吞吐**近恒定**（O(1)/token + 恒定显存），naive **衰减**（O(t)/token，O(seq²) 图 → OOM）。⇒ 短 eval prompt overhead-bound、cache 几乎无收益，真正受益的是**长 rollout**（DPO 造对 / GRPO completion）——与 T17（process-per-GPU 吞吐中性）同一条 measure-first 教训：收益真实，但只在真正压到瓶颈的 regime 里。M2a 的 per-layer 主机往返是短序列 overhead-bound 的一部分原因，M2b（device 端 cache + 批量 ragged）针对它。
+
 ## 四、perf 杠杆台账（详见 [known-issues.md](known-issues.md)）

 - **已修**：KI-1 单序列 launch-bound（T10）· KI-5 per-op cudaMalloc 串行（T11）· KI-2 bf16/OOM（T12）· KI-3 激活重计算（T13，解锁 dim1024，v8 用上）。