docs: Phase T2 — tensor abstraction

Design doc for the minimal tensor layer: DType/shape/Storage/Tensor, host↔device copy, and one elementwise kernel (scale) wired end-to-end. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:12:55 +08:00
parent c1b204296b
commit 8557a289a2
1 changed files with 134 additions and 0 deletions
--- a/docs/01-tensor.md
+++ b/docs/01-tensor.md
@@ -0,0 +1,134 @@
 # Phase: Tensor & Device Buffer — Design Document
 ## Goal
 在 T1 的 `xtrain-cuda`（`GpuBuffer`/`device`/`error`）之上搭最小张量抽象，
 作为后续 GEMM / autograd / transformer 的数据基础。本 Phase 只做四件事：
 1. `DType`（先 F32，可扩）+ shape/strides；
 2. 引用计数的 host/device `Storage`；
 3. `Tensor`：创建 + host↔device 拷贝；
 4. **一个** elementwise CUDA kernel（`scale`，`out=in*alpha`）端到端打通张量 API。
 **明确不做**（留给 T3+）：GEMM、autograd、broadcast、view/transpose、半精度。
 ## Module Layout
 ```
 crates/xtrain-tensor/
 ├── Cargo.toml          # 依赖 xtrain-cuda + half + smallvec
 ├── build.rs            # 检测 nvcc，缺失则发 no_cuda cfg（与 xtrain-cuda 一致）
 └── src/
    ├── lib.rs          # re-exports
    ├── dtype.rs        # DType{F32} + TensorDType trait
    ├── shape.rs        # contiguous_strides / is_contiguous / num_elements
    ├── storage.rs      # Storage(Arc) + Device，CPU↔CUDA 拷贝
    └── tensor.rs       # Tensor：创建 / 设备迁移 / as_slice / scale kernel
 csrc/ops/elementwise.cu  # scale_f32 + launch_scale_f32（由 xtrain-cuda/build.rs 编）
 ```
 ## Key Design Decisions
 ### DType + TensorDType trait（先 F32）
 ```rust
 pub enum DType { F32 }                 // 后续 T7 混合精度再加 F16/BF16
 pub trait TensorDType: Copy + Send + Sync + 'static {
    const DTYPE: DType;
    fn to_f64(self) -> f64;
    fn from_f64(v: f64) -> Self;
 }
 ```
 trait 让 `from_slice<T>` / `as_slice<T>` 有类型安全。镜像 xserv 的结构，
 但只实现 F32 一种——不提前引入用不到的类型。
 ### Storage 引用计数
 ```rust
 #[derive(Clone)]
 pub struct Storage(Arc<StorageInner>);
 enum StorageInner {
    Cpu  { data: Vec<u8> },
    Cuda { buffer: GpuBuffer, device: u32 },
 }
 ```
 - `Arc` 让未来的 view（transpose/slice）能共享底层数据；T2 暂不产生 view，但类型已就位。
 - `to_device(target)`：同设备返回 `Arc` clone（零拷贝）；
  CPU→CUDA 走 `GpuBuffer::alloc + copy_from_host`(H2D)，CUDA→CPU 走 `copy_to_host`(D2H)。
 - 跨 GPU（CUDA→CUDA 不同卡）T2 不支持（`xtrain-cuda` 暂无 D2D），显式 panic 说明边界。
 - `zeros` 在 GPU 上靠 host 端零缓冲 stage 上去（T2 无 device memset，简单优先；后续可加 kernel）。
 ### Strided Tensor（结构就位，T2 只产生 contiguous）
 ```rust
 pub struct Tensor {
    storage: Storage,
    shape: Dims,        // SmallVec<[usize;4]>，≤4D 免堆分配
    strides: Dims,      // 以元素为单位，row-major
    offset: usize,      // 给未来 slice 留的口子
    dtype: DType,
 }
 ```
 - `strides`/`offset` 字段先放着，T2 创建的张量恒 contiguous、offset=0；
  这样 T3+ 加 view 不必改结构体形状。
 - `is_contiguous()` 校验 strides 是否匹配 shape（size-1 维度的 stride 不计）。
 - `as_slice::<T>()` / `data_ptr()` 要求 contiguous；`data_ptr` 按 dtype 字节算偏移，
  供 kernel launch 用。
 ### Elementwise kernel 端到端（scale）
 CUDA 侧（`csrc/ops/elementwise.cu`）：
 ```cuda
 __global__ void scale_f32(const float* in, float* out, float alpha, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) out[i] = in[i] * alpha;
 }
 ```
 FFI 声明放在 `xtrain-cuda/src/ffi.rs`（与既有 build 链路/`no_cuda` 门控同处），
 张量层 `Tensor::scale(alpha)` 调它：
 ```rust
 #[cfg(not(no_cuda))]
 pub fn scale(&self, alpha: f32) -> Self {     // out-of-place，要求 contiguous F32 CUDA 张量
    let out = Tensor::zeros(&self.shape, self.dtype, self.device());
    unsafe { xtrain_cuda::ffi::launch_scale_f32(self.data_ptr() as *const f32,
             out.data_ptr() as *mut f32, alpha, self.numel() as i32, null_mut()); }
    xtrain_cuda::device::synchronize().unwrap();
    out
 }
 ```
 - kernel FFI 留在 `xtrain-cuda`（构建链路与 `no_cuda` cfg 都在那），张量层只调用——
  避免在张量 crate 里再开一套 nvcc 编译。
 - `scale` 用 `#[cfg(not(no_cuda))]` 门控；为此 `xtrain-tensor` 加了一个**只检测 nvcc、
  不编译任何 .cu** 的 `build.rs`，发同名 `no_cuda` cfg（cfg 不跨 crate 传播，必须各自发）。
 ## 验证方法
 GPU 测试用 `#![cfg(not(no_cuda))]` 门控，在 dash5 实跑：
 ```sh
 ssh dash5
 export PATH=/usr/local/cuda/bin:/opt/wjh/.cargo/bin:$PATH
 cd ~/projects/xtrain && cargo test -p xtrain-tensor -- --nocapture
 ```
 - **(a) host↔device 往返拷贝**：CPU 张量 → CUDA → 拷回 CPU，逐元素 `assert_eq` 原样。
 - **(b) elementwise 正确性**：`scale(3.0)` 后拷回，对 `host[i]*3.0` 逐元素相等。
 本地（无 GPU）：`cargo check --workspace --all-targets` + `cargo fmt --all -- --check` 绿；
 GPU 测试编译出局（约定：本地只 check/fmt，链接+测试都在 dash5）。
 ## Takeaways
 1. **cfg 不跨 crate**：`no_cuda` 由各 crate 自己的 `build.rs` 发；张量 crate 要门控 kernel 调用，
   就得加一个轻量 build.rs（只检测、不编译）。
 2. **结构先于功能**：`strides`/`offset` 先放进结构体，T3 加 view 时不动 shape，降低后续改动面。
 3. **边界显式 panic**：跨 GPU 拷贝、非 contiguous as_slice 等 T2 不支持的路径直接 panic 写清原因，
   而不是悄悄给错结果。
 4. **kernel 收口在 xtrain-cuda**：构建链路单点，张量层保持纯 Rust 调用，符合 T1 立的约定。