docs: T20 — Phase-2 systems-depth capstone (reframe README to two phases)
Re-conclude xtrain as TWO phases now that Phase-2 (T14–T18) is merged on main: README.md - Status header: "complete (T1–T13) + scaling v0–v8" → "complete — two phases" (Phase 1 = from-scratch stack T1–T13 + v0–v8 scaling study; Phase 2 = the five deferred systems-stack features T14–T18). - Crate table: note the Phase-2 additions (fused flash-attn + repeat_kv + dropout in autodiff; GQA + dropout in model; grad-accum in train; process-per-GPU launcher in distributed). - Build-journey section retitled Phase 1 + Phase 2; replaced the run-on T14–T18 prose with a structured "## Phase 2" summary (5 features + honest results: flash = mem-not-walltime win, GQA group-sum backward, grad-accum −74% mem, dropout × recompute bit-exact, T17 throughput-neutral falsification). - Engineering lessons: T17 added as the THIRD profile-first falsification; reinforced honest-correctness with the Phase-2 hard gates + md5 b04fc9f9. - Doc index: doc range …14-* → …17-*; KI status line (process-per-GPU CLOSED, KI-4 accepted tradeoff). docs/evolution.md - New "三·五、Phase 2 systems-depth synthesis": ties the 5 features into the per-axis (算法/架构/Infra/数据) narrative + the two integration notes. docs/known-issues.md - KI-4 reframed as a deliberately-accepted modeling tradeoff (保 xserv closed loop; T19 DROPPED), not "open". - New integration notes: (a) DDP tests need --test-threads=1 (parallel deadlock); (b) fresh-train md5 is non-deterministic (atomicAdd reduction order) → the valid determinism gate is export re-determinism, not fresh-train reproduction. - (process-per-GPU item was already CLOSED=measured no-op in T17.) Docs-only; no code touched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
109
README.md
109
README.md
@@ -6,10 +6,14 @@ inference side). A learning project: hand-write the entire training-systems stac
|
||||
gradient checkpointing), then use it to run a multi-version **scaling study** that maps
|
||||
the data-vs-capacity frontier for a tiny model.
|
||||
|
||||
> **Status: complete.** From-scratch full stack (phases T1–T13) + an 8-version scaling
|
||||
> ladder (v0–v8). Trains a Qwen3-compatible LM whose weights load into **xserv** and
|
||||
> generate **token-identical** output. This README is the capstone; per-topic detail
|
||||
> lives in [`docs/`](docs/).
|
||||
> **Status: complete — two phases.**
|
||||
> **Phase 1** = the from-scratch full stack (T1–T13) + an 8-version scaling study (v0–v8):
|
||||
> hand-write the whole training-systems stack, then map the data-vs-capacity frontier.
|
||||
> **Phase 2** = systems-stack depth (T14–T18): hand-write the five deferred training-stack
|
||||
> features — fused flash-attention, real GQA, gradient accumulation, process-per-GPU DDP,
|
||||
> dropout. Trains a Qwen3-compatible LM whose weights load into **xserv** and generate
|
||||
> **token-identical** output — the closed loop held byte-for-byte across both phases. This
|
||||
> README is the capstone; per-topic detail lives in [`docs/`](docs/).
|
||||
|
||||
---
|
||||
|
||||
@@ -22,20 +26,22 @@ borrows, the rest hand-written CUDA + Rust:
|
||||
|---|---|
|
||||
| `xtrain-cuda` | CUDA Runtime FFI, RAII `GpuBuffer`, **caching/pool allocator**, cuBLAS (sgemm + bf16 GemmEx) bindings |
|
||||
| `xtrain-tensor` | tensor (dtype/shape/strides/storage), elementwise + transpose + embedding kernels |
|
||||
| `xtrain-autodiff` | **tape autograd engine** (grad accumulation), per-op backward, finite-diff grad-check, **checkpoint** (recompute) primitive |
|
||||
| `xtrain-model` | tiny **Qwen3-style** transformer (RoPE + RMSNorm + QK-norm + SwiGLU), batched forward |
|
||||
| `xtrain-autodiff` | **tape autograd engine** (grad accumulation), per-op backward, finite-diff grad-check, **checkpoint** (recompute) primitive, **fused flash-attention** (online-softmax) fwd/bwd, **`repeat_kv`** broadcast (GQA), **`dropout`** (counter-based device RNG + mask) |
|
||||
| `xtrain-model` | tiny **Qwen3-style** transformer (RoPE + RMSNorm + QK-norm + SwiGLU), batched forward, **GQA** (`num_kv_heads<num_heads`), residual/MLP **dropout** |
|
||||
| `xtrain-optim` | hand-written **AdamW** (host + GPU kernels) |
|
||||
| `xtrain-train` | training loop, LR schedule, grad clip, checkpoint, BPE corpus + cache, samplers, safetensors export |
|
||||
| `xtrain-distributed` | **NCCL DDP** (thread-per-GPU + torchrun-style process-per-GPU, all-reduce) |
|
||||
| `xtrain-train` | training loop, LR schedule, grad clip, **gradient accumulation**, checkpoint, BPE corpus + cache, samplers, safetensors export |
|
||||
| `xtrain-distributed` | **NCCL DDP** (thread-per-GPU + torchrun-style process-per-GPU launcher / cross-process `ncclUniqueId`, all-reduce) |
|
||||
|
||||
Every op's backward is verified against **finite differences** and against **PyTorch**
|
||||
(forward + per-parameter grads, batch > 1). Trained weights export to HF-safetensors and
|
||||
load into xserv (Qwen3, BF16) producing token-identical greedy output — the closed loop.
|
||||
|
||||
## The build journey — phases T1–T13
|
||||
## The build journey — Phase 1 (T1–T13) + Phase 2 (T14–T18)
|
||||
|
||||
Each phase: design doc + implementation + tests + a scoped commit (see [`docs/`](docs/) and
|
||||
[`docs/evolution.md`](docs/evolution.md) for the per-axis changelog).
|
||||
[`docs/evolution.md`](docs/evolution.md) for the per-axis changelog). **Phase 1 (T1–T13)**
|
||||
hand-built the stack and fixed the four real bottlenecks; **Phase 2 (T14–T18)** went back to
|
||||
hand-write five deferred training-stack features — see the Phase-2 summary below the table.
|
||||
|
||||
| phase | what | result |
|
||||
|---|---|---|
|
||||
@@ -57,22 +63,48 @@ Each phase: design doc + implementation + tests + a scoped commit (see [`docs/`]
|
||||
| **T18** | **dropout** (hand counter-based device RNG + mask, inverted scaling, train/eval switch) | fixed-seed grad-check; **p=0 bit-identical**; recompute-safe |
|
||||
|
||||
The four performance fixes (T10–T13) each removed a real bottleneck — see
|
||||
[`docs/known-issues.md`](docs/known-issues.md). **Phase 2 (systems-stack depth, T14–)**
|
||||
revisits hand-writing deferred training-stack features: T14 = the fused
|
||||
flash-attention kernel ([`docs/13-flash-attention.md`](docs/13-flash-attention.md));
|
||||
T15 = real grouped-query attention ([`docs/14-gqa.md`](docs/14-gqa.md), `num_kv_heads <
|
||||
num_heads` via a `repeat_kv` broadcast op whose backward sums each kv head's query-head
|
||||
group — feeding both SDPA paths unchanged, default MHA bit-identical);
|
||||
T16 = micro-batch gradient accumulation ([`docs/15-grad-accum.md`](docs/15-grad-accum.md)),
|
||||
which decouples the effective batch from activation memory (memory tracks the micro-batch,
|
||||
not N×); T17 = torchrun-style process-per-GPU DDP
|
||||
([`docs/16-process-per-gpu.md`](docs/16-process-per-gpu.md), one process + CUDA context per
|
||||
GPU, launcher-minted `ncclUniqueId` via env injection, reusing the T8 training step
|
||||
unchanged) — which **measured** that, at this scale, separate contexts give no throughput
|
||||
gain over thread-per-GPU (the residual ~5.3×@8 is the NCCL/PCIe communication wall, not
|
||||
single-context serialization as the old KI-5 note speculated); T18 = dropout
|
||||
([`docs/17-dropout.md`](docs/17-dropout.md), hand counter-based device RNG + mask, inverted
|
||||
scaling, train/eval switch).
|
||||
[`docs/known-issues.md`](docs/known-issues.md) — which is where **Phase 1** closed.
|
||||
|
||||
## Phase 2 — systems-stack depth (T14–T18)
|
||||
|
||||
Phase 1 fixed bottlenecks; Phase 2 went back to hand-write the five training-stack features
|
||||
that had been **explicitly deferred** earlier (project's actual goal = learn the whole stack).
|
||||
Each is opt-in, kept the default path **bit-identical**, and held a **hard correctness gate**:
|
||||
|
||||
- **T14 · fused flash-attention** ([`docs/13-flash-attention.md`](docs/13-flash-attention.md)) —
|
||||
a single hand-written kernel: **online (streaming) softmax, tiled over KV, never materializes
|
||||
the `N×N` scores**; flash-style backward recomputes scores + the `D=ΣdO·O` Jacobian
|
||||
simplification for dQ/dK/dV. Opt-in `--flash`, default off. **The win is memory, not
|
||||
wall-clock**: peak activation **−16%@seq1024 / −23%@seq2048** (grows with seq, since the
|
||||
`N×N` never lands), but **~2.3× slower** at head-dim 64 (a hand kernel can't beat cuBLAS
|
||||
tensor-cores on a small head). Gate: flash == composed (loss rel `0.0`, grad `4.4e-5`),
|
||||
PyTorch B>1 `7.9e-6`.
|
||||
- **T15 · real GQA** ([`docs/14-gqa.md`](docs/14-gqa.md)) — `num_kv_heads < num_heads` via a new
|
||||
`repeat_kv` **broadcast op** that copies K/V `group = nh/num_kv` times to feed **both**
|
||||
(composed + flash) SDPA paths **unchanged**; its **backward is a deterministic group-sum**
|
||||
(no atomics) collapsing each kv head's query-head group. Gate: `repeat_kv` grad-check +
|
||||
**group=1 bit-identical to MHA** (regression guard); **xserv closed loop with real
|
||||
`num_key_value_heads`** token-identical.
|
||||
- **T16 · gradient accumulation** ([`docs/15-grad-accum.md`](docs/15-grad-accum.md)) — N
|
||||
micro-steps scaled by `1/N` accumulate on the tape, then one AdamW step; DDP **all-reduces
|
||||
only at the accumulation boundary**. Decouples effective batch from activation memory: same
|
||||
effective batch 64, big-batch **27.7GB (OOM)** → accum 4×16 **7.2GB (−74%)**. Gate: `accum=N`
|
||||
≡ one N× batch (grad `3.8e-5`); `accum=1` bit-identical.
|
||||
- **T18 · dropout** ([`docs/17-dropout.md`](docs/17-dropout.md)) — a **stateless counter-based
|
||||
device RNG** (Philox-style bit-mix) → Bernoulli mask, inverted `1/(1−p)` scaling in train,
|
||||
identity in eval; wired at the two residual sites (attn-out, mlp-out). Stateless RNG is what
|
||||
makes it **compose bit-exactly with T13 activation recompute** — the backward re-run
|
||||
regenerates the *same* mask from `(seed, index)`. Gate: fixed-seed grad-check; **p=0
|
||||
bit-identical**.
|
||||
- **T17 · process-per-GPU** ([`docs/16-process-per-gpu.md`](docs/16-process-per-gpu.md)) — a
|
||||
torchrun-style launcher: one worker process + CUDA context per GPU, the launcher mints one
|
||||
`ncclUniqueId` and **hex-injects it into each child's env** (no shared FS/TCP, no race); the
|
||||
worker reuses the T8 `train_rank` **unchanged**. Built and **correct** (proc vs thread loss
|
||||
`1.5e-7`, cross-rank `1.2e-7`, xserv md5 identical) — but **measured throughput-neutral**:
|
||||
8-GPU thread **491K (5.27×)** vs proc **493K (5.31×)**, `<1%`. This **falsifies** the
|
||||
long-standing KI-5/T11 hypothesis that thread-per-GPU's shared CUDA context caused the
|
||||
residual ~5×@8; with all 8 GPUs at 95–99% util, the residual is the **NCCL all-reduce + PCIe
|
||||
topology wall**, not context serialization. The third profile-first falsification (see below).
|
||||
|
||||
## The scaling study — v0 → v8
|
||||
|
||||
@@ -122,14 +154,19 @@ versions — a fixed-MFU estimate is off by up to ~100× for the early launch-bo
|
||||
|
||||
## Engineering lessons
|
||||
|
||||
- **Profile before optimizing.** Two "known" perf fixes were *falsified by measurement*
|
||||
before being shipped: "bigger batch fixes DDP scaling" (real cause: single-seq
|
||||
launch-bound → T10) and "bucket the all-reduce" (real cause: per-op `cudaMalloc`
|
||||
serialization → T11 caching allocator). Both would have been no-ops; both got reverted +
|
||||
re-diagnosed instead of shipped.
|
||||
- **Honest correctness.** QK-norm was *added* to match xserv's Qwen3 (not faked); every perf
|
||||
change kept a hard correctness gate (recompute grads bit-identical; bf16 keeps the fp32
|
||||
path untouched; the full grad-check / PyTorch / DDP / xserv suite must stay green).
|
||||
- **Profile before optimizing.** *Three* "known" fixes were *falsified by measurement*: (1)
|
||||
"bigger batch fixes DDP scaling" (real cause: single-seq launch-bound → T10); (2) "bucket the
|
||||
all-reduce" (real cause: per-op `cudaMalloc` serialization → T11 caching allocator); and (3)
|
||||
"process-per-GPU would fix the residual ~5×@8" (T17 — built the torchrun-style launcher and
|
||||
measured it **throughput-neutral**: the residual is the NCCL/PCIe communication wall, not
|
||||
shared-context serialization). All three would have been no-ops; each got measured and either
|
||||
reverted or recorded as a deliberate negative result instead of shipped on faith.
|
||||
- **Honest correctness.** QK-norm was *added* to match xserv's Qwen3 (not faked); every change
|
||||
kept a hard correctness gate, and **no tolerance was ever loosened to go green**. Phase 2 held
|
||||
the line: flash == composed SDPA (grads/PyTorch), GQA group=1 bit-identical to MHA, gradient
|
||||
accumulation `accum=1` bit-identical, dropout p=0 bit-identical *and* dropout × recompute
|
||||
bit-exact, the default path unchanged on every feature, and the **xserv closed-loop md5
|
||||
byte-identical (`b04fc9f9`) throughout both phases**.
|
||||
- **The closed loop matters.** Exporting to xserv and checking token-identical greedy output
|
||||
caught real bugs and proved the whole stack end-to-end.
|
||||
|
||||
@@ -156,5 +193,5 @@ cargo test --workspace # autograd grad-checks, PyTorch parity, DDP, e
|
||||
|
||||
- [`docs/evolution.md`](docs/evolution.md) — per-milestone changes across algorithm / architecture / infra / dataset.
|
||||
- [`docs/runs/README.md`](docs/runs/README.md) — the v0–v8 comparison; [`docs/runs/0N-*.md`](docs/runs/) — per-run detail.
|
||||
- [`docs/00-*` … `14-*`](docs/) — per-phase design docs (build chain → tensor → autograd → transformer → training → perf → distributed → export → batched → allocator → bf16 → recompute → flash-attention → GQA).
|
||||
- [`docs/known-issues.md`](docs/known-issues.md) — perf backlog (KI-1/2/3/5 fixed; KI-4 + process-per-GPU open).
|
||||
- [`docs/00-*` … `17-*`](docs/) — per-phase design docs (build chain → tensor → autograd → transformer → training → perf → distributed → export → batched → allocator → bf16 → recompute → flash-attention → GQA → grad-accum → process-per-GPU → dropout).
|
||||
- [`docs/known-issues.md`](docs/known-issues.md) — perf backlog (KI-1/2/3/5 fixed; process-per-GPU CLOSED = measured no-op; KI-4 = accepted modeling tradeoff).
|
||||
|
||||
@@ -64,6 +64,22 @@
|
||||
- ⭐ **容量轴有用,但也只有 ~3%(v8)**:v6/v7 在 dim768 上「吃不动更多数据」,v8 用最干净的 A/B 回答了「是数据见够还是容量不够」——**冻结数据子集、纯把 dim768→dim1024(core 127M→226M,+78%)**,同 ~1 epoch 下 FineWeb val **3.07→2.98(↓0.085)**,且 v8(1.05ep)还低于 v7(1.45ep 更多老数据)的 3.01。⇒ **容量有用,v6/v7 部分是 capacity-limited(不全是数据见够)**;放大容量比「给小模型多喂老数据」更值。**但增益只有 ~3%**,与数据轴单步杠杆同量级。
|
||||
- 🧭 **元结论:单轴单步都已 ~3%/lever = 全面边际递减,要双轴一起 scale(Chinchilla 小尺度复现)**:把三条轴并起来看——数据量轴(v5/v7 同子集多 epoch,饱和,~1.6–5%/步)、数据广度轴(v6 换语料,是一次性换分布红利)、容量轴(v8,有用但 ~3%)——**到 v8,任何单轴的单步杠杆都收敛到 ~3%/lever**。而 v8 容量 +78% 却只配同样的 2.36B token、val 末步仍在降 ⇒ 数据立刻成新瓶颈。⇒ **要继续进步,容量与数据必须匹配地一起 scale,而不是单独猛拨一根轴**——这正是 Chinchilla 在这个 toy 尺度上的复现。
|
||||
|
||||
## 三·五、Phase 2 系统栈深度综合(T14–T18 五条特性按四维收束)
|
||||
|
||||
scaling 科学线(v0–v8)收官后,项目重启回到本职「学训练全栈」,把此前**显式延后**的五条训练栈特性补齐。区别于 Phase 1 的「修真实瓶颈」(T10–T13 每条都治一个 KI),Phase 2 是**补齐标配 + 一次诚实的负结果**。五条按四维落点:
|
||||
|
||||
- **算法**三条 = **flash-attention(T14)** + **梯度累积(T16)** + **dropout(T18)**。
|
||||
- 三条里 **T14/T16 与 Phase 1 的 T13 一起构成可叠加的「显存三杠杆」**:T13 压激活峰值、T14 不物化 N×N attention scores(收益随 seq 增长)、T16 解耦有效 batch 与激活显存(显存随 micro 不随 N×)——三者正交叠加可放大有效 batch / seq。
|
||||
- **T18 dropout 的设计点是 stateless counter-based RNG**:mask 由 `(seed, 元素下标)` 无状态产出,所以**与 T13 激活重计算天然 bit-exact 组合**——反向重算时同 seed 重生同一张 mask,梯度逐位一致。这是两条 Phase-2/Phase-1 特性的正交性被正确性闸门钉死的一个例子。
|
||||
- 诚实账:flash-attention **赢在显存不赢墙钟**(hd=64 小头维手写 kernel ~2.3× 慢于 cuBLAS tensor-core),opt-in 默认关、不回归。
|
||||
- **模型架构**一条 = **真 GQA(T15)**:架构补齐到现代 LLM 标配(MHA/GQA/MQA 一条 `num_kv_heads` 轴)。实现关键 = `repeat_kv` broadcast 算子的**反向组内确定性求和(无 atomic)**,让 K/V 零改动喂进 composed + flash 两条 SDPA;`group=1` 对 MHA **逐位一致**作回归保护,导出真 `num_key_value_heads` 且 xserv 闭环真 GQA。
|
||||
- **Infra**一条 = **process-per-GPU(T17)**,但它是**实测负结果**而非性能提升:落地 torchrun 式独立进程/CUDA context 标准链路(复用 T8 train_rank 零改动),却实测本尺度**吞吐中性**(thread ~5.27× vs proc ~5.31×@8,差<1%,8 卡全 95–99% util),把 KI-5/T11 doc 长挂的「共享单 context 致残留 ~5×@8」猜想**钉死推翻**——残留是 NCCL all-reduce + PCIe 拓扑墙,非 context 串行。**方法论与 Phase 1 的 T11(证伪「分桶 all-reduce」)一脉相承:profile/measure-first。**
|
||||
- **数据集**零条:Phase 2 不动数据轴(KI-4 小词表用户拍板 **drop 以保 xserv gpt2-tokenizer 闭环**,转记为接受的建模权衡,见 known-issues)。
|
||||
|
||||
**Phase 2 的统一闸门 = 诚实正确性,全程未为凑绿放宽容差**:flash==composed(grad/PyTorch)、GQA group=1 == MHA 逐位、accum=1 逐位、dropout p=0 逐位 + dropout×重算 bit-exact、每条特性默认路径不变、**xserv 闭环 md5 `b04fc9f9` 两阶段全程逐位一致**。
|
||||
|
||||
> 📌 两条 integration 发现(非回归,pre-existing,记账):① **DDP 三个测试并行会争 2 卡 deadlock** → 文档/测试用 `--test-threads=1`(或标 serial)跑。② **fresh-train md5 run-to-run 不定**——反向 atomicAdd 归约序非确定 → 有效的确定性闸门是**导出(export)重确定性**(同 ckpt 重导 safetensors md5 逐位一致),**不是** fresh-train 复现。
|
||||
|
||||
## 四、perf 杠杆台账(详见 [known-issues.md](known-issues.md))
|
||||
|
||||
- **已修**:KI-1 单序列 launch-bound(T10)· KI-5 per-op cudaMalloc 串行(T11)· KI-2 bf16/OOM(T12)· KI-3 激活重计算(T13,解锁 dim1024,v8 用上)。
|
||||
|
||||
@@ -156,6 +156,20 @@ _(KI-3 已在 T13 FIXED,见上方 Fixed。本节暂无待启项。)_
|
||||
|
||||
## Modeling notes
|
||||
|
||||
### KI-4 · 大词表 embedding 占比过高
|
||||
### KI-4 · 大词表 embedding 占比过高 — `接受的建模权衡(用户拍板,T19 DROPPED)`
|
||||
- gpt2 `vocab=50257` 在 dim 小时让 embed+lm_head 主导参数:v1 25.7M/34M、v2 38.6M/66.9M;core transformer 才是学习主体。
|
||||
- 后续可考虑更贴合 TinyStories 的小 vocab(会牺牲 xserv gpt2-tokenizer 复用);或在更大 dim 下让 core 自然成为主体(继续 scaling 即可缓解占比)。
|
||||
- **决定(不是 open,是一个被接受的权衡)**:曾计划 T19 训一个更贴语料的小 vocab 来压 embed 占比,**用户拍板 DROP**——保住 **xserv gpt2-tokenizer 闭环**(项目皇冠:导出权重回流 xserv 逐 token 一致)比清理 embedding 占比更值。小 dim 下 embed 占比高=复用 gpt2 大词表的**已知、接受的代价**。
|
||||
- 缓解路径仍在:更大 dim 时 core 自然成为主体(继续 scaling 即可摊薄占比,v8 dim1024 core 226M 已主导);若日后愿意放弃该版闭环再重启小词表(见 [`~/toc/projects/xtrain.md`](../../toc/projects/xtrain.md) T19)。
|
||||
|
||||
---
|
||||
|
||||
## 集成 / 测试注记(pre-existing,非回归,记账)
|
||||
|
||||
### DDP 三测并行会争卡 deadlock → `--test-threads=1`
|
||||
- `xtrain-distributed` 的三个 DDP 测试(thread-per-GPU correctness / scaling、process-per-GPU `ddp_proc`)若被 cargo **并行**跑,会在共享的 2 卡上互相争 GPU/NCCL 资源 **deadlock**(不是数值 bug,是测试调度)。
|
||||
- **跑法**:`cargo test ... -- --test-threads=1`(或把 DDP 测试标 serial)串行跑即全绿。Phase-2 全回归 capture 均在 `--test-threads=1` 下取得。
|
||||
|
||||
### fresh-train md5 run-to-run 不定 → 有效确定性闸门是「导出重确定性」而非「fresh-train 复现」
|
||||
- **现象**:从随机初始化全新训练(fresh-train)产出的 ckpt,其 md5 **run-to-run 不逐位复现**。
|
||||
- **根因**:反向里多处 `atomicAdd` 归约(如跨行 dK/dV、扇出累加)的浮点**加法序非确定**(GPU 原子操作完成序不固定)→ 末位 ULP 抖动逐步累积 → ckpt 字节不同。属本机/本版的已知浮点非确定性,**不是正确性回归**(loss 轨迹仍同量级收敛)。
|
||||
- **因此项目的确定性硬闸门定义为「导出(export)重确定性」**:拿**同一个已固定的 ckpt** 重新导出 HF-safetensors,md5 与 registry **逐位一致**(`b04fc9f9`,两阶段全程)——这条是确定性的、承重的;**不要求** fresh-train 字节复现。
|
||||
|
||||
Reference in New Issue
Block a user