kvcache-simulator/README.md

# kvcache-simulator

Discrete-event simulator for cluster-level LLM **prefill** serving with a
two-tier KV cache (GPU HBM + CPU DRAM / v6d) and KV-aware request routing.
Replays real production traces against a synthetic cluster so you can
ablate routing strategies and cache sizing without spinning up any GPUs.

Assumes **PD (prefill/decode) disaggregation** — only the prefill path is
modeled.

## Features

- **Architecture-derived roofline compute** — auto-derives FLOPs,
  attention coefficients, and weight-streaming costs from model
  architecture (MoE, MLA, GQA, DSA, sliding window).
- **HuggingFace config.json auto-parsing** — drop in any HF
  `config.json` and the simulator extracts layer counts, attention heads,
  MoE expert configs, MLA LoRA ranks, and DSA sparse parameters.
- **Built-in GPU hardware presets** — H100, H800, H20, A100-80GB,
  A100-40GB, B200 with tensor-parallel scaling (e.g. `8xb200`).
- **Two-tier KV cache hierarchy** — L0 (GPU HBM) + L1 (CPU DRAM) with
  LRU eviction and cross-instance RDMA fetch via a cluster-wide
  meta-store.
- **11 routing policies** — from baselines (random, round-robin) to
  cache-aware (min\_pd, prefix\_affinity) for systematic ablation.
- **Token-bucket link contention** — PCIe and RDMA bandwidth modeled with
  reservation-based token-bucket queues.
- **Oracle analysis** — computes theoretical hit-rate ceilings (infinite
  cache, Belady optimal, LRU) for gap analysis.

## Build

```bash
cargo build --release
# binary: target/release/kvcache-sim
```

Fetch the upstream trace (consumed as a git submodule):

```bash
git submodule update --init --recursive
```

## Usage

### 1. Run a single simulation

```bash
target/release/kvcache-sim run --config configs/glm5-8xb200-hf.yaml
```

Prints `summary.json` to stdout and writes the full output directory
(see [Outputs](#outputs) below).

### 2. Compare routers on the same trace (ablation)

```bash
target/release/kvcache-sim ablate \
    --config configs/glm5-8xb200-hf.yaml \
    --routers random,least_loaded,least_tokens,min_pd,prefix_affinity \
    --evict-policies lru \
    --output-dir runs/glm5_ablation
```

Writes `ablation.json` with one row per `router x evict_policy`.

`ablate` currently supports only `lru` as a valid eviction policy. The
aggregated output keeps the online prefill-time metrics
(`ttft_mean/p50/p95/p99`) and omits `e2e`.

The previous replay-based `belady` approximation has been removed from
the CLI because it was not an exact full-hierarchy Belady algorithm and
could produce misleading comparisons against `lru`.

### 3. Compute theoretical hit-rate ceilings (oracle)

```bash
# Cluster-aggregate capacity (default)
target/release/kvcache-sim oracle \
    --config configs/glm5-8xb200-hf.yaml --num-instances 64

# A single instance's HBM budget
target/release/kvcache-sim oracle \
    --config configs/glm5-8xb200-hf.yaml --per-instance

# Explicit capacity in blocks
target/release/kvcache-sim oracle \
    --config configs/glm5-8xb200-hf.yaml --capacity-blocks 200000
```

Reports three numbers:

- `unlimited.hit_rate` — absolute ceiling (infinite cache)
- `belady_finite.hit_rate` — optimal-eviction ceiling at the given capacity
- `lru_finite.hit_rate` — production LRU at the same capacity

Gap between `lru_finite` and `belady_finite` = headroom from a smarter
eviction policy. Gap between `belady_finite` and `unlimited` = headroom
only reachable by adding capacity.

### 4. Validate a config without running

```bash
target/release/kvcache-sim validate --config configs/glm5-8xb200-hf.yaml
```

Parses the YAML, prints derived per-instance block budgets, and dumps
the first 5 trace records so you can sanity-check the path.

## CLI overrides

These flags work on **all** subcommands and override the YAML in place,
so the same config can be reused across sweeps:

| Flag                     | Overrides                                 |
|--------------------------|-------------------------------------------|
| `--num-instances <N>`    | `cluster.num_instances`                   |
| `--max-requests <N>`     | `sim.max_requests`                        |
| `--trace <PATH>`         | `sim.trace_path`                          |
| `--output-dir <PATH>`    | `sim.output_dir`                          |
| `--seed <N>`             | `sim.seed`                                |
| `--precise-topk <N>`     | `cluster.router.precise_probe_topk`       |
| `--ttl-seconds <S>`      | `cluster.meta_store.ttl_seconds`          |

`oracle` additionally takes `--capacity-blocks <N>` / `--per-instance`
and `--out <PATH>`. `ablate` additionally takes `--routers <csv>` and
`--evict-policies <csv>` (currently only `lru`).

## Router modes

Set `cluster.router.mode` in the YAML or list in `--routers`:

| Mode              | Aliases          | What it does                                                                         |
|-------------------|------------------|--------------------------------------------------------------------------------------|
| `random`          |                  | Uniform random. Baseline.                                                            |
| `round_robin`     | `rr`             | Deterministic round-robin. Baseline.                                                 |
| `least_loaded`    |                  | `argmin(kv_blocks_used + alpha * queue_len)`. KV-blind load balance.                 |
| `least_tokens`    | `lt`             | `argmin(waiting_tokens)`. Pure load balance by queued compute work.                  |
| `ttl_aware`       | `ttl`            | Picks instance with longest prefix in the global TTL meta-store. Cache-only.         |
| `precise`         | `precise_aware`  | Probes top-K least-loaded instances' actual caches; charges probe latency into TTFT. |
| `min_pd`          | `minpd`, `pd`    | Minimizes `P*D` (prefill tokens x ongoing requests). Cluster-wide RDMA-aware.        |
| `cache_load`      | `cl`             | Filters to least-loaded 1/4 instances, then picks best cache prefix.                 |
| `cache_score`     | `cs`             | Exponential scoring: `2^(alpha * queue_len + beta * miss_blocks)`.                   |
| `estimated_ttft`  | `ettft`,`optimal`| Estimates `drain_time + fetch_time` per instance using architecture-aware compute.    |
| `prefix_affinity` | `affinity`, `pa` | Rendezvous-hashed prefix fingerprinting for deterministic cache locality.             |

### Router parameters

These fields in `cluster.router` tune specific routers:

| Field                    | Default | Used by          | Description                                          |
|--------------------------|---------|------------------|------------------------------------------------------|
| `load_alpha`             | `1.0`   | `least_loaded`   | Weight of queue\_len vs kv\_blocks\_used              |
| `score_alpha`            | `1.0`   | `cache_score`    | Load weight in `2^(alpha*load + beta*miss)`           |
| `score_beta`             | `0.1`   | `cache_score`    | Cache-miss weight in `2^(alpha*load + beta*miss)`     |
| `prefix_k`              | `8`     | `prefix_affinity`| Number of leading blocks for the prefix fingerprint   |
| `affinity_fan_out`       | `0`     | `prefix_affinity`| Top-K affinity candidates (0 = auto: n/8, min 2)     |
| `precise_probe_latency_us`| `50.0`| `precise`        | Simulated per-probe latency (microseconds)            |
| `precise_probe_topk`    | `4`     | `precise`        | Number of instances probed                            |

### Router design spectrum

```
        Cache-only                  Hybrid                  Load-only
        (hot-spot risk)                                     (cache-blind)
  ┌─────────┬───────────┬───────────┬────────────┬───────────┬───────────┐
  ttl_aware  precise  cache_score  min_pd     prefix_    least_     random
                                   cache_load  affinity   loaded
                                   est_ttft               least_tokens
```

`prefix_affinity` sits in a unique position: it builds **proactive cache
locality** by consistently routing same-prefix requests to the same
instances (via rendezvous hashing), rather than reactively chasing
existing cache state. This yields the highest L0 hit rates while
maintaining load balance through within-group drain-time-aware selection.

## Model configuration

### HuggingFace config.json (recommended)

Point `model.config_json` at any HF `config.json` to auto-extract
architecture:

```yaml
model:
  config_json: ../models/GLM-5/config.json
  dtype_bytes: 2          # required (not in HF schema)
  block_size_tokens: 512  # required (not in HF schema)
```

Auto-detected features:

| Feature   | Detection trigger             | What it extracts                             |
|-----------|-------------------------------|----------------------------------------------|
| **MoE**   | `n_routed_experts`, `num_local_experts`, or `num_experts` | Expert count, active experts, shared experts, expert FFN width |
| **MLA**   | `kv_lora_rank` present        | KV/Q LoRA ranks, qk\_rope/nope dims, v\_head\_dim |
| **DSA**   | `first_k_dense_replace` present| Dense window, sparse stride, first dense layers |
| **Sliding window** | `sliding_window` present | Window size                                 |
| **GQA**   | `num_key_value_heads < num_attention_heads` | KV head count for grouped-query attention  |

Explicit YAML fields always override the auto-detected values.

### Inline specification

Alternatively, specify architecture fields directly:

```yaml
model:
  name: qwen2.5-coder-7b
  num_layers: 28
  hidden_size: 3584
  num_attention_heads: 28
  num_kv_heads: 4
  head_dim: 128
  intermediate_size: 18944
  dtype_bytes: 2
  block_size_tokens: 16
```

When `hidden_size` is present, the compute model is auto-derived
(architecture mode). Without it, you must supply legacy manual
coefficients (`flops_per_token_prefill`, `attn_quadratic_coeff`, etc.).

### Bundled model configs

| Model | Path | Architecture |
|-------|------|--------------|
| GLM-5 (744B/40B-active) | `models/GLM-5/config.json` | MoE (256 routed, 8 active, 1 shared) + MLA + DSA |
| GLM-5-FP8 | `models/GLM-5-FP8/config.json` | GLM-5 architecture + upstream FP8 quantization metadata |
| Qwen3-Coder-480B-A35B FP8 | `models/Qwen3-Coder-480B-A35B-Instruct-FP8/config.json` | MoE (160 experts, 8 active) + GQA |

## Hardware configuration

### Using presets (recommended)

Set `hardware.type` to a preset name — individual fields can override:

```yaml
hardware:
  type: 8xb200
  hbm_bytes: 500.0e9     # override KV budget (after model weights)
```

Available presets:

| Preset      | FLOPS      | HBM     | Mem BW     | PCIe |
|-------------|------------|---------|------------|------|
| `h100`      | 989 TFLOPS | 80 GB   | 3.35 TB/s  | Gen5 |
| `h800`      | 989 TFLOPS | 80 GB   | 3.35 TB/s  | Gen5 |
| `h20`       | 148 TFLOPS | 96 GB   | 4.0 TB/s   | Gen5 |
| `h20-141g`  | 148 TFLOPS | 141 GB  | 4.8 TB/s   | Gen5 |
| `a100-80gb` | 312 TFLOPS | 80 GB   | 2.0 TB/s   | Gen4 |
| `a100-40gb` | 312 TFLOPS | 40 GB   | 1.555 TB/s | Gen4 |
| `b200`      | 2.25 PFLOPS| 192 GB  | 8.0 TB/s   | Gen6 |

Prefix with `2x`, `4x`, or `8x` for tensor-parallel groups (e.g.
`8xh20`). FLOPS, memory bandwidth, and HBM scale linearly; RDMA and
DRAM are set to sensible per-node defaults.

### Inline specification

```yaml
hardware:
  gpu_flops: 1.80e16
  gpu_mem_bw: 6.40e13
  hbm_bytes: 500.0e9
  dram_bytes: 1.5e12
  pcie_bw: 128.0e9
  pcie_latency_us: 4.0
  rdma_bw: 50.0e9
  rdma_latency_us: 6.0
  max_batch_slots: 256
  prefill_chunk_tokens: 4096
```

## Architecture-aware compute model

The simulator derives a **roofline prefill model** from model
architecture:

```
prefill_time(N tokens) = max(compute_time, memory_time)

compute_time = layers * (N * linear_flops + attn_coeff * N * effective_ctx(N)) / gpu_flops
memory_time  = layers * weight_bytes_per_layer / gpu_mem_bw
```

- **MoE**: only active experts contribute to FLOPs and weight streaming
  (shared experts always counted)
- **MLA**: compressed KV projections reduce attention FLOPs; KV cache
  uses `kv_lora_rank + qk_rope_head_dim` instead of `2 * kv_heads * head_dim`
- **DSA**: `effective_ctx = min(N, dense_window) + max(0, N - dense_window) / sparse_stride`,
  with the first K layers using full dense attention
- **GQA**: fewer KV heads reduce both attention compute and KV cache size

## Bundled config files

| Config | Model | Hardware | Instances | Trace |
|--------|-------|----------|-----------|-------|
| `glm5-8xb200-hf.yaml` | GLM-5 via HF config.json | 8xB200 preset | 32 | GLM coder blk512 |
| `glm5-fp8-8xh20-141g.yaml` | GLM-5-FP8 via ModelScope config.json | 8xH20-141G preset | 128 | GLM coder blk512 |
| `glm5-nvfp4-8xb300.yaml` | GLM-5-NVFP4 via HF config.json | 8xB300 preset | 8 | GLM coder blk512 |
| `qwen3-coder-480b-8xh20.yaml` | Qwen3-Coder via HF | 8xH20 preset | 32 | Qwen coder blk16 |

## Outputs

Each run writes a directory under `sim.output_dir`:

| File                 | Contents                                                                   |
|----------------------|----------------------------------------------------------------------------|
| `summary.json`       | Router, throughput, TTFT p50/p95/p99, hit rates per tier, total RDMA/PCIe bytes |
| `per_request.csv`    | `req_id,arrival,ttft,e2e,instance,total_blocks,l0_hit,l1_hit,remote_hit,miss,rdma_bytes,pcie_bytes,probe_overhead_s` |
| `instances.csv`      | `t,instance,queue_len,kv_blocks_used,kv_blocks_total,busy` per sample      |
| `routing_log.jsonl`  | One JSON per request: all router candidates + chosen instance + reason    |

For `ablate`: an extra `ablation.json` with one summary per router.
For `oracle`: an `oracle.json` with the three hit-rate analyses.

### Reading results quickly

```bash
# Pretty-print the summary
cat runs/glm5_8xb200_hf/summary.json | jq .

# Compare all routers from an ablation
cat runs/glm5_8xb200_hf/ablation.json | \
  jq '.[] | {router, ttft_mean, ttft_p50, hit_rate_l0, miss_rate}'

# Sort by TTFT
cat runs/glm5_8xb200_hf/ablation.json | \
  jq 'sort_by(.ttft_mean) | .[] | {router, ttft_mean, hit_rate_l0}'
```

## Trace format

The simulator reads the Alibaba
[`qwen-bailian-usagetraces-anon`](https://github.com/alibaba-edu/qwen-bailian-usagetraces-anon)
JSONL schema. Each record has `chat_id`, `timestamp`, `input_length`,
`output_length`, and `hash_ids` (block hashes, typically 16 tokens each).
Only the input side is used.

Available traces in the submodule:

| Trace | Requests | Description |
|-------|----------|-------------|
| `qwen_coder_blksz_16.jsonl` | 43k | Qwen Coder serving traffic |
| `qwen_traceA_blksz_16.jsonl` | 43k | Qwen general traffic A |
| `qwen_traceB_blksz_16.jsonl` | 173k | Qwen general traffic B |
| `qwen_thinking_blksz_16.jsonl` | 11k | Qwen reasoning/thinking traffic |

## Testing

```bash
cargo test --release
```

28 tests: 27 unit tests (compute model, HF config parsing, hardware
presets) + 1 integration smoke test that runs routers on a synthetic
shared-prefix trace and asserts the expected hit-rate ordering.