kvcache-simulator/README.md

# kvcache-simulator

Discrete-event simulator for cluster-level LLM **prefill** serving with a
two-tier KV cache and routing experiments. The simulator models a
PD-disaggregated deployment: only the **prefill** path is simulated, while
decode is reduced to a small completion tail for TTFT/E2E accounting.

It is intended for answering questions like:

- How much do different KV-aware routers help on the same trace?
- How much HBM/DRAM capacity is enough before routing dominates?
- How do prefix-locality policies behave under bucketed input-length pools?
- What is the gap between online LRU and offline-optimal cache capacity?

## What The Repo Models

- **Architecture-derived prefill cost** from model structure, including MoE,
  MLA, GQA, sliding-window attention, and DSA.
- **Two-tier KV hierarchy** with L0 GPU HBM and L1 host DRAM, plus remote
  RDMA fetches via a meta-store.
- **Single-pool and bucketed clusters**. Bucketed mode separates the service
  into input-length buckets with isolated instance pools and meta-stores.
- **Local instance routing and global bucket routing** with detailed
  per-request routing logs.
- **Trace replay with optional input-length filtering** so the same trace can
  be sliced into buckets without rewriting the source file.
- **Offline oracle analysis** for unlimited capacity, Belady, and LRU hit-rate
  ceilings.

## Highlights

- **HF `config.json` auto-loading**: point `model.config_json` at a model
  config and the simulator derives architecture parameters automatically.
- **Hardware presets**: `h100`, `h800`, `h20`, `h20-141g`, `a100-80gb`,
  `a100-40gb`, `b200`, and `b300`, plus TP variants such as `8xb200`.
- **18 local router modes** covering baselines, load-based, cache-aware,
  affinity, and TTFT-estimating policies.
- **2 global bucket router modes**: `strict_input_length` and `bucket_score`.
- **Detailed outputs**: `summary.json`, `per_request.csv`, `instances.csv`,
  `routing_log.jsonl`, plus `ablation.json` / `oracle.json` when applicable.

## Build

```bash
cargo build --release
```

If you want the public Qwen trace submodule as well:

```bash
git submodule update --init --recursive
```

The release binary is:

```bash
target/release/kvcache-sim
```

## Quick Start

Validate a config:

```bash
target/release/kvcache-sim validate --config configs/glm5-8xb200.yaml
```

Run one simulation:

```bash
target/release/kvcache-sim run --config configs/glm5-8xb200.yaml
```

Compare several routers on the same trace:

```bash
target/release/kvcache-sim ablate \
  --config configs/glm5-8xb200.yaml \
  --routers random,least_loaded,cache_score,cache_affinity,estimated_ttft
```

Auto-pick the smallest cluster size that meets a TTFT target, then ablate at
that size:

```bash
target/release/kvcache-sim ablate \
  --config configs/glm5-8xb200.yaml \
  --auto-instances \
  --auto-probe-router cache_score \
  --auto-target-ttft-mean 4.0
```

Run the oracle:

```bash
target/release/kvcache-sim oracle \
  --config configs/glm5-8xb200.yaml \
  --per-instance
```

`run` prints `summary.json` to stdout and also writes the full output directory
under `sim.output_dir`.

## Current Command Boundaries

The repository now supports both legacy single-pool clusters and bucketed
service topologies, but not every CLI path supports both yet.

- `run`: supports `cluster.num_instances` and `cluster.buckets`
- `validate`: supports `cluster.num_instances` and `cluster.buckets`
- `ablate`: currently **single-pool only**
- `ablate --evict-policies`: currently supports **`lru` only**
- `oracle`: currently **single-pool only**
- `--num-instances` override: currently **single-pool only**
- `--auto-instances`: currently **single-pool only**

In practice, bucket-aware experiments are ready in `run`, while fixed-placement
ablation and oracle analysis still reject `cluster.buckets`.

## Config Model

### Single-Pool Cluster

Use `cluster.num_instances` for the original flat instance pool:

```yaml
cluster:
  num_instances: 32
  meta_store:
    ttl_seconds: 300.0
  router:
    mode: cache_affinity
```

### Bucketed Service

Use `cluster.buckets` plus a `global_router` to model explicit input-length
buckets:

```yaml
cluster:
  meta_store:
    ttl_seconds: 300.0
  router:
    mode: cache_affinity
    load_alpha: 1.5
    prefix_k: 8
  global_router:
    mode: strict_input_length
    length_penalty_weight: 1.0
    load_weight: 1.0
    cache_weight: 1.0
  buckets:
    - name: short
      input_length_min: 0
      input_length_max: 32768
      num_instances: 8
    - name: long
      input_length_min: 32769
      input_length_max: 131072
      num_instances: 4
```

Rules enforced by config validation:

- `cluster.num_instances` and `cluster.buckets` are mutually exclusive
- bucket ranges must not overlap
- every bucket must have `num_instances > 0`
- `input_length_min <= input_length_max`

### CLI Overrides

These flags apply on top of the YAML config:

| Flag | Overrides |
|------|-----------|
| `--num-instances <N>` | `cluster.num_instances` |
| `--max-requests <N>` | `sim.max_requests` |
| `--trace <PATH>` | `sim.trace_path` |
| `--output-dir <PATH>` | `sim.output_dir` |
| `--seed <N>` | `sim.seed` |
| `--precise-topk <N>` | `cluster.router.precise_probe_topk` |
| `--ttl-seconds <S>` | `cluster.meta_store.ttl_seconds` |
| `--input-length-min <N>` | `sim.input_length_min` |
| `--input-length-max <N>` | `sim.input_length_max` |

Subcommand-specific additions:

- `ablate`: `--routers`, `--evict-policies`, `--auto-instances`,
  `--auto-target-ttft-mean`, `--auto-candidates`, `--auto-probe-router`,
  `--jobs`
- `oracle`: `--capacity-blocks`, `--per-instance`, `--out`

## Routing Modes

### Global Bucket Routers

Configured through `cluster.global_router.mode`.

| Mode | What it does |
|------|---------------|
| `strict_input_length` | Routes to the unique bucket whose `[input_length_min, input_length_max]` contains the request. |
| `bucket_score` | Scores every bucket using weighted length mismatch, aggregate queue load, and predicted cache miss. Can intentionally deviate from the strict length bucket. |

### Local Instance Routers

Configured through `cluster.router.mode`. All of these names are accepted by
`run`, and any of them can be passed to `ablate --routers` on single-pool
configs.

| Mode | Aliases | What it does |
|------|---------|---------------|
| `random` |  | Uniform random baseline. |
| `round_robin` | `rr` | Deterministic round-robin baseline. |
| `least_loaded` |  | Minimizes `kv_blocks_used + alpha * queue_len`. |
| `least_tokens` | `lt` | Minimizes queued token work. |
| `ttl_aware` | `ttl` | Uses the global TTL meta-store to chase the longest reusable prefix. |
| `precise` | `precise_aware` | Probes top-K least-loaded instances for actual cache contents and charges probe latency. |
| `min_pd` | `minpd`, `pd` | Minimizes `P * D` using ongoing load and prefix reuse. |
| `cache_load` | `cl` | Filters to lightly loaded instances, then chooses the best cache prefix. |
| `cache_affinity` | `caff`, `ca` | Strong cache-first scoring with rendezvous-based sticky homes for prefix families. |
| `cache_affinity_weak_rend` | `caff_weak` | Ablation: weak cache weights plus rendezvous placement. |
| `cache_affinity_strong_only` | `caff_strong` | Ablation: strong cache weights without rendezvous tie-breaking. |
| `cache_score` | `cs` | Exponential score over queue length and miss blocks. |
| `cache_score_strong` | `cs_strong`, `css` | Parity probe with stronger cache weighting than default `cache_score`. |
| `cache_score_ttl` | `csttl`, `cs_ttl` | `cache_score` variant that also uses TTL/meta-store visibility. |
| `estimated_ttft` | `ettft`, `optimal` | First-principles TTFT estimate per instance using compute plus KV movement. |
| `prefix_affinity` | `affinity`, `pa` | Deterministic prefix fingerprinting with affinity fan-out and load-aware selection. |
| `adaptive_affinity` | `aa` | Uses hot-prefix detection: affinity for short hot stems, TTFT optimization otherwise. |
| `lineage_affinity` | `la` | Combines parent stickiness, family homesets, and strong local cache scoring. |

Router tuning knobs in `cluster.router`:

| Field | Default | Used by |
|-------|---------|---------|
| `load_alpha` | `1.0` | `least_loaded`, `ttl_aware`, affinity families |
| `score_alpha` | `1.0` | `cache_score`, `cache_score_ttl` |
| `score_beta` | `0.1` | `cache_score`, `cache_score_ttl` |
| `prefix_k` | `8` | prefix and affinity fingerprinting |
| `affinity_fan_out` | `0` | `prefix_affinity`, `adaptive_affinity`, `lineage_affinity` |
| `precise_probe_latency_us` | `50.0` | `precise` |
| `precise_probe_topk` | `4` | `precise` |

## Model And Hardware Configuration

### Model Config

Recommended pattern:

```yaml
model:
  config_json: ../models/GLM-5/config.json
  name: glm-5
  compute_dtype: fp8
  weight_dtype: fp4
  dtype_bytes: 1
  block_size_tokens: 512
```

Notes:

- `config_json` is resolved relative to the YAML file
- explicit YAML fields override values loaded from the model config
- `compute_dtype` selects the compute FLOPS tier
- `weight_dtype` controls model-weight bytes separately from KV-cache bytes
- `dtype_bytes` sizes the KV cache

The architecture loader understands:

- MoE expert counts and active experts
- MLA LoRA ranks and attention dimensions
- DSA sparse-attention parameters
- sliding-window attention
- GQA from KV-head count

### Hardware Presets

Recommended pattern:

```yaml
hardware:
  type: 8xb300
  hbm_bytes: 1900.0e9
  dram_bytes: 1.5e12
  max_batch_slots: 256
```

Available preset families:

- `h100`, `h800`, `h20`, `h20-141g`
- `a100-80gb`, `a100-40gb`
- `b200`, `b300`
- TP forms such as `2xh100`, `4xh20`, `8xb200`, `8xb300`

## Bundled Configs

Representative configs in `configs/`:

| Config | Notes |
|--------|-------|
| `glm5-8xb200.yaml` | GLM-5 on `8xb200`, single-pool baseline config. |
| `glm5-fp8-8xh20-141g.yaml` | GLM-5-FP8 on `8xh20-141g`, with a 0-32k input-length filter. |
| `glm5-fp8-8xh20-141g-ca-tuned.yaml` | Same family as above, tuned for `cache_affinity`. |
| `glm5-nvfp4-8xb300.yaml` | GLM-5-NVFP4 on `8xb300`. |
| `glm5-nvfp4-fp8compute-8xb300.yaml` | NVFP4 weights with FP8 compute on `8xb300`. |
| `qwen3-coder-480b-8xh20.yaml` | Qwen3-Coder-480B-A35B on `8xh20`. |

Many of the `glm5-*n*.yaml` configs are bucket/slice-specific experiment
points that use `sim.input_length_min` and `sim.input_length_max`.

## Trace Inputs

This repository currently contains two trace sources:

- `bailian-traces/`
  - `glm_coder_blksz_512_040915-040917.jsonl`
  - `qwen3_coder_blksz_512_040915-040917.jsonl`
- `qwen-bailian-usagetraces-anon/` submodule
  - public 16-token-block Qwen traces such as
    `qwen_coder_blksz_16.jsonl` and `qwen_traceB_blksz_16.jsonl`

The simulator expects JSONL records with fields like:

```json
{
  "chat_id": 159,
  "parent_chat_id": 55,
  "timestamp": 61.114,
  "input_length": 521,
  "output_length": 132,
  "type": "text",
  "turn": 2,
  "hash_ids": [1089, 1090, 1091]
}
```

Only prefill-side behavior is modeled; `output_length` is used only for a
decode tail in completion metrics.

## Outputs

Each `run` writes a directory under `sim.output_dir`:

| File | Contents |
|------|----------|
| `summary.json` | Aggregate throughput, TTFT/E2E percentiles, hit rates, RDMA bytes, PCIe bytes. |
| `per_request.csv` | Per-request latency and cache stats, including `bucket`, `instance`, and `length_bucket_match`. |
| `instances.csv` | Periodic per-instance samples with `bucket`, `instance`, `queue_len`, and KV usage. |
| `routing_log.jsonl` | One JSON route decision per request, including `global_mode`, `mode`, `chosen_bucket`, candidate buckets, and candidate instances. |

Additional outputs:

- `ablate`: writes `ablation.json`
- `oracle`: writes `oracle.json`
- `ablate --auto-instances`: writes calibration runs under
  `<output_dir>/auto_instances/`

Quick inspection examples:

```bash
jq . runs/glm5_8xb200/summary.json
```

```bash
jq 'sort_by(.ttft_mean) | .[] | {router, ttft_mean, hit_rate_l0, miss_rate}' \
  runs/glm5_8xb200/ablation.json
```

## Oracle Semantics

`oracle` computes three hit-rate references at a chosen cache capacity:

- `unlimited.hit_rate`: absolute ceiling with infinite capacity
- `belady_finite.hit_rate`: offline-optimal eviction at the chosen capacity
- `lru_finite.hit_rate`: LRU at the same capacity

When `sim.input_length_min` / `sim.input_length_max` are set, `oracle` still
feeds the full trace into cache state but only counts requests inside the
selected input-length range. That matches the intended "measure one bucket
inside a mixed workload" interpretation.

The gap from `lru_finite` to `belady_finite` is eviction-policy headroom. The
gap from `belady_finite` to `unlimited` is pure capacity headroom.

## Testing

```bash
cargo test --release
```

The test suite covers config parsing, hardware presets, routing behavior,
bucket-aware service semantics, oracle logic, and smoke-style end-to-end runs.