# kvcache-simulator Discrete-event simulator for cluster-level LLM **prefill** serving with a two-tier KV cache and routing experiments. The simulator models a PD-disaggregated deployment: only the **prefill** path is simulated, while decode is reduced to a small completion tail for TTFT/E2E accounting. It is intended for answering questions like: - How much do different KV-aware routers help on the same trace? - How much HBM/DRAM capacity is enough before routing dominates? - How do prefix-locality policies behave under bucketed input-length pools? - What is the gap between online LRU and offline-optimal cache capacity? ## What The Repo Models - **Architecture-derived prefill cost** from model structure, including MoE, MLA, GQA, sliding-window attention, and DSA. - **Two-tier KV hierarchy** with L0 GPU HBM and L1 host DRAM, plus remote RDMA fetches via a meta-store. - **Single-pool and bucketed clusters**. Bucketed mode separates the service into input-length buckets with isolated instance pools and meta-stores. - **Local instance routing and global bucket routing** with detailed per-request routing logs. - **Trace replay with optional input-length filtering** so the same trace can be sliced into buckets without rewriting the source file. - **Offline oracle analysis** for unlimited capacity, Belady, and LRU hit-rate ceilings. ## Highlights - **HF `config.json` auto-loading**: point `model.config_json` at a model config and the simulator derives architecture parameters automatically. - **Hardware presets**: `h100`, `h800`, `h20`, `h20-141g`, `a100-80gb`, `a100-40gb`, `b200`, and `b300`, plus TP variants such as `8xb200`. - **18 local router modes** covering baselines, load-based, cache-aware, affinity, and TTFT-estimating policies. - **2 global bucket router modes**: `strict_input_length` and `bucket_score`. - **Detailed outputs**: `summary.json`, `per_request.csv`, `instances.csv`, `routing_log.jsonl`, plus `ablation.json` / `oracle.json` when applicable. ## Build ```bash cargo build --release ``` If you want the public Qwen trace submodule as well: ```bash git submodule update --init --recursive ``` The release binary is: ```bash target/release/kvcache-sim ``` ## Quick Start Validate a config: ```bash target/release/kvcache-sim validate --config configs/glm5-8xb200.yaml ``` Run one simulation: ```bash target/release/kvcache-sim run --config configs/glm5-8xb200.yaml ``` Compare several routers on the same trace: ```bash target/release/kvcache-sim ablate \ --config configs/glm5-8xb200.yaml \ --routers random,least_loaded,cache_score,cache_affinity,estimated_ttft ``` Auto-pick the smallest cluster size that meets a TTFT target, then ablate at that size: ```bash target/release/kvcache-sim ablate \ --config configs/glm5-8xb200.yaml \ --auto-instances \ --auto-probe-router cache_score \ --auto-target-ttft-mean 4.0 ``` Run the oracle: ```bash target/release/kvcache-sim oracle \ --config configs/glm5-8xb200.yaml \ --per-instance ``` `run` prints `summary.json` to stdout and also writes the full output directory under `sim.output_dir`. ## Current Command Boundaries The repository now supports both legacy single-pool clusters and bucketed service topologies, but not every CLI path supports both yet. - `run`: supports `cluster.num_instances` and `cluster.buckets` - `validate`: supports `cluster.num_instances` and `cluster.buckets` - `ablate`: currently **single-pool only** - `ablate --evict-policies`: currently supports **`lru` only** - `oracle`: currently **single-pool only** - `--num-instances` override: currently **single-pool only** - `--auto-instances`: currently **single-pool only** In practice, bucket-aware experiments are ready in `run`, while fixed-placement ablation and oracle analysis still reject `cluster.buckets`. ## Config Model ### Single-Pool Cluster Use `cluster.num_instances` for the original flat instance pool: ```yaml cluster: num_instances: 32 meta_store: ttl_seconds: 300.0 router: mode: cache_affinity ``` ### Bucketed Service Use `cluster.buckets` plus a `global_router` to model explicit input-length buckets: ```yaml cluster: meta_store: ttl_seconds: 300.0 router: mode: cache_affinity load_alpha: 1.5 prefix_k: 8 global_router: mode: strict_input_length length_penalty_weight: 1.0 load_weight: 1.0 cache_weight: 1.0 buckets: - name: short input_length_min: 0 input_length_max: 32768 num_instances: 8 - name: long input_length_min: 32769 input_length_max: 131072 num_instances: 4 ``` Rules enforced by config validation: - `cluster.num_instances` and `cluster.buckets` are mutually exclusive - bucket ranges must not overlap - every bucket must have `num_instances > 0` - `input_length_min <= input_length_max` ### CLI Overrides These flags apply on top of the YAML config: | Flag | Overrides | |------|-----------| | `--num-instances ` | `cluster.num_instances` | | `--max-requests ` | `sim.max_requests` | | `--trace ` | `sim.trace_path` | | `--output-dir ` | `sim.output_dir` | | `--seed ` | `sim.seed` | | `--precise-topk ` | `cluster.router.precise_probe_topk` | | `--ttl-seconds ` | `cluster.meta_store.ttl_seconds` | | `--input-length-min ` | `sim.input_length_min` | | `--input-length-max ` | `sim.input_length_max` | Subcommand-specific additions: - `ablate`: `--routers`, `--evict-policies`, `--auto-instances`, `--auto-target-ttft-mean`, `--auto-candidates`, `--auto-probe-router`, `--jobs` - `oracle`: `--capacity-blocks`, `--per-instance`, `--out` ## Routing Modes ### Global Bucket Routers Configured through `cluster.global_router.mode`. | Mode | What it does | |------|---------------| | `strict_input_length` | Routes to the unique bucket whose `[input_length_min, input_length_max]` contains the request. | | `bucket_score` | Scores every bucket using weighted length mismatch, aggregate queue load, and predicted cache miss. Can intentionally deviate from the strict length bucket. | ### Local Instance Routers Configured through `cluster.router.mode`. All of these names are accepted by `run`, and any of them can be passed to `ablate --routers` on single-pool configs. | Mode | Aliases | What it does | |------|---------|---------------| | `random` | | Uniform random baseline. | | `round_robin` | `rr` | Deterministic round-robin baseline. | | `least_loaded` | | Minimizes `kv_blocks_used + alpha * queue_len`. | | `least_tokens` | `lt` | Minimizes queued token work. | | `ttl_aware` | `ttl` | Uses the global TTL meta-store to chase the longest reusable prefix. | | `precise` | `precise_aware` | Probes top-K least-loaded instances for actual cache contents and charges probe latency. | | `min_pd` | `minpd`, `pd` | Minimizes `P * D` using ongoing load and prefix reuse. | | `cache_load` | `cl` | Filters to lightly loaded instances, then chooses the best cache prefix. | | `cache_affinity` | `caff`, `ca` | Strong cache-first scoring with rendezvous-based sticky homes for prefix families. | | `cache_affinity_weak_rend` | `caff_weak` | Ablation: weak cache weights plus rendezvous placement. | | `cache_affinity_strong_only` | `caff_strong` | Ablation: strong cache weights without rendezvous tie-breaking. | | `cache_score` | `cs` | Exponential score over queue length and miss blocks. | | `cache_score_strong` | `cs_strong`, `css` | Parity probe with stronger cache weighting than default `cache_score`. | | `cache_score_ttl` | `csttl`, `cs_ttl` | `cache_score` variant that also uses TTL/meta-store visibility. | | `estimated_ttft` | `ettft`, `optimal` | First-principles TTFT estimate per instance using compute plus KV movement. | | `prefix_affinity` | `affinity`, `pa` | Deterministic prefix fingerprinting with affinity fan-out and load-aware selection. | | `adaptive_affinity` | `aa` | Uses hot-prefix detection: affinity for short hot stems, TTFT optimization otherwise. | | `lineage_affinity` | `la` | Combines parent stickiness, family homesets, and strong local cache scoring. | Router tuning knobs in `cluster.router`: | Field | Default | Used by | |-------|---------|---------| | `load_alpha` | `1.0` | `least_loaded`, `ttl_aware`, affinity families | | `score_alpha` | `1.0` | `cache_score`, `cache_score_ttl` | | `score_beta` | `0.1` | `cache_score`, `cache_score_ttl` | | `prefix_k` | `8` | prefix and affinity fingerprinting | | `affinity_fan_out` | `0` | `prefix_affinity`, `adaptive_affinity`, `lineage_affinity` | | `precise_probe_latency_us` | `50.0` | `precise` | | `precise_probe_topk` | `4` | `precise` | ## Model And Hardware Configuration ### Model Config Recommended pattern: ```yaml model: config_json: ../models/GLM-5/config.json name: glm-5 compute_dtype: fp8 weight_dtype: fp4 dtype_bytes: 1 block_size_tokens: 512 ``` Notes: - `config_json` is resolved relative to the YAML file - explicit YAML fields override values loaded from the model config - `compute_dtype` selects the compute FLOPS tier - `weight_dtype` controls model-weight bytes separately from KV-cache bytes - `dtype_bytes` sizes the KV cache The architecture loader understands: - MoE expert counts and active experts - MLA LoRA ranks and attention dimensions - DSA sparse-attention parameters - sliding-window attention - GQA from KV-head count ### Hardware Presets Recommended pattern: ```yaml hardware: type: 8xb300 hbm_bytes: 1900.0e9 dram_bytes: 1.5e12 max_batch_slots: 256 ``` Available preset families: - `h100`, `h800`, `h20`, `h20-141g` - `a100-80gb`, `a100-40gb` - `b200`, `b300` - TP forms such as `2xh100`, `4xh20`, `8xb200`, `8xb300` ## Bundled Configs Representative configs in `configs/`: | Config | Notes | |--------|-------| | `glm5-8xb200.yaml` | GLM-5 on `8xb200`, single-pool baseline config. | | `glm5-fp8-8xh20-141g.yaml` | GLM-5-FP8 on `8xh20-141g`, with a 0-32k input-length filter. | | `glm5-fp8-8xh20-141g-ca-tuned.yaml` | Same family as above, tuned for `cache_affinity`. | | `glm5-nvfp4-8xb300.yaml` | GLM-5-NVFP4 on `8xb300`. | | `glm5-nvfp4-fp8compute-8xb300.yaml` | NVFP4 weights with FP8 compute on `8xb300`. | | `qwen3-coder-480b-8xh20.yaml` | Qwen3-Coder-480B-A35B on `8xh20`. | Many of the `glm5-*n*.yaml` configs are bucket/slice-specific experiment points that use `sim.input_length_min` and `sim.input_length_max`. ## Trace Inputs This repository currently contains two trace sources: - `bailian-traces/` - `glm_coder_blksz_512_040915-040917.jsonl` - `qwen3_coder_blksz_512_040915-040917.jsonl` - `qwen-bailian-usagetraces-anon/` submodule - public 16-token-block Qwen traces such as `qwen_coder_blksz_16.jsonl` and `qwen_traceB_blksz_16.jsonl` The simulator expects JSONL records with fields like: ```json { "chat_id": 159, "parent_chat_id": 55, "timestamp": 61.114, "input_length": 521, "output_length": 132, "type": "text", "turn": 2, "hash_ids": [1089, 1090, 1091] } ``` Only prefill-side behavior is modeled; `output_length` is used only for a decode tail in completion metrics. ## Outputs Each `run` writes a directory under `sim.output_dir`: | File | Contents | |------|----------| | `summary.json` | Aggregate throughput, TTFT/E2E percentiles, hit rates, RDMA bytes, PCIe bytes. | | `per_request.csv` | Per-request latency and cache stats, including `bucket`, `instance`, and `length_bucket_match`. | | `instances.csv` | Periodic per-instance samples with `bucket`, `instance`, `queue_len`, and KV usage. | | `routing_log.jsonl` | One JSON route decision per request, including `global_mode`, `mode`, `chosen_bucket`, candidate buckets, and candidate instances. | Additional outputs: - `ablate`: writes `ablation.json` - `oracle`: writes `oracle.json` - `ablate --auto-instances`: writes calibration runs under `/auto_instances/` Quick inspection examples: ```bash jq . runs/glm5_8xb200/summary.json ``` ```bash jq 'sort_by(.ttft_mean) | .[] | {router, ttft_mean, hit_rate_l0, miss_rate}' \ runs/glm5_8xb200/ablation.json ``` ## Oracle Semantics `oracle` computes three hit-rate references at a chosen cache capacity: - `unlimited.hit_rate`: absolute ceiling with infinite capacity - `belady_finite.hit_rate`: offline-optimal eviction at the chosen capacity - `lru_finite.hit_rate`: LRU at the same capacity When `sim.input_length_min` / `sim.input_length_max` are set, `oracle` still feeds the full trace into cache state but only counts requests inside the selected input-length range. That matches the intended "measure one bucket inside a mixed workload" interpretation. The gap from `lru_finite` to `belady_finite` is eviction-policy headroom. The gap from `belady_finite` to `unlimited` is pure capacity headroom. ## Testing ```bash cargo test --release ``` The test suite covers config parsing, hardware presets, routing behavior, bucket-aware service semantics, oracle logic, and smoke-style end-to-end runs.