# kvcache-simulator Discrete-event simulator for cluster-level LLM **prefill** serving with a two-tier KV cache (GPU HBM + CPU DRAM / v6d) and KV-aware request routing. Replays real production traces against a synthetic cluster so you can ablate routing strategies and cache sizing without spinning up any GPUs. Assumes **PD (prefill/decode) disaggregation** — only the prefill path is modeled. ## Build ```bash cargo build --release # binary: target/release/kvcache-sim ``` Fetch the upstream trace (consumed as a git submodule): ```bash git submodule update --init --recursive ``` ## Usage ### 1. Run a single simulation ```bash target/release/kvcache-sim run --config configs/qwen2.5-coder-7b-h800.yaml ``` Prints `summary.json` to stdout and writes the full output directory (see [Outputs](#outputs) below). ### 2. Compare routers on the same trace (ablation) ```bash target/release/kvcache-sim ablate \ --config configs/qwen2.5-coder-7b-h800.yaml \ --num-instances 64 \ --output-dir runs/qwen7b_n64 \ --routers random,least_loaded,ttl_aware,precise ``` Writes one subdirectory per router plus a combined `runs/qwen7b_n64/ablation.json` with side-by-side summaries. ### 3. Compute theoretical hit-rate ceilings (oracle) ```bash # Cluster-aggregate capacity (default) target/release/kvcache-sim oracle \ --config configs/qwen2.5-coder-7b-h800.yaml --num-instances 64 # A single instance's HBM budget target/release/kvcache-sim oracle \ --config configs/qwen2.5-coder-7b-h800.yaml --per-instance # Explicit capacity in 16-token blocks target/release/kvcache-sim oracle \ --config configs/qwen2.5-coder-7b-h800.yaml --capacity-blocks 200000 ``` Reports three numbers: - `unlimited.hit_rate` — absolute ceiling (infinite cache) - `belady_finite.hit_rate` — optimal-eviction ceiling at the given capacity - `lru_finite.hit_rate` — production LRU at the same capacity Gap between `lru_finite` and `belady_finite` = headroom from a smarter eviction policy. Gap between `belady_finite` and `unlimited` = headroom only reachable by adding capacity. ### 4. Validate a config without running ```bash target/release/kvcache-sim validate --config configs/qwen2.5-coder-7b-h800.yaml ``` Parses the YAML, prints derived per-instance block budgets, and dumps the first 5 trace records so you can sanity-check the path. ## CLI overrides These flags work on **all** subcommands and override the YAML in place, so the same config can be reused across sweeps: | Flag | Overrides | |--------------------------|-------------------------------------------| | `--num-instances ` | `cluster.num_instances` | | `--max-requests ` | `sim.max_requests` | | `--trace ` | `sim.trace_path` | | `--output-dir ` | `sim.output_dir` | | `--seed ` | `sim.seed` | | `--precise-topk ` | `cluster.router.precise_probe_topk` | | `--ttl-seconds ` | `cluster.meta_store.ttl_seconds` | `oracle` additionally takes `--capacity-blocks ` / `--per-instance` and `--out `. `ablate` additionally takes `--routers `. ## Router modes Set `cluster.router.mode` in the YAML or list in `--routers`: | Mode | What it does | |----------------|--------------------------------------------------------------------| | `random` | Uniform random. Baseline. | | `round_robin` | Deterministic round-robin. Baseline. | | `least_loaded` | `argmin(kv_blocks_used + alpha * queue_len)`. KV-blind. | | `ttl_aware` | Picks instance with longest prefix in the global TTL meta store. | | `precise` | Probes top-K least-loaded instances' actual caches; charges probe latency into TTFT. | Expected hit-rate ordering: `random ≲ least_loaded ≲ ttl_aware ≲ precise`. ## Outputs Each run writes a directory under `sim.output_dir`: | File | Contents | |----------------------|----------------------------------------------------------------------------| | `summary.json` | Router, throughput, TTFT p50/p95/p99, hit rates per tier, total RDMA/PCIe bytes | | `per_request.csv` | `req_id,arrival,ttft,e2e,instance,total_blocks,l0_hit,l1_hit,remote_hit,miss,rdma_bytes,pcie_bytes,probe_overhead_s` | | `instances.csv` | `t,instance,queue_len,kv_blocks_used,kv_blocks_total,busy` per sample | | `routing_log.jsonl` | One JSON per request: all router candidates + chosen instance + reason | For `ablate`: an extra `ablation.json` with one summary per router. For `oracle`: an `oracle.json` with the three hit-rate analyses. ### Reading results quickly ```bash # Pretty-print the summary cat runs/qwen7b/summary.json | jq . # Compare all routers from an ablation cat runs/qwen7b_n64/ablation.json | jq '.[] | {router, ttft_p50, hit_rate_l0, total_rdma_bytes}' # Hit-rate ceilings vs LRU at the same capacity cat runs/qwen7b/oracle.json | jq '{unlimited: .unlimited.hit_rate, belady: .belady_finite.hit_rate, lru: .lru_finite.hit_rate}' ``` ## Config A config is a single YAML file with four sections. A working example lives at [`configs/qwen2.5-coder-7b-h800.yaml`](configs/qwen2.5-coder-7b-h800.yaml); copy and edit for other models/hardware. ```yaml model: # shape + prefill roofline coefficients hardware: # per-instance GPU/PCIe/RDMA capabilities + batch knobs cluster: # num_instances, meta_store TTL, router mode sim: # trace_path, max_requests, output_dir, seed ``` Only prefill-side model coefficients are used; any decode fields in legacy YAMLs are accepted and ignored. ## Trace format The simulator reads the Alibaba [`qwen-bailian-usagetraces-anon`](https://github.com/alibaba-edu/qwen-bailian-usagetraces-anon) JSONL schema. Each record has `chat_id`, `timestamp`, `input_length`, `output_length`, and `hash_ids` (16-token block hashes). Only the input side is used. ## Testing ```bash cargo test --release ``` 16 tests: 15 unit + 1 smoke that runs all four routers on a synthetic shared-prefix trace and asserts the expected hit-rate ordering.