Discrete-event simulator for evaluating KV cache-aware routing policies in prefill-disaggregated LLM serving clusters. Models a two-tier KV cache hierarchy (L0 GPU HBM + L1 CPU DRAM) with RDMA/PCIe link contention, architecture-derived roofline compute (MoE, MLA, DSA), and a cluster-wide meta-store for prefix-aware routing decisions. Includes 11 routing policies (random, round_robin, least_loaded, least_tokens, ttl_aware, precise, min_pd, cache_load, cache_score, estimated_ttft, prefix_affinity), HuggingFace config.json auto-parsing, built-in GPU hardware presets (H100/H800/H20/A100/B200), and ablation tooling for systematic policy comparison across real Alibaba serving traces. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
175 lines
6.3 KiB
Markdown
175 lines
6.3 KiB
Markdown
# kvcache-simulator
|
|
|
|
Discrete-event simulator for cluster-level LLM **prefill** serving with a
|
|
two-tier KV cache (GPU HBM + CPU DRAM / v6d) and KV-aware request routing.
|
|
Replays real production traces against a synthetic cluster so you can
|
|
ablate routing strategies and cache sizing without spinning up any GPUs.
|
|
|
|
Assumes **PD (prefill/decode) disaggregation** — only the prefill path is
|
|
modeled.
|
|
|
|
## Build
|
|
|
|
```bash
|
|
cargo build --release
|
|
# binary: target/release/kvcache-sim
|
|
```
|
|
|
|
Fetch the upstream trace (consumed as a git submodule):
|
|
|
|
```bash
|
|
git submodule update --init --recursive
|
|
```
|
|
|
|
## Usage
|
|
|
|
### 1. Run a single simulation
|
|
|
|
```bash
|
|
target/release/kvcache-sim run --config configs/qwen2.5-coder-7b-h800.yaml
|
|
```
|
|
|
|
Prints `summary.json` to stdout and writes the full output directory
|
|
(see [Outputs](#outputs) below).
|
|
|
|
### 2. Compare routers on the same trace (ablation)
|
|
|
|
```bash
|
|
target/release/kvcache-sim ablate \
|
|
--config configs/qwen2.5-coder-7b-h800.yaml \
|
|
--num-instances 64 \
|
|
--output-dir runs/qwen7b_n64 \
|
|
--routers random,least_loaded,ttl_aware,precise
|
|
```
|
|
|
|
Writes one subdirectory per router plus a combined
|
|
`runs/qwen7b_n64/ablation.json` with side-by-side summaries.
|
|
|
|
### 3. Compute theoretical hit-rate ceilings (oracle)
|
|
|
|
```bash
|
|
# Cluster-aggregate capacity (default)
|
|
target/release/kvcache-sim oracle \
|
|
--config configs/qwen2.5-coder-7b-h800.yaml --num-instances 64
|
|
|
|
# A single instance's HBM budget
|
|
target/release/kvcache-sim oracle \
|
|
--config configs/qwen2.5-coder-7b-h800.yaml --per-instance
|
|
|
|
# Explicit capacity in 16-token blocks
|
|
target/release/kvcache-sim oracle \
|
|
--config configs/qwen2.5-coder-7b-h800.yaml --capacity-blocks 200000
|
|
```
|
|
|
|
Reports three numbers:
|
|
|
|
- `unlimited.hit_rate` — absolute ceiling (infinite cache)
|
|
- `belady_finite.hit_rate` — optimal-eviction ceiling at the given capacity
|
|
- `lru_finite.hit_rate` — production LRU at the same capacity
|
|
|
|
Gap between `lru_finite` and `belady_finite` = headroom from a smarter
|
|
eviction policy. Gap between `belady_finite` and `unlimited` = headroom
|
|
only reachable by adding capacity.
|
|
|
|
### 4. Validate a config without running
|
|
|
|
```bash
|
|
target/release/kvcache-sim validate --config configs/qwen2.5-coder-7b-h800.yaml
|
|
```
|
|
|
|
Parses the YAML, prints derived per-instance block budgets, and dumps
|
|
the first 5 trace records so you can sanity-check the path.
|
|
|
|
## CLI overrides
|
|
|
|
These flags work on **all** subcommands and override the YAML in place,
|
|
so the same config can be reused across sweeps:
|
|
|
|
| Flag | Overrides |
|
|
|--------------------------|-------------------------------------------|
|
|
| `--num-instances <N>` | `cluster.num_instances` |
|
|
| `--max-requests <N>` | `sim.max_requests` |
|
|
| `--trace <PATH>` | `sim.trace_path` |
|
|
| `--output-dir <PATH>` | `sim.output_dir` |
|
|
| `--seed <N>` | `sim.seed` |
|
|
| `--precise-topk <N>` | `cluster.router.precise_probe_topk` |
|
|
| `--ttl-seconds <S>` | `cluster.meta_store.ttl_seconds` |
|
|
|
|
`oracle` additionally takes `--capacity-blocks <N>` / `--per-instance`
|
|
and `--out <PATH>`. `ablate` additionally takes `--routers <csv>`.
|
|
|
|
## Router modes
|
|
|
|
Set `cluster.router.mode` in the YAML or list in `--routers`:
|
|
|
|
| Mode | What it does |
|
|
|----------------|--------------------------------------------------------------------|
|
|
| `random` | Uniform random. Baseline. |
|
|
| `round_robin` | Deterministic round-robin. Baseline. |
|
|
| `least_loaded` | `argmin(kv_blocks_used + alpha * queue_len)`. KV-blind. |
|
|
| `ttl_aware` | Picks instance with longest prefix in the global TTL meta store. |
|
|
| `precise` | Probes top-K least-loaded instances' actual caches; charges probe latency into TTFT. |
|
|
|
|
Expected hit-rate ordering: `random ≲ least_loaded ≲ ttl_aware ≲ precise`.
|
|
|
|
## Outputs
|
|
|
|
Each run writes a directory under `sim.output_dir`:
|
|
|
|
| File | Contents |
|
|
|----------------------|----------------------------------------------------------------------------|
|
|
| `summary.json` | Router, throughput, TTFT p50/p95/p99, hit rates per tier, total RDMA/PCIe bytes |
|
|
| `per_request.csv` | `req_id,arrival,ttft,e2e,instance,total_blocks,l0_hit,l1_hit,remote_hit,miss,rdma_bytes,pcie_bytes,probe_overhead_s` |
|
|
| `instances.csv` | `t,instance,queue_len,kv_blocks_used,kv_blocks_total,busy` per sample |
|
|
| `routing_log.jsonl` | One JSON per request: all router candidates + chosen instance + reason |
|
|
|
|
For `ablate`: an extra `ablation.json` with one summary per router.
|
|
For `oracle`: an `oracle.json` with the three hit-rate analyses.
|
|
|
|
### Reading results quickly
|
|
|
|
```bash
|
|
# Pretty-print the summary
|
|
cat runs/qwen7b/summary.json | jq .
|
|
|
|
# Compare all routers from an ablation
|
|
cat runs/qwen7b_n64/ablation.json | jq '.[] | {router, ttft_p50, hit_rate_l0, total_rdma_bytes}'
|
|
|
|
# Hit-rate ceilings vs LRU at the same capacity
|
|
cat runs/qwen7b/oracle.json | jq '{unlimited: .unlimited.hit_rate, belady: .belady_finite.hit_rate, lru: .lru_finite.hit_rate}'
|
|
```
|
|
|
|
## Config
|
|
|
|
A config is a single YAML file with four sections. A working example
|
|
lives at
|
|
[`configs/qwen2.5-coder-7b-h800.yaml`](configs/qwen2.5-coder-7b-h800.yaml);
|
|
copy and edit for other models/hardware.
|
|
|
|
```yaml
|
|
model: # shape + prefill roofline coefficients
|
|
hardware: # per-instance GPU/PCIe/RDMA capabilities + batch knobs
|
|
cluster: # num_instances, meta_store TTL, router mode
|
|
sim: # trace_path, max_requests, output_dir, seed
|
|
```
|
|
|
|
Only prefill-side model coefficients are used; any decode fields in
|
|
legacy YAMLs are accepted and ignored.
|
|
|
|
## Trace format
|
|
|
|
The simulator reads the Alibaba
|
|
[`qwen-bailian-usagetraces-anon`](https://github.com/alibaba-edu/qwen-bailian-usagetraces-anon)
|
|
JSONL schema. Each record has `chat_id`, `timestamp`, `input_length`,
|
|
`output_length`, and `hash_ids` (16-token block hashes). Only the
|
|
input side is used.
|
|
|
|
## Testing
|
|
|
|
```bash
|
|
cargo test --release
|
|
```
|
|
|
|
16 tests: 15 unit + 1 smoke that runs all four routers on a synthetic
|
|
shared-prefix trace and asserts the expected hit-rate ordering.
|