Discrete-event simulator for cluster-level LLM prefill serving with a two-tier KV cache and routing experiments. The simulator models a PD-disaggregated deployment: only the prefill path is simulated, while decode is reduced to a small completion tail for TTFT/E2E accounting.

It is intended for answering questions like:

How much do different KV-aware routers help on the same trace?
How much HBM/DRAM capacity is enough before routing dominates?
How do prefix-locality policies behave under bucketed input-length pools?
What is the gap between online LRU and offline-optimal cache capacity?

What The Repo Models

Architecture-derived prefill cost from model structure, including MoE, MLA, GQA, sliding-window attention, and DSA.
Two-tier KV hierarchy with L0 GPU HBM and L1 host DRAM, plus remote RDMA fetches via a meta-store.
Single-pool and bucketed clusters. Bucketed mode separates the service into input-length buckets with isolated instance pools and meta-stores.
Local instance routing and global bucket routing with detailed per-request routing logs.
Trace replay with optional input-length filtering so the same trace can be sliced into buckets without rewriting the source file.
Offline oracle analysis for unlimited capacity, Belady, and LRU hit-rate ceilings.

Highlights

HF config.json auto-loading: point model.config_json at a model config and the simulator derives architecture parameters automatically.
Hardware presets: h100, h800, h20, h20-141g, a100-80gb, a100-40gb, b200, and b300, plus TP variants such as 8xb200.
18 local router modes covering baselines, load-based, cache-aware, affinity, and TTFT-estimating policies.
2 global bucket router modes: strict_input_length and bucket_score.
Detailed outputs: summary.json, per_request.csv, instances.csv, routing_log.jsonl, plus ablation.json / oracle.json when applicable.

Build

cargo build --release

If you want the public Qwen trace submodule as well:

git submodule update --init --recursive

The release binary is:

target/release/kvcache-sim

Quick Start

Validate a config:

target/release/kvcache-sim validate --config configs/glm5-8xb200.yaml

Run one simulation:

target/release/kvcache-sim run --config configs/glm5-8xb200.yaml

Compare several routers on the same trace:

target/release/kvcache-sim ablate \
  --config configs/glm5-8xb200.yaml \
  --routers random,least_loaded,cache_score,cache_affinity,estimated_ttft

Auto-pick the smallest cluster size that meets a TTFT target, then ablate at that size:

target/release/kvcache-sim ablate \
  --config configs/glm5-8xb200.yaml \
  --auto-instances \
  --auto-probe-router cache_score \
  --auto-target-ttft-mean 4.0

Run the oracle:

target/release/kvcache-sim oracle \
  --config configs/glm5-8xb200.yaml \
  --per-instance

run prints summary.json to stdout and also writes the full output directory under sim.output_dir.

Current Command Boundaries

The repository now supports both legacy single-pool clusters and bucketed service topologies, but not every CLI path supports both yet.

run: supports cluster.num_instances and cluster.buckets
validate: supports cluster.num_instances and cluster.buckets
ablate: currently single-pool only
ablate --evict-policies: currently supports lru only
oracle: currently single-pool only
--num-instances override: currently single-pool only
--auto-instances: currently single-pool only

In practice, bucket-aware experiments are ready in run, while fixed-placement ablation and oracle analysis still reject cluster.buckets.

Config Model

Single-Pool Cluster

Use cluster.num_instances for the original flat instance pool:

cluster:
  num_instances: 32
  meta_store:
    ttl_seconds: 300.0
  router:
    mode: cache_affinity

Bucketed Service

Use cluster.buckets plus a global_router to model explicit input-length buckets:

cluster:
  meta_store:
    ttl_seconds: 300.0
  router:
    mode: cache_affinity
    load_alpha: 1.5
    prefix_k: 8
  global_router:
    mode: strict_input_length
    length_penalty_weight: 1.0
    load_weight: 1.0
    cache_weight: 1.0
  buckets:
    - name: short
      input_length_min: 0
      input_length_max: 32768
      num_instances: 8
    - name: long
      input_length_min: 32769
      input_length_max: 131072
      num_instances: 4

Rules enforced by config validation:

cluster.num_instances and cluster.buckets are mutually exclusive
bucket ranges must not overlap
every bucket must have num_instances > 0
input_length_min <= input_length_max

CLI Overrides

These flags apply on top of the YAML config:

Flag	Overrides
`--num-instances <N>`	`cluster.num_instances`
`--max-requests <N>`	`sim.max_requests`
`--trace <PATH>`	`sim.trace_path`
`--output-dir <PATH>`	`sim.output_dir`
`--seed <N>`	`sim.seed`
`--precise-topk <N>`	`cluster.router.precise_probe_topk`
`--ttl-seconds <S>`	`cluster.meta_store.ttl_seconds`
`--input-length-min <N>`	`sim.input_length_min`
`--input-length-max <N>`	`sim.input_length_max`

Subcommand-specific additions:

ablate: --routers, --evict-policies, --auto-instances, --auto-target-ttft-mean, --auto-candidates, --auto-probe-router, --jobs
oracle: --capacity-blocks, --per-instance, --out

Routing Modes

Global Bucket Routers

Configured through cluster.global_router.mode.

Mode	What it does
`strict_input_length`	Routes to the unique bucket whose `[input_length_min, input_length_max]` contains the request.
`bucket_score`	Scores every bucket using weighted length mismatch, aggregate queue load, and predicted cache miss. Can intentionally deviate from the strict length bucket.

Local Instance Routers

Configured through cluster.router.mode. All of these names are accepted by run, and any of them can be passed to ablate --routers on single-pool configs.

Mode	Aliases	What it does
`random`		Uniform random baseline.
`round_robin`	`rr`	Deterministic round-robin baseline.
`least_loaded`		Minimizes `kv_blocks_used + alpha * queue_len`.
`least_tokens`	`lt`	Minimizes queued token work.
`ttl_aware`	`ttl`	Uses the global TTL meta-store to chase the longest reusable prefix.
`precise`	`precise_aware`	Probes top-K least-loaded instances for actual cache contents and charges probe latency.
`min_pd`	`minpd`, `pd`	Minimizes `P * D` using ongoing load and prefix reuse.
`cache_load`	`cl`	Filters to lightly loaded instances, then chooses the best cache prefix.
`cache_affinity`	`caff`, `ca`	Strong cache-first scoring with rendezvous-based sticky homes for prefix families.
`cache_affinity_weak_rend`	`caff_weak`	Ablation: weak cache weights plus rendezvous placement.
`cache_affinity_strong_only`	`caff_strong`	Ablation: strong cache weights without rendezvous tie-breaking.
`cache_score`	`cs`	Exponential score over queue length and miss blocks.
`cache_score_strong`	`cs_strong`, `css`	Parity probe with stronger cache weighting than default `cache_score`.
`cache_score_ttl`	`csttl`, `cs_ttl`	`cache_score` variant that also uses TTL/meta-store visibility.
`estimated_ttft`	`ettft`, `optimal`	First-principles TTFT estimate per instance using compute plus KV movement.
`prefix_affinity`	`affinity`, `pa`	Deterministic prefix fingerprinting with affinity fan-out and load-aware selection.
`adaptive_affinity`	`aa`	Uses hot-prefix detection: affinity for short hot stems, TTFT optimization otherwise.
`lineage_affinity`	`la`	Combines parent stickiness, family homesets, and strong local cache scoring.

Router tuning knobs in cluster.router:

Field	Default	Used by
`load_alpha`	`1.0`	`least_loaded`, `ttl_aware`, affinity families
`score_alpha`	`1.0`	`cache_score`, `cache_score_ttl`
`score_beta`	`0.1`	`cache_score`, `cache_score_ttl`
`prefix_k`	`8`	prefix and affinity fingerprinting
`affinity_fan_out`	`0`	`prefix_affinity`, `adaptive_affinity`, `lineage_affinity`
`precise_probe_latency_us`	`50.0`	`precise`
`precise_probe_topk`	`4`	`precise`

Model And Hardware Configuration

Model Config

Recommended pattern:

model:
  config_json: ../models/GLM-5/config.json
  name: glm-5
  compute_dtype: fp8
  weight_dtype: fp4
  dtype_bytes: 1
  block_size_tokens: 512

Notes:

config_json is resolved relative to the YAML file
explicit YAML fields override values loaded from the model config
compute_dtype selects the compute FLOPS tier
weight_dtype controls model-weight bytes separately from KV-cache bytes
dtype_bytes sizes the KV cache

The architecture loader understands:

MoE expert counts and active experts
MLA LoRA ranks and attention dimensions
DSA sparse-attention parameters
sliding-window attention
GQA from KV-head count

Hardware Presets

Recommended pattern:

hardware:
  type: 8xb300
  hbm_bytes: 1900.0e9
  dram_bytes: 1.5e12
  max_batch_slots: 256

Available preset families:

h100, h800, h20, h20-141g
a100-80gb, a100-40gb
b200, b300
TP forms such as 2xh100, 4xh20, 8xb200, 8xb300

Bundled Configs

Representative configs in configs/:

Config	Notes
`glm5-8xb200.yaml`	GLM-5 on `8xb200`, single-pool baseline config.
`glm5-fp8-8xh20-141g.yaml`	GLM-5-FP8 on `8xh20-141g`, with a 0-32k input-length filter.
`glm5-fp8-8xh20-141g-ca-tuned.yaml`	Same family as above, tuned for `cache_affinity`.
`glm5-nvfp4-8xb300.yaml`	GLM-5-NVFP4 on `8xb300`.
`glm5-nvfp4-fp8compute-8xb300.yaml`	NVFP4 weights with FP8 compute on `8xb300`.
`qwen3-coder-480b-8xh20.yaml`	Qwen3-Coder-480B-A35B on `8xh20`.

Many of the glm5-*n*.yaml configs are bucket/slice-specific experiment points that use sim.input_length_min and sim.input_length_max.

Trace Inputs

This repository currently contains two trace sources:

bailian-traces/
- glm_coder_blksz_512_040915-040917.jsonl
- qwen3_coder_blksz_512_040915-040917.jsonl
qwen-bailian-usagetraces-anon/ submodule
- public 16-token-block Qwen traces such as qwen_coder_blksz_16.jsonl and qwen_traceB_blksz_16.jsonl

The simulator expects JSONL records with fields like:

{
  "chat_id": 159,
  "parent_chat_id": 55,
  "timestamp": 61.114,
  "input_length": 521,
  "output_length": 132,
  "type": "text",
  "turn": 2,
  "hash_ids": [1089, 1090, 1091]
}

Only prefill-side behavior is modeled; output_length is used only for a decode tail in completion metrics.

Outputs

Each run writes a directory under sim.output_dir:

File	Contents
`summary.json`	Aggregate throughput, TTFT/E2E percentiles, hit rates, RDMA bytes, PCIe bytes.
`per_request.csv`	Per-request latency and cache stats, including `bucket`, `instance`, and `length_bucket_match`.
`instances.csv`	Periodic per-instance samples with `bucket`, `instance`, `queue_len`, and KV usage.
`routing_log.jsonl`	One JSON route decision per request, including `global_mode`, `mode`, `chosen_bucket`, candidate buckets, and candidate instances.

Additional outputs:

ablate: writes ablation.json
oracle: writes oracle.json
ablate --auto-instances: writes calibration runs under <output_dir>/auto_instances/

Quick inspection examples:

jq . runs/glm5_8xb200/summary.json

jq 'sort_by(.ttft_mean) | .[] | {router, ttft_mean, hit_rate_l0, miss_rate}' \
  runs/glm5_8xb200/ablation.json

Oracle Semantics

oracle computes three hit-rate references at a chosen cache capacity:

unlimited.hit_rate: absolute ceiling with infinite capacity
belady_finite.hit_rate: offline-optimal eviction at the chosen capacity
lru_finite.hit_rate: LRU at the same capacity

When sim.input_length_min / sim.input_length_max are set, oracle still feeds the full trace into cache state but only counts requests inside the selected input-length range. That matches the intended "measure one bucket inside a mixed workload" interpretation.

The gap from lru_finite to belady_finite is eviction-policy headroom. The gap from belady_finite to unlimited is pure capacity headroom.

Testing

cargo test --release

The test suite covers config parsing, hardware presets, routing behavior, bucket-aware service semantics, oracle logic, and smoke-style end-to-end runs.