Update README with full feature documentation
Cover all 11 routing policies (including new prefix_affinity, cache_load, cache_score, estimated_ttft, least_tokens), HuggingFace config.json auto-parsing, GPU hardware presets, architecture-aware compute model (MoE/MLA/DSA/GQA), router parameter tuning, bundled model configs and config files, and available traces. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
264
README.md
264
README.md
@@ -8,6 +8,26 @@ ablate routing strategies and cache sizing without spinning up any GPUs.
|
|||||||
Assumes **PD (prefill/decode) disaggregation** — only the prefill path is
|
Assumes **PD (prefill/decode) disaggregation** — only the prefill path is
|
||||||
modeled.
|
modeled.
|
||||||
|
|
||||||
|
## Features
|
||||||
|
|
||||||
|
- **Architecture-derived roofline compute** — auto-derives FLOPs,
|
||||||
|
attention coefficients, and weight-streaming costs from model
|
||||||
|
architecture (MoE, MLA, GQA, DSA, sliding window).
|
||||||
|
- **HuggingFace config.json auto-parsing** — drop in any HF
|
||||||
|
`config.json` and the simulator extracts layer counts, attention heads,
|
||||||
|
MoE expert configs, MLA LoRA ranks, and DSA sparse parameters.
|
||||||
|
- **Built-in GPU hardware presets** — H100, H800, H20, A100-80GB,
|
||||||
|
A100-40GB, B200 with tensor-parallel scaling (e.g. `8xb200`).
|
||||||
|
- **Two-tier KV cache hierarchy** — L0 (GPU HBM) + L1 (CPU DRAM) with
|
||||||
|
LRU eviction and cross-instance RDMA fetch via a cluster-wide
|
||||||
|
meta-store.
|
||||||
|
- **11 routing policies** — from baselines (random, round-robin) to
|
||||||
|
cache-aware (min\_pd, prefix\_affinity) for systematic ablation.
|
||||||
|
- **Token-bucket link contention** — PCIe and RDMA bandwidth modeled with
|
||||||
|
reservation-based token-bucket queues.
|
||||||
|
- **Oracle analysis** — computes theoretical hit-rate ceilings (infinite
|
||||||
|
cache, Belady optimal, LRU) for gap analysis.
|
||||||
|
|
||||||
## Build
|
## Build
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@@ -26,7 +46,7 @@ git submodule update --init --recursive
|
|||||||
### 1. Run a single simulation
|
### 1. Run a single simulation
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
target/release/kvcache-sim run --config configs/qwen2.5-coder-7b-h800.yaml
|
target/release/kvcache-sim run --config configs/glm5-8xb200-hf.yaml
|
||||||
```
|
```
|
||||||
|
|
||||||
Prints `summary.json` to stdout and writes the full output directory
|
Prints `summary.json` to stdout and writes the full output directory
|
||||||
@@ -36,29 +56,28 @@ Prints `summary.json` to stdout and writes the full output directory
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
target/release/kvcache-sim ablate \
|
target/release/kvcache-sim ablate \
|
||||||
--config configs/qwen2.5-coder-7b-h800.yaml \
|
--config configs/glm5-8xb200-hf.yaml \
|
||||||
--num-instances 64 \
|
--routers random,least_loaded,least_tokens,min_pd,prefix_affinity \
|
||||||
--output-dir runs/qwen7b_n64 \
|
--output-dir runs/glm5_ablation
|
||||||
--routers random,least_loaded,ttl_aware,precise
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Writes one subdirectory per router plus a combined
|
Writes one subdirectory per router plus a combined
|
||||||
`runs/qwen7b_n64/ablation.json` with side-by-side summaries.
|
`ablation.json` with side-by-side summaries.
|
||||||
|
|
||||||
### 3. Compute theoretical hit-rate ceilings (oracle)
|
### 3. Compute theoretical hit-rate ceilings (oracle)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Cluster-aggregate capacity (default)
|
# Cluster-aggregate capacity (default)
|
||||||
target/release/kvcache-sim oracle \
|
target/release/kvcache-sim oracle \
|
||||||
--config configs/qwen2.5-coder-7b-h800.yaml --num-instances 64
|
--config configs/glm5-8xb200-hf.yaml --num-instances 64
|
||||||
|
|
||||||
# A single instance's HBM budget
|
# A single instance's HBM budget
|
||||||
target/release/kvcache-sim oracle \
|
target/release/kvcache-sim oracle \
|
||||||
--config configs/qwen2.5-coder-7b-h800.yaml --per-instance
|
--config configs/glm5-8xb200-hf.yaml --per-instance
|
||||||
|
|
||||||
# Explicit capacity in 16-token blocks
|
# Explicit capacity in blocks
|
||||||
target/release/kvcache-sim oracle \
|
target/release/kvcache-sim oracle \
|
||||||
--config configs/qwen2.5-coder-7b-h800.yaml --capacity-blocks 200000
|
--config configs/glm5-8xb200-hf.yaml --capacity-blocks 200000
|
||||||
```
|
```
|
||||||
|
|
||||||
Reports three numbers:
|
Reports three numbers:
|
||||||
@@ -74,7 +93,7 @@ only reachable by adding capacity.
|
|||||||
### 4. Validate a config without running
|
### 4. Validate a config without running
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
target/release/kvcache-sim validate --config configs/qwen2.5-coder-7b-h800.yaml
|
target/release/kvcache-sim validate --config configs/glm5-8xb200-hf.yaml
|
||||||
```
|
```
|
||||||
|
|
||||||
Parses the YAML, prints derived per-instance block budgets, and dumps
|
Parses the YAML, prints derived per-instance block budgets, and dumps
|
||||||
@@ -102,15 +121,179 @@ and `--out <PATH>`. `ablate` additionally takes `--routers <csv>`.
|
|||||||
|
|
||||||
Set `cluster.router.mode` in the YAML or list in `--routers`:
|
Set `cluster.router.mode` in the YAML or list in `--routers`:
|
||||||
|
|
||||||
| Mode | What it does |
|
| Mode | Aliases | What it does |
|
||||||
|----------------|--------------------------------------------------------------------|
|
|-------------------|------------------|--------------------------------------------------------------------------------------|
|
||||||
| `random` | Uniform random. Baseline. |
|
| `random` | | Uniform random. Baseline. |
|
||||||
| `round_robin` | Deterministic round-robin. Baseline. |
|
| `round_robin` | `rr` | Deterministic round-robin. Baseline. |
|
||||||
| `least_loaded` | `argmin(kv_blocks_used + alpha * queue_len)`. KV-blind. |
|
| `least_loaded` | | `argmin(kv_blocks_used + alpha * queue_len)`. KV-blind load balance. |
|
||||||
| `ttl_aware` | Picks instance with longest prefix in the global TTL meta store. |
|
| `least_tokens` | `lt` | `argmin(waiting_tokens)`. Pure load balance by queued compute work. |
|
||||||
| `precise` | Probes top-K least-loaded instances' actual caches; charges probe latency into TTFT. |
|
| `ttl_aware` | `ttl` | Picks instance with longest prefix in the global TTL meta-store. Cache-only. |
|
||||||
|
| `precise` | `precise_aware` | Probes top-K least-loaded instances' actual caches; charges probe latency into TTFT. |
|
||||||
|
| `min_pd` | `minpd`, `pd` | Minimizes `P*D` (prefill tokens x ongoing requests). Cluster-wide RDMA-aware. |
|
||||||
|
| `cache_load` | `cl` | Filters to least-loaded 1/4 instances, then picks best cache prefix. |
|
||||||
|
| `cache_score` | `cs` | Exponential scoring: `2^(alpha * queue_len + beta * miss_blocks)`. |
|
||||||
|
| `estimated_ttft` | `ettft`,`optimal`| Estimates `drain_time + fetch_time` per instance using architecture-aware compute. |
|
||||||
|
| `prefix_affinity` | `affinity`, `pa` | Rendezvous-hashed prefix fingerprinting for deterministic cache locality. |
|
||||||
|
|
||||||
Expected hit-rate ordering: `random ≲ least_loaded ≲ ttl_aware ≲ precise`.
|
### Router parameters
|
||||||
|
|
||||||
|
These fields in `cluster.router` tune specific routers:
|
||||||
|
|
||||||
|
| Field | Default | Used by | Description |
|
||||||
|
|--------------------------|---------|------------------|------------------------------------------------------|
|
||||||
|
| `load_alpha` | `1.0` | `least_loaded` | Weight of queue\_len vs kv\_blocks\_used |
|
||||||
|
| `score_alpha` | `1.0` | `cache_score` | Load weight in `2^(alpha*load + beta*miss)` |
|
||||||
|
| `score_beta` | `0.1` | `cache_score` | Cache-miss weight in `2^(alpha*load + beta*miss)` |
|
||||||
|
| `prefix_k` | `8` | `prefix_affinity`| Number of leading blocks for the prefix fingerprint |
|
||||||
|
| `affinity_fan_out` | `0` | `prefix_affinity`| Top-K affinity candidates (0 = auto: n/8, min 2) |
|
||||||
|
| `precise_probe_latency_us`| `50.0`| `precise` | Simulated per-probe latency (microseconds) |
|
||||||
|
| `precise_probe_topk` | `4` | `precise` | Number of instances probed |
|
||||||
|
|
||||||
|
### Router design spectrum
|
||||||
|
|
||||||
|
```
|
||||||
|
Cache-only Hybrid Load-only
|
||||||
|
(hot-spot risk) (cache-blind)
|
||||||
|
┌─────────┬───────────┬───────────┬────────────┬───────────┬───────────┐
|
||||||
|
ttl_aware precise cache_score min_pd prefix_ least_ random
|
||||||
|
cache_load affinity loaded
|
||||||
|
est_ttft least_tokens
|
||||||
|
```
|
||||||
|
|
||||||
|
`prefix_affinity` sits in a unique position: it builds **proactive cache
|
||||||
|
locality** by consistently routing same-prefix requests to the same
|
||||||
|
instances (via rendezvous hashing), rather than reactively chasing
|
||||||
|
existing cache state. This yields the highest L0 hit rates while
|
||||||
|
maintaining load balance through within-group drain-time-aware selection.
|
||||||
|
|
||||||
|
## Model configuration
|
||||||
|
|
||||||
|
### HuggingFace config.json (recommended)
|
||||||
|
|
||||||
|
Point `model.config_json` at any HF `config.json` to auto-extract
|
||||||
|
architecture:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
model:
|
||||||
|
config_json: ../models/GLM-5/config.json
|
||||||
|
dtype_bytes: 2 # required (not in HF schema)
|
||||||
|
block_size_tokens: 512 # required (not in HF schema)
|
||||||
|
```
|
||||||
|
|
||||||
|
Auto-detected features:
|
||||||
|
|
||||||
|
| Feature | Detection trigger | What it extracts |
|
||||||
|
|-----------|-------------------------------|----------------------------------------------|
|
||||||
|
| **MoE** | `n_routed_experts`, `num_local_experts`, or `num_experts` | Expert count, active experts, shared experts, expert FFN width |
|
||||||
|
| **MLA** | `kv_lora_rank` present | KV/Q LoRA ranks, qk\_rope/nope dims, v\_head\_dim |
|
||||||
|
| **DSA** | `first_k_dense_replace` present| Dense window, sparse stride, first dense layers |
|
||||||
|
| **Sliding window** | `sliding_window` present | Window size |
|
||||||
|
| **GQA** | `num_key_value_heads < num_attention_heads` | KV head count for grouped-query attention |
|
||||||
|
|
||||||
|
Explicit YAML fields always override the auto-detected values.
|
||||||
|
|
||||||
|
### Inline specification
|
||||||
|
|
||||||
|
Alternatively, specify architecture fields directly:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
model:
|
||||||
|
name: qwen2.5-coder-7b
|
||||||
|
num_layers: 28
|
||||||
|
hidden_size: 3584
|
||||||
|
num_attention_heads: 28
|
||||||
|
num_kv_heads: 4
|
||||||
|
head_dim: 128
|
||||||
|
intermediate_size: 18944
|
||||||
|
dtype_bytes: 2
|
||||||
|
block_size_tokens: 16
|
||||||
|
```
|
||||||
|
|
||||||
|
When `hidden_size` is present, the compute model is auto-derived
|
||||||
|
(architecture mode). Without it, you must supply legacy manual
|
||||||
|
coefficients (`flops_per_token_prefill`, `attn_quadratic_coeff`, etc.).
|
||||||
|
|
||||||
|
### Bundled model configs
|
||||||
|
|
||||||
|
| Model | Path | Architecture |
|
||||||
|
|-------|------|--------------|
|
||||||
|
| GLM-5 (744B/40B-active) | `models/GLM-5/config.json` | MoE (256 routed, 8 active, 1 shared) + MLA + DSA |
|
||||||
|
| Qwen3-Coder-480B-A35B FP8 | `models/Qwen3-Coder-480B-A35B-Instruct-FP8/config.json` | MoE (160 experts, 8 active) + GQA |
|
||||||
|
|
||||||
|
## Hardware configuration
|
||||||
|
|
||||||
|
### Using presets (recommended)
|
||||||
|
|
||||||
|
Set `hardware.type` to a preset name — individual fields can override:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
hardware:
|
||||||
|
type: 8xb200
|
||||||
|
hbm_bytes: 500.0e9 # override KV budget (after model weights)
|
||||||
|
```
|
||||||
|
|
||||||
|
Available presets:
|
||||||
|
|
||||||
|
| Preset | FLOPS | HBM | Mem BW | PCIe |
|
||||||
|
|-------------|------------|---------|------------|------|
|
||||||
|
| `h100` | 989 TFLOPS | 80 GB | 3.35 TB/s | Gen5 |
|
||||||
|
| `h800` | 989 TFLOPS | 80 GB | 3.35 TB/s | Gen5 |
|
||||||
|
| `h20` | 148 TFLOPS | 96 GB | 4.0 TB/s | Gen5 |
|
||||||
|
| `a100-80gb` | 312 TFLOPS | 80 GB | 2.0 TB/s | Gen4 |
|
||||||
|
| `a100-40gb` | 312 TFLOPS | 40 GB | 1.555 TB/s | Gen4 |
|
||||||
|
| `b200` | 2.25 PFLOPS| 192 GB | 8.0 TB/s | Gen6 |
|
||||||
|
|
||||||
|
Prefix with `2x`, `4x`, or `8x` for tensor-parallel groups (e.g.
|
||||||
|
`8xh20`). FLOPS, memory bandwidth, and HBM scale linearly; RDMA and
|
||||||
|
DRAM are set to sensible per-node defaults.
|
||||||
|
|
||||||
|
### Inline specification
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
hardware:
|
||||||
|
gpu_flops: 1.80e16
|
||||||
|
gpu_mem_bw: 6.40e13
|
||||||
|
hbm_bytes: 500.0e9
|
||||||
|
dram_bytes: 1.5e12
|
||||||
|
pcie_bw: 128.0e9
|
||||||
|
pcie_latency_us: 4.0
|
||||||
|
rdma_bw: 50.0e9
|
||||||
|
rdma_latency_us: 6.0
|
||||||
|
max_batch_slots: 256
|
||||||
|
prefill_chunk_tokens: 4096
|
||||||
|
```
|
||||||
|
|
||||||
|
## Architecture-aware compute model
|
||||||
|
|
||||||
|
The simulator derives a **roofline prefill model** from model
|
||||||
|
architecture:
|
||||||
|
|
||||||
|
```
|
||||||
|
prefill_time(N tokens) = max(compute_time, memory_time)
|
||||||
|
|
||||||
|
compute_time = layers * (N * linear_flops + attn_coeff * N * effective_ctx(N)) / gpu_flops
|
||||||
|
memory_time = layers * weight_bytes_per_layer / gpu_mem_bw
|
||||||
|
```
|
||||||
|
|
||||||
|
- **MoE**: only active experts contribute to FLOPs and weight streaming
|
||||||
|
(shared experts always counted)
|
||||||
|
- **MLA**: compressed KV projections reduce attention FLOPs; KV cache
|
||||||
|
uses `kv_lora_rank + qk_rope_head_dim` instead of `2 * kv_heads * head_dim`
|
||||||
|
- **DSA**: `effective_ctx = min(N, dense_window) + max(0, N - dense_window) / sparse_stride`,
|
||||||
|
with the first K layers using full dense attention
|
||||||
|
- **GQA**: fewer KV heads reduce both attention compute and KV cache size
|
||||||
|
|
||||||
|
## Bundled config files
|
||||||
|
|
||||||
|
| Config | Model | Hardware | Instances | Trace |
|
||||||
|
|--------|-------|----------|-----------|-------|
|
||||||
|
| `glm5-8xb200-hf.yaml` | GLM-5 via HF config.json | 8xB200 preset | 32 | GLM coder blk512 |
|
||||||
|
| `glm5-8xb200-blk512.yaml` | GLM-5 inline | 8xB200 inline | 64 | GLM coder blk512 |
|
||||||
|
| `glm5-8xb200.yaml` | GLM-5 inline | 8xB200 inline | 8 | GLM coder blk512 |
|
||||||
|
| `qwen3-coder-480b-8xh20.yaml` | Qwen3-Coder via HF | 8xH20 preset | 32 | Qwen coder blk16 |
|
||||||
|
| `qwen2.5-coder-7b-h800.yaml` | Qwen2.5-7B inline | H800 inline | 16 | Qwen coder blk16 |
|
||||||
|
| `qwen2.5-coder-7b-preset.yaml` | Qwen2.5-7B inline | H800 preset | 16 | Qwen coder blk16 |
|
||||||
|
| `qwen2.5-coder-32b-h800.yaml` | Qwen2.5-32B inline | H800 inline | 16 | Qwen coder blk16 |
|
||||||
|
|
||||||
## Outputs
|
## Outputs
|
||||||
|
|
||||||
@@ -130,39 +313,33 @@ For `oracle`: an `oracle.json` with the three hit-rate analyses.
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Pretty-print the summary
|
# Pretty-print the summary
|
||||||
cat runs/qwen7b/summary.json | jq .
|
cat runs/glm5_8xb200_hf/summary.json | jq .
|
||||||
|
|
||||||
# Compare all routers from an ablation
|
# Compare all routers from an ablation
|
||||||
cat runs/qwen7b_n64/ablation.json | jq '.[] | {router, ttft_p50, hit_rate_l0, total_rdma_bytes}'
|
cat runs/glm5_8xb200_hf/ablation.json | \
|
||||||
|
jq '.[] | {router, ttft_mean, ttft_p50, hit_rate_l0, miss_rate}'
|
||||||
|
|
||||||
# Hit-rate ceilings vs LRU at the same capacity
|
# Sort by TTFT
|
||||||
cat runs/qwen7b/oracle.json | jq '{unlimited: .unlimited.hit_rate, belady: .belady_finite.hit_rate, lru: .lru_finite.hit_rate}'
|
cat runs/glm5_8xb200_hf/ablation.json | \
|
||||||
|
jq 'sort_by(.ttft_mean) | .[] | {router, ttft_mean, hit_rate_l0}'
|
||||||
```
|
```
|
||||||
|
|
||||||
## Config
|
|
||||||
|
|
||||||
A config is a single YAML file with four sections. A working example
|
|
||||||
lives at
|
|
||||||
[`configs/qwen2.5-coder-7b-h800.yaml`](configs/qwen2.5-coder-7b-h800.yaml);
|
|
||||||
copy and edit for other models/hardware.
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
model: # shape + prefill roofline coefficients
|
|
||||||
hardware: # per-instance GPU/PCIe/RDMA capabilities + batch knobs
|
|
||||||
cluster: # num_instances, meta_store TTL, router mode
|
|
||||||
sim: # trace_path, max_requests, output_dir, seed
|
|
||||||
```
|
|
||||||
|
|
||||||
Only prefill-side model coefficients are used; any decode fields in
|
|
||||||
legacy YAMLs are accepted and ignored.
|
|
||||||
|
|
||||||
## Trace format
|
## Trace format
|
||||||
|
|
||||||
The simulator reads the Alibaba
|
The simulator reads the Alibaba
|
||||||
[`qwen-bailian-usagetraces-anon`](https://github.com/alibaba-edu/qwen-bailian-usagetraces-anon)
|
[`qwen-bailian-usagetraces-anon`](https://github.com/alibaba-edu/qwen-bailian-usagetraces-anon)
|
||||||
JSONL schema. Each record has `chat_id`, `timestamp`, `input_length`,
|
JSONL schema. Each record has `chat_id`, `timestamp`, `input_length`,
|
||||||
`output_length`, and `hash_ids` (16-token block hashes). Only the
|
`output_length`, and `hash_ids` (block hashes, typically 16 tokens each).
|
||||||
input side is used.
|
Only the input side is used.
|
||||||
|
|
||||||
|
Available traces in the submodule:
|
||||||
|
|
||||||
|
| Trace | Requests | Description |
|
||||||
|
|-------|----------|-------------|
|
||||||
|
| `qwen_coder_blksz_16.jsonl` | 43k | Qwen Coder serving traffic |
|
||||||
|
| `qwen_traceA_blksz_16.jsonl` | 43k | Qwen general traffic A |
|
||||||
|
| `qwen_traceB_blksz_16.jsonl` | 173k | Qwen general traffic B |
|
||||||
|
| `qwen_thinking_blksz_16.jsonl` | 11k | Qwen reasoning/thinking traffic |
|
||||||
|
|
||||||
## Testing
|
## Testing
|
||||||
|
|
||||||
@@ -170,5 +347,6 @@ input side is used.
|
|||||||
cargo test --release
|
cargo test --release
|
||||||
```
|
```
|
||||||
|
|
||||||
16 tests: 15 unit + 1 smoke that runs all four routers on a synthetic
|
28 tests: 27 unit tests (compute model, HF config parsing, hardware
|
||||||
|
presets) + 1 integration smoke test that runs routers on a synthetic
|
||||||
shared-prefix trace and asserts the expected hit-rate ordering.
|
shared-prefix trace and asserts the expected hit-rate ordering.
|
||||||
|
|||||||
Reference in New Issue
Block a user