From 8d41123418869edbaa99be76de89678e3bff69f6 Mon Sep 17 00:00:00 2001 From: Gahow Wang Date: Tue, 14 Apr 2026 01:21:28 +0800 Subject: [PATCH] Update README with full feature documentation Cover all 11 routing policies (including new prefix_affinity, cache_load, cache_score, estimated_ttft, least_tokens), HuggingFace config.json auto-parsing, GPU hardware presets, architecture-aware compute model (MoE/MLA/DSA/GQA), router parameter tuning, bundled model configs and config files, and available traces. Co-Authored-By: Claude Opus 4.6 --- README.md | 266 +++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 222 insertions(+), 44 deletions(-) diff --git a/README.md b/README.md index 1dde7d3..b859653 100644 --- a/README.md +++ b/README.md @@ -8,6 +8,26 @@ ablate routing strategies and cache sizing without spinning up any GPUs. Assumes **PD (prefill/decode) disaggregation** — only the prefill path is modeled. +## Features + +- **Architecture-derived roofline compute** — auto-derives FLOPs, + attention coefficients, and weight-streaming costs from model + architecture (MoE, MLA, GQA, DSA, sliding window). +- **HuggingFace config.json auto-parsing** — drop in any HF + `config.json` and the simulator extracts layer counts, attention heads, + MoE expert configs, MLA LoRA ranks, and DSA sparse parameters. +- **Built-in GPU hardware presets** — H100, H800, H20, A100-80GB, + A100-40GB, B200 with tensor-parallel scaling (e.g. `8xb200`). +- **Two-tier KV cache hierarchy** — L0 (GPU HBM) + L1 (CPU DRAM) with + LRU eviction and cross-instance RDMA fetch via a cluster-wide + meta-store. +- **11 routing policies** — from baselines (random, round-robin) to + cache-aware (min\_pd, prefix\_affinity) for systematic ablation. +- **Token-bucket link contention** — PCIe and RDMA bandwidth modeled with + reservation-based token-bucket queues. +- **Oracle analysis** — computes theoretical hit-rate ceilings (infinite + cache, Belady optimal, LRU) for gap analysis. + ## Build ```bash @@ -26,7 +46,7 @@ git submodule update --init --recursive ### 1. Run a single simulation ```bash -target/release/kvcache-sim run --config configs/qwen2.5-coder-7b-h800.yaml +target/release/kvcache-sim run --config configs/glm5-8xb200-hf.yaml ``` Prints `summary.json` to stdout and writes the full output directory @@ -36,29 +56,28 @@ Prints `summary.json` to stdout and writes the full output directory ```bash target/release/kvcache-sim ablate \ - --config configs/qwen2.5-coder-7b-h800.yaml \ - --num-instances 64 \ - --output-dir runs/qwen7b_n64 \ - --routers random,least_loaded,ttl_aware,precise + --config configs/glm5-8xb200-hf.yaml \ + --routers random,least_loaded,least_tokens,min_pd,prefix_affinity \ + --output-dir runs/glm5_ablation ``` Writes one subdirectory per router plus a combined -`runs/qwen7b_n64/ablation.json` with side-by-side summaries. +`ablation.json` with side-by-side summaries. ### 3. Compute theoretical hit-rate ceilings (oracle) ```bash # Cluster-aggregate capacity (default) target/release/kvcache-sim oracle \ - --config configs/qwen2.5-coder-7b-h800.yaml --num-instances 64 + --config configs/glm5-8xb200-hf.yaml --num-instances 64 # A single instance's HBM budget target/release/kvcache-sim oracle \ - --config configs/qwen2.5-coder-7b-h800.yaml --per-instance + --config configs/glm5-8xb200-hf.yaml --per-instance -# Explicit capacity in 16-token blocks +# Explicit capacity in blocks target/release/kvcache-sim oracle \ - --config configs/qwen2.5-coder-7b-h800.yaml --capacity-blocks 200000 + --config configs/glm5-8xb200-hf.yaml --capacity-blocks 200000 ``` Reports three numbers: @@ -74,7 +93,7 @@ only reachable by adding capacity. ### 4. Validate a config without running ```bash -target/release/kvcache-sim validate --config configs/qwen2.5-coder-7b-h800.yaml +target/release/kvcache-sim validate --config configs/glm5-8xb200-hf.yaml ``` Parses the YAML, prints derived per-instance block budgets, and dumps @@ -102,15 +121,179 @@ and `--out `. `ablate` additionally takes `--routers `. Set `cluster.router.mode` in the YAML or list in `--routers`: -| Mode | What it does | -|----------------|--------------------------------------------------------------------| -| `random` | Uniform random. Baseline. | -| `round_robin` | Deterministic round-robin. Baseline. | -| `least_loaded` | `argmin(kv_blocks_used + alpha * queue_len)`. KV-blind. | -| `ttl_aware` | Picks instance with longest prefix in the global TTL meta store. | -| `precise` | Probes top-K least-loaded instances' actual caches; charges probe latency into TTFT. | +| Mode | Aliases | What it does | +|-------------------|------------------|--------------------------------------------------------------------------------------| +| `random` | | Uniform random. Baseline. | +| `round_robin` | `rr` | Deterministic round-robin. Baseline. | +| `least_loaded` | | `argmin(kv_blocks_used + alpha * queue_len)`. KV-blind load balance. | +| `least_tokens` | `lt` | `argmin(waiting_tokens)`. Pure load balance by queued compute work. | +| `ttl_aware` | `ttl` | Picks instance with longest prefix in the global TTL meta-store. Cache-only. | +| `precise` | `precise_aware` | Probes top-K least-loaded instances' actual caches; charges probe latency into TTFT. | +| `min_pd` | `minpd`, `pd` | Minimizes `P*D` (prefill tokens x ongoing requests). Cluster-wide RDMA-aware. | +| `cache_load` | `cl` | Filters to least-loaded 1/4 instances, then picks best cache prefix. | +| `cache_score` | `cs` | Exponential scoring: `2^(alpha * queue_len + beta * miss_blocks)`. | +| `estimated_ttft` | `ettft`,`optimal`| Estimates `drain_time + fetch_time` per instance using architecture-aware compute. | +| `prefix_affinity` | `affinity`, `pa` | Rendezvous-hashed prefix fingerprinting for deterministic cache locality. | -Expected hit-rate ordering: `random ≲ least_loaded ≲ ttl_aware ≲ precise`. +### Router parameters + +These fields in `cluster.router` tune specific routers: + +| Field | Default | Used by | Description | +|--------------------------|---------|------------------|------------------------------------------------------| +| `load_alpha` | `1.0` | `least_loaded` | Weight of queue\_len vs kv\_blocks\_used | +| `score_alpha` | `1.0` | `cache_score` | Load weight in `2^(alpha*load + beta*miss)` | +| `score_beta` | `0.1` | `cache_score` | Cache-miss weight in `2^(alpha*load + beta*miss)` | +| `prefix_k` | `8` | `prefix_affinity`| Number of leading blocks for the prefix fingerprint | +| `affinity_fan_out` | `0` | `prefix_affinity`| Top-K affinity candidates (0 = auto: n/8, min 2) | +| `precise_probe_latency_us`| `50.0`| `precise` | Simulated per-probe latency (microseconds) | +| `precise_probe_topk` | `4` | `precise` | Number of instances probed | + +### Router design spectrum + +``` + Cache-only Hybrid Load-only + (hot-spot risk) (cache-blind) + ┌─────────┬───────────┬───────────┬────────────┬───────────┬───────────┐ + ttl_aware precise cache_score min_pd prefix_ least_ random + cache_load affinity loaded + est_ttft least_tokens +``` + +`prefix_affinity` sits in a unique position: it builds **proactive cache +locality** by consistently routing same-prefix requests to the same +instances (via rendezvous hashing), rather than reactively chasing +existing cache state. This yields the highest L0 hit rates while +maintaining load balance through within-group drain-time-aware selection. + +## Model configuration + +### HuggingFace config.json (recommended) + +Point `model.config_json` at any HF `config.json` to auto-extract +architecture: + +```yaml +model: + config_json: ../models/GLM-5/config.json + dtype_bytes: 2 # required (not in HF schema) + block_size_tokens: 512 # required (not in HF schema) +``` + +Auto-detected features: + +| Feature | Detection trigger | What it extracts | +|-----------|-------------------------------|----------------------------------------------| +| **MoE** | `n_routed_experts`, `num_local_experts`, or `num_experts` | Expert count, active experts, shared experts, expert FFN width | +| **MLA** | `kv_lora_rank` present | KV/Q LoRA ranks, qk\_rope/nope dims, v\_head\_dim | +| **DSA** | `first_k_dense_replace` present| Dense window, sparse stride, first dense layers | +| **Sliding window** | `sliding_window` present | Window size | +| **GQA** | `num_key_value_heads < num_attention_heads` | KV head count for grouped-query attention | + +Explicit YAML fields always override the auto-detected values. + +### Inline specification + +Alternatively, specify architecture fields directly: + +```yaml +model: + name: qwen2.5-coder-7b + num_layers: 28 + hidden_size: 3584 + num_attention_heads: 28 + num_kv_heads: 4 + head_dim: 128 + intermediate_size: 18944 + dtype_bytes: 2 + block_size_tokens: 16 +``` + +When `hidden_size` is present, the compute model is auto-derived +(architecture mode). Without it, you must supply legacy manual +coefficients (`flops_per_token_prefill`, `attn_quadratic_coeff`, etc.). + +### Bundled model configs + +| Model | Path | Architecture | +|-------|------|--------------| +| GLM-5 (744B/40B-active) | `models/GLM-5/config.json` | MoE (256 routed, 8 active, 1 shared) + MLA + DSA | +| Qwen3-Coder-480B-A35B FP8 | `models/Qwen3-Coder-480B-A35B-Instruct-FP8/config.json` | MoE (160 experts, 8 active) + GQA | + +## Hardware configuration + +### Using presets (recommended) + +Set `hardware.type` to a preset name — individual fields can override: + +```yaml +hardware: + type: 8xb200 + hbm_bytes: 500.0e9 # override KV budget (after model weights) +``` + +Available presets: + +| Preset | FLOPS | HBM | Mem BW | PCIe | +|-------------|------------|---------|------------|------| +| `h100` | 989 TFLOPS | 80 GB | 3.35 TB/s | Gen5 | +| `h800` | 989 TFLOPS | 80 GB | 3.35 TB/s | Gen5 | +| `h20` | 148 TFLOPS | 96 GB | 4.0 TB/s | Gen5 | +| `a100-80gb` | 312 TFLOPS | 80 GB | 2.0 TB/s | Gen4 | +| `a100-40gb` | 312 TFLOPS | 40 GB | 1.555 TB/s | Gen4 | +| `b200` | 2.25 PFLOPS| 192 GB | 8.0 TB/s | Gen6 | + +Prefix with `2x`, `4x`, or `8x` for tensor-parallel groups (e.g. +`8xh20`). FLOPS, memory bandwidth, and HBM scale linearly; RDMA and +DRAM are set to sensible per-node defaults. + +### Inline specification + +```yaml +hardware: + gpu_flops: 1.80e16 + gpu_mem_bw: 6.40e13 + hbm_bytes: 500.0e9 + dram_bytes: 1.5e12 + pcie_bw: 128.0e9 + pcie_latency_us: 4.0 + rdma_bw: 50.0e9 + rdma_latency_us: 6.0 + max_batch_slots: 256 + prefill_chunk_tokens: 4096 +``` + +## Architecture-aware compute model + +The simulator derives a **roofline prefill model** from model +architecture: + +``` +prefill_time(N tokens) = max(compute_time, memory_time) + +compute_time = layers * (N * linear_flops + attn_coeff * N * effective_ctx(N)) / gpu_flops +memory_time = layers * weight_bytes_per_layer / gpu_mem_bw +``` + +- **MoE**: only active experts contribute to FLOPs and weight streaming + (shared experts always counted) +- **MLA**: compressed KV projections reduce attention FLOPs; KV cache + uses `kv_lora_rank + qk_rope_head_dim` instead of `2 * kv_heads * head_dim` +- **DSA**: `effective_ctx = min(N, dense_window) + max(0, N - dense_window) / sparse_stride`, + with the first K layers using full dense attention +- **GQA**: fewer KV heads reduce both attention compute and KV cache size + +## Bundled config files + +| Config | Model | Hardware | Instances | Trace | +|--------|-------|----------|-----------|-------| +| `glm5-8xb200-hf.yaml` | GLM-5 via HF config.json | 8xB200 preset | 32 | GLM coder blk512 | +| `glm5-8xb200-blk512.yaml` | GLM-5 inline | 8xB200 inline | 64 | GLM coder blk512 | +| `glm5-8xb200.yaml` | GLM-5 inline | 8xB200 inline | 8 | GLM coder blk512 | +| `qwen3-coder-480b-8xh20.yaml` | Qwen3-Coder via HF | 8xH20 preset | 32 | Qwen coder blk16 | +| `qwen2.5-coder-7b-h800.yaml` | Qwen2.5-7B inline | H800 inline | 16 | Qwen coder blk16 | +| `qwen2.5-coder-7b-preset.yaml` | Qwen2.5-7B inline | H800 preset | 16 | Qwen coder blk16 | +| `qwen2.5-coder-32b-h800.yaml` | Qwen2.5-32B inline | H800 inline | 16 | Qwen coder blk16 | ## Outputs @@ -123,46 +306,40 @@ Each run writes a directory under `sim.output_dir`: | `instances.csv` | `t,instance,queue_len,kv_blocks_used,kv_blocks_total,busy` per sample | | `routing_log.jsonl` | One JSON per request: all router candidates + chosen instance + reason | -For `ablate`: an extra `ablation.json` with one summary per router. +For `ablate`: an extra `ablation.json` with one summary per router. For `oracle`: an `oracle.json` with the three hit-rate analyses. ### Reading results quickly ```bash # Pretty-print the summary -cat runs/qwen7b/summary.json | jq . +cat runs/glm5_8xb200_hf/summary.json | jq . # Compare all routers from an ablation -cat runs/qwen7b_n64/ablation.json | jq '.[] | {router, ttft_p50, hit_rate_l0, total_rdma_bytes}' +cat runs/glm5_8xb200_hf/ablation.json | \ + jq '.[] | {router, ttft_mean, ttft_p50, hit_rate_l0, miss_rate}' -# Hit-rate ceilings vs LRU at the same capacity -cat runs/qwen7b/oracle.json | jq '{unlimited: .unlimited.hit_rate, belady: .belady_finite.hit_rate, lru: .lru_finite.hit_rate}' +# Sort by TTFT +cat runs/glm5_8xb200_hf/ablation.json | \ + jq 'sort_by(.ttft_mean) | .[] | {router, ttft_mean, hit_rate_l0}' ``` -## Config - -A config is a single YAML file with four sections. A working example -lives at -[`configs/qwen2.5-coder-7b-h800.yaml`](configs/qwen2.5-coder-7b-h800.yaml); -copy and edit for other models/hardware. - -```yaml -model: # shape + prefill roofline coefficients -hardware: # per-instance GPU/PCIe/RDMA capabilities + batch knobs -cluster: # num_instances, meta_store TTL, router mode -sim: # trace_path, max_requests, output_dir, seed -``` - -Only prefill-side model coefficients are used; any decode fields in -legacy YAMLs are accepted and ignored. - ## Trace format The simulator reads the Alibaba [`qwen-bailian-usagetraces-anon`](https://github.com/alibaba-edu/qwen-bailian-usagetraces-anon) JSONL schema. Each record has `chat_id`, `timestamp`, `input_length`, -`output_length`, and `hash_ids` (16-token block hashes). Only the -input side is used. +`output_length`, and `hash_ids` (block hashes, typically 16 tokens each). +Only the input side is used. + +Available traces in the submodule: + +| Trace | Requests | Description | +|-------|----------|-------------| +| `qwen_coder_blksz_16.jsonl` | 43k | Qwen Coder serving traffic | +| `qwen_traceA_blksz_16.jsonl` | 43k | Qwen general traffic A | +| `qwen_traceB_blksz_16.jsonl` | 173k | Qwen general traffic B | +| `qwen_thinking_blksz_16.jsonl` | 11k | Qwen reasoning/thinking traffic | ## Testing @@ -170,5 +347,6 @@ input side is used. cargo test --release ``` -16 tests: 15 unit + 1 smoke that runs all four routers on a synthetic +28 tests: 27 unit tests (compute model, HF config parsing, hardware +presets) + 1 integration smoke test that runs routers on a synthetic shared-prefix trace and asserts the expected hit-rate ordering.