feat: new router and benchmark setup

2026-04-16 14:23:53 +08:00
parent c86d931d8f
commit 996511f300
35 changed files with 1480 additions and 76 deletions
--- a/README.md
+++ b/README.md
@@ -227,6 +227,7 @@ coefficients (`flops_per_token_prefill`, `attn_quadratic_coeff`, etc.).
 | Model | Path | Architecture |
 |-------|------|--------------|
 | GLM-5 (744B/40B-active) | `models/GLM-5/config.json` | MoE (256 routed, 8 active, 1 shared) + MLA + DSA |
+| GLM-5-FP8 | `models/GLM-5-FP8/config.json` | GLM-5 architecture + upstream FP8 quantization metadata |
 | Qwen3-Coder-480B-A35B FP8 | `models/Qwen3-Coder-480B-A35B-Instruct-FP8/config.json` | MoE (160 experts, 8 active) + GQA |

 ## Hardware configuration
@@ -248,6 +249,7 @@ Available presets:
 | `h100`      | 989 TFLOPS | 80 GB   | 3.35 TB/s  | Gen5 |
 | `h800`      | 989 TFLOPS | 80 GB   | 3.35 TB/s  | Gen5 |
 | `h20`       | 148 TFLOPS | 96 GB   | 4.0 TB/s   | Gen5 |
+| `h20-141g`  | 148 TFLOPS | 141 GB  | 4.8 TB/s   | Gen5 |
 | `a100-80gb` | 312 TFLOPS | 80 GB   | 2.0 TB/s   | Gen4 |
 | `a100-40gb` | 312 TFLOPS | 40 GB   | 1.555 TB/s | Gen4 |
 | `b200`      | 2.25 PFLOPS| 192 GB  | 8.0 TB/s   | Gen6 |
@@ -297,6 +299,7 @@ memory_time  = layers * weight_bytes_per_layer / gpu_mem_bw
 | Config | Model | Hardware | Instances | Trace |
 |--------|-------|----------|-----------|-------|
 | `glm5-8xb200-hf.yaml` | GLM-5 via HF config.json | 8xB200 preset | 32 | GLM coder blk512 |
+| `glm5-fp8-8xh20-141g.yaml` | GLM-5-FP8 via ModelScope config.json | 8xH20-141G preset | 128 | GLM coder blk512 |
 | `glm5-nvfp4-8xb300.yaml` | GLM-5-NVFP4 via HF config.json | 8xB300 preset | 8 | GLM coder blk512 |
 | `qwen3-coder-480b-8xh20.yaml` | Qwen3-Coder via HF | 8xH20 preset | 32 | Qwen coder blk16 |