obsidian/260306.md at a57afa86b47c58aeca557e7cbcb0d38b81159d78

Files

Gahow Wang a57afa86b4 Initial commit: obsidian to gitea

2026-05-07 15:04:41 +08:00

37 KiB

Raw Blame History

在 dash0 上，测试 vllm 社区版的性能（Trace A，10min），对比 ali 的事实上确实性能更好，但是 Qwen3-235B-A22B 内部 fp8 和直接的 Qwen3-235B-A22B-FP8 模型应该不能直接对比（模型参数大小都不一样）
实现在 dashp 上进行 tuning
- 支持读 h20.config，每次 propose 的配置修改这个 h20.config，重启 dashllm_cmd serving
- 支持等待模型启动，再启动 trace-replayer，支持修改 trace-replayer 的参数
测试 prod 环境的 tuner
- 测试判断 dashllm_cmd serving 的逻辑是否能正常启动/停止
- 判断所有的 tunable flag
- 启动短请求的 tuning 测试（600 reqs）

[INFO] 2026-03-07 14:30:45,385531: {"filename":"_inference_vllm.py","lineno":508,"message":"vllm version: 0.13.0rc2, v5:False, v6:True engine_args: {'model': '/dev/s
hm/dashllm_model_0', 'device': 'cuda', 'dtype': 'bfloat16', 'tensor_parallel_size': 8, 'enforce_eager': False, 'gpu_memory_utilization': 0.8, 'block_size': 256, 'swa
p_space': 4, 'max_num_seqs': 256, 'max_num_batched_tokens': None, 'trust_remote_code': True, 'disable_custom_all_reduce': False, 'skip_tokenizer_init': False, 'disab
le_log_stats': False, 'quantization': None, 'max_model_len': 262144}","time":"2026-03-07 14:30:45.385"}

{"tag": "llm_request", "logged_at": "2026-03-07T06:42:59.251199+08:00", "step": 1, "llm_calls": 1, "messages": [{"role": "system", "content": "You are a system optimization agent for LLM inference tuning. You must choose when to call tools to inspect hardware, profile the model, profile the workload, and run benchmarks. Use the tools to gather evidence before making recommendations. Always output strict JSON with no extra text."}, {"role": "user", "content": "Task: act as the tuning agent for a system benchmark.\nObjective: maximize the goodput_qps_per_gpu\nModel path: /home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717\nTrace path: qwen-bailian-usagetraces-anon/qwen_traceA_blksz_16.jsonl\nEngine: trace-replayer\nDefault trace_minutes=10.0, time_scale=2.0, max_requests=None\nConstraints: min_requests=100, max_trace_minutes=10.0, allowed_sample_policies=['head', 'tail', 'random_window', 'random_requests'], allowed_arrival_modes=['trace']\nGoodput SLO: prefill-only mode, enforce only TTFT p95 <= 0.001*L + 1.0 s (pass_rate>=0.95).\n\nHuman baseline config provided; it will be benchmarked automatically.\nBaseline config:\n```json\n{\n  \"engine_args\": {\n    \"block_size\": 64,\n    \"compilation_config\": {\n      \"cudagraph_mode\": \"PIECEWISE\",\n      \"custom_ops\": [\n        \"all\"\n      ],\n      \"max_cudagraph_capture_size\": 2048,\n      \"use_inductor\": false\n    },\n    \"cuda_graph_sizes\": [\n      16,\n      32,\n      64,\n      96,\n      128,\n      160,\n      192,\n      224,\n      256,\n      288,\n      320,\n      352,\n      384,\n      416,\n      448,\n      480,\n      512,\n      544,\n      576,\n      608,\n      640,\n      672,\n      704,\n      736,\n      768,\n      800,\n      832,\n      864,\n      896,\n      928,\n      960,\n      992,\n      1024\n    ],\n    \"disable_hybrid_kv_cache_manager\": true,\n    \"enable_chunked_prefill\": true,\n    \"enable_expert_parallel\": false,\n    \"enable_prefix_caching\": true,\n    \"enforce_eager\": false,\n    \"gpu_memory_utilization\": 0.85,\n    \"hf_overrides\": {\n      \"architectures\": [\n        \"Qwen3MoeForCausalLM\"\n      ],\n      \"model_type\": \"qwen3_moe\"\n    },\n    \"kv_cache_dtype\": \"fp8\",\n    \"kv_transfer_config\": {\n      \"kv_connector\": \"HybridConnector\",\n      \"kv_connector_extra_config\": {\n        \"backend\": \"kvt\"\n      },\n      \"kv_role\": \"kv_producer\"\n    },\n    \"max_model_len\": 262144,\n    \"max_num_batched_tokens\": 8192,\n    \"max_num_seqs\": 64,\n    \"quantization\": \"fp8\",\n    \"swap_space\": 0,\n    \"tensor_parallel_size\": 4\n  },\n  \"envs\": {\n    \"ACCL_IBV_MTU\": \"9000\",\n    \"ACCL_LOW_LATENCY_OPTIMIZE\": \"2\",\n    \"ACCL_RETRANSMIT_TIMEOUT\": \"17\",\n    \"ACCL_TX_DEPTH\": \"1024\",\n    \"ACCL_WRITEBATCH_OPT\": \"2\",\n    \"BLLM_KVTRANS_RDMA_SP\": \"2\",\n    \"DEEPEP_LL_BUFFER_FP8_OPT\": \"1\",\n    \"DEEPEP_LL_COMBINE_USE_FP8\": \"1\",\n    \"DEEPEP_LL_COMBINE_USE_NVL\": \"1\",\n    \"DEEPEP_LL_DISPATCH_USE_NVL\": \"1\",\n    \"GLOO_SOCKET_IFNAME\": \"eth1\",\n    \"NCCL_SOCKET_IFNAME\": \"eth1\",\n    \"NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME\": \"eth1\",\n    \"NVSHMEM_IBGDA_NUM_RC_PER_PE\": \"4\",\n    \"VLLM_ATTENTION_BACKEND\": \"FLASH_ATTN\",\n    \"VLLM_DEEP_GEMM_WARMUP\": \"skip\",\n    \"VLLM_DP_META_USE_CPU_GROUP\": \"0\",\n    \"VLLM_ENABLE_TBO_OPT\": \"0\",\n    \"VLLM_ENABLE_TORCH_COMPILE\": \"1\",\n    \"VLLM_FP8_USE_BLADNN\": \"1\",\n    \"VLLM_FUSED_MOE_CHUNK_SIZE\": \"4096\",\n    \"VLLM_MLA_FP8_ATTENTION\": \"0\",\n    \"VLLM_MOE_BALANCED_GATING\": \"0\",\n    \"VLLM_MOE_EXPERTS_OVERLAP\": \"0\",\n    \"VLLM_MOE_RANDOM_GATING\": \"0\",\n    \"VLLM_MOE_USE_BLADNN\": \"1\",\n    \"VLLM_MOE_USE_DEEPEP\": \"0\",\n    \"VLLM_PD_TRY_CONNECT_TIMEOUT_SECONDS\": \"120\",\n    \"VLLM_QUANTIZATION_LAYER_WISE\": \"1\",\n    \"VLLM_RESPONSE_TIMEOUT\": \"290\",\n    \"VLLM_USE_DEEP_GEMM\": \"0\",\n    \"VLLM_USE_FLASHINFER_SAMPLER\": \"0\",\n    \"VLLM_USE_V1\": \"1\"\n  }\n}\n```\nCurrent baseline values for allowed tunable config paths:\n```json\n{\n  \"engine_args\": {\n    \"block_size\": 64,\n    \"compilation_config\": {\n      \"cudagraph_mode\": \"PIECEWISE\",\n      \"custom_ops\": [\n        \"all\"\n      ],\n      \"max_cudagraph_capture_size\": 2048,\n      \"use_inductor\": false\n    },\n    \"cuda_graph_sizes\": [\n      16,\n      32,\n      64,\n      96,\n      128,\n      160,\n      192,\n      224,\n      256,\n      288,\n      320,\n      352,\n      384,\n      416,\n      448,\n      480,\n      512,\n      544,\n      576,\n      608,\n      640,\n      672,\n      704,\n      736,\n      768,\n      800,\n      832,\n      864,\n      896,\n      928,\n      960,\n      992,\n      1024\n    ],\n    \"enable_chunked_prefill\": true,\n    \"enable_expert_parallel\": false,\n    \"gpu_memory_utilization\": 0.85,\n    \"max_num_batched_tokens\": 8192,\n    \"max_num_seqs\": 64,\n    \"tensor_parallel_size\": 4\n  },\n  \"envs\": {\n    \"VLLM_ATTENTION_BACKEND\": \"FLASH_ATTN\",\n    \"VLLM_FP8_USE_BLADNN\": \"1\",\n    \"VLLM_MOE_USE_BLADNN\": \"1\",\n    \"VLLM_MOE_USE_DEEPEP\": \"0\"\n  }\n}\n```\nSample policy meanings: head=first window, tail=last window, random_window=random contiguous window, random_requests=random subset within window.\ntrace_minutes selects the head window used to build the fixed trace; time_scale rescales timestamps within that window.\n\nSearch space specification (docs/vllm-search-space.md):\n## Global rules (constraints) the agent must enforce\n\n1. **Distributed executor backend auto-selection / validity (affects TP × PP)**\n\n   * `--distributed-executor-backend` supports: `external_launcher`, `mp`, `ray`, `uni`.\n   * If `--distributed-executor-backend` is **not set** (`None`):\n\n     * If `pipeline_parallel_size * tensor_parallel_size <= (#GPUs available)`, vLLM will use **`mp`** (single-host multiprocessing).\n     * Otherwise, **vLLM raises an error** (invalid configuration).\n   * To use **`mp`**, you must also set `--nnodes`.\n   * To use **`ray`**, you must explicitly set `--distributed-executor-backend ray`.\n   * **TPU** only supports **Ray** for distributed inference.\n\n2. **Context Parallel constraints (GQA/MQA)**\n\n   * `tensor_parallel_size % decode_context_parallel_size == 0`.\n   * If `prefill_context_parallel_size > 1` or `decode_context_parallel_size > 1`, require\n     `tensor_parallel_size > num_kv_heads`.\n   * `tp * pp * dp * pcp` and `tp * pp * dp * dcp` must fit within world size.\n\n3. **KV cache block size device constraint**\n\n   * `--block-size` choices: `16, 32`\n   * On **CUDA devices**, only `block_size <= 32` is supported.\n\n4. **MoE sharding relationship (impacts feasibility/performance expectations)**\n\n   * With MoE models, MoE layers are sharded according to: `tensor_parallel_size * data_parallel_size`.\n\n5. **Chunked prefill scheduling rule**\n\n   * If `--enable-chunked-prefill` is enabled, prefill requests **can be chunked** based on the **remaining** `max_num_batched_tokens` budget.\n\n6. **MoE all2all coupling**\n\n   * `--all2all-backend` must be **unset** unless `--enable-expert-parallel` is true.\n\n7. **Human-readable integer parsing (max-num-batched-tokens)**\n\n   * `--max-num-batched-tokens` accepts human-readable integers like `1k`, `2M`, and decimals with multipliers:\n\n     * `1k -> 1000`, `1K -> 1024`, `25.6k -> 25600`\n\n---\n\n## AITuner search space table (copy-paste Markdown)\n\n| Category              | Flag                                       | Type / Domain                                         | Default                                 | What it controls (performance)                                                                                    | Constraints / rules to enforce                                                                                                     |\n| --------------------- | ------------------------------------------ | ----------------------------------------------------- | --------------------------------------- | ----------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------- |\n| Parallelism           | `--tensor-parallel-size` (`-tp`)           | int                                                   | `1`                                     | Tensor-parallel groups; changes compute/communication balance and per-GPU memory footprint.                       | Affected by global distributed-backend rule (TP×PP vs available GPUs when backend not set).                                        |\n| Parallelism           | `--pipeline-parallel-size` (`-pp`)         | int                                                   | `1`                                     | Pipeline-parallel groups; reduces per-GPU memory but may introduce pipeline bubbles / latency.                    | Affected by global distributed-backend rule (TP×PP vs available GPUs when backend not set).                                        |\n| Parallelism           | `--data-parallel-size` (`-dp`)             | int                                                   | `1`                                     | Data-parallel replicas; increases throughput capacity and changes how MoE sharding scales with TP.                | For MoE models: MoE sharding factor is `tp * dp` (track as a derived relationship).                                                |\n| Context Parallel      | `--decode-context-parallel-size` (`-dcp`)  | int                                                   | `1`                                     | Decode-stage context parallel groups; reuses GPUs in a TP group to change decode parallelism / KV placement.      | **Hard constraint:** `tp % dcp == 0`.                                                                                              |\n| Context Parallel      | `--prefill-context-parallel-size` (`-pcp`) | int                                                   | `1`                                     | Prefill-stage context parallel groups; affects long-prompt prefill scalability and memory behavior.               | No additional hard constraint stated on the `serve` arguments page beyond being a group count.                                     |\n| MoE / Expert Parallel | `--enable-expert-parallel` (`-ep`)         | bool                                                  | `False`                                 | Uses expert parallelism (instead of tensor parallelism) for **MoE layers**, changing communication/compute shape. | Meaningful for MoE models; pair with `--all2all-backend`.                                                                          |\n| MoE / Communication   | `--all2all-backend`                        | enum                                                  | `allgather_reducescatter`               | Selects All2All backend for MoE expert-parallel communication; can strongly affect throughput/latency.            | Must be one of: `allgather_reducescatter`, `deepep_high_throughput`, `deepep_low_latency`, `flashinfer_all2allv`, `naive`, `pplx`. |\n| Overlap / Dual-batch  | `--enable-dbo`                             | bool                                                  | `False`                                 | Enables dual batch overlap in the model executor; can improve GPU utilization and throughput.                     | No additional hard constraint stated.                                                                                              |\n| Overlap / Microbatch  | `--ubatch-size`                            | int                                                   | `0`                                     | Microbatch (uBatch) size; affects overlap granularity vs scheduling overhead.                                     | No additional hard constraint stated.                                                                                              |\n| Overlap / Threshold   | `--dbo-decode-token-threshold`             | int                                                   | `32`                                    | Threshold for decode-only batches: above it uses microbatching, otherwise single batch.                           | Behavioral rule: if request tokens `>` threshold ⇒ microbatching; else single batch.                                               |\n| Overlap / Threshold   | `--dbo-prefill-token-threshold`            | int                                                   | `512`                                   | Threshold for batches with ≥1 prefill: above it uses microbatching, otherwise single batch.                       | Behavioral rule: if request tokens `>` threshold ⇒ microbatching; else single batch.                                               |\n| KV Cache              | `--block-size`                             | enum {`16`,`32`}                                      | `None` (platform-chosen if unspecified) | KV cache block granularity; affects fragmentation, paging overhead, and KV locality.                              | **Hard constraint (CUDA):** `block_size <= 32`. Always choose from allowed enum.                                                   |\n| Batching              | `--max-num-seqs`                           | int or `None`                                         | `None`                                  | Max sequences per iteration; trades throughput vs scheduling overhead and tail latency.                           | No additional hard numeric bounds stated on the `serve` arguments page.                                                            |\n| Batching              | `--max-num-batched-tokens`                 | int or human-readable string (`1k`, `2M`, `25.6k`, …) | `None`                                  | Max tokens processed per iteration; primary knob for throughput/latency tradeoff.                                 | Must parse human-readable forms correctly (`1k=1000`, `1K=1024`, decimals allowed).                                                |\n| Prefill Scheduling    | `--enable-chunked-prefill`                 | bool or `None`                                        | `None`                                  | Enables chunking of prefill requests so they can fit into the per-iteration token budget.                         | Behavioral rule: when enabled, prefill can be chunked based on remaining `max_num_batched_tokens`.                                 |\n\nCurrent state:\nhardware: not profiled\nmodel: not profiled\nworkload: not profiled\nproposed_runs_successful: 0\n\nrecent runs: none\n\nbottleneck signals: none\n\nfailed config ledger: none\n\nAvailable tools:\n- inspect_hardware: Probe the current node and return HardwareProfile. (args: no arguments)\n- profile_model: Load model config and return ModelProfile. (args: model_path: string (optional; defaults to configured model path), model_name: string (optional), model_config: string (optional))\n- profile_workload: Summarize the trace workload and return WorkloadProfile. (args: trace_path: string (optional; defaults to configured trace path), workload_profile_path: string (optional), workload_name: string (optional), trace_minutes: number (optional), time_scale: number (optional), min_input_length: int (optional), max_input_length: int (optional), sample_policy: string (optional), sample_seed: int (optional), max_requests: int (optional), prefill_only: boolean (optional))\n- count_trace_requests: Count trace requests after applying filters. (args: trace_path: string (optional), trace_minutes: number (optional), min_input_length: int (optional), max_input_length: int (optional), sample_policy: string (optional), sample_seed: int (optional), max_requests: int (optional))\n- sample_trace_requests: Sample trace records into a temporary trace file with optional filtering and time scaling. (args: trace_path: string (optional; defaults to configured trace path), trace_minutes: number (optional), time_scale: number (optional), min_input_length: int (optional), max_input_length: int (optional), sample_policy: string (optional), sample_seed: int (optional), max_requests: int (optional), prefill_only: boolean (optional), output_path: string (optional))\n- get_constraints: Return benchmark constraints and allowed override keys. (args: no arguments)\n- run_trace_replayer: Run a trace-replayer benchmark with a VLLMConfig and optional overrides. (args: config: object (partial config patch; engine_args/envs in production mode), run_id: string (optional), kind: seed | iter (optional), benchmark_overrides: object (optional; e.g. trace_minutes, sample_policy, replica_count), use_cache: boolean (optional), proposal: object (required; fields: selection_logic, why_better, expected_improvements))\n\nExploration policy:\n- Start with aggressive exploration: vary major knobs by large factors early (tp/pp/dp/pcp/dcp, block_size, max_num_batched_tokens, max_num_seqs, enable_expert_parallel) to cover a wide search space.\n- If a human baseline exists, start with small deltas from the baseline config before large jumps.\n- Avoid tiny tweaks until you have enough diverse runs; then refine based on results.\n- Keep trace_minutes and max_requests fixed to the configured values; do not override them unless the user explicitly asks.\n- Before stopping, validate top configs on the full configured trace_minutes to confirm they hold up.\n- time_scale is controlled by the tuner's binary search; do not set benchmark_overrides.time_scale.\n- You may call sample_trace_requests to estimate request counts for candidate trace_minutes before running.\n\nOptimization policy:\n- If objective_score_expr is provided, treat it as the primary objective and maximize it.\n- Otherwise optimize for the textual objective.\n\nBottleneck analysis requirement (strict):\n- Every run_trace_replayer proposal.selection_logic MUST follow exactly this structure:\n  Observation:\n  - cite at least 3 numeric metrics from recent bottleneck signals.\n  Inference:\n  - identify the primary bottleneck and why those metrics imply it.\n  Action:\n  - map bottleneck to concrete config knob changes for the next run.\n- Do not propose a config without this O->I->A chain.\n\nBaseline policy:\n- If a human baseline run is present, prioritize configs that improve goodput over the baseline while meeting SLOs.\n\nKnob guidance:\n- prefill_context_parallel_size and decode_context_parallel_size can be traded against tp/pp/dp; keep tp*pp*pcp*dp within the GPU count (per-replica when replica_scaling is enabled) and ensure decode_context_parallel_size divides tensor_parallel_size.\n- If prefill/decode context parallel > 1, require tensor_parallel_size > num_kv_heads.\n- block_size must be one of {16, 32}; never use 8.\n- all2all_backend only applies when enable_expert_parallel is true; set it alongside expert parallel changes.\n- all2all_backend must be one of {allgather_reducescatter, deepep_high_throughput, deepep_low_latency, flashinfer_all2allv, naive, pplx}; use null or omit for the default.\n\n- Provide numeric values for max_num_batched_tokens (no human-readable suffixes).\n\nOutput JSON schema:\n{\n  \"action\": \"tool_calls | final\",\n  \"tool_calls\": [\n    {\"tool\": \"string\", \"args\": {\"key\": \"value\"}}\n  ],\n  \"analysis\": \"string\",\n  \"final\": {\n    \"summary\": \"string\",\n    \"recommended_config\": {\"config\": {\"field\": \"value\"}},\n    \"should_stop\": true\n  }\n}\nAllowed config paths (use only these in tool args.config and final.recommended_config.config):\n- engine_args.tensor_parallel_size\n- engine_args.data_parallel_size\n- engine_args.block_size\n- engine_args.max_num_batched_tokens\n- engine_args.max_num_seqs\n- engine_args.gpu_memory_utilization\n- engine_args.enable_expert_parallel\n- engine_args.enable_chunked_prefill\n- engine_args.cuda_graph_sizes\n- engine_args.compilation_config\n- envs.VLLM_FP8_USE_BLADNN\n- envs.VLLM_MOE_USE_BLADNN\n- envs.VLLM_ATTENTION_BACKEND\n- envs.VLLM_MOE_USE_DEEPEP\nAllowed metrics for expected_improvements.metric:\n- latency_ms_mean\n- latency_ms_p50\n- latency_ms_p90\n- latency_ms_p95\n- latency_ms_p99\n- ttft_ms_mean\n- ttft_ms_p50\n- ttft_ms_p90\n- ttft_ms_p95\n- ttft_ms_p99\n- tpot_ms_mean\n- tpot_ms_p50\n- tpot_ms_p90\n- tpot_ms_p95\n- tpot_ms_p99\n- overall_qps\n- overall_tokens_per_second\n- decode_tokens_throughput\n- total_requests\n- decode_tokens_total\n- goodput_qps\n- goodput_qps_per_gpu\n- goodput_tps\n- goodput_tps_per_gpu\n- goodput_slo_ok\n- tokens_per_second_mean\n- tokens_per_second_p50\n- tokens_per_second_p95\n- tokens_per_second_p99\n- request_throughput\n- notes_count\nBottleneck metrics to cite in Observation (prefer these):\n- diag_engine_prefix_cache_hit_rate_window\n- diag_engine_kv_cache_usage_perc_mean\n- diag_engine_kv_cache_usage_perc_p95\n- diag_engine_num_requests_running_mean\n- diag_engine_num_requests_running_p95\n- diag_engine_num_requests_waiting_mean\n- diag_engine_num_requests_waiting_p95\n- diag_engine_queue_ratio\n- diag_engine_prefill_ratio\n- diag_engine_decode_ratio\n- diag_engine_request_prefill_time_seconds_window_mean_s\n- diag_engine_request_decode_time_seconds_window_mean_s\n- diag_engine_e2e_request_latency_seconds_window_mean_s\n- diag_engine_prompt_tokens_per_s\n- diag_engine_generation_tokens_per_s\n- diag_engine_preemption_rate\nRules:\n- Use action=tool_calls to invoke tools.\n- Use action=final when you are ready to stop.\n- run_trace_replayer requires hardware, model, and workload profiles.\n- run_trace_replayer config may be partial; missing values inherit from the last successful run if available.\n- Only set benchmark_overrides.trace_minutes or benchmark_overrides.max_requests when the user explicitly requests it.\n- Do not set benchmark_overrides.time_scale; it is ignored.\n- trace_minutes in benchmark_overrides sets the per-run trace duration (must be <= max_trace_minutes).\n- Only override benchmark knobs listed in get_constraints.\n- Every run_trace_replayer call MUST include a proposal object with:\n  * selection_logic: why you chose this config.\n  * why_better: what you expect to be better by using it.\n  * expected_improvements: list of {metric, direction, min_delta_pct, rationale}.\n- Avoid repeating exact failed config signatures listed in failed config ledger.\n- Provide JSON only; no code fences."}, {"role": "assistant", "content": "{\n  \"action\": \"tool_calls\",\n  \"tool_calls\": [\n    {\n      \"tool\": \"inspect_hardware\",\n      \"args\": {}\n    },\n    {\n      \"tool\": \"profile_model\",\n      \"args\": {\n        \"model_path\": \"/home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717\"\n      }\n    },\n    {\n      \"tool\": \"profile_workload\",\n      \"args\": {\n        \"trace_path\": \"qwen-bailian-usagetraces-anon/qwen_traceA_blksz_16.jsonl\",\n        \"trace_minutes\": 10.0,\n        \"time_scale\": 2.0,\n        \"prefill_only\": true\n      }\n    }\n  ],\n  \"analysis\": \"To begin optimization, we must first gather foundational information about the hardware, model, and workload. This will inform our initial exploration strategy and ensure that proposed configurations are feasible and targeted.\"\n}"}, {"role": "user", "content": "{\n  \"tool_results\": [\n    {\n      \"ok\": true,\n      \"tool\": \"inspect_hardware\",\n      \"result\": {\n        \"hardware\": {\n          \"gpu_type\": \"NVIDIA H20\",\n          \"num_gpus\": 8,\n          \"hbm_gb\": 95.5771484375,\n          \"nvlink_topology\": \"pcie\",\n          \"cpu_cores\": 160,\n          \"system_memory_gb\": 1482.4867248535156,\n          \"world_size\": 8\n        }\n      }\n    },\n    {\n      \"ok\": true,\n      \"tool\": \"profile_model\",\n      \"result\": {\n        \"model\": {\n          \"model_name\": \"/home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717\",\n          \"param_count\": 158999773184,\n          \"hidden_size\": 4096,\n          \"num_layers\": 94,\n          \"num_heads\": 64,\n          \"is_moe\": true,\n          \"num_experts\": 128,\n          \"is_mla\": false,\n          \"num_kv_heads\": 4,\n          \"num_experts_per_tok\": 8,\n          \"vocab_size\": 151936,\n          \"max_position_embeddings\": 262144\n        }\n      }\n    },\n    {\n      \"ok\": true,\n      \"tool\": \"profile_workload\",\n      \"result\": {\n        \"workload\": {\n          \"workload_name\": \"qwen_traceA_blksz_16\",\n          \"qps\": 6.824186356627912,\n          \"avg_prompt_tokens\": 1875.0840254030288,\n          \"p95_prompt_tokens\": 7697.0,\n          \"avg_decode_tokens\": 1.0,\n          \"p95_decode_tokens\": 1.0,\n          \"request_type\": \"trace_replay\",\n          \"source_file\": \"qwen-bailian-usagetraces-anon/qwen_traceA_blksz_16.jsonl\",\n          \"duration_minutes\": 4.999375,\n          \"time_scale\": \"2.0\",\n          \"p99_total_tokens\": 13081.0\n        }\n      }\n    }\n  ]\n}\n\nUpdated state:\nhardware: 8x NVIDIA H20, HBM 95.6GB, nvlink=pcie\nmodel: /home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717, params=158999773184, layers=94, heads=64\nworkload: qwen_traceA_blksz_16, qps=6.82, prompt(avg/p95)=1875.1/7697.0, decode(avg/p95)=1.0/1.0\nproposed_runs_successful: 0\nrecent runs: none"}, {"role": "assistant", "content": "{\"action\": \"tool_calls\", \"tool_calls\": [{\"tool\": \"run_trace_replayer\", \"args\": {\"config\": {\"engine_args\": {\"swap_space\": 0, \"gpu_memory_utilization\": 0.85, \"enable_prefix_caching\": true, \"enable_chunked_prefill\": true, \"tensor_parallel_size\": 4, \"max_num_batched_tokens\": 8192, \"enable_expert_parallel\": false, \"disable_hybrid_kv_cache_manager\": true, \"max_model_len\": 262144, \"block_size\": 64, \"max_num_seqs\": 64, \"enforce_eager\": false, \"quantization\": \"fp8\", \"cuda_graph_sizes\": [16, 32, 64, 96, 128, 160, 192, 224, 256, 288, 320, 352, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800, 832, 864, 896, 928, 960, 992, 1024], \"compilation_config\": {\"cudagraph_mode\": \"PIECEWISE\", \"use_inductor\": false, \"custom_ops\": [\"all\"], \"max_cudagraph_capture_size\": 2048}, \"kv_transfer_config\": {\"kv_connector\": \"HybridConnector\", \"kv_role\": \"kv_producer\", \"kv_connector_extra_config\": {\"backend\": \"kvt\"}}, \"hf_overrides\": {\"architectures\": [\"Qwen3MoeForCausalLM\"], \"model_type\": \"qwen3_moe\"}, \"kv_cache_dtype\": \"fp8\"}, \"envs\": {\"VLLM_USE_V1\": \"1\", \"VLLM_ENABLE_TORCH_COMPILE\": \"1\", \"VLLM_ATTENTION_BACKEND\": \"FLASH_ATTN\", \"VLLM_QUANTIZATION_LAYER_WISE\": \"1\", \"VLLM_MOE_USE_DEEPEP\": \"0\", \"VLLM_ENABLE_TBO_OPT\": \"0\", \"VLLM_MOE_BALANCED_GATING\": \"0\", \"VLLM_MOE_RANDOM_GATING\": \"0\", \"VLLM_FUSED_MOE_CHUNK_SIZE\": \"4096\", \"VLLM_DP_META_USE_CPU_GROUP\": \"0\", \"VLLM_MLA_FP8_ATTENTION\": \"0\", \"VLLM_MOE_EXPERTS_OVERLAP\": \"0\", \"VLLM_USE_FLASHINFER_SAMPLER\": \"0\", \"VLLM_RESPONSE_TIMEOUT\": \"290\", \"VLLM_FP8_USE_BLADNN\": \"1\", \"VLLM_MOE_USE_BLADNN\": \"1\", \"VLLM_USE_DEEP_GEMM\": \"0\", \"VLLM_PD_TRY_CONNECT_TIMEOUT_SECONDS\": \"120\", \"VLLM_DEEP_GEMM_WARMUP\": \"skip\", \"DEEPEP_LL_COMBINE_USE_FP8\": \"1\", \"DEEPEP_LL_BUFFER_FP8_OPT\": \"1\", \"DEEPEP_LL_DISPATCH_USE_NVL\": \"1\", \"DEEPEP_LL_COMBINE_USE_NVL\": \"1\", \"ACCL_LOW_LATENCY_OPTIMIZE\": \"2\", \"ACCL_WRITEBATCH_OPT\": \"2\", \"ACCL_IBV_MTU\": \"9000\", \"ACCL_TX_DEPTH\": \"1024\", \"ACCL_RETRANSMIT_TIMEOUT\": \"17\", \"NVSHMEM_IBGDA_NUM_RC_PER_PE\": \"4\", \"BLLM_KVTRANS_RDMA_SP\": \"2\", \"NCCL_SOCKET_IFNAME\": \"eth1\", \"NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME\": \"eth1\", \"GLOO_SOCKET_IFNAME\": \"eth1\"}}, \"run_id\": \"baseline\", \"kind\": \"baseline\", \"use_cache\": true, \"proposal\": {\"selection_logic\": \"human_baseline\", \"why_better\": \"expert baseline for comparison\", \"expected_improvements\": [{\"metric\": \"goodput_qps_per_gpu\", \"direction\": \"increase\", \"min_delta_pct\": 0.0, \"rationale\": \"record expert baseline performance\"}]}}}]}"}, {"role": "user", "content": "{\n  \"tool_results\": [\n    {\n      \"ok\": true,\n      \"tool\": \"run_trace_replayer\",\n      \"result\": {\n        \"run_id\": \"baseline-goodput\",\n        \"success\": true,\n        \"config_patch\": {\n          \"engine_args\": {\n            \"swap_space\": 0,\n            \"gpu_memory_utilization\": 0.85,\n            \"enable_prefix_caching\": true,\n            \"enable_chunked_prefill\": true,\n            \"tensor_parallel_size\": 4,\n            \"max_num_batched_tokens\": 8192,\n            \"enable_expert_parallel\": false,\n            \"disable_hybrid_kv_cache_manager\": true,\n            \"max_model_len\": 262144,\n            \"block_size\": 64,\n            \"max_num_seqs\": 64,\n            \"enforce_eager\": false,\n            \"quantization\": \"fp8\",\n            \"cuda_graph_sizes\": [\n              16,\n              32,\n              64,\n              96,\n              128,\n              160,\n              192,\n              224,\n              256,\n              288,\n              320,\n              352,\n              384,\n              416,\n              448,\n              480,\n              512,\n              544,\n              576,\n              608,\n              640,\n              672,\n              704,\n              736,\n              768,\n              800,\n              832,\n              864,\n              896,\n              928,\n              960,\n              992,\n              1024\n            ],\n            \"compilation_config\": {\n              \"cudagraph_mode\": \"PIECEWISE\",\n              \"use_inductor\": false,\n              \"custom_ops\": [\n                \"all\"\n              ],\n              \"max_cudagraph_capture_size\": 2048\n            },\n            \"kv_transfer_config\": {\n              \"kv_connector\": \"HybridConnector\",\n              \"kv_role\": \"kv_producer\",\n              \"kv_connector_extra_config\": {\n                \"backend\": \"kvt\"\n              }\n            },\n            \"hf_overrides\": {\n              \"architectures\": [\n                \"Qwen3MoeForCausalLM\"\n              ],\n              \"model_type\": \"qwen3_moe\"\n            },\n            \"kv_cache_dtype\": \"fp8\"\n          },\n          \"envs\": {\n            \"VLLM_USE_V1\": \"1\",\n            \"VLLM_ENABLE_TORCH_COMPILE\": \"1\",\n            \"VLLM_ATTENTION_BACKEND\": \"FLASH_ATTN\",\n            \"VLLM_QUANTIZATION_LAYER_WISE\": \"1\",\n            \"VLLM_MOE_USE_DEEPEP\": \"0\",\n            \"VLLM_ENABLE_TBO_OPT\": \"0\",\n            \"VLLM_MOE_BALANCED_GATING\": \"0\",\n            \"VLLM_MOE_RANDOM_GATING\": \"0\",\n            \"VLLM_FUSED_MOE_CHUNK_SIZE\": \"4096\",\n            \"VLLM_DP_META_USE_CPU_GROUP\": \"0\",\n            \"VLLM_MLA_FP8_ATTENTION\": \"0\",\n            \"VLLM_MOE_EXPERTS_OVERLAP\": \"0\",\n            \"VLLM_USE_FLASHINFER_SAMPLER\": \"0\",\n            \"VLLM_RESPONSE_TIMEOUT\": \"290\",\n            \"VLLM_FP8_USE_BLADNN\": \"1\",\n            \"VLLM_MOE_USE_BLADNN\": \"1\",\n            \"VLLM_USE_DEEP_GEMM\": \"0\",\n            \"VLLM_PD_TRY_CONNECT_TIMEOUT_SECONDS\": \"120\",\n            \"VLLM_DEEP_GEMM_WARMUP\": \"skip\",\n            \"DEEPEP_LL_COMBINE_USE_FP8\": \"1\",\n            \"DEEPEP_LL_BUFFER_FP8_OPT\": \"1\",\n            \"DEEPEP_LL_DISPATCH_USE_NVL\": \"1\",\n            \"DEEPEP_LL_COMBINE_USE_NVL\": \"1\",\n            \"ACCL_LOW_LATENCY_OPTIMIZE\": \"2\",\n            \"ACCL_WRITEBATCH_OPT\": \"2\",\n            \"ACCL_IBV_MTU\": \"9000\",\n            \"ACCL_TX_DEPTH\": \"1024\",\n            \"ACCL_RETRANSMIT_TIMEOUT\": \"17\",\n            \"NVSHMEM_IBGDA_NUM_RC_PER_PE\": \"4\",\n            \"BLLM_KVTRANS_RDMA_SP\": \"2\",\n            \"NCCL_SOCKET_IFNAME\": \"eth1\",\n            \"NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME\": \"eth1\",\n            \"GLOO_SOCKET_IFNAME\": \"eth1\"\n          }\n        },\n        \"aggregated\": {\n          \"latency_ms_mean\": 613.304,\n          \"latency_ms_p50\": 288.101,\n          \"latency_ms_p90\": 1545.816,\n          \"latency_ms_p95\": 2152.527,\n          \"latency_ms_p99\": 3646.883,\n          \"ttft_ms_mean\": 613.304,\n          \"ttft_ms_p50\": 288.101,\n          \"ttft_ms_p90\": 1545.816,\n          \"ttft_ms_p95\": 2152.527,\n          \"ttft_ms_p99\": 3646.883,\n          \"tpot_ms_mean\": 613.304,\n          \"tpot_ms_p50\": 288.101,\n          \"tpot_ms_p90\": 1545.816,\n          \"tpot_ms_p95\": 2152.527,\n          \"tpot_ms_p99\": 3646.883,\n          \"overall_qps\": 1.999,\n          \"overall_tokens_per_second\": 3751.174962510804,\n          \"decode_tokens_throughput\": 1.999,\n          \"total_requests\": 2047.0,\n          \"decode_tokens_total\": 2047.0,\n          \"tokens_per_second_mean\": 120.0,\n          \"tokens_per_second_p50\": 120.0,\n          \"tokens_per_second_p95\": 120.0,\n          \"tokens_per_second_p99\": 120.0,\n          \"request_throughput\": 1.999,\n          \"notes_count\": 0.0,\n          \"failed_request_count\": 0.0,\n          \"ttft_slo_pass_rate\": 0.9604298974108452,\n          \"goodput_qps\": 1.9992733466683337,\n          \"goodput_qps_per_gpu\": 0.4998183366670834,\n          \"goodput_tps\": 3751.174962510804,\n          \"goodput_tps_per_gpu\": 937.793740627701,\n          \"goodput_slo_ok\": 1.0,\n          \"diag_resource_sample_count\": 0.0,\n          \"diag_engine_sample_count\": 0.0,\n          \"diag_metric_collection_error_count\": 0.0,\n          \"time_scale\": 0.5859375\n        },\n        \"error_message\": null,\n        \"failure_class\": null,\n        \"error_detail\": null,\n        \"proposal_eval\": {\n          \"status\": \"no_baseline\"\n        },\n        \"cache_info\": {\n          \"cache_root\": \".cache\",\n          \"cache_key\": \"667e20a07bd84216c5e8b641e9e69b6a41ce4d9a\",\n          \"cache_dir\": \".cache/667e20a07bd84216c5e8b641e9e69b6a41ce4d9a\",\n          \"index_path\": \".cache/667e20a07bd84216c5e8b641e9e69b6a41ce4d9a/index.json\",\n          \"next_index\": 1,\n          \"entries\": [\n            {\n              \"entry_dir\": \".cache/667e20a07bd84216c5e8b641e9e69b6a41ce4d9a/0\",\n              \"metrics_path\": \".cache/667e20a07bd84216c5e8b641e9e69b6a41ce4d9a/0/metrics.json\",\n              \"metrics_mtime_s\": 1772836973.0017326\n            }\n          ],\n          \"use_cache\": true,\n          \"cached_metrics_count\": 1,\n          \"cache_hit\": true\n        },\n        \"time_scale\": 0.5859375,\n        \"goodput_search\": {\n          \"status\": \"binary_search\",\n          \"min_scale\": 0.5859375,\n          \"max_scale\": 0.59375,\n          \"selected_scale\": 0.5859375,\n          \"ttft_slo\": 60.0,\n          \"tpot_slo\": 0.05,\n          \"slo_percentile\": \"p95\",\n          \"goodput_qps\": 1.9992733466683337,\n          \"goodput_qps_per_gpu\": 0.4998183366670834,\n          \"goodput_slo_ok\": 1.0\n        },\n        \"ignored_benchmark_overrides\": {}\n      }\n    }\n  ]\n}\n\nUpdated state:\nhardware: 8x NVIDIA H20, HBM 95.6GB, nvlink=pcie\nmodel: /home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717, params=158999773184, layers=94, heads=64\nworkload: qwen_traceA_blksz_16, qps=6.82, prompt(avg/p95)=1875.1/7697.0, decode(avg/p95)=1.0/1.0\nproposed_runs_successful: 0\nrecent runs:\n - baseline-goodput [ok] engine_args.block_size: None -> 64, engine_args.compilation_config.cudagraph_mode: None -> 'PIECEWISE', engine_args.compilation_config.custom_ops: None -> ['all'], engine_args.compilation_config.max_cudagraph_capture_size: None -> 2048, engine_args.compilation_config.use_inductor: None -> False, engine_args.cuda_graph_sizes: None -> [16, 32, 64, 96, 128, 160, 192, 224, 256, 288, 320, 352, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800, 832, 864, 896, 928, 960, 992, 1024], engine_args.disable_hybrid_kv_cache_manager: None -> True, engine_args.enable_chunked_prefill: None -> True, engine_args.enable_expert_parallel: None -> False, engine_args.enable_prefix_caching: None -> True, engine_args.enforce_eager: None -> False, engine_args.gpu_memory_utilization: None -> 0.85, ... (45 more) | qps=1.999, tps=3751.174962510804, lat_mean=613.304, p95_lat=2152.527, ttft_mean=613.304, p95_ttft=2152.527, tpot_mean=613.304, p95_tpot=2152.527, goodput=1.9992733466683337, goodput_per_gpu=0.4998183366670834"}]}

37 KiB Raw Blame History Unescape Escape

37 KiB

Raw Blame History