Files

Gahow Wang a57afa86b4 Initial commit: obsidian to gitea

2026-05-07 15:04:41 +08:00

14 KiB

Raw Blame History

You are an expert Python systems engineer working on an LLM inference auto-tuning project.

The repository layout currently looks like:

llm_autotune/
├── adapter/
│   └── __init__.py
├── config_generator.py
├── config_space/
│   └── __init__.py
├── feasibility/
│   ├── filter_configs.py
│   ├── __init__.py
│   ├── memory_model.py
│   ├── model_meta.py
│   ├── topology_rules.py
│   └── __pycache__/...
├── harness/
│   └── __init__.py
├── inspector/
│   ├── hardware_inspector.py
│   ├── probe.py
│   ├── workload_profiler.py
│   └── __pycache__/...
├── search/
│   ├── heuristic.py
│   └── __init__.py
├── store/
│   ├── json_store.py
│   └── __init__.py
├── util/
│   └── __init__.py
└── scripts/
    ├── run_inspect.py
    ├── run_search.py
    ├── run_vllm_benchmark.py
    └── run_workload_benchmarks.py

Goal

Implement the core infrastructure for an AI-driven LLM inference config auto-tuner, with four concrete capabilities:

A robust profiling & logging tool that can run vLLM benchmarks and record detailed logs for later bottleneck analysis.
A basic config generator that produces VALID vLLM configs for a given hardware + model + workload.
A unified harness to run vidur (simulation) and vLLM (real system) with the same interface, writing detailed profiling logs.
An AI-driven iterative loop that reads historical results, diagnoses bottlenecks, proposes new configs, and calls the harness repeatedly.

General requirements

Python 3.12+ with type hints and dataclasses where appropriate.
Keep dependencies minimal; assume we can add small, well-known libs if necessary, but prefer the standard library.
Avoid ad-hoc global state; pass objects explicitly.
Logging:
- Use the logging module, not print, for internal logs.
- User-facing scripts may still print concise summaries to stdout.
All public functions should have clear docstrings.

Part 0: Core types & JSON store

Create or extend a module harness/types.py (or harness/__init__.py if you prefer) to define the core data classes:
- HardwareProfile
  - High-level fields like: gpu_type, num_gpus, hbm_gb, nvlink_topology (string or simple struct), cpu_cores, system_memory_gb, etc.
  - This should be compatible with what inspector/hardware_inspector.py can produce.
- ModelProfile
  - Fields like: model_name, param_count, hidden_size, num_layers, num_heads, is_moe, num_experts, is_mla, max_position_embeddings, etc.
  - This should be compatible with feasibility/model_meta.py.
- WorkloadProfile
  - Fields like: workload_name, qps, avg_prompt_tokens, p95_prompt_tokens, avg_decode_tokens, p95_decode_tokens, request_type, etc.
  - This should be compatible with inspector/workload_profiler.py.
- VLLMConfig
  - A structured representation of the core vLLM config knobs we care about:
    - tensor_parallel_size, pipeline_parallel_size, expert_parallel_size, data_parallel_size
    - block_size, max_num_batched_tokens, max_num_seqs
    - gpu_memory_utilization
    - scheduling_policy (string)
    - router/admission knobs if applicable
    - any other important vLLM engine args we need.
- BenchmarkRunConfig
  - Fields:
    - run_id: str
    - engine: Literal["vllm", "vidur"]
    - vllm_config: VLLMConfig
    - workload: WorkloadProfile
    - objective: str
    - extra: dict[str, Any] | None (for future extensions)
- BenchmarkResult
  - Fields:
    - run_id: str
    - success: bool
    - aggregated_metrics: dict[str, float] # e.g., {"qps": ..., "p95_latency_ms": ..., "ttft_ms": ...}
    - hw: HardwareProfile
    - model: ModelProfile
    - workload: WorkloadProfile
    - vllm_config: VLLMConfig
    - error_message: str | None
    - trace_paths: list[str] # paths to detailed traces if any
    - started_at: datetime
    - finished_at: datetime
- BottleneckReport
  - A simple summary object that the AI loop could use later, with fields like:
    - primary_bottleneck: Literal["memory", "compute", "communication", "scheduler", "unknown"]
    - secondary_bottlenecks: list[str]
    - notes: str

Extend store/json_store.py to implement a minimal JSON-based store:

Provide at least these methods (or similar, well-documented alternatives):

class JsonStore:
    def __init__(self, root: Path): ...

    def create_run_dir(self, run_id: str) -> Path: ...
    def save_run_config(self, cfg: BenchmarkRunConfig) -> None: ...
    def save_run_result(self, result: BenchmarkResult) -> None: ...
    def append_trace_record(self, run_id: str, record: dict[str, Any]) -> None: ...

The layout on disk should roughly be:
- root/run_id/config.json
- root/run_id/metrics.json
- root/run_id/traces.jsonl (optional)
These files should be valid JSON and easy to parse later.

Part 1: vLLM benchmark runner with profiling

Implement a robust vLLM runner that:

Takes a BenchmarkRunConfig, HardwareProfile, and ModelProfile.
Runs a benchmark against vLLM.
Collects aggregated metrics and (optionally) time-series traces.
Persists everything through JsonStore.

In harness/vllm_runner.py implement:

from .types import BenchmarkRunConfig, BenchmarkResult, HardwareProfile, ModelProfile
from store.json_store import JsonStore

def run_vllm_benchmark(
    run_config: BenchmarkRunConfig,
    hw: HardwareProfile,
    model: ModelProfile,
    store: JsonStore,
) -> BenchmarkResult:
    """
    Launch a vLLM benchmark according to run_config, collect metrics, and store logs.

    This function is allowed to:
    - Use an in-process vLLM Engine OR
    - Start a vLLM HTTP server as a subprocess and send requests to it.

    It MUST:
    - Generate a unique run directory via JsonStore.
    - Save config and metrics via JsonStore.
    - Return a BenchmarkResult object populated with aggregated metrics.
    """

Integrate existing modules:

Use inspector/workload_profiler.py to generate the request stream and WorkloadProfile.
Optionally call a helper (in a new module harness/profiler.py) that periodically samples GPU utilization, memory usage, etc., and writes records via JsonStore.append_trace_record.

Update scripts/run_vllm_benchmark.py to:

Parse CLI arguments (model, workload description, objective, etc.).
Instantiate HardwareProfile via inspector/hardware_inspector.py.
Instantiate ModelProfile via feasibility/model_meta.py.
Instantiate a JsonStore rooted at e.g. ./runs.
Build a BenchmarkRunConfig.
Call run_vllm_benchmark.
Print a concise summary plus the run_id.

Part 2: Config generator & validity checking

We already have feasibility/filter_configs.py, memory_model.py, model_meta.py, and topology_rules.py. Now we need to:

Extend feasibility/filter_configs.py to expose a single, well-typed validator:

from harness.types import VLLMConfig, HardwareProfile, ModelProfile, WorkloadProfile

def validate_config(
    cfg: VLLMConfig,
    hw: HardwareProfile,
    model: ModelProfile,
    workload: WorkloadProfile | None = None,
) -> tuple[bool, list[str]]:
    """
    Check whether a vLLMConfig is valid for the given hardware/model/workload.

    Returns:
      (is_valid, reasons_if_invalid)
    """

Use memory_model to estimate HBM usage and enforce an upper bound.
Use topology_rules to reject obviously bad parallelism combinations.
Use simple logical constraints (tp * pp * ep * dp == world_size, etc.).

Implement a minimal "repair" helper in feasibility/filter_configs.py:

def repair_config(
    cfg: VLLMConfig,
    hw: HardwareProfile,
    model: ModelProfile,
    workload: WorkloadProfile,
) -> VLLMConfig:
    """
    Attempt to minimally modify cfg to make it valid.

    Strategies:
    - Adjust gpu_memory_utilization downwards if memory is too tight.
    - Reduce max_num_batched_tokens or max_num_seqs if KV cache is too large.
    - Fix tp/pp/ep/dp product to match world_size.
    - Raise a clear exception if repair is impossible.
    """

Implement config_generator.py as the main entry for generating seed configs:

Provide:

def generate_seed_configs(
    hw: HardwareProfile,
    model: ModelProfile,
    workload: WorkloadProfile,
    objective: str,
    max_configs: int = 8,
) -> list[VLLMConfig]:
    """
    Generate a small set of reasonable, VALID seed vLLM configs for the given environment.

    Use:
    - Handcrafted "template families" for dense/MoE/MLA models.
    - Different templates for latency vs throughput oriented objectives.
    - validate_config(...) to filter out invalid configs.
    """

Use the validator and repair helper when constructing these configs.

Part 3: Unified runner for vidur(simulation) and vLLM(real)

We want a single harness function that can run either vidur or vLLM with the same input types and logging behavior.

In adapter/ add two modules:

adapter/vllm_adapter.py
- Responsible for low-level integration with vLLM (EngineArgs / HTTP service).
- Should expose simple primitives such as "apply VLLMConfig" and "send requests".
adapter/vidur_adapter.py
- Responsible for integrating with the vidur simulator.
- Should expose an interface that, given a VLLMConfig and WorkloadProfile, produces metrics comparable to vLLM. Both adapters can be thin now; they can evolve later.

In harness/runner.py implement:

from harness.types import BenchmarkRunConfig, BenchmarkResult, HardwareProfile, ModelProfile
from store.json_store import JsonStore

def run_benchmark(
    run_config: BenchmarkRunConfig,
    hw: HardwareProfile,
    model: ModelProfile,
    store: JsonStore,
) -> BenchmarkResult:
    """
    Dispatch to the appropriate engine runner ("vllm" or "vidur") based on run_config.engine,
    then return a BenchmarkResult.

    The interface and logging semantics should be identical for both engines.
    """

Internally call run_vllm_benchmark for "vllm" and run_vidur_simulation for "vidur" (you should create run_vidur_simulation similar to run_vllm_benchmark).

Update / implement scripts/run_workload_benchmarks.py:

Accept CLI options like --engine {vllm,vidur}.
Loop over a list of configs (seed configs from config_generator) and call run_benchmark for each.
Summarize results to stdout.

Part 4: AI-driven iterative search loop

Implement an iterative search loop that calls an external LLM (e.g., OpenAI API) to:

Inspect previous configs + metrics.
Diagnose bottlenecks.
Propose new configs or modifications.
Iterate until convergence or a step budget is reached.

In search/ai_loop.py implement:

from harness.types import (
    HardwareProfile,
    ModelProfile,
    WorkloadProfile,
    VLLMConfig,
    BenchmarkRunConfig,
    BenchmarkResult,
)

def ai_search_loop(
    hw: HardwareProfile,
    model: ModelProfile,
    workload: WorkloadProfile,
    objective: str,
    engine: str = "vllm",
    max_steps: int = 20,
) -> list[BenchmarkResult]:
    """
    Core loop:

    - Generate seed configs via config_generator.generate_seed_configs.
    - For each config:
      - Create a BenchmarkRunConfig and run a benchmark via harness.run_benchmark.
      - Store BenchmarkResult objects in a `history` list.

    - For step in range(max_steps):
      - Summarize `history` into a textual prompt for an external LLM.
      - Call an LLM client (you may stub this out, or define a placeholder function) that returns:
        - A bottleneck analysis
        - A small set of new candidate configs (or edits to existing configs)
      - For each candidate:
        - Validate and repair via feasibility.validate_config / repair_config.
        - Run a benchmark and append to `history`.
      - Optionally, ask the LLM (or use simple heuristics) to check for convergence.
    - Return the full history for further analysis.
    """

You do NOT need to implement the actual OpenAI API call in detail; you can define a placeholder interface like:

def ask_llm_for_new_configs(
    hw: HardwareProfile,
    model: ModelProfile,
    workload: WorkloadProfile,
    objective: str,
    history: list[BenchmarkResult],
) -> tuple[str, list[VLLMConfig]]:
    """
    Returns (bottleneck_analysis_text, candidate_configs).
    This can be implemented later with a real LLM backend.
    """

Update scripts/run_search.py to:

Parse arguments for model, workload, objective, engine, max_steps.
Use hardware_inspector / model_meta / workload_profiler to build the profiles.
Instantiate JsonStore.
Call ai_search_loop.
Print the best configuration and its metrics at the end.

Coding style & quality expectations

Use clear, descriptive function and variable names.
Add type hints to all public functions and dataclasses.
Include docstrings that explain the intent, not just restate the name.
Use pathlib.Path for filesystem operations.
Use logging.getLogger(name) in each module.

Deliverables summary

Implement or update the following modules:

harness/types.py (or equivalent)
store/json_store.py (extend)
harness/vllm_runner.py
harness/runner.py
adapter/vllm_adapter.py
adapter/vidur_adapter.py
feasibility/filter_configs.py (extend)
config_generator.py (implement seed generation)
search/ai_loop.py
scripts/run_vllm_benchmark.py (update)
scripts/run_workload_benchmarks.py (update)
scripts/run_search.py (update) Focus on clean architecture and clear boundaries between:
Config generation & validation (config_generator, feasibility)
Execution & profiling (harness, adapter, inspector, store)
AI-driven reasoning (search/ai_loop.py)

14 KiB Raw Blame History