Files
obsidian/projects/auto-tuner/Untitled.md

14 KiB

You are an expert Python systems engineer working on an LLM inference auto-tuning project.

The repository layout currently looks like:

llm_autotune/
├── adapter/
│   └── __init__.py
├── config_generator.py
├── config_space/
│   └── __init__.py
├── feasibility/
│   ├── filter_configs.py
│   ├── __init__.py
│   ├── memory_model.py
│   ├── model_meta.py
│   ├── topology_rules.py
│   └── __pycache__/...
├── harness/
│   └── __init__.py
├── inspector/
│   ├── hardware_inspector.py
│   ├── probe.py
│   ├── workload_profiler.py
│   └── __pycache__/...
├── search/
│   ├── heuristic.py
│   └── __init__.py
├── store/
│   ├── json_store.py
│   └── __init__.py
├── util/
│   └── __init__.py
└── scripts/
    ├── run_inspect.py
    ├── run_search.py
    ├── run_vllm_benchmark.py
    └── run_workload_benchmarks.py

Goal

Implement the core infrastructure for an AI-driven LLM inference config auto-tuner, with four concrete capabilities:

  1. A robust profiling & logging tool that can run vLLM benchmarks and record detailed logs for later bottleneck analysis.
  2. A basic config generator that produces VALID vLLM configs for a given hardware + model + workload.
  3. A unified harness to run vidur (simulation) and vLLM (real system) with the same interface, writing detailed profiling logs.
  4. An AI-driven iterative loop that reads historical results, diagnoses bottlenecks, proposes new configs, and calls the harness repeatedly.

General requirements

  • Python 3.12+ with type hints and dataclasses where appropriate.
  • Keep dependencies minimal; assume we can add small, well-known libs if necessary, but prefer the standard library.
  • Avoid ad-hoc global state; pass objects explicitly.
  • Logging:
    • Use the logging module, not print, for internal logs.
    • User-facing scripts may still print concise summaries to stdout.
  • All public functions should have clear docstrings.

Part 0: Core types & JSON store

  1. Create or extend a module harness/types.py (or harness/__init__.py if you prefer) to define the core data classes:

    • HardwareProfile

      • High-level fields like: gpu_type, num_gpus, hbm_gb, nvlink_topology (string or simple struct), cpu_cores, system_memory_gb, etc.
      • This should be compatible with what inspector/hardware_inspector.py can produce.
    • ModelProfile

      • Fields like: model_name, param_count, hidden_size, num_layers, num_heads, is_moe, num_experts, is_mla, max_position_embeddings, etc.
      • This should be compatible with feasibility/model_meta.py.
    • WorkloadProfile

      • Fields like: workload_name, qps, avg_prompt_tokens, p95_prompt_tokens, avg_decode_tokens, p95_decode_tokens, request_type, etc.
      • This should be compatible with inspector/workload_profiler.py.
    • VLLMConfig

      • A structured representation of the core vLLM config knobs we care about:
        • tensor_parallel_size, pipeline_parallel_size, expert_parallel_size, data_parallel_size
        • block_size, max_num_batched_tokens, max_num_seqs
        • gpu_memory_utilization
        • scheduling_policy (string)
        • router/admission knobs if applicable
        • any other important vLLM engine args we need.
    • BenchmarkRunConfig

      • Fields:
        • run_id: str
        • engine: Literal["vllm", "vidur"]
        • vllm_config: VLLMConfig
        • workload: WorkloadProfile
        • objective: str
        • extra: dict[str, Any] | None (for future extensions)
    • BenchmarkResult

      • Fields:
        • run_id: str
        • success: bool
        • aggregated_metrics: dict[str, float] # e.g., {"qps": ..., "p95_latency_ms": ..., "ttft_ms": ...}
        • hw: HardwareProfile
        • model: ModelProfile
        • workload: WorkloadProfile
        • vllm_config: VLLMConfig
        • error_message: str | None
        • trace_paths: list[str] # paths to detailed traces if any
        • started_at: datetime
        • finished_at: datetime
    • BottleneckReport

      • A simple summary object that the AI loop could use later, with fields like:
        • primary_bottleneck: Literal["memory", "compute", "communication", "scheduler", "unknown"]
        • secondary_bottlenecks: list[str]
        • notes: str
  2. Extend store/json_store.py to implement a minimal JSON-based store:

    • Provide at least these methods (or similar, well-documented alternatives):

      class JsonStore:
          def __init__(self, root: Path): ...
      
          def create_run_dir(self, run_id: str) -> Path: ...
          def save_run_config(self, cfg: BenchmarkRunConfig) -> None: ...
          def save_run_result(self, result: BenchmarkResult) -> None: ...
          def append_trace_record(self, run_id: str, record: dict[str, Any]) -> None: ...
      
    • The layout on disk should roughly be:

      • root/run_id/config.json
      • root/run_id/metrics.json
      • root/run_id/traces.jsonl (optional)
    • These files should be valid JSON and easy to parse later.

Part 1: vLLM benchmark runner with profiling

Implement a robust vLLM runner that:

  • Takes a BenchmarkRunConfig, HardwareProfile, and ModelProfile.
  • Runs a benchmark against vLLM.
  • Collects aggregated metrics and (optionally) time-series traces.
  • Persists everything through JsonStore.
  1. In harness/vllm_runner.py implement:

    from .types import BenchmarkRunConfig, BenchmarkResult, HardwareProfile, ModelProfile
    from store.json_store import JsonStore
    
    def run_vllm_benchmark(
        run_config: BenchmarkRunConfig,
        hw: HardwareProfile,
        model: ModelProfile,
        store: JsonStore,
    ) -> BenchmarkResult:
        """
        Launch a vLLM benchmark according to run_config, collect metrics, and store logs.
    
        This function is allowed to:
        - Use an in-process vLLM Engine OR
        - Start a vLLM HTTP server as a subprocess and send requests to it.
    
        It MUST:
        - Generate a unique run directory via JsonStore.
        - Save config and metrics via JsonStore.
        - Return a BenchmarkResult object populated with aggregated metrics.
        """
    
  2. Integrate existing modules:

  • Use inspector/workload_profiler.py to generate the request stream and WorkloadProfile.
  • Optionally call a helper (in a new module harness/profiler.py) that periodically samples GPU utilization, memory usage, etc., and writes records via JsonStore.append_trace_record.
  1. Update scripts/run_vllm_benchmark.py to:
  • Parse CLI arguments (model, workload description, objective, etc.).
  • Instantiate HardwareProfile via inspector/hardware_inspector.py.
  • Instantiate ModelProfile via feasibility/model_meta.py.
  • Instantiate a JsonStore rooted at e.g. ./runs.
  • Build a BenchmarkRunConfig.
  • Call run_vllm_benchmark.
  • Print a concise summary plus the run_id.

Part 2: Config generator & validity checking

We already have feasibility/filter_configs.py, memory_model.py, model_meta.py, and topology_rules.py. Now we need to:

  1. Extend feasibility/filter_configs.py to expose a single, well-typed validator:
from harness.types import VLLMConfig, HardwareProfile, ModelProfile, WorkloadProfile

def validate_config(
    cfg: VLLMConfig,
    hw: HardwareProfile,
    model: ModelProfile,
    workload: WorkloadProfile | None = None,
) -> tuple[bool, list[str]]:
    """
    Check whether a vLLMConfig is valid for the given hardware/model/workload.

    Returns:
      (is_valid, reasons_if_invalid)
    """
  • Use memory_model to estimate HBM usage and enforce an upper bound.
  • Use topology_rules to reject obviously bad parallelism combinations.
  • Use simple logical constraints (tp * pp * ep * dp == world_size, etc.).
  1. Implement a minimal "repair" helper in feasibility/filter_configs.py:
def repair_config(
    cfg: VLLMConfig,
    hw: HardwareProfile,
    model: ModelProfile,
    workload: WorkloadProfile,
) -> VLLMConfig:
    """
    Attempt to minimally modify cfg to make it valid.

    Strategies:
    - Adjust gpu_memory_utilization downwards if memory is too tight.
    - Reduce max_num_batched_tokens or max_num_seqs if KV cache is too large.
    - Fix tp/pp/ep/dp product to match world_size.
    - Raise a clear exception if repair is impossible.
    """
  1. Implement config_generator.py as the main entry for generating seed configs:
  • Provide:
def generate_seed_configs(
    hw: HardwareProfile,
    model: ModelProfile,
    workload: WorkloadProfile,
    objective: str,
    max_configs: int = 8,
) -> list[VLLMConfig]:
    """
    Generate a small set of reasonable, VALID seed vLLM configs for the given environment.

    Use:
    - Handcrafted "template families" for dense/MoE/MLA models.
    - Different templates for latency vs throughput oriented objectives.
    - validate_config(...) to filter out invalid configs.
    """
  • Use the validator and repair helper when constructing these configs.

Part 3: Unified runner for vidur(simulation) and vLLM(real)

We want a single harness function that can run either vidur or vLLM with the same input types and logging behavior.

  1. In adapter/ add two modules:
  • adapter/vllm_adapter.py
    • Responsible for low-level integration with vLLM (EngineArgs / HTTP service).
    • Should expose simple primitives such as "apply VLLMConfig" and "send requests".
  • adapter/vidur_adapter.py
    • Responsible for integrating with the vidur simulator.
    • Should expose an interface that, given a VLLMConfig and WorkloadProfile, produces metrics comparable to vLLM. Both adapters can be thin now; they can evolve later.
  1. In harness/runner.py implement:
from harness.types import BenchmarkRunConfig, BenchmarkResult, HardwareProfile, ModelProfile
from store.json_store import JsonStore

def run_benchmark(
    run_config: BenchmarkRunConfig,
    hw: HardwareProfile,
    model: ModelProfile,
    store: JsonStore,
) -> BenchmarkResult:
    """
    Dispatch to the appropriate engine runner ("vllm" or "vidur") based on run_config.engine,
    then return a BenchmarkResult.

    The interface and logging semantics should be identical for both engines.
    """
  • Internally call run_vllm_benchmark for "vllm" and run_vidur_simulation for "vidur" (you should create run_vidur_simulation similar to run_vllm_benchmark).
  1. Update / implement scripts/run_workload_benchmarks.py:
  • Accept CLI options like --engine {vllm,vidur}.
  • Loop over a list of configs (seed configs from config_generator) and call run_benchmark for each.
  • Summarize results to stdout.

Part 4: AI-driven iterative search loop

Implement an iterative search loop that calls an external LLM (e.g., OpenAI API) to:

  • Inspect previous configs + metrics.
  • Diagnose bottlenecks.
  • Propose new configs or modifications.
  • Iterate until convergence or a step budget is reached.
  1. In search/ai_loop.py implement:
from harness.types import (
    HardwareProfile,
    ModelProfile,
    WorkloadProfile,
    VLLMConfig,
    BenchmarkRunConfig,
    BenchmarkResult,
)

def ai_search_loop(
    hw: HardwareProfile,
    model: ModelProfile,
    workload: WorkloadProfile,
    objective: str,
    engine: str = "vllm",
    max_steps: int = 20,
) -> list[BenchmarkResult]:
    """
    Core loop:

    - Generate seed configs via config_generator.generate_seed_configs.
    - For each config:
      - Create a BenchmarkRunConfig and run a benchmark via harness.run_benchmark.
      - Store BenchmarkResult objects in a `history` list.

    - For step in range(max_steps):
      - Summarize `history` into a textual prompt for an external LLM.
      - Call an LLM client (you may stub this out, or define a placeholder function) that returns:
        - A bottleneck analysis
        - A small set of new candidate configs (or edits to existing configs)
      - For each candidate:
        - Validate and repair via feasibility.validate_config / repair_config.
        - Run a benchmark and append to `history`.
      - Optionally, ask the LLM (or use simple heuristics) to check for convergence.
    - Return the full history for further analysis.
    """
  • You do NOT need to implement the actual OpenAI API call in detail; you can define a placeholder interface like:
def ask_llm_for_new_configs(
    hw: HardwareProfile,
    model: ModelProfile,
    workload: WorkloadProfile,
    objective: str,
    history: list[BenchmarkResult],
) -> tuple[str, list[VLLMConfig]]:
    """
    Returns (bottleneck_analysis_text, candidate_configs).
    This can be implemented later with a real LLM backend.
    """
  1. Update scripts/run_search.py to:
  • Parse arguments for model, workload, objective, engine, max_steps.
  • Use hardware_inspector / model_meta / workload_profiler to build the profiles.
  • Instantiate JsonStore.
  • Call ai_search_loop.
  • Print the best configuration and its metrics at the end.

Coding style & quality expectations

  • Use clear, descriptive function and variable names.
  • Add type hints to all public functions and dataclasses.
  • Include docstrings that explain the intent, not just restate the name.
  • Use pathlib.Path for filesystem operations.
  • Use logging.getLogger(name) in each module.

Deliverables summary

Implement or update the following modules:

  • harness/types.py (or equivalent)
  • store/json_store.py (extend)
  • harness/vllm_runner.py
  • harness/runner.py
  • adapter/vllm_adapter.py
  • adapter/vidur_adapter.py
  • feasibility/filter_configs.py (extend)
  • config_generator.py (implement seed generation)
  • search/ai_loop.py
  • scripts/run_vllm_benchmark.py (update)
  • scripts/run_workload_benchmarks.py (update)
  • scripts/run_search.py (update) Focus on clean architecture and clear boundaries between:
  • Config generation & validation (config_generator, feasibility)
  • Execution & profiling (harness, adapter, inspector, store)
  • AI-driven reasoning (search/ai_loop.py)