You are an expert Python systems engineer working on an LLM inference auto-tuning project. The repository layout currently looks like: ``` llm_autotune/ ├── adapter/ │ └── __init__.py ├── config_generator.py ├── config_space/ │ └── __init__.py ├── feasibility/ │ ├── filter_configs.py │ ├── __init__.py │ ├── memory_model.py │ ├── model_meta.py │ ├── topology_rules.py │ └── __pycache__/... ├── harness/ │ └── __init__.py ├── inspector/ │ ├── hardware_inspector.py │ ├── probe.py │ ├── workload_profiler.py │ └── __pycache__/... ├── search/ │ ├── heuristic.py │ └── __init__.py ├── store/ │ ├── json_store.py │ └── __init__.py ├── util/ │ └── __init__.py └── scripts/ ├── run_inspect.py ├── run_search.py ├── run_vllm_benchmark.py └── run_workload_benchmarks.py ``` # Goal Implement the core infrastructure for an AI-driven LLM inference config auto-tuner, with four concrete capabilities: 1. A robust profiling & logging tool that can run vLLM benchmarks and record detailed logs for later bottleneck analysis. 2. A basic config generator that produces VALID vLLM configs for a given hardware + model + workload. 3. A unified harness to run vidur (simulation) and vLLM (real system) with the same interface, writing detailed profiling logs. 4. An AI-driven iterative loop that reads historical results, diagnoses bottlenecks, proposes new configs, and calls the harness repeatedly. # General requirements - Python 3.12+ with type hints and dataclasses where appropriate. - Keep dependencies minimal; assume we can add small, well-known libs if necessary, but prefer the standard library. - Avoid ad-hoc global state; pass objects explicitly. - Logging: - Use the `logging` module, not `print`, for internal logs. - User-facing scripts may still print concise summaries to stdout. - All public functions should have clear docstrings. # Part 0: Core types & JSON store 1. Create or extend a module `harness/types.py` (or `harness/__init__.py` if you prefer) to define the core data classes: - `HardwareProfile` - High-level fields like: gpu_type, num_gpus, hbm_gb, nvlink_topology (string or simple struct), cpu_cores, system_memory_gb, etc. - This should be compatible with what `inspector/hardware_inspector.py` can produce. - `ModelProfile` - Fields like: model_name, param_count, hidden_size, num_layers, num_heads, is_moe, num_experts, is_mla, max_position_embeddings, etc. - This should be compatible with `feasibility/model_meta.py`. - `WorkloadProfile` - Fields like: workload_name, qps, avg_prompt_tokens, p95_prompt_tokens, avg_decode_tokens, p95_decode_tokens, request_type, etc. - This should be compatible with `inspector/workload_profiler.py`. - `VLLMConfig` - A structured representation of the core vLLM config knobs we care about: - tensor_parallel_size, pipeline_parallel_size, expert_parallel_size, data_parallel_size - block_size, max_num_batched_tokens, max_num_seqs - gpu_memory_utilization - scheduling_policy (string) - router/admission knobs if applicable - any other important vLLM engine args we need. - `BenchmarkRunConfig` - Fields: - run_id: str - engine: Literal["vllm", "vidur"] - vllm_config: VLLMConfig - workload: WorkloadProfile - objective: str - extra: dict[str, Any] | None (for future extensions) - `BenchmarkResult` - Fields: - run_id: str - success: bool - aggregated_metrics: dict[str, float] # e.g., {"qps": ..., "p95_latency_ms": ..., "ttft_ms": ...} - hw: HardwareProfile - model: ModelProfile - workload: WorkloadProfile - vllm_config: VLLMConfig - error_message: str | None - trace_paths: list[str] # paths to detailed traces if any - started_at: datetime - finished_at: datetime - `BottleneckReport` - A simple summary object that the AI loop could use later, with fields like: - primary_bottleneck: Literal["memory", "compute", "communication", "scheduler", "unknown"] - secondary_bottlenecks: list[str] - notes: str 2. Extend `store/json_store.py` to implement a minimal JSON-based store: - Provide at least these methods (or similar, well-documented alternatives): ```python class JsonStore: def __init__(self, root: Path): ... def create_run_dir(self, run_id: str) -> Path: ... def save_run_config(self, cfg: BenchmarkRunConfig) -> None: ... def save_run_result(self, result: BenchmarkResult) -> None: ... def append_trace_record(self, run_id: str, record: dict[str, Any]) -> None: ... ``` - The layout on disk should roughly be: - `root/run_id/config.json` - `root/run_id/metrics.json` - `root/run_id/traces.jsonl` (optional) - These files should be valid JSON and easy to parse later. # Part 1: vLLM benchmark runner with profiling Implement a robust vLLM runner that: - Takes a `BenchmarkRunConfig`, `HardwareProfile`, and `ModelProfile`. - Runs a benchmark against vLLM. - Collects aggregated metrics and (optionally) time-series traces. - Persists everything through `JsonStore`. 1. In `harness/vllm_runner.py` implement: ```python from .types import BenchmarkRunConfig, BenchmarkResult, HardwareProfile, ModelProfile from store.json_store import JsonStore def run_vllm_benchmark( run_config: BenchmarkRunConfig, hw: HardwareProfile, model: ModelProfile, store: JsonStore, ) -> BenchmarkResult: """ Launch a vLLM benchmark according to run_config, collect metrics, and store logs. This function is allowed to: - Use an in-process vLLM Engine OR - Start a vLLM HTTP server as a subprocess and send requests to it. It MUST: - Generate a unique run directory via JsonStore. - Save config and metrics via JsonStore. - Return a BenchmarkResult object populated with aggregated metrics. """ ``` 2. Integrate existing modules: - Use inspector/workload_profiler.py to generate the request stream and WorkloadProfile. - Optionally call a helper (in a new module harness/profiler.py) that periodically samples GPU utilization, memory usage, etc., and writes records via JsonStore.append_trace_record. 3. Update scripts/run_vllm_benchmark.py to: - Parse CLI arguments (model, workload description, objective, etc.). - Instantiate HardwareProfile via inspector/hardware_inspector.py. - Instantiate ModelProfile via feasibility/model_meta.py. - Instantiate a JsonStore rooted at e.g. ./runs. - Build a BenchmarkRunConfig. - Call run_vllm_benchmark. - Print a concise summary plus the run_id. # Part 2: Config generator & validity checking We already have feasibility/filter_configs.py, memory_model.py, model_meta.py, and topology_rules.py. Now we need to: 1. Extend feasibility/filter_configs.py to expose a single, well-typed validator: ```python from harness.types import VLLMConfig, HardwareProfile, ModelProfile, WorkloadProfile def validate_config( cfg: VLLMConfig, hw: HardwareProfile, model: ModelProfile, workload: WorkloadProfile | None = None, ) -> tuple[bool, list[str]]: """ Check whether a vLLMConfig is valid for the given hardware/model/workload. Returns: (is_valid, reasons_if_invalid) """ ``` - Use memory_model to estimate HBM usage and enforce an upper bound. - Use topology_rules to reject obviously bad parallelism combinations. - Use simple logical constraints (tp * pp * ep * dp == world_size, etc.). 2. Implement a minimal "repair" helper in feasibility/filter_configs.py: ```python def repair_config( cfg: VLLMConfig, hw: HardwareProfile, model: ModelProfile, workload: WorkloadProfile, ) -> VLLMConfig: """ Attempt to minimally modify cfg to make it valid. Strategies: - Adjust gpu_memory_utilization downwards if memory is too tight. - Reduce max_num_batched_tokens or max_num_seqs if KV cache is too large. - Fix tp/pp/ep/dp product to match world_size. - Raise a clear exception if repair is impossible. """ ``` 3. Implement config_generator.py as the main entry for generating seed configs: - Provide: ```python def generate_seed_configs( hw: HardwareProfile, model: ModelProfile, workload: WorkloadProfile, objective: str, max_configs: int = 8, ) -> list[VLLMConfig]: """ Generate a small set of reasonable, VALID seed vLLM configs for the given environment. Use: - Handcrafted "template families" for dense/MoE/MLA models. - Different templates for latency vs throughput oriented objectives. - validate_config(...) to filter out invalid configs. """ ``` - Use the validator and repair helper when constructing these configs. # Part 3: Unified runner for vidur(simulation) and vLLM(real) We want a single harness function that can run either vidur or vLLM with the same input types and logging behavior. 1. In adapter/ add two modules: - adapter/vllm_adapter.py - Responsible for low-level integration with vLLM (EngineArgs / HTTP service). - Should expose simple primitives such as "apply VLLMConfig" and "send requests". - adapter/vidur_adapter.py - Responsible for integrating with the vidur simulator. - Should expose an interface that, given a VLLMConfig and WorkloadProfile, produces metrics comparable to vLLM. Both adapters can be thin now; they can evolve later. 1. In harness/runner.py implement: ```python from harness.types import BenchmarkRunConfig, BenchmarkResult, HardwareProfile, ModelProfile from store.json_store import JsonStore def run_benchmark( run_config: BenchmarkRunConfig, hw: HardwareProfile, model: ModelProfile, store: JsonStore, ) -> BenchmarkResult: """ Dispatch to the appropriate engine runner ("vllm" or "vidur") based on run_config.engine, then return a BenchmarkResult. The interface and logging semantics should be identical for both engines. """ ``` - Internally call run_vllm_benchmark for "vllm" and run_vidur_simulation for "vidur" (you should create run_vidur_simulation similar to run_vllm_benchmark). 3. Update / implement scripts/run_workload_benchmarks.py: - Accept CLI options like --engine {vllm,vidur}. - Loop over a list of configs (seed configs from config_generator) and call run_benchmark for each. - Summarize results to stdout. # Part 4: AI-driven iterative search loop Implement an iterative search loop that calls an external LLM (e.g., OpenAI API) to: - Inspect previous configs + metrics. - Diagnose bottlenecks. - Propose new configs or modifications. - Iterate until convergence or a step budget is reached. 1. In search/ai_loop.py implement: ```python from harness.types import ( HardwareProfile, ModelProfile, WorkloadProfile, VLLMConfig, BenchmarkRunConfig, BenchmarkResult, ) def ai_search_loop( hw: HardwareProfile, model: ModelProfile, workload: WorkloadProfile, objective: str, engine: str = "vllm", max_steps: int = 20, ) -> list[BenchmarkResult]: """ Core loop: - Generate seed configs via config_generator.generate_seed_configs. - For each config: - Create a BenchmarkRunConfig and run a benchmark via harness.run_benchmark. - Store BenchmarkResult objects in a `history` list. - For step in range(max_steps): - Summarize `history` into a textual prompt for an external LLM. - Call an LLM client (you may stub this out, or define a placeholder function) that returns: - A bottleneck analysis - A small set of new candidate configs (or edits to existing configs) - For each candidate: - Validate and repair via feasibility.validate_config / repair_config. - Run a benchmark and append to `history`. - Optionally, ask the LLM (or use simple heuristics) to check for convergence. - Return the full history for further analysis. """ ``` - You do NOT need to implement the actual OpenAI API call in detail; you can define a placeholder interface like: ```python def ask_llm_for_new_configs( hw: HardwareProfile, model: ModelProfile, workload: WorkloadProfile, objective: str, history: list[BenchmarkResult], ) -> tuple[str, list[VLLMConfig]]: """ Returns (bottleneck_analysis_text, candidate_configs). This can be implemented later with a real LLM backend. """ ``` 2. Update scripts/run_search.py to: - Parse arguments for model, workload, objective, engine, max_steps. - Use hardware_inspector / model_meta / workload_profiler to build the profiles. - Instantiate JsonStore. - Call ai_search_loop. - Print the best configuration and its metrics at the end. # Coding style & quality expectations - Use clear, descriptive function and variable names. - Add type hints to all public functions and dataclasses. - Include docstrings that explain the intent, not just restate the name. - Use pathlib.Path for filesystem operations. - Use logging.getLogger(__name__) in each module. # Deliverables summary Implement or update the following modules: - harness/types.py (or equivalent) - store/json_store.py (extend) - harness/vllm_runner.py - harness/runner.py - adapter/vllm_adapter.py - adapter/vidur_adapter.py - feasibility/filter_configs.py (extend) - config_generator.py (implement seed generation) - search/ai_loop.py - scripts/run_vllm_benchmark.py (update) - scripts/run_workload_benchmarks.py (update) - scripts/run_search.py (update) Focus on clean architecture and clear boundaries between: - Config generation & validation (config_generator, feasibility) - Execution & profiling (harness, adapter, inspector, store) - AI-driven reasoning (search/ai_loop.py)