14 KiB
You are an expert Python systems engineer working on an LLM inference auto-tuning project.
The repository layout currently looks like:
llm_autotune/
├── adapter/
│ └── __init__.py
├── config_generator.py
├── config_space/
│ └── __init__.py
├── feasibility/
│ ├── filter_configs.py
│ ├── __init__.py
│ ├── memory_model.py
│ ├── model_meta.py
│ ├── topology_rules.py
│ └── __pycache__/...
├── harness/
│ └── __init__.py
├── inspector/
│ ├── hardware_inspector.py
│ ├── probe.py
│ ├── workload_profiler.py
│ └── __pycache__/...
├── search/
│ ├── heuristic.py
│ └── __init__.py
├── store/
│ ├── json_store.py
│ └── __init__.py
├── util/
│ └── __init__.py
└── scripts/
├── run_inspect.py
├── run_search.py
├── run_vllm_benchmark.py
└── run_workload_benchmarks.py
Goal
Implement the core infrastructure for an AI-driven LLM inference config auto-tuner, with four concrete capabilities:
- A robust profiling & logging tool that can run vLLM benchmarks and record detailed logs for later bottleneck analysis.
- A basic config generator that produces VALID vLLM configs for a given hardware + model + workload.
- A unified harness to run vidur (simulation) and vLLM (real system) with the same interface, writing detailed profiling logs.
- An AI-driven iterative loop that reads historical results, diagnoses bottlenecks, proposes new configs, and calls the harness repeatedly.
General requirements
- Python 3.12+ with type hints and dataclasses where appropriate.
- Keep dependencies minimal; assume we can add small, well-known libs if necessary, but prefer the standard library.
- Avoid ad-hoc global state; pass objects explicitly.
- Logging:
- Use the
loggingmodule, notprint, for internal logs. - User-facing scripts may still print concise summaries to stdout.
- Use the
- All public functions should have clear docstrings.
Part 0: Core types & JSON store
-
Create or extend a module
harness/types.py(orharness/__init__.pyif you prefer) to define the core data classes:-
HardwareProfile- High-level fields like: gpu_type, num_gpus, hbm_gb, nvlink_topology (string or simple struct), cpu_cores, system_memory_gb, etc.
- This should be compatible with what
inspector/hardware_inspector.pycan produce.
-
ModelProfile- Fields like: model_name, param_count, hidden_size, num_layers, num_heads, is_moe, num_experts, is_mla, max_position_embeddings, etc.
- This should be compatible with
feasibility/model_meta.py.
-
WorkloadProfile- Fields like: workload_name, qps, avg_prompt_tokens, p95_prompt_tokens, avg_decode_tokens, p95_decode_tokens, request_type, etc.
- This should be compatible with
inspector/workload_profiler.py.
-
VLLMConfig- A structured representation of the core vLLM config knobs we care about:
- tensor_parallel_size, pipeline_parallel_size, expert_parallel_size, data_parallel_size
- block_size, max_num_batched_tokens, max_num_seqs
- gpu_memory_utilization
- scheduling_policy (string)
- router/admission knobs if applicable
- any other important vLLM engine args we need.
- A structured representation of the core vLLM config knobs we care about:
-
BenchmarkRunConfig- Fields:
- run_id: str
- engine: Literal["vllm", "vidur"]
- vllm_config: VLLMConfig
- workload: WorkloadProfile
- objective: str
- extra: dict[str, Any] | None (for future extensions)
- Fields:
-
BenchmarkResult- Fields:
- run_id: str
- success: bool
- aggregated_metrics: dict[str, float] # e.g., {"qps": ..., "p95_latency_ms": ..., "ttft_ms": ...}
- hw: HardwareProfile
- model: ModelProfile
- workload: WorkloadProfile
- vllm_config: VLLMConfig
- error_message: str | None
- trace_paths: list[str] # paths to detailed traces if any
- started_at: datetime
- finished_at: datetime
- Fields:
-
BottleneckReport- A simple summary object that the AI loop could use later, with fields like:
- primary_bottleneck: Literal["memory", "compute", "communication", "scheduler", "unknown"]
- secondary_bottlenecks: list[str]
- notes: str
- A simple summary object that the AI loop could use later, with fields like:
-
-
Extend
store/json_store.pyto implement a minimal JSON-based store:-
Provide at least these methods (or similar, well-documented alternatives):
class JsonStore: def __init__(self, root: Path): ... def create_run_dir(self, run_id: str) -> Path: ... def save_run_config(self, cfg: BenchmarkRunConfig) -> None: ... def save_run_result(self, result: BenchmarkResult) -> None: ... def append_trace_record(self, run_id: str, record: dict[str, Any]) -> None: ... -
The layout on disk should roughly be:
root/run_id/config.jsonroot/run_id/metrics.jsonroot/run_id/traces.jsonl(optional)
-
These files should be valid JSON and easy to parse later.
-
Part 1: vLLM benchmark runner with profiling
Implement a robust vLLM runner that:
- Takes a
BenchmarkRunConfig,HardwareProfile, andModelProfile. - Runs a benchmark against vLLM.
- Collects aggregated metrics and (optionally) time-series traces.
- Persists everything through
JsonStore.
-
In
harness/vllm_runner.pyimplement:from .types import BenchmarkRunConfig, BenchmarkResult, HardwareProfile, ModelProfile from store.json_store import JsonStore def run_vllm_benchmark( run_config: BenchmarkRunConfig, hw: HardwareProfile, model: ModelProfile, store: JsonStore, ) -> BenchmarkResult: """ Launch a vLLM benchmark according to run_config, collect metrics, and store logs. This function is allowed to: - Use an in-process vLLM Engine OR - Start a vLLM HTTP server as a subprocess and send requests to it. It MUST: - Generate a unique run directory via JsonStore. - Save config and metrics via JsonStore. - Return a BenchmarkResult object populated with aggregated metrics. """ -
Integrate existing modules:
- Use inspector/workload_profiler.py to generate the request stream and WorkloadProfile.
- Optionally call a helper (in a new module harness/profiler.py) that periodically samples GPU utilization, memory usage, etc., and writes records via JsonStore.append_trace_record.
- Update scripts/run_vllm_benchmark.py to:
- Parse CLI arguments (model, workload description, objective, etc.).
- Instantiate HardwareProfile via inspector/hardware_inspector.py.
- Instantiate ModelProfile via feasibility/model_meta.py.
- Instantiate a JsonStore rooted at e.g. ./runs.
- Build a BenchmarkRunConfig.
- Call run_vllm_benchmark.
- Print a concise summary plus the run_id.
Part 2: Config generator & validity checking
We already have feasibility/filter_configs.py, memory_model.py, model_meta.py, and topology_rules.py. Now we need to:
- Extend feasibility/filter_configs.py to expose a single, well-typed validator:
from harness.types import VLLMConfig, HardwareProfile, ModelProfile, WorkloadProfile
def validate_config(
cfg: VLLMConfig,
hw: HardwareProfile,
model: ModelProfile,
workload: WorkloadProfile | None = None,
) -> tuple[bool, list[str]]:
"""
Check whether a vLLMConfig is valid for the given hardware/model/workload.
Returns:
(is_valid, reasons_if_invalid)
"""
- Use memory_model to estimate HBM usage and enforce an upper bound.
- Use topology_rules to reject obviously bad parallelism combinations.
- Use simple logical constraints (tp * pp * ep * dp == world_size, etc.).
- Implement a minimal "repair" helper in feasibility/filter_configs.py:
def repair_config(
cfg: VLLMConfig,
hw: HardwareProfile,
model: ModelProfile,
workload: WorkloadProfile,
) -> VLLMConfig:
"""
Attempt to minimally modify cfg to make it valid.
Strategies:
- Adjust gpu_memory_utilization downwards if memory is too tight.
- Reduce max_num_batched_tokens or max_num_seqs if KV cache is too large.
- Fix tp/pp/ep/dp product to match world_size.
- Raise a clear exception if repair is impossible.
"""
- Implement config_generator.py as the main entry for generating seed configs:
- Provide:
def generate_seed_configs(
hw: HardwareProfile,
model: ModelProfile,
workload: WorkloadProfile,
objective: str,
max_configs: int = 8,
) -> list[VLLMConfig]:
"""
Generate a small set of reasonable, VALID seed vLLM configs for the given environment.
Use:
- Handcrafted "template families" for dense/MoE/MLA models.
- Different templates for latency vs throughput oriented objectives.
- validate_config(...) to filter out invalid configs.
"""
- Use the validator and repair helper when constructing these configs.
Part 3: Unified runner for vidur(simulation) and vLLM(real)
We want a single harness function that can run either vidur or vLLM with the same input types and logging behavior.
- In adapter/ add two modules:
- adapter/vllm_adapter.py
- Responsible for low-level integration with vLLM (EngineArgs / HTTP service).
- Should expose simple primitives such as "apply VLLMConfig" and "send requests".
- adapter/vidur_adapter.py
- Responsible for integrating with the vidur simulator.
- Should expose an interface that, given a VLLMConfig and WorkloadProfile, produces metrics comparable to vLLM. Both adapters can be thin now; they can evolve later.
- In harness/runner.py implement:
from harness.types import BenchmarkRunConfig, BenchmarkResult, HardwareProfile, ModelProfile
from store.json_store import JsonStore
def run_benchmark(
run_config: BenchmarkRunConfig,
hw: HardwareProfile,
model: ModelProfile,
store: JsonStore,
) -> BenchmarkResult:
"""
Dispatch to the appropriate engine runner ("vllm" or "vidur") based on run_config.engine,
then return a BenchmarkResult.
The interface and logging semantics should be identical for both engines.
"""
- Internally call run_vllm_benchmark for "vllm" and run_vidur_simulation for "vidur" (you should create run_vidur_simulation similar to run_vllm_benchmark).
- Update / implement scripts/run_workload_benchmarks.py:
- Accept CLI options like --engine {vllm,vidur}.
- Loop over a list of configs (seed configs from config_generator) and call run_benchmark for each.
- Summarize results to stdout.
Part 4: AI-driven iterative search loop
Implement an iterative search loop that calls an external LLM (e.g., OpenAI API) to:
- Inspect previous configs + metrics.
- Diagnose bottlenecks.
- Propose new configs or modifications.
- Iterate until convergence or a step budget is reached.
- In search/ai_loop.py implement:
from harness.types import (
HardwareProfile,
ModelProfile,
WorkloadProfile,
VLLMConfig,
BenchmarkRunConfig,
BenchmarkResult,
)
def ai_search_loop(
hw: HardwareProfile,
model: ModelProfile,
workload: WorkloadProfile,
objective: str,
engine: str = "vllm",
max_steps: int = 20,
) -> list[BenchmarkResult]:
"""
Core loop:
- Generate seed configs via config_generator.generate_seed_configs.
- For each config:
- Create a BenchmarkRunConfig and run a benchmark via harness.run_benchmark.
- Store BenchmarkResult objects in a `history` list.
- For step in range(max_steps):
- Summarize `history` into a textual prompt for an external LLM.
- Call an LLM client (you may stub this out, or define a placeholder function) that returns:
- A bottleneck analysis
- A small set of new candidate configs (or edits to existing configs)
- For each candidate:
- Validate and repair via feasibility.validate_config / repair_config.
- Run a benchmark and append to `history`.
- Optionally, ask the LLM (or use simple heuristics) to check for convergence.
- Return the full history for further analysis.
"""
- You do NOT need to implement the actual OpenAI API call in detail; you can define a placeholder interface like:
def ask_llm_for_new_configs(
hw: HardwareProfile,
model: ModelProfile,
workload: WorkloadProfile,
objective: str,
history: list[BenchmarkResult],
) -> tuple[str, list[VLLMConfig]]:
"""
Returns (bottleneck_analysis_text, candidate_configs).
This can be implemented later with a real LLM backend.
"""
- Update scripts/run_search.py to:
- Parse arguments for model, workload, objective, engine, max_steps.
- Use hardware_inspector / model_meta / workload_profiler to build the profiles.
- Instantiate JsonStore.
- Call ai_search_loop.
- Print the best configuration and its metrics at the end.
Coding style & quality expectations
- Use clear, descriptive function and variable names.
- Add type hints to all public functions and dataclasses.
- Include docstrings that explain the intent, not just restate the name.
- Use pathlib.Path for filesystem operations.
- Use logging.getLogger(name) in each module.
Deliverables summary
Implement or update the following modules:
- harness/types.py (or equivalent)
- store/json_store.py (extend)
- harness/vllm_runner.py
- harness/runner.py
- adapter/vllm_adapter.py
- adapter/vidur_adapter.py
- feasibility/filter_configs.py (extend)
- config_generator.py (implement seed generation)
- search/ai_loop.py
- scripts/run_vllm_benchmark.py (update)
- scripts/run_workload_benchmarks.py (update)
- scripts/run_search.py (update) Focus on clean architecture and clear boundaries between:
- Config generation & validation (config_generator, feasibility)
- Execution & profiling (harness, adapter, inspector, store)
- AI-driven reasoning (search/ai_loop.py)