Files
obsidian/projects/auto-tuner/Untitled.md

362 lines
14 KiB
Markdown

You are an expert Python systems engineer working on an LLM inference auto-tuning project.
The repository layout currently looks like:
```
llm_autotune/
├── adapter/
│ └── __init__.py
├── config_generator.py
├── config_space/
│ └── __init__.py
├── feasibility/
│ ├── filter_configs.py
│ ├── __init__.py
│ ├── memory_model.py
│ ├── model_meta.py
│ ├── topology_rules.py
│ └── __pycache__/...
├── harness/
│ └── __init__.py
├── inspector/
│ ├── hardware_inspector.py
│ ├── probe.py
│ ├── workload_profiler.py
│ └── __pycache__/...
├── search/
│ ├── heuristic.py
│ └── __init__.py
├── store/
│ ├── json_store.py
│ └── __init__.py
├── util/
│ └── __init__.py
└── scripts/
├── run_inspect.py
├── run_search.py
├── run_vllm_benchmark.py
└── run_workload_benchmarks.py
```
# Goal
Implement the core infrastructure for an AI-driven LLM inference config auto-tuner, with four concrete capabilities:
1. A robust profiling & logging tool that can run vLLM benchmarks and record detailed logs for later bottleneck analysis.
2. A basic config generator that produces VALID vLLM configs for a given hardware + model + workload.
3. A unified harness to run vidur (simulation) and vLLM (real system) with the same interface, writing detailed profiling logs.
4. An AI-driven iterative loop that reads historical results, diagnoses bottlenecks, proposes new configs, and calls the harness repeatedly.
# General requirements
- Python 3.12+ with type hints and dataclasses where appropriate.
- Keep dependencies minimal; assume we can add small, well-known libs if necessary, but prefer the standard library.
- Avoid ad-hoc global state; pass objects explicitly.
- Logging:
- Use the `logging` module, not `print`, for internal logs.
- User-facing scripts may still print concise summaries to stdout.
- All public functions should have clear docstrings.
# Part 0: Core types & JSON store
1. Create or extend a module `harness/types.py` (or `harness/__init__.py` if you prefer) to define the core data classes:
- `HardwareProfile`
- High-level fields like: gpu_type, num_gpus, hbm_gb, nvlink_topology (string or simple struct), cpu_cores, system_memory_gb, etc.
- This should be compatible with what `inspector/hardware_inspector.py` can produce.
- `ModelProfile`
- Fields like: model_name, param_count, hidden_size, num_layers, num_heads, is_moe, num_experts, is_mla, max_position_embeddings, etc.
- This should be compatible with `feasibility/model_meta.py`.
- `WorkloadProfile`
- Fields like: workload_name, qps, avg_prompt_tokens, p95_prompt_tokens, avg_decode_tokens, p95_decode_tokens, request_type, etc.
- This should be compatible with `inspector/workload_profiler.py`.
- `VLLMConfig`
- A structured representation of the core vLLM config knobs we care about:
- tensor_parallel_size, pipeline_parallel_size, expert_parallel_size, data_parallel_size
- block_size, max_num_batched_tokens, max_num_seqs
- gpu_memory_utilization
- scheduling_policy (string)
- router/admission knobs if applicable
- any other important vLLM engine args we need.
- `BenchmarkRunConfig`
- Fields:
- run_id: str
- engine: Literal["vllm", "vidur"]
- vllm_config: VLLMConfig
- workload: WorkloadProfile
- objective: str
- extra: dict[str, Any] | None (for future extensions)
- `BenchmarkResult`
- Fields:
- run_id: str
- success: bool
- aggregated_metrics: dict[str, float] # e.g., {"qps": ..., "p95_latency_ms": ..., "ttft_ms": ...}
- hw: HardwareProfile
- model: ModelProfile
- workload: WorkloadProfile
- vllm_config: VLLMConfig
- error_message: str | None
- trace_paths: list[str] # paths to detailed traces if any
- started_at: datetime
- finished_at: datetime
- `BottleneckReport`
- A simple summary object that the AI loop could use later, with fields like:
- primary_bottleneck: Literal["memory", "compute", "communication", "scheduler", "unknown"]
- secondary_bottlenecks: list[str]
- notes: str
2. Extend `store/json_store.py` to implement a minimal JSON-based store:
- Provide at least these methods (or similar, well-documented alternatives):
```python
class JsonStore:
def __init__(self, root: Path): ...
def create_run_dir(self, run_id: str) -> Path: ...
def save_run_config(self, cfg: BenchmarkRunConfig) -> None: ...
def save_run_result(self, result: BenchmarkResult) -> None: ...
def append_trace_record(self, run_id: str, record: dict[str, Any]) -> None: ...
```
- The layout on disk should roughly be:
- `root/run_id/config.json`
- `root/run_id/metrics.json`
- `root/run_id/traces.jsonl` (optional)
- These files should be valid JSON and easy to parse later.
# Part 1: vLLM benchmark runner with profiling
Implement a robust vLLM runner that:
- Takes a `BenchmarkRunConfig`, `HardwareProfile`, and `ModelProfile`.
- Runs a benchmark against vLLM.
- Collects aggregated metrics and (optionally) time-series traces.
- Persists everything through `JsonStore`.
1. In `harness/vllm_runner.py` implement:
```python
from .types import BenchmarkRunConfig, BenchmarkResult, HardwareProfile, ModelProfile
from store.json_store import JsonStore
def run_vllm_benchmark(
run_config: BenchmarkRunConfig,
hw: HardwareProfile,
model: ModelProfile,
store: JsonStore,
) -> BenchmarkResult:
"""
Launch a vLLM benchmark according to run_config, collect metrics, and store logs.
This function is allowed to:
- Use an in-process vLLM Engine OR
- Start a vLLM HTTP server as a subprocess and send requests to it.
It MUST:
- Generate a unique run directory via JsonStore.
- Save config and metrics via JsonStore.
- Return a BenchmarkResult object populated with aggregated metrics.
"""
```
2. Integrate existing modules:
- Use inspector/workload_profiler.py to generate the request stream and WorkloadProfile.
- Optionally call a helper (in a new module harness/profiler.py) that periodically samples GPU utilization, memory usage, etc., and writes records via JsonStore.append_trace_record.
3. Update scripts/run_vllm_benchmark.py to:
- Parse CLI arguments (model, workload description, objective, etc.).
- Instantiate HardwareProfile via inspector/hardware_inspector.py.
- Instantiate ModelProfile via feasibility/model_meta.py.
- Instantiate a JsonStore rooted at e.g. ./runs.
- Build a BenchmarkRunConfig.
- Call run_vllm_benchmark.
- Print a concise summary plus the run_id.
# Part 2: Config generator & validity checking
We already have feasibility/filter_configs.py, memory_model.py, model_meta.py, and topology_rules.py. Now we need to:
1. Extend feasibility/filter_configs.py to expose a single, well-typed validator:
```python
from harness.types import VLLMConfig, HardwareProfile, ModelProfile, WorkloadProfile
def validate_config(
cfg: VLLMConfig,
hw: HardwareProfile,
model: ModelProfile,
workload: WorkloadProfile | None = None,
) -> tuple[bool, list[str]]:
"""
Check whether a vLLMConfig is valid for the given hardware/model/workload.
Returns:
(is_valid, reasons_if_invalid)
"""
```
- Use memory_model to estimate HBM usage and enforce an upper bound.
- Use topology_rules to reject obviously bad parallelism combinations.
- Use simple logical constraints (tp * pp * ep * dp == world_size, etc.).
2. Implement a minimal "repair" helper in feasibility/filter_configs.py:
```python
def repair_config(
cfg: VLLMConfig,
hw: HardwareProfile,
model: ModelProfile,
workload: WorkloadProfile,
) -> VLLMConfig:
"""
Attempt to minimally modify cfg to make it valid.
Strategies:
- Adjust gpu_memory_utilization downwards if memory is too tight.
- Reduce max_num_batched_tokens or max_num_seqs if KV cache is too large.
- Fix tp/pp/ep/dp product to match world_size.
- Raise a clear exception if repair is impossible.
"""
```
3. Implement config_generator.py as the main entry for generating seed configs:
- Provide:
```python
def generate_seed_configs(
hw: HardwareProfile,
model: ModelProfile,
workload: WorkloadProfile,
objective: str,
max_configs: int = 8,
) -> list[VLLMConfig]:
"""
Generate a small set of reasonable, VALID seed vLLM configs for the given environment.
Use:
- Handcrafted "template families" for dense/MoE/MLA models.
- Different templates for latency vs throughput oriented objectives.
- validate_config(...) to filter out invalid configs.
"""
```
- Use the validator and repair helper when constructing these configs.
# Part 3: Unified runner for vidur(simulation) and vLLM(real)
We want a single harness function that can run either vidur or vLLM with the same input types and logging behavior.
1. In adapter/ add two modules:
- adapter/vllm_adapter.py
- Responsible for low-level integration with vLLM (EngineArgs / HTTP service).
- Should expose simple primitives such as "apply VLLMConfig" and "send requests".
- adapter/vidur_adapter.py
- Responsible for integrating with the vidur simulator.
- Should expose an interface that, given a VLLMConfig and WorkloadProfile, produces metrics comparable to vLLM.
Both adapters can be thin now; they can evolve later.
1. In harness/runner.py implement:
```python
from harness.types import BenchmarkRunConfig, BenchmarkResult, HardwareProfile, ModelProfile
from store.json_store import JsonStore
def run_benchmark(
run_config: BenchmarkRunConfig,
hw: HardwareProfile,
model: ModelProfile,
store: JsonStore,
) -> BenchmarkResult:
"""
Dispatch to the appropriate engine runner ("vllm" or "vidur") based on run_config.engine,
then return a BenchmarkResult.
The interface and logging semantics should be identical for both engines.
"""
```
- Internally call run_vllm_benchmark for "vllm" and run_vidur_simulation for "vidur" (you should create run_vidur_simulation similar to run_vllm_benchmark).
3. Update / implement scripts/run_workload_benchmarks.py:
- Accept CLI options like --engine {vllm,vidur}.
- Loop over a list of configs (seed configs from config_generator) and call run_benchmark for each.
- Summarize results to stdout.
# Part 4: AI-driven iterative search loop
Implement an iterative search loop that calls an external LLM (e.g., OpenAI API) to:
- Inspect previous configs + metrics.
- Diagnose bottlenecks.
- Propose new configs or modifications.
- Iterate until convergence or a step budget is reached.
1. In search/ai_loop.py implement:
```python
from harness.types import (
HardwareProfile,
ModelProfile,
WorkloadProfile,
VLLMConfig,
BenchmarkRunConfig,
BenchmarkResult,
)
def ai_search_loop(
hw: HardwareProfile,
model: ModelProfile,
workload: WorkloadProfile,
objective: str,
engine: str = "vllm",
max_steps: int = 20,
) -> list[BenchmarkResult]:
"""
Core loop:
- Generate seed configs via config_generator.generate_seed_configs.
- For each config:
- Create a BenchmarkRunConfig and run a benchmark via harness.run_benchmark.
- Store BenchmarkResult objects in a `history` list.
- For step in range(max_steps):
- Summarize `history` into a textual prompt for an external LLM.
- Call an LLM client (you may stub this out, or define a placeholder function) that returns:
- A bottleneck analysis
- A small set of new candidate configs (or edits to existing configs)
- For each candidate:
- Validate and repair via feasibility.validate_config / repair_config.
- Run a benchmark and append to `history`.
- Optionally, ask the LLM (or use simple heuristics) to check for convergence.
- Return the full history for further analysis.
"""
```
- You do NOT need to implement the actual OpenAI API call in detail; you can define a placeholder interface like:
```python
def ask_llm_for_new_configs(
hw: HardwareProfile,
model: ModelProfile,
workload: WorkloadProfile,
objective: str,
history: list[BenchmarkResult],
) -> tuple[str, list[VLLMConfig]]:
"""
Returns (bottleneck_analysis_text, candidate_configs).
This can be implemented later with a real LLM backend.
"""
```
2. Update scripts/run_search.py to:
- Parse arguments for model, workload, objective, engine, max_steps.
- Use hardware_inspector / model_meta / workload_profiler to build the profiles.
- Instantiate JsonStore.
- Call ai_search_loop.
- Print the best configuration and its metrics at the end.
# Coding style & quality expectations
- Use clear, descriptive function and variable names.
- Add type hints to all public functions and dataclasses.
- Include docstrings that explain the intent, not just restate the name.
- Use pathlib.Path for filesystem operations.
- Use logging.getLogger(__name__) in each module.
# Deliverables summary
Implement or update the following modules:
- harness/types.py (or equivalent)
- store/json_store.py (extend)
- harness/vllm_runner.py
- harness/runner.py
- adapter/vllm_adapter.py
- adapter/vidur_adapter.py
- feasibility/filter_configs.py (extend)
- config_generator.py (implement seed generation)
- search/ai_loop.py
- scripts/run_vllm_benchmark.py (update)
- scripts/run_workload_benchmarks.py (update)
- scripts/run_search.py (update)
Focus on clean architecture and clear boundaries between:
- Config generation & validation (config_generator, feasibility)
- Execution & profiling (harness, adapter, inspector, store)
- AI-driven reasoning (search/ai_loop.py)