362 lines
14 KiB
Markdown
362 lines
14 KiB
Markdown
You are an expert Python systems engineer working on an LLM inference auto-tuning project.
|
|
|
|
The repository layout currently looks like:
|
|
|
|
```
|
|
llm_autotune/
|
|
├── adapter/
|
|
│ └── __init__.py
|
|
├── config_generator.py
|
|
├── config_space/
|
|
│ └── __init__.py
|
|
├── feasibility/
|
|
│ ├── filter_configs.py
|
|
│ ├── __init__.py
|
|
│ ├── memory_model.py
|
|
│ ├── model_meta.py
|
|
│ ├── topology_rules.py
|
|
│ └── __pycache__/...
|
|
├── harness/
|
|
│ └── __init__.py
|
|
├── inspector/
|
|
│ ├── hardware_inspector.py
|
|
│ ├── probe.py
|
|
│ ├── workload_profiler.py
|
|
│ └── __pycache__/...
|
|
├── search/
|
|
│ ├── heuristic.py
|
|
│ └── __init__.py
|
|
├── store/
|
|
│ ├── json_store.py
|
|
│ └── __init__.py
|
|
├── util/
|
|
│ └── __init__.py
|
|
└── scripts/
|
|
├── run_inspect.py
|
|
├── run_search.py
|
|
├── run_vllm_benchmark.py
|
|
└── run_workload_benchmarks.py
|
|
```
|
|
|
|
# Goal
|
|
|
|
Implement the core infrastructure for an AI-driven LLM inference config auto-tuner, with four concrete capabilities:
|
|
|
|
1. A robust profiling & logging tool that can run vLLM benchmarks and record detailed logs for later bottleneck analysis.
|
|
2. A basic config generator that produces VALID vLLM configs for a given hardware + model + workload.
|
|
3. A unified harness to run vidur (simulation) and vLLM (real system) with the same interface, writing detailed profiling logs.
|
|
4. An AI-driven iterative loop that reads historical results, diagnoses bottlenecks, proposes new configs, and calls the harness repeatedly.
|
|
|
|
# General requirements
|
|
|
|
- Python 3.12+ with type hints and dataclasses where appropriate.
|
|
- Keep dependencies minimal; assume we can add small, well-known libs if necessary, but prefer the standard library.
|
|
- Avoid ad-hoc global state; pass objects explicitly.
|
|
- Logging:
|
|
- Use the `logging` module, not `print`, for internal logs.
|
|
- User-facing scripts may still print concise summaries to stdout.
|
|
- All public functions should have clear docstrings.
|
|
|
|
# Part 0: Core types & JSON store
|
|
|
|
1. Create or extend a module `harness/types.py` (or `harness/__init__.py` if you prefer) to define the core data classes:
|
|
|
|
- `HardwareProfile`
|
|
- High-level fields like: gpu_type, num_gpus, hbm_gb, nvlink_topology (string or simple struct), cpu_cores, system_memory_gb, etc.
|
|
- This should be compatible with what `inspector/hardware_inspector.py` can produce.
|
|
|
|
- `ModelProfile`
|
|
- Fields like: model_name, param_count, hidden_size, num_layers, num_heads, is_moe, num_experts, is_mla, max_position_embeddings, etc.
|
|
- This should be compatible with `feasibility/model_meta.py`.
|
|
|
|
- `WorkloadProfile`
|
|
- Fields like: workload_name, qps, avg_prompt_tokens, p95_prompt_tokens, avg_decode_tokens, p95_decode_tokens, request_type, etc.
|
|
- This should be compatible with `inspector/workload_profiler.py`.
|
|
|
|
- `VLLMConfig`
|
|
- A structured representation of the core vLLM config knobs we care about:
|
|
- tensor_parallel_size, pipeline_parallel_size, expert_parallel_size, data_parallel_size
|
|
- block_size, max_num_batched_tokens, max_num_seqs
|
|
- gpu_memory_utilization
|
|
- scheduling_policy (string)
|
|
- router/admission knobs if applicable
|
|
- any other important vLLM engine args we need.
|
|
|
|
- `BenchmarkRunConfig`
|
|
- Fields:
|
|
- run_id: str
|
|
- engine: Literal["vllm", "vidur"]
|
|
- vllm_config: VLLMConfig
|
|
- workload: WorkloadProfile
|
|
- objective: str
|
|
- extra: dict[str, Any] | None (for future extensions)
|
|
|
|
- `BenchmarkResult`
|
|
- Fields:
|
|
- run_id: str
|
|
- success: bool
|
|
- aggregated_metrics: dict[str, float] # e.g., {"qps": ..., "p95_latency_ms": ..., "ttft_ms": ...}
|
|
- hw: HardwareProfile
|
|
- model: ModelProfile
|
|
- workload: WorkloadProfile
|
|
- vllm_config: VLLMConfig
|
|
- error_message: str | None
|
|
- trace_paths: list[str] # paths to detailed traces if any
|
|
- started_at: datetime
|
|
- finished_at: datetime
|
|
|
|
- `BottleneckReport`
|
|
- A simple summary object that the AI loop could use later, with fields like:
|
|
- primary_bottleneck: Literal["memory", "compute", "communication", "scheduler", "unknown"]
|
|
- secondary_bottlenecks: list[str]
|
|
- notes: str
|
|
|
|
2. Extend `store/json_store.py` to implement a minimal JSON-based store:
|
|
|
|
- Provide at least these methods (or similar, well-documented alternatives):
|
|
|
|
```python
|
|
class JsonStore:
|
|
def __init__(self, root: Path): ...
|
|
|
|
def create_run_dir(self, run_id: str) -> Path: ...
|
|
def save_run_config(self, cfg: BenchmarkRunConfig) -> None: ...
|
|
def save_run_result(self, result: BenchmarkResult) -> None: ...
|
|
def append_trace_record(self, run_id: str, record: dict[str, Any]) -> None: ...
|
|
```
|
|
|
|
- The layout on disk should roughly be:
|
|
|
|
- `root/run_id/config.json`
|
|
- `root/run_id/metrics.json`
|
|
- `root/run_id/traces.jsonl` (optional)
|
|
|
|
- These files should be valid JSON and easy to parse later.
|
|
|
|
# Part 1: vLLM benchmark runner with profiling
|
|
|
|
Implement a robust vLLM runner that:
|
|
|
|
- Takes a `BenchmarkRunConfig`, `HardwareProfile`, and `ModelProfile`.
|
|
- Runs a benchmark against vLLM.
|
|
- Collects aggregated metrics and (optionally) time-series traces.
|
|
- Persists everything through `JsonStore`.
|
|
|
|
1. In `harness/vllm_runner.py` implement:
|
|
|
|
```python
|
|
from .types import BenchmarkRunConfig, BenchmarkResult, HardwareProfile, ModelProfile
|
|
from store.json_store import JsonStore
|
|
|
|
def run_vllm_benchmark(
|
|
run_config: BenchmarkRunConfig,
|
|
hw: HardwareProfile,
|
|
model: ModelProfile,
|
|
store: JsonStore,
|
|
) -> BenchmarkResult:
|
|
"""
|
|
Launch a vLLM benchmark according to run_config, collect metrics, and store logs.
|
|
|
|
This function is allowed to:
|
|
- Use an in-process vLLM Engine OR
|
|
- Start a vLLM HTTP server as a subprocess and send requests to it.
|
|
|
|
It MUST:
|
|
- Generate a unique run directory via JsonStore.
|
|
- Save config and metrics via JsonStore.
|
|
- Return a BenchmarkResult object populated with aggregated metrics.
|
|
"""
|
|
```
|
|
2. Integrate existing modules:
|
|
- Use inspector/workload_profiler.py to generate the request stream and WorkloadProfile.
|
|
- Optionally call a helper (in a new module harness/profiler.py) that periodically samples GPU utilization, memory usage, etc., and writes records via JsonStore.append_trace_record.
|
|
3. Update scripts/run_vllm_benchmark.py to:
|
|
- Parse CLI arguments (model, workload description, objective, etc.).
|
|
- Instantiate HardwareProfile via inspector/hardware_inspector.py.
|
|
- Instantiate ModelProfile via feasibility/model_meta.py.
|
|
- Instantiate a JsonStore rooted at e.g. ./runs.
|
|
- Build a BenchmarkRunConfig.
|
|
- Call run_vllm_benchmark.
|
|
- Print a concise summary plus the run_id.
|
|
# Part 2: Config generator & validity checking
|
|
We already have feasibility/filter_configs.py, memory_model.py, model_meta.py, and topology_rules.py. Now we need to:
|
|
1. Extend feasibility/filter_configs.py to expose a single, well-typed validator:
|
|
```python
|
|
from harness.types import VLLMConfig, HardwareProfile, ModelProfile, WorkloadProfile
|
|
|
|
def validate_config(
|
|
cfg: VLLMConfig,
|
|
hw: HardwareProfile,
|
|
model: ModelProfile,
|
|
workload: WorkloadProfile | None = None,
|
|
) -> tuple[bool, list[str]]:
|
|
"""
|
|
Check whether a vLLMConfig is valid for the given hardware/model/workload.
|
|
|
|
Returns:
|
|
(is_valid, reasons_if_invalid)
|
|
"""
|
|
```
|
|
- Use memory_model to estimate HBM usage and enforce an upper bound.
|
|
- Use topology_rules to reject obviously bad parallelism combinations.
|
|
- Use simple logical constraints (tp * pp * ep * dp == world_size, etc.).
|
|
2. Implement a minimal "repair" helper in feasibility/filter_configs.py:
|
|
```python
|
|
def repair_config(
|
|
cfg: VLLMConfig,
|
|
hw: HardwareProfile,
|
|
model: ModelProfile,
|
|
workload: WorkloadProfile,
|
|
) -> VLLMConfig:
|
|
"""
|
|
Attempt to minimally modify cfg to make it valid.
|
|
|
|
Strategies:
|
|
- Adjust gpu_memory_utilization downwards if memory is too tight.
|
|
- Reduce max_num_batched_tokens or max_num_seqs if KV cache is too large.
|
|
- Fix tp/pp/ep/dp product to match world_size.
|
|
- Raise a clear exception if repair is impossible.
|
|
"""
|
|
```
|
|
3. Implement config_generator.py as the main entry for generating seed configs:
|
|
- Provide:
|
|
```python
|
|
def generate_seed_configs(
|
|
hw: HardwareProfile,
|
|
model: ModelProfile,
|
|
workload: WorkloadProfile,
|
|
objective: str,
|
|
max_configs: int = 8,
|
|
) -> list[VLLMConfig]:
|
|
"""
|
|
Generate a small set of reasonable, VALID seed vLLM configs for the given environment.
|
|
|
|
Use:
|
|
- Handcrafted "template families" for dense/MoE/MLA models.
|
|
- Different templates for latency vs throughput oriented objectives.
|
|
- validate_config(...) to filter out invalid configs.
|
|
"""
|
|
```
|
|
- Use the validator and repair helper when constructing these configs.
|
|
# Part 3: Unified runner for vidur(simulation) and vLLM(real)
|
|
We want a single harness function that can run either vidur or vLLM with the same input types and logging behavior.
|
|
1. In adapter/ add two modules:
|
|
- adapter/vllm_adapter.py
|
|
- Responsible for low-level integration with vLLM (EngineArgs / HTTP service).
|
|
- Should expose simple primitives such as "apply VLLMConfig" and "send requests".
|
|
- adapter/vidur_adapter.py
|
|
- Responsible for integrating with the vidur simulator.
|
|
- Should expose an interface that, given a VLLMConfig and WorkloadProfile, produces metrics comparable to vLLM.
|
|
Both adapters can be thin now; they can evolve later.
|
|
1. In harness/runner.py implement:
|
|
```python
|
|
from harness.types import BenchmarkRunConfig, BenchmarkResult, HardwareProfile, ModelProfile
|
|
from store.json_store import JsonStore
|
|
|
|
def run_benchmark(
|
|
run_config: BenchmarkRunConfig,
|
|
hw: HardwareProfile,
|
|
model: ModelProfile,
|
|
store: JsonStore,
|
|
) -> BenchmarkResult:
|
|
"""
|
|
Dispatch to the appropriate engine runner ("vllm" or "vidur") based on run_config.engine,
|
|
then return a BenchmarkResult.
|
|
|
|
The interface and logging semantics should be identical for both engines.
|
|
"""
|
|
```
|
|
- Internally call run_vllm_benchmark for "vllm" and run_vidur_simulation for "vidur" (you should create run_vidur_simulation similar to run_vllm_benchmark).
|
|
3. Update / implement scripts/run_workload_benchmarks.py:
|
|
- Accept CLI options like --engine {vllm,vidur}.
|
|
- Loop over a list of configs (seed configs from config_generator) and call run_benchmark for each.
|
|
- Summarize results to stdout.
|
|
# Part 4: AI-driven iterative search loop
|
|
Implement an iterative search loop that calls an external LLM (e.g., OpenAI API) to:
|
|
- Inspect previous configs + metrics.
|
|
- Diagnose bottlenecks.
|
|
- Propose new configs or modifications.
|
|
- Iterate until convergence or a step budget is reached.
|
|
1. In search/ai_loop.py implement:
|
|
```python
|
|
from harness.types import (
|
|
HardwareProfile,
|
|
ModelProfile,
|
|
WorkloadProfile,
|
|
VLLMConfig,
|
|
BenchmarkRunConfig,
|
|
BenchmarkResult,
|
|
)
|
|
|
|
def ai_search_loop(
|
|
hw: HardwareProfile,
|
|
model: ModelProfile,
|
|
workload: WorkloadProfile,
|
|
objective: str,
|
|
engine: str = "vllm",
|
|
max_steps: int = 20,
|
|
) -> list[BenchmarkResult]:
|
|
"""
|
|
Core loop:
|
|
|
|
- Generate seed configs via config_generator.generate_seed_configs.
|
|
- For each config:
|
|
- Create a BenchmarkRunConfig and run a benchmark via harness.run_benchmark.
|
|
- Store BenchmarkResult objects in a `history` list.
|
|
|
|
- For step in range(max_steps):
|
|
- Summarize `history` into a textual prompt for an external LLM.
|
|
- Call an LLM client (you may stub this out, or define a placeholder function) that returns:
|
|
- A bottleneck analysis
|
|
- A small set of new candidate configs (or edits to existing configs)
|
|
- For each candidate:
|
|
- Validate and repair via feasibility.validate_config / repair_config.
|
|
- Run a benchmark and append to `history`.
|
|
- Optionally, ask the LLM (or use simple heuristics) to check for convergence.
|
|
- Return the full history for further analysis.
|
|
"""
|
|
```
|
|
- You do NOT need to implement the actual OpenAI API call in detail; you can define a placeholder interface like:
|
|
```python
|
|
def ask_llm_for_new_configs(
|
|
hw: HardwareProfile,
|
|
model: ModelProfile,
|
|
workload: WorkloadProfile,
|
|
objective: str,
|
|
history: list[BenchmarkResult],
|
|
) -> tuple[str, list[VLLMConfig]]:
|
|
"""
|
|
Returns (bottleneck_analysis_text, candidate_configs).
|
|
This can be implemented later with a real LLM backend.
|
|
"""
|
|
```
|
|
2. Update scripts/run_search.py to:
|
|
- Parse arguments for model, workload, objective, engine, max_steps.
|
|
- Use hardware_inspector / model_meta / workload_profiler to build the profiles.
|
|
- Instantiate JsonStore.
|
|
- Call ai_search_loop.
|
|
- Print the best configuration and its metrics at the end.
|
|
# Coding style & quality expectations
|
|
- Use clear, descriptive function and variable names.
|
|
- Add type hints to all public functions and dataclasses.
|
|
- Include docstrings that explain the intent, not just restate the name.
|
|
- Use pathlib.Path for filesystem operations.
|
|
- Use logging.getLogger(__name__) in each module.
|
|
# Deliverables summary
|
|
Implement or update the following modules:
|
|
- harness/types.py (or equivalent)
|
|
- store/json_store.py (extend)
|
|
- harness/vllm_runner.py
|
|
- harness/runner.py
|
|
- adapter/vllm_adapter.py
|
|
- adapter/vidur_adapter.py
|
|
- feasibility/filter_configs.py (extend)
|
|
- config_generator.py (implement seed generation)
|
|
- search/ai_loop.py
|
|
- scripts/run_vllm_benchmark.py (update)
|
|
- scripts/run_workload_benchmarks.py (update)
|
|
- scripts/run_search.py (update)
|
|
Focus on clean architecture and clear boundaries between:
|
|
- Config generation & validation (config_generator, feasibility)
|
|
- Execution & profiling (harness, adapter, inspector, store)
|
|
- AI-driven reasoning (search/ai_loop.py) |