obsidian/projects/auto-tuner/Untitled.md

You are an expert Python systems engineer working on an LLM inference auto-tuning project.

The repository layout currently looks like:

```
llm_autotune/
├── adapter/
│   └── __init__.py
├── config_generator.py
├── config_space/
│   └── __init__.py
├── feasibility/
│   ├── filter_configs.py
│   ├── __init__.py
│   ├── memory_model.py
│   ├── model_meta.py
│   ├── topology_rules.py
│   └── __pycache__/...
├── harness/
│   └── __init__.py
├── inspector/
│   ├── hardware_inspector.py
│   ├── probe.py
│   ├── workload_profiler.py
│   └── __pycache__/...
├── search/
│   ├── heuristic.py
│   └── __init__.py
├── store/
│   ├── json_store.py
│   └── __init__.py
├── util/
│   └── __init__.py
└── scripts/
    ├── run_inspect.py
    ├── run_search.py
    ├── run_vllm_benchmark.py
    └── run_workload_benchmarks.py
```

# Goal

Implement the core infrastructure for an AI-driven LLM inference config auto-tuner, with four concrete capabilities:

1. A robust profiling & logging tool that can run vLLM benchmarks and record detailed logs for later bottleneck analysis.
2. A basic config generator that produces VALID vLLM configs for a given hardware + model + workload.
3. A unified harness to run vidur (simulation) and vLLM (real system) with the same interface, writing detailed profiling logs.
4. An AI-driven iterative loop that reads historical results, diagnoses bottlenecks, proposes new configs, and calls the harness repeatedly.

# General requirements

- Python 3.12+ with type hints and dataclasses where appropriate.
- Keep dependencies minimal; assume we can add small, well-known libs if necessary, but prefer the standard library.
- Avoid ad-hoc global state; pass objects explicitly.
- Logging:
	- Use the `logging` module, not `print`, for internal logs.
	- User-facing scripts may still print concise summaries to stdout.
- All public functions should have clear docstrings.

# Part 0: Core types & JSON store

1. Create or extend a module `harness/types.py` (or `harness/__init__.py` if you prefer) to define the core data classes:

   - `HardwareProfile`
	 - High-level fields like: gpu_type, num_gpus, hbm_gb, nvlink_topology (string or simple struct), cpu_cores, system_memory_gb, etc.
	 - This should be compatible with what `inspector/hardware_inspector.py` can produce.

   - `ModelProfile`
	 - Fields like: model_name, param_count, hidden_size, num_layers, num_heads, is_moe, num_experts, is_mla, max_position_embeddings, etc.
	 - This should be compatible with `feasibility/model_meta.py`.

   - `WorkloadProfile`
	 - Fields like: workload_name, qps, avg_prompt_tokens, p95_prompt_tokens, avg_decode_tokens, p95_decode_tokens, request_type, etc.
	 - This should be compatible with `inspector/workload_profiler.py`.

   - `VLLMConfig`
     - A structured representation of the core vLLM config knobs we care about:
	   - tensor_parallel_size, pipeline_parallel_size, expert_parallel_size, data_parallel_size
	   - block_size, max_num_batched_tokens, max_num_seqs
	   - gpu_memory_utilization
	   - scheduling_policy (string)
	   - router/admission knobs if applicable
	   - any other important vLLM engine args we need.

   - `BenchmarkRunConfig`
     - Fields:
       - run_id: str
       - engine: Literal["vllm", "vidur"]
       - vllm_config: VLLMConfig
       - workload: WorkloadProfile
       - objective: str
       - extra: dict[str, Any] | None (for future extensions)

   - `BenchmarkResult`
     - Fields:
       - run_id: str
       - success: bool
       - aggregated_metrics: dict[str, float]  # e.g., {"qps": ..., "p95_latency_ms": ..., "ttft_ms": ...}
       - hw: HardwareProfile
       - model: ModelProfile
       - workload: WorkloadProfile
       - vllm_config: VLLMConfig
       - error_message: str | None
       - trace_paths: list[str]  # paths to detailed traces if any
       - started_at: datetime
       - finished_at: datetime

   - `BottleneckReport`
     - A simple summary object that the AI loop could use later, with fields like:
       - primary_bottleneck: Literal["memory", "compute", "communication", "scheduler", "unknown"]
       - secondary_bottlenecks: list[str]
       - notes: str

2. Extend `store/json_store.py` to implement a minimal JSON-based store:

   - Provide at least these methods (or similar, well-documented alternatives):

     ```python
     class JsonStore:
         def __init__(self, root: Path): ...

         def create_run_dir(self, run_id: str) -> Path: ...
         def save_run_config(self, cfg: BenchmarkRunConfig) -> None: ...
         def save_run_result(self, result: BenchmarkResult) -> None: ...
         def append_trace_record(self, run_id: str, record: dict[str, Any]) -> None: ...
     ```

   - The layout on disk should roughly be:

     - `root/run_id/config.json`
     - `root/run_id/metrics.json`
     - `root/run_id/traces.jsonl` (optional)

   - These files should be valid JSON and easy to parse later.

# Part 1: vLLM benchmark runner with profiling

Implement a robust vLLM runner that:

- Takes a `BenchmarkRunConfig`, `HardwareProfile`, and `ModelProfile`.
- Runs a benchmark against vLLM.
- Collects aggregated metrics and (optionally) time-series traces.
- Persists everything through `JsonStore`.

1. In `harness/vllm_runner.py` implement:

   ```python
   from .types import BenchmarkRunConfig, BenchmarkResult, HardwareProfile, ModelProfile
   from store.json_store import JsonStore

   def run_vllm_benchmark(
       run_config: BenchmarkRunConfig,
       hw: HardwareProfile,
       model: ModelProfile,
       store: JsonStore,
   ) -> BenchmarkResult:
       """
       Launch a vLLM benchmark according to run_config, collect metrics, and store logs.

       This function is allowed to:
       - Use an in-process vLLM Engine OR
       - Start a vLLM HTTP server as a subprocess and send requests to it.

       It MUST:
       - Generate a unique run directory via JsonStore.
       - Save config and metrics via JsonStore.
       - Return a BenchmarkResult object populated with aggregated metrics.
       """
	```
2. Integrate existing modules:
- Use inspector/workload_profiler.py to generate the request stream and WorkloadProfile.
- Optionally call a helper (in a new module harness/profiler.py) that periodically samples GPU utilization, memory usage, etc., and writes records via JsonStore.append_trace_record.
3. Update scripts/run_vllm_benchmark.py to:
- Parse CLI arguments (model, workload description, objective, etc.).
- Instantiate HardwareProfile via inspector/hardware_inspector.py.
- Instantiate ModelProfile via feasibility/model_meta.py.
- Instantiate a JsonStore rooted at e.g. ./runs.
- Build a BenchmarkRunConfig.
- Call run_vllm_benchmark.
- Print a concise summary plus the run_id.
# Part 2: Config generator & validity checking
We already have feasibility/filter_configs.py, memory_model.py, model_meta.py, and topology_rules.py. Now we need to:
1. Extend feasibility/filter_configs.py to expose a single, well-typed validator:
```python
from harness.types import VLLMConfig, HardwareProfile, ModelProfile, WorkloadProfile

def validate_config(
    cfg: VLLMConfig,
    hw: HardwareProfile,
    model: ModelProfile,
    workload: WorkloadProfile | None = None,
) -> tuple[bool, list[str]]:
    """
    Check whether a vLLMConfig is valid for the given hardware/model/workload.

    Returns:
      (is_valid, reasons_if_invalid)
    """
```
- Use memory_model to estimate HBM usage and enforce an upper bound.
- Use topology_rules to reject obviously bad parallelism combinations.
- Use simple logical constraints (tp * pp * ep * dp == world_size, etc.).
2. Implement a minimal "repair" helper in feasibility/filter_configs.py:
```python
def repair_config(
    cfg: VLLMConfig,
    hw: HardwareProfile,
    model: ModelProfile,
    workload: WorkloadProfile,
) -> VLLMConfig:
    """
    Attempt to minimally modify cfg to make it valid.

    Strategies:
    - Adjust gpu_memory_utilization downwards if memory is too tight.
    - Reduce max_num_batched_tokens or max_num_seqs if KV cache is too large.
    - Fix tp/pp/ep/dp product to match world_size.
    - Raise a clear exception if repair is impossible.
    """
```
3. Implement config_generator.py as the main entry for generating seed configs:
- Provide:
```python
def generate_seed_configs(
    hw: HardwareProfile,
    model: ModelProfile,
    workload: WorkloadProfile,
    objective: str,
    max_configs: int = 8,
) -> list[VLLMConfig]:
    """
    Generate a small set of reasonable, VALID seed vLLM configs for the given environment.

    Use:
    - Handcrafted "template families" for dense/MoE/MLA models.
    - Different templates for latency vs throughput oriented objectives.
    - validate_config(...) to filter out invalid configs.
    """
```
- Use the validator and repair helper when constructing these configs.
# Part 3: Unified runner for vidur(simulation) and vLLM(real)
We want a single harness function that can run either vidur or vLLM with the same input types and logging behavior.
1. In adapter/ add two modules:
- adapter/vllm_adapter.py
	- Responsible for low-level integration with vLLM (EngineArgs / HTTP service).
	- Should expose simple primitives such as "apply VLLMConfig" and "send requests".
- adapter/vidur_adapter.py
	- Responsible for integrating with the vidur simulator.
	- Should expose an interface that, given a VLLMConfig and WorkloadProfile, produces metrics comparable to vLLM.
	Both adapters can be thin now; they can evolve later.
1. In harness/runner.py implement:
```python
from harness.types import BenchmarkRunConfig, BenchmarkResult, HardwareProfile, ModelProfile
from store.json_store import JsonStore

def run_benchmark(
    run_config: BenchmarkRunConfig,
    hw: HardwareProfile,
    model: ModelProfile,
    store: JsonStore,
) -> BenchmarkResult:
    """
    Dispatch to the appropriate engine runner ("vllm" or "vidur") based on run_config.engine,
    then return a BenchmarkResult.

    The interface and logging semantics should be identical for both engines.
    """
```
- Internally call run_vllm_benchmark for "vllm" and run_vidur_simulation for "vidur" (you should create run_vidur_simulation similar to run_vllm_benchmark).
3. Update / implement scripts/run_workload_benchmarks.py:
- Accept CLI options like --engine {vllm,vidur}.
- Loop over a list of configs (seed configs from config_generator) and call run_benchmark for each.
- Summarize results to stdout.
# Part 4: AI-driven iterative search loop
Implement an iterative search loop that calls an external LLM (e.g., OpenAI API) to:
- Inspect previous configs + metrics.
- Diagnose bottlenecks.
- Propose new configs or modifications.
- Iterate until convergence or a step budget is reached.
1. In search/ai_loop.py implement:
```python
from harness.types import (
    HardwareProfile,
    ModelProfile,
    WorkloadProfile,
    VLLMConfig,
    BenchmarkRunConfig,
    BenchmarkResult,
)

def ai_search_loop(
    hw: HardwareProfile,
    model: ModelProfile,
    workload: WorkloadProfile,
    objective: str,
    engine: str = "vllm",
    max_steps: int = 20,
) -> list[BenchmarkResult]:
    """
    Core loop:

    - Generate seed configs via config_generator.generate_seed_configs.
    - For each config:
      - Create a BenchmarkRunConfig and run a benchmark via harness.run_benchmark.
      - Store BenchmarkResult objects in a `history` list.

    - For step in range(max_steps):
      - Summarize `history` into a textual prompt for an external LLM.
      - Call an LLM client (you may stub this out, or define a placeholder function) that returns:
        - A bottleneck analysis
        - A small set of new candidate configs (or edits to existing configs)
      - For each candidate:
        - Validate and repair via feasibility.validate_config / repair_config.
        - Run a benchmark and append to `history`.
      - Optionally, ask the LLM (or use simple heuristics) to check for convergence.
    - Return the full history for further analysis.
    """
```
- You do NOT need to implement the actual OpenAI API call in detail; you can define a placeholder interface like:
```python
def ask_llm_for_new_configs(
    hw: HardwareProfile,
    model: ModelProfile,
    workload: WorkloadProfile,
    objective: str,
    history: list[BenchmarkResult],
) -> tuple[str, list[VLLMConfig]]:
    """
    Returns (bottleneck_analysis_text, candidate_configs).
    This can be implemented later with a real LLM backend.
    """
```
2. Update scripts/run_search.py to:
- Parse arguments for model, workload, objective, engine, max_steps.
- Use hardware_inspector / model_meta / workload_profiler to build the profiles.
- Instantiate JsonStore.
- Call ai_search_loop.
- Print the best configuration and its metrics at the end.
# Coding style & quality expectations
- Use clear, descriptive function and variable names.
- Add type hints to all public functions and dataclasses.
- Include docstrings that explain the intent, not just restate the name.
- Use pathlib.Path for filesystem operations.
- Use logging.getLogger(__name__) in each module.
# Deliverables summary
Implement or update the following modules:
- harness/types.py (or equivalent)
- store/json_store.py (extend)
- harness/vllm_runner.py
- harness/runner.py
- adapter/vllm_adapter.py
- adapter/vidur_adapter.py
- feasibility/filter_configs.py (extend)
- config_generator.py (implement seed generation)
- search/ai_loop.py
- scripts/run_vllm_benchmark.py (update)
- scripts/run_workload_benchmarks.py (update)
- scripts/run_search.py (update)
Focus on clean architecture and clear boundaries between:
- Config generation & validation (config_generator, feasibility)
- Execution & profiling (harness, adapter, inspector, store)
- AI-driven reasoning (search/ai_loop.py)