- Replace mutable module constants (HEAVY_THRESHOLD/OVERLOAD_FACTOR/
MAX_OFFLOAD_INFLIGHT/PREFILL_THROUGHPUT/RDMA_OVERHEAD_S/
CACHE_CAPACITY_BLOCKS) with a Settings dataclass + SETTINGS singleton.
__main__ now mutates SETTINGS so CLI overrides survive even when the
module is imported as a library (e.g. by tests/) (D5).
- Add --max-offload-inflight CLI flag (M3) and read it from SETTINGS.
- Add --cache-gate-ratio CLI flag and a real gate before the cost-model
branch: if cache_hit/input_length < ratio, mark cache_gate_REASON and
fall back to colocated. cache_ratio is no longer a write-only field
(B4).
- P candidate selection penalises instances already running offloaded
HEAVY prefills, so back-to-back HEAVY requests don't pile onto the
same P (M2).
- bench.sh forwards --max-offload-inflight / --cache-gate-ratio to the
proxy.
- Tests cover SETTINGS knobs + the heavy_threshold-driven P-offload
penalty.