Use local cache for qwen30b vllm runs

2026-05-02 08:47:16 +08:00
parent 1880e859b5
commit 664aeb49b2
3 changed files with 8 additions and 2 deletions
--- a/docs/qwen30b-community-vllm020/harness-early-stop-ablation-20260502.md
+++ b/docs/qwen30b-community-vllm020/harness-early-stop-ablation-20260502.md
@@ -15,6 +15,8 @@ The comparison is:

 Both specs start from the same base vLLM configuration. The base contains only serving access fields: `host`, `port`, and `served-model-name`. It does not set performance flags such as TP, DP, EP, max model length, prefix cache, chunked prefill, max-num-seqs, max-num-batched-tokens, or gpu-memory-utilization. The first trial therefore measures community vLLM defaults for this model.

+The launch environment sets `HOME=/tmp/wjh` and `XDG_CACHE_HOME=/tmp/wjh/.cache` so vLLM, torch.compile, and FlashInfer build caches land on dash0 local storage instead of CPFS. This is a startup/cache placement choice, not a vLLM performance flag.
+
 ## vLLM Install

 PyPI reports `vllm==0.20.0` as the current community release checked on 2026-05-02. The dash0 runtime venv is on local rootfs rather than CPFS, because installing torch/CUDA wheels into CPFS was I/O-bound: