Commit Graph

2 Commits

Author SHA1 Message Date
tim
905d671135 feat(env): MC_TRANSFER_TIMEOUT=1800s default in setup_env + stack
Mooncake C++ batch_transfer_sync defaults to 30s timeout; on
saturated D scheduler threads doing LRU eviction, that fires as a
false positive and the SGLang hair-trigger in conn.py:1270
permanently blacklists the D's mooncake_session_id (E2 forensic in
docs/E1_E2_RESULTS_ZH.md §5c). Bump to 1800s in setup_env.sh and
mirror to subprocess env in stack.py so SGLang workers get it too.
30-min envelope still detects genuinely broken peers eventually.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 11:45:09 +08:00
tim
b55371fe69 docs: H200 + driver 570 setup guide + 11 lessons learned
Captures the full debugging journey of getting vendored SGLang 0.5.10
+ mooncake RDMA running on a 4×H200 node with the older driver
570.86.15. Driver 570's actual API is cu12.8 — nvidia-smi's
"CUDA Version: 13.0" header is a forward-compat ceiling, not the
driver's own version — and that single misreading drove most of the
detours. Lessons cover: pip vs vendor sglang divergence, why cu13
switching was a dead end (mooncake is cu12-only by wheel, driver 570
can't run cu13 anyway), why --disable-overlap-schedule alone isn't
enough, why pip nvidia-cuda-nvcc-cu12 doesn't ship the nvcc binary,
and how tvm_ffi's ninja-driven nvcc invocation makes CUDA_HOME the
single hook point that fixes everything.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 00:10:14 +08:00