Paper section: PD-sep scaffold + drop --enforce-eager from launch scripts

Adds analysis/pd_sep_paper_section/ as the home for the "PD separation is
net negative under agentic workloads" paper section: plot scripts for C1
(workload chars), C6 (roofline), C7 (routing-vs-PD-sep lever), the C6/C7
PDFs already rendered, and a README mapping candidate claims to required
figures plus open re-run items.

Removes --enforce-eager from bench.sh and all active launch scripts so
cuda graphs are captured -- the prior methodology suppressed one of
PD-sep's structural advantages (D-node fixed-shape decode). Legacy
scripts under scripts/legacy/ are intentionally untouched as historical
records.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-25 11:24:16 +08:00
parent 6a27f75337
commit d71a111099
11 changed files with 576 additions and 9 deletions

View File

@@ -0,0 +1,87 @@
# Paper section: PD separation under agentic workloads
This directory collects everything produced for the "PD-sep is net negative
on agentic workloads" paper section. It is one section of a larger paper,
not the whole paper.
## Layout
```
analysis/pd_sep_paper_section/
├── README.md # this file
├── scripts/
│ ├── plot_workload.py # C1: input/output CDF + KV reuse decomposition
│ ├── plot_roofline.py # C6: prefill roofline at varying cache reuse
│ └── plot_routing_lever.py # C7: routing vs PD-sep as design levers
└── figures/
├── fig_c6_roofline.pdf # rendered locally (analytical, no trace needed)
├── fig_c7_routing_lever.pdf # rendered locally (from REPORT.md §3.1)
└── (fig_c1a_io_cdf.pdf, # produced on dash0 when trace is available
fig_c1b_reuse.pdf)
```
## Candidate claims -> figures (status)
| Claim | Figure | Status |
|---|---|---|
| C1: 98% prefill share + 91% intra-session KV reuse | `figures/fig_c1a_io_cdf.pdf`, `figures/fig_c1b_reuse.pdf` | **needs trace on dash0** |
| C2: PD-sep vs Combined headline numbers | (not yet) | **needs re-run without --enforce-eager on `traces/w600_r0.0015_st30.jsonl`** |
| C3: decode KV cache memory wall (time-series) | (not yet) | needs step-level vLLM telemetry during PD-sep run |
| C4: TTFT stacked breakdown (prefill / KV pull / decode wait) | (not yet) | needs per-request breakdown.json from PD-sep run |
| C5: cuda-graph ablation (eager vs cudagraph × Combined vs PD-sep) | (not yet) | needs the 2×2 matrix |
| C6: prefill stays compute-bound at 95% reuse | `figures/fig_c6_roofline.pdf` | **rendered** |
| C7: cache-aware routing is a larger lever than PD-sep | `figures/fig_c7_routing_lever.pdf` | **rendered** (legacy data, footer caveat) |
## In-place edits made for this task
These edits are in the repo, not in this directory, because they modify
existing launch scripts. `--enforce-eager` was removed so cuda graphs can be
captured — PD-sep's D-node is a particularly clean case for cuda-graph
benefit and the prior methodology suppressed it.
| File | Lines | Change |
|---|---|---|
| `scripts/bench.sh` | 150, 161 | drop `--enforce-eager` (elastic + baseline modes) |
| `scripts/launch_pd_mooncake.sh` | 47, 64 | drop `--enforce-eager` (P and D instances) |
| `scripts/launch_pd_separated.sh` | 52, 68 | drop `--enforce-eager` (P and D instances) |
| `scripts/launch_phase1_ps.sh` | 32, 43 | drop `--enforce-eager` (C and PS instances) |
| `scripts/launch_elastic_p2p.sh` | 57 | drop `--enforce-eager` (kv_both instances) |
`scripts/legacy/*.sh` are intentionally left as-is — they record the
configuration of past experiments.
`REPORT.md` and `analysis/pd_separation_analysis.md` still describe the
old `--enforce-eager` setup. Update them once the new runs land.
## Reproducing the figures
From repo root:
```bash
# C1 (needs sampled trace on dash0)
.venv/bin/python analysis/pd_sep_paper_section/scripts/plot_workload.py \
--trace traces/w600_r0.0015_st30.jsonl
# C6 (analytical, runs anywhere with matplotlib)
.venv/bin/python analysis/pd_sep_paper_section/scripts/plot_roofline.py
# C7 (hardcoded REPORT.md §3.1 numbers; no inputs)
.venv/bin/python analysis/pd_sep_paper_section/scripts/plot_routing_lever.py
```
All three default `--outdir` to `analysis/pd_sep_paper_section/figures`.
## Caveats / open items
- **C7 uses legacy data**. The footer of `fig_c7_routing_lever.pdf` says so:
PD-sep numbers come from the random-sampled trace + `--enforce-eager`. Re-run
on `traces/w600_r0.0015_st30.jsonl` with cuda-graphs on before paper-grade
citation. The plotting code keeps the source numbers in a single `ROWS`
table (top of `plot_routing_lever.py`) for a one-line swap.
- **C2/C3/C4/C5 figures are not produced** because the experiments have not
been re-run. The 4h matrix proposed in the prior conversation turn
(Combined + RR, Combined + cache-aware, PD-sep 4P+4D, PD-sep 6P+2D, plus
eager-vs-cudagraph ablation, ×3 seeds) is the prerequisite.
- **C6 is analytical**, so it is independent of any re-run. The numbers
match `scripts/compute_roofline.py` (constants are duplicated; if one
changes, the other must change too).

View File

@@ -0,0 +1,144 @@
"""C6: roofline plot for Qwen3-Coder-30B-A3B on H20.
Reproduces the analytical roofline used in scripts/compute_roofline.py and
plots it as a single PDF: AI vs achievable throughput, with annotated
operating points for prefill at reuse {0, 70, 90, 95}% and decode.
The constants must stay in lockstep with compute_roofline.py. If you change
one, change the other.
"""
import argparse
from pathlib import Path
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import numpy as np
# ---- model constants (mirror scripts/compute_roofline.py) ----
L, D, H_KV, D_HEAD, D_FFN = 48, 2048, 4, 128, 6144
K_EXPERTS = 8
BYTES = 2 # bf16
# ---- H20 ----
PEAK_FLOPS = 148e12
HBM_BW = 4.0e12
RIDGE = PEAK_FLOPS / HBM_BW # ~37
def attn_prefill_flops(seq_len, new_tokens):
d_kv = H_KV * D_HEAD
qkv = new_tokens * (D * D * 2 + D * d_kv * 2 * 2)
attn = new_tokens * seq_len * D * 2 * 2
out = new_tokens * D * D * 2
return (qkv + attn + out) * L
def attn_prefill_bytes(seq_len, new_tokens, cached_tokens):
d_kv = H_KV * D_HEAD
weight = D * (D + 2 * d_kv + D) * BYTES * L
cached_kv = cached_tokens * 2 * d_kv * BYTES * L
act = new_tokens * D * BYTES * 2 * L
new_kv = new_tokens * 2 * d_kv * BYTES * L
return weight + cached_kv + act + new_kv
def ffn_flops(n):
return 3 * n * D * D_FFN * 2 * K_EXPERTS * L
def ffn_bytes(n):
weight = K_EXPERTS * 3 * D * D_FFN * BYTES * L
act = n * D * BYTES * 2 * L
return weight + act
def point(seq_len, reuse):
cached = int(seq_len * reuse)
new = max(1, seq_len - cached)
f = attn_prefill_flops(seq_len, new) + ffn_flops(new)
b = attn_prefill_bytes(seq_len, new, cached) + ffn_bytes(new)
return f, b, new
def decode_point(seq_len):
f = attn_prefill_flops(seq_len, 1) + ffn_flops(1)
b = attn_prefill_bytes(seq_len, 1, seq_len) + ffn_bytes(1)
return f, b
def plot(out_path, seq_len=64000):
fig, ax = plt.subplots(figsize=(6.5, 4.2))
ai_grid = np.logspace(-1, 5, 400)
achievable = np.minimum(ai_grid * HBM_BW, PEAK_FLOPS) / 1e12
ax.plot(ai_grid, achievable, color="#222", lw=1.5, label="H20 roofline")
ax.axvline(RIDGE, color="#888", ls=":", lw=1)
ax.text(RIDGE, 420, f"ridge = {RIDGE:.0f}", color="#666",
fontsize=8, ha="center", va="top",
bbox=dict(boxstyle="round,pad=0.2", fc="white", ec="none", alpha=0.85))
ax.axhline(PEAK_FLOPS / 1e12, color="#aaa", ls="--", lw=0.6)
ax.text(2, PEAK_FLOPS / 1e12 * 1.08, "compute ceiling (148 TFLOPS bf16)",
fontsize=8, color="#666", ha="left")
# operating points: use a legend (not annotations with leader lines, since
# all 4 prefill points sit on the compute ceiling and would overlap).
reuses = [0.0, 0.7, 0.9, 0.95]
colors = ["#d62728", "#ff7f0e", "#2ca02c", "#1f77b4"]
for reuse, color in zip(reuses, colors):
f, b, new = point(seq_len, reuse)
ai = f / b
thpt = min(ai * HBM_BW, PEAK_FLOPS) / 1e12
ax.scatter([ai], [thpt], color=color, s=80, zorder=5,
edgecolor="white", linewidth=1.2,
label=f"prefill reuse={int(reuse*100):>2}% "
f"(new={new:>6,} tok, AI={ai:>6,.0f})")
f, b = decode_point(seq_len)
ai_dec = f / b
thpt_dec = min(ai_dec * HBM_BW, PEAK_FLOPS) / 1e12
ax.scatter([ai_dec], [thpt_dec], color="#8c564b", s=80, marker="D",
zorder=5, edgecolor="white", linewidth=1.2,
label=f"decode (per-token, seqlen={seq_len:,}, AI={ai_dec:.1f})")
ax.legend(loc="lower right", fontsize=8.5, framealpha=0.95,
prop={"family": "monospace", "size": 8})
ax.set_xscale("log")
ax.set_yscale("log")
ax.set_xlim(0.5, 1e5)
ax.set_ylim(0.5, 500)
ax.set_xlabel("Arithmetic intensity (FLOP/byte)")
ax.set_ylabel("Achievable throughput (TFLOPS)")
ax.set_title(
f"Prefill stays compute-bound even at 95% reuse "
f"(Qwen3-Coder-30B-A3B, H20, seqlen={seq_len:,})",
fontsize=10,
)
ax.grid(True, which="both", alpha=0.25)
fig.tight_layout()
fig.savefig(out_path, bbox_inches="tight")
plt.close(fig)
print(f"[C6] wrote {out_path}")
for reuse in reuses:
f, b, new = point(seq_len, reuse)
print(f" reuse={int(reuse*100):>3}% new={new:>6,} AI={f/b:>8.1f} "
f"bound={'COMPUTE' if f/b > RIDGE else 'MEMORY'}")
f, b = decode_point(seq_len)
print(f" decode AI={f/b:>8.1f} bound={'COMPUTE' if f/b > RIDGE else 'MEMORY'}")
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--seq-len", type=int, default=64000)
ap.add_argument("--outdir", default="analysis/pd_sep_paper_section/figures")
args = ap.parse_args()
out = Path(args.outdir)
out.mkdir(parents=True, exist_ok=True)
plot(out / "fig_c6_roofline.pdf", seq_len=args.seq_len)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,123 @@
"""C7: routing lever vs PD-separation lever.
Side-by-side comparison of the magnitude of two design changes on the same
agentic workload:
(A) Round-robin -> cache-aware routing, both Combined-mode
(B) Combined -> PD-separated, both cache-aware
For each, plot delta TTFT p50 / TPOT p90 / APC. Green = improvement, red =
regression. Numbers come from REPORT.md §3.1 (PD-separation_analysis.md §3.1).
CAVEAT shown on the figure: these numbers are from the legacy
trace methodology (random sampling, 1 req/GPU). They are not yet reproduced
on the trace-driven 850-req sampling at production concurrency, and the
PD-sep runs were captured with --enforce-eager. The current plot is meant
to show the qualitative gap between the two levers; a re-run is required
for paper-grade quantitative claims.
"""
import argparse
from pathlib import Path
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import numpy as np
# (label, RR baseline, cache-aware baseline, PD-sep w/ cache-aware,
# unit, format, "improve_when_smaller")
ROWS = [
("TTFT p50 (s)", 1.836, 0.731, 1.261, "s", "{:.2f}", True),
("TPOT p90 (s)", 0.086, 0.073, 0.074, "s", "{:.3f}", True),
("APC (%)", 20.8, 44.7, 40.2, "pp", "{:.1f}", False),
]
def pct_delta(before, after, improve_when_smaller):
"""Return signed % change framed so positive = improvement.
For APC (pp): return absolute pp delta because relative % is misleading.
"""
diff = after - before
if improve_when_smaller:
improvement = -(diff / before) * 100
return improvement, f"{improvement:+.0f}%"
pp = diff
return pp, f"{pp:+.1f}pp"
def plot(out_path):
fig, axes = plt.subplots(1, 3, figsize=(10, 3.5))
bar_colors = lambda val: "#2ca02c" if val >= 0 else "#d62728"
for ax, (metric, rr, ca, pdsep, unit, fmt, smaller_better) in zip(axes, ROWS):
# lever A: RR -> cache-aware (both combined)
a_val, a_txt = pct_delta(rr, ca, smaller_better)
# lever B: combined -> PD-sep (both cache-aware)
b_val, b_txt = pct_delta(ca, pdsep, smaller_better)
bars = ax.bar(
["RR → cache-aware\n(within Combined)",
"Combined → PD-Sep\n(both cache-aware)"],
[a_val, b_val],
color=[bar_colors(a_val), bar_colors(b_val)],
edgecolor="black", linewidth=0.6, width=0.55,
)
ymax = max(abs(a_val), abs(b_val))
ax.set_ylim(-ymax * 1.35, ymax * 1.35)
ax.axhline(0, color="black", lw=0.6)
for bar, val, txt in zip(bars, [a_val, b_val], [a_txt, b_txt]):
yoff = ymax * 0.06 if val >= 0 else -ymax * 0.06
ax.text(bar.get_x() + bar.get_width() / 2,
val + yoff,
txt,
ha="center", va="bottom" if val >= 0 else "top",
fontsize=10, fontweight="bold")
ax.set_title(metric, fontsize=10)
if smaller_better:
ax.set_ylabel("Δ (positive = improvement)")
else:
ax.set_ylabel("Δ percentage points")
ax.grid(True, axis="y", alpha=0.25)
ax.tick_params(axis="x", labelsize=8.5)
u = "" if unit == "pp" else unit
ax.set_xlabel(
f"RR={fmt.format(rr)}{u} · CA={fmt.format(ca)}{u} · PD-Sep={fmt.format(pdsep)}{u}",
fontsize=8, color="#555", labelpad=8,
)
fig.suptitle(
"Cache-aware routing is a larger lever than PD separation on agentic workload",
fontsize=11, y=1.02,
)
fig.tight_layout(rect=(0, 0.10, 1, 0.96))
footer = (
"Source: REPORT.md §3.1 / analysis/pd_separation_analysis.md §3.1. "
"Legacy random-sampling methodology + --enforce-eager. "
"Re-run on trace-driven w600_r0.0015_st30 with cuda-graph required before paper-grade citation."
)
fig.text(0.5, 0.01, footer, ha="center", fontsize=7.5, color="#666",
style="italic", wrap=True)
fig.savefig(out_path, bbox_inches="tight")
plt.close(fig)
print(f"[C7] wrote {out_path}")
for metric, rr, ca, pdsep, unit, fmt, smaller in ROWS:
a, a_txt = pct_delta(rr, ca, smaller)
b, b_txt = pct_delta(ca, pdsep, smaller)
print(f" {metric:14s} RR→CA: {a_txt:>7s} Combined→PD-Sep: {b_txt:>7s}")
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--outdir", default="analysis/pd_sep_paper_section/figures")
args = ap.parse_args()
out = Path(args.outdir)
out.mkdir(parents=True, exist_ok=True)
plot(out / "fig_c7_routing_lever.pdf")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,217 @@
"""C1: workload characterization figures.
Generates two figures from the sampled trace:
fig_c1a_io_cdf.pdf -- input / output token CDF (two panels)
fig_c1b_reuse.pdf -- KV-block reuse decomposition
Run on dash0 where the trace lives and matplotlib is installed.
Usage:
.venv/bin/python scripts/plot_workload.py \
--trace traces/w600_r0.0015_st30.jsonl \
--outdir analysis/figures
"""
import argparse
import json
import sys
from collections import Counter
from pathlib import Path
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import numpy as np
BLOCK_SIZE = 512
def load_trace(path):
rows = [json.loads(l) for l in open(path)]
rows.sort(key=lambda r: float(r["timestamp"]))
return rows
def percentile_markers(arr, qs=(0.5, 0.9, 0.99)):
arr = np.asarray(arr)
return {q: float(np.quantile(arr, q)) for q in qs}
def plot_io_cdf(rows, out_path):
inputs = np.array([r["input_length"] for r in rows if r["input_length"] > 0])
outputs = np.array([r["output_length"] for r in rows if r["output_length"] > 0])
fig, axes = plt.subplots(1, 2, figsize=(8.5, 3.2))
for ax, data, label, log in [
(axes[0], inputs, "input tokens (log scale)", True),
(axes[1], outputs, "output tokens", False),
]:
sorted_d = np.sort(data)
cdf = np.arange(1, len(sorted_d) + 1) / len(sorted_d)
ax.plot(sorted_d, cdf, color="#1f77b4", lw=1.6)
if log:
ax.set_xscale("log")
ax.set_xlabel(label)
ax.set_ylabel("CDF")
ax.set_ylim(0, 1.02)
ax.grid(True, alpha=0.3)
pcts = percentile_markers(data)
for q, v in pcts.items():
ax.axvline(v, color="#888", ls=":", lw=0.8)
ax.annotate(
f"p{int(q*100)}={int(v):,}",
xy=(v, q),
xytext=(4, -8),
textcoords="offset points",
fontsize=8,
color="#444",
)
io_ratio = inputs.sum() / max(outputs.sum(), 1)
fig.suptitle(
f"Agentic workload I/O: aggregate ratio = {io_ratio:.1f}x "
f"(N={len(rows)} requests, sampled from GLM-5.1)",
fontsize=10,
)
fig.tight_layout(rect=(0, 0, 1, 0.94))
fig.savefig(out_path, bbox_inches="tight")
plt.close(fig)
print(f"[C1a] wrote {out_path}")
print(f" input p50={int(np.quantile(inputs, 0.5)):,} "
f"p90={int(np.quantile(inputs, 0.9)):,} "
f"p99={int(np.quantile(inputs, 0.99)):,}")
print(f" output p50={int(np.quantile(outputs, 0.5)):,} "
f"p90={int(np.quantile(outputs, 0.9)):,} "
f"p99={int(np.quantile(outputs, 0.99)):,}")
print(f" aggregate I/O ratio = {io_ratio:.2f}x")
def reuse_decomposition(rows):
"""Classify every cacheable block as intra-session / cross-session / unique.
Walk requests in timestamp order. For each block (hash_id) in the request:
- if first time seen globally -> 'unique-or-future-reuse' (resolved later)
- if already seen earlier within the same session -> 'intra-session'
- if already seen in a different session -> 'cross-session'
After the pass, blocks classified as 'unique-or-future-reuse' that have
a global refcount of 1 are 'unique'; those with refcount > 1 stay where
they were first seen (counted under whichever later request reused them).
Token counts use BLOCK_SIZE = 512.
"""
# Session id resolution mirrors analyze_cache_hit.py.
chat_to_session = {}
block_first_session = {} # hid -> session_id of first emitter
block_seen_in_session = {} # hid -> set of session_ids that have seen it
block_global_count = Counter()
intra = 0
cross = 0
first_time = 0 # token-count of blocks the first time they appear
for r in rows:
cid = int(r["chat_id"])
pid = int(r["parent_chat_id"])
sid = r.get("session_id",
str(cid) if pid < 0 else chat_to_session.get(pid, str(pid)))
sid = str(sid)
chat_to_session[cid] = sid
for hid in r.get("hash_ids", []):
block_global_count[hid] += 1
if hid not in block_first_session:
block_first_session[hid] = sid
block_seen_in_session[hid] = {sid}
first_time += BLOCK_SIZE
else:
if sid in block_seen_in_session[hid]:
intra += BLOCK_SIZE
else:
cross += BLOCK_SIZE
block_seen_in_session[hid].add(sid)
# Of the first-time tokens, those whose block was never reused are 'unique'.
unique_tokens = 0
reused_first = 0
for hid, count in block_global_count.items():
if count == 1:
unique_tokens += BLOCK_SIZE
else:
reused_first += BLOCK_SIZE # first emission of a reused block
# Total tokens (block-rounded) = intra + cross + first_time
# first_time decomposes into: unique_tokens + reused_first
# For the reuse story we attribute first_time to 'unique vs the
# first-emit-of-a-shared-block'. Convention used in the figure:
# intra-session reuse = subsequent hits within the same session
# cross-session reuse = subsequent hits across sessions
# first emission (will-reuse) = block emitted once, reused later
# unique (never-reuse) = block emitted exactly once, never hit again
return {
"intra_session_reuse_tokens": intra,
"cross_session_reuse_tokens": cross,
"first_emission_will_reuse_tokens": reused_first,
"unique_no_reuse_tokens": unique_tokens,
}
def plot_reuse(rows, out_path):
d = reuse_decomposition(rows)
total = sum(d.values())
parts = [
("intra-session reuse", d["intra_session_reuse_tokens"], "#2ca02c"),
("cross-session reuse", d["cross_session_reuse_tokens"], "#1f77b4"),
("first emission (reused later)", d["first_emission_will_reuse_tokens"], "#ff7f0e"),
("unique (never reused)", d["unique_no_reuse_tokens"], "#d62728"),
]
fig, ax = plt.subplots(figsize=(8.5, 1.9))
left = 0
for label, val, color in parts:
frac = val / total
ax.barh(0, frac, left=left, color=color, edgecolor="white", height=0.6, label=label)
if frac > 0.025:
ax.text(left + frac / 2, 0,
f"{label}\n{frac*100:.1f}%",
ha="center", va="center", fontsize=8.5, color="white")
left += frac
ax.set_xlim(0, 1)
ax.set_yticks([])
ax.set_xlabel("share of total cacheable tokens (block-aligned, 512 tok blocks)")
ax.set_title("Where do prefix cache hits come from? "
f"(N={len(rows)} requests, sampled trace)")
ax.legend(loc="upper center", bbox_to_anchor=(0.5, -0.45), ncol=4, fontsize=8, frameon=False)
for spine in ("top", "right", "left"):
ax.spines[spine].set_visible(False)
fig.tight_layout()
fig.savefig(out_path, bbox_inches="tight")
plt.close(fig)
print(f"[C1b] wrote {out_path}")
for label, val, _ in parts:
print(f" {label:40s} {val/total*100:5.1f}% ({val:>12,} tokens)")
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--trace", default="traces/w600_r0.0015_st30.jsonl")
ap.add_argument("--outdir", default="analysis/pd_sep_paper_section/figures")
args = ap.parse_args()
trace = Path(args.trace)
outdir = Path(args.outdir)
outdir.mkdir(parents=True, exist_ok=True)
if not trace.exists():
sys.exit(f"trace not found: {trace}")
rows = load_trace(trace)
print(f"loaded {len(rows)} requests from {trace}")
plot_io_cdf(rows, outdir / "fig_c1a_io_cdf.pdf")
plot_reuse(rows, outdir / "fig_c1b_reuse.pdf")
if __name__ == "__main__":
main()

View File

@@ -147,7 +147,7 @@ launch_instances() {
$VLLM serve "$MODEL" \
--host 0.0.0.0 --port $port \
--tensor-parallel-size 1 \
--trust-remote-code --enable-prefix-caching --enforce-eager \
--trust-remote-code --enable-prefix-caching \
--dtype auto --gpu-memory-utilization 0.9 --max-model-len 200000 \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
$vllm_extra_args \
@@ -158,7 +158,7 @@ launch_instances() {
$VLLM serve "$MODEL" \
--host 0.0.0.0 --port $port \
--tensor-parallel-size 1 \
--trust-remote-code --enable-prefix-caching --enforce-eager \
--trust-remote-code --enable-prefix-caching \
--dtype auto --gpu-memory-utilization 0.9 --max-model-len 200000 \
$vllm_extra_args \
> "$logfile" 2>&1 &

View File

@@ -54,7 +54,7 @@ for i in $(seq 0 $((N_INSTANCES - 1))); do
$VLLM serve "$MODEL" \
--host 0.0.0.0 --port $port \
--tensor-parallel-size 1 \
--trust-remote-code --enable-prefix-caching --enforce-eager \
--trust-remote-code --enable-prefix-caching \
--dtype auto --gpu-memory-utilization 0.9 --max-model-len 200000 \
--kv-transfer-config \
'{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \

View File

@@ -44,7 +44,6 @@ $VLLM serve "$MODEL_PATH" \
--tensor-parallel-size 4 \
--trust-remote-code \
--enable-prefix-caching \
--enforce-eager \
--dtype auto \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
@@ -61,7 +60,6 @@ $VLLM serve "$MODEL_PATH" \
--tensor-parallel-size 4 \
--trust-remote-code \
--enable-prefix-caching \
--enforce-eager \
--dtype auto \
--gpu-memory-utilization 0.8 \
--kv-transfer-config \

View File

@@ -49,7 +49,6 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 $VLLM serve "$MODEL_PATH" \
--tensor-parallel-size 4 \
--trust-remote-code \
--enable-prefix-caching \
--enforce-eager \
--dtype auto \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
@@ -65,7 +64,6 @@ CUDA_VISIBLE_DEVICES=4,5,6,7 $VLLM serve "$MODEL_PATH" \
--tensor-parallel-size 4 \
--trust-remote-code \
--enable-prefix-caching \
--enforce-eager \
--dtype auto \
--gpu-memory-utilization 0.8 \
--kv-transfer-config \

View File

@@ -29,7 +29,7 @@ for i in $(seq 0 6); do
echo "Starting C instance $i on GPU $i, port $((8000+i)), bootstrap $((8998+i))"
VLLM_MOONCAKE_BOOTSTRAP_PORT=$((8998+i)) MASTER_PORT=$((29500+i)) CUDA_VISIBLE_DEVICES=$i \
.venv/bin/vllm serve "$MODEL" --host 0.0.0.0 --port $((8000+i)) --tensor-parallel-size 1 \
--trust-remote-code --enable-prefix-caching --enforce-eager \
--trust-remote-code --enable-prefix-caching \
--dtype auto --gpu-memory-utilization 0.9 --max-model-len 200000 \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
> "$OUTDIR/vllm_c_$i.log" 2>&1 &
@@ -40,7 +40,7 @@ done
echo "=== Launching PS instance on GPU 7, port 8007, bootstrap 9005 ==="
VLLM_MOONCAKE_BOOTSTRAP_PORT=9005 MASTER_PORT=29507 CUDA_VISIBLE_DEVICES=7 \
.venv/bin/vllm serve "$MODEL" --host 0.0.0.0 --port 8007 --tensor-parallel-size 1 \
--trust-remote-code --enable-prefix-caching --enforce-eager \
--trust-remote-code --enable-prefix-caching \
--dtype auto --gpu-memory-utilization 0.9 --max-model-len 200000 \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
> "$OUTDIR/vllm_ps_0.log" 2>&1 &