Two bugs caused elastic to concentrate load on cached instances (10x token
imbalance vs 2.7x baseline):
1. _instance_cost queue only counted pending_prefill_tokens, missing
ongoing_decode_tokens entirely — instances with 50 decoding requests
appeared idle to the cost model.
2. Cache hits made overloaded instances look "cheap", creating a positive
feedback loop: more sessions → more cache → lower cost → more routing.
Added a hard gate (ongoing_tokens > avg * overload_factor) that breaks
affinity before the cost model runs, matching linear policy behavior.
Result: token imbalance 10.3x → 2.6x, TTFT p90 -37% vs baseline.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The replayer and proxy were building multi-turn prompts from trace tokens,
but the model generates different output tokens. Subsequent turns had wrong
prefix tokens, causing cache misses and invalid experimental measurements.
- replay.py: min_tokens=max_tokens for deterministic length, return_token_ids
to capture actual output, _apply_realized_prefix for next-turn correction
- proxy: extract output token_ids from SSE, record prompt+output as realized
prefix in shadow cache, extract _handle_local_request to deduplicate
- bench.sh/launch_elastic_p2p.sh: default elastic mode to unified policy
- mooncake_connector: only send prompt blocks (not stale output blocks),
track failed_recving_block_ids for error recovery
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copies mooncake_connector.py, mooncake_utils.py, scheduler.py from
third_party/vllm to the pip-installed vllm's site-packages. C extensions
stay from the pip package; only Python files are overridden.
Usage: bash scripts/deploy_vllm_patches.sh [HOST]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
A: Add /estimate_hit endpoint to bootstrap server for real-time cache
probing. Proxy queries this before committing to PUSH, eliminating
24% zero-match PUSH requests (shadow cache divergence).
C: Add _handle_cached_prefill_offload: C (cache source) does fast
cached prefill → KV to Mooncake → D pulls and decodes.
Replaces broken direct_read PUSH where D waited for RDMA transfer
while occupying KV blocks without doing compute.
Also: update §3.9 baseline to plain vLLM with full mean/p50/p90/p99.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Without affinity, all cached requests route to the same instance
(cache source always has lowest prefill cost), causing 149s queue.
Fix: if the session's last instance has cost <= 2x the global best,
use it (preserves cache locality). Only re-route when the affinity
instance is significantly more expensive (overloaded).
The 2x threshold is intentionally loose — it's not a hardcoded magic
number but a "prefer locality unless clearly worse" heuristic.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause confirmed: NONE_HASH = os.urandom(32) differs between
scheduler and bootstrap server even in the same process (init_none_hash
called separately by each import path). PYTHONHASHSEED makes it
deterministic: NONE_HASH = hash_fn(seed), same across all code paths.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: each vLLM instance has a random NONE_HASH (os.urandom(32))
when PYTHONHASHSEED is not set. All block hashes are chained from
NONE_HASH, so D's hashes never match C's hashes.
Fix: C's bootstrap server now accepts token_ids and does the prefix
cache lookup locally using C's own hash function and block pool.
No cross-instance hash matching needed.
New flow: D sends prompt token_ids → C computes hashes on C's side →
C looks up in C's own BlockPool → returns block_ids.
Also: module-level _shared_block_pool for scheduler→bootstrap bridge,
prompt_token_ids passed through PullReqMeta, test script added.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Old cost model: offload_cost = colocated_cost + RDMA_overhead, so offload
was always 0.1s more expensive. Result: only 19/117 HEAVY offloaded.
New: colocated_cost includes interference penalty when C_s has decode
requests: penalty = prefill_time × min(num_requests, 3) × 0.3.
Offload now wins when C_s has 1+ concurrent request.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
retry on ConnectError to handle kv_both connection instability
With RDMA_overhead=0.1s, offload triggers when C_s has just 700 tokens
pending (0.1s queue), vs 38k tokens (5.4s) with the old 2.0s estimate.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The cache_gate_ratio=0.3 check blocked 83/112 HEAVY requests (75%)
because they were cold (cache_ratio=0). But with direct RDMA read,
D reads C's cached blocks via RDMA regardless of cache ratio — the
gate was protecting against the OLD flow (C does prefill + push).
Also fixed cost model: offload_cost now reflects direct read reality:
OLD: P_queue + P_full_prefill + RDMA (P has no cache → expensive)
NEW: D_queue + RDMA_read + D_local_prefill(new_tokens)
Offload wins when C_s queue > RDMA_overhead (~2s).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace the global session_affinity dict with two namespace-isolated
ones (combined / prefill) so a session_id never indexes the wrong
instance list across mode switches. Keep `session_affinity` as a
read-only alias to the combined dict for any existing tooling.
- Add a startup _verify_vllm_patch() that scans
vllm.v1.core.sched.scheduler.Scheduler for the original
`assert req_id in self.requests` line. If the patch was not
re-applied after a vLLM upgrade we now print a loud warning at
lifespan startup instead of dying mid-experiment on a KV-transfer
abort race.
- Replace mutable module constants (HEAVY_THRESHOLD/OVERLOAD_FACTOR/
MAX_OFFLOAD_INFLIGHT/PREFILL_THROUGHPUT/RDMA_OVERHEAD_S/
CACHE_CAPACITY_BLOCKS) with a Settings dataclass + SETTINGS singleton.
__main__ now mutates SETTINGS so CLI overrides survive even when the
module is imported as a library (e.g. by tests/) (D5).
- Add --max-offload-inflight CLI flag (M3) and read it from SETTINGS.
- Add --cache-gate-ratio CLI flag and a real gate before the cost-model
branch: if cache_hit/input_length < ratio, mark cache_gate_REASON and
fall back to colocated. cache_ratio is no longer a write-only field
(B4).
- P candidate selection penalises instances already running offloaded
HEAVY prefills, so back-to-back HEAVY requests don't pile onto the
same P (M2).
- bench.sh forwards --max-offload-inflight / --cache-gate-ratio to the
proxy.
- Tests cover SETTINGS knobs + the heavy_threshold-driven P-offload
penalty.
Complete implementation of direct RDMA read for KV cache migration:
vLLM Mooncake connector (mooncake_connector.py):
- PullReqMeta: add direct_read flag + block_hashes
- MooncakeConnectorMetadata: add hash_table_updates/removals for
scheduler->worker block hash sync
- MooncakeConnectorScheduler: set_block_pool() to access BlockPool,
build_connector_meta() computes hash table deltas each step,
update_state_after_alloc() captures request block hashes for direct_read
- MooncakeConnectorWorker: _start_direct_read() + _direct_read_single()
implements D-side RDMA read via batch_transfer_sync_read, with
HTTP query/unpin to C's bootstrap server
Bootstrap server (mooncake_utils.py):
- POST /query_blocks: look up block hashes, return block_ids + GPU layout
- POST /unpin_blocks: release pin tracking
- set_worker_kv_info(): register GPU addresses at init
- update_hash_table(): receive scheduler deltas each step
Scheduler (scheduler.py):
- One-line hookup: pass block_pool to connector after KVCacheManager init
Proxy (cache_aware_proxy.py):
- _handle_direct_read_offload: sends request ONLY to D with
direct_read=True + remote_bootstrap_addr. No request to C at all.
- C's scheduler is completely uninvolved (0 GPU time on C)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
M1: cached_blocks was a plain set with a "trim half via list slicing"
eviction. CPython does not guarantee set iteration order, so the trim
discarded an arbitrary half of the entries — completely unlike vLLM's
LRU and a known contributor to the router's cache_hit estimate
diverging from real APC. Replace with an OrderedDict-backed LRU:
move_to_end on hits, popitem(last=False) on overflow. Capacity exposed
as CACHE_CAPACITY_BLOCKS module constant (200000 by default).
M5: streamed responses decrement load counters in their generator's
finally block. If a client disconnects before consuming the body the
generator is never entered and the decrement is lost, causing
ongoing_tokens / num_requests / pending_prefill_tokens to drift
negative under load. Add a 60s background reconcile_loop that clamps
those counters at zero as a safety net. Started in lifespan, cancelled
on shutdown. Does not replace proper vLLM exact-state syncing.
The hardcoded traces/sampled_1000req_seed42.jsonl no longer exists; switch
the default to the current sampled trace file w600_r0.0015_st30.jsonl and
let users override via --trace. Skip Part 4 cleanly when the file is
missing instead of relying on os.path.exists.
D2: run_benchmark.sh and run_experiments.sh still pass --time-scale and
--max-inflight-sessions to the replayer, but those flags were removed when
the project moved to trace-driven dispatch. The scripts cannot run as-is.
D3: ~25 ad-hoc analyze_* / compare_* / profile_* / final_* scripts and a
handful of single-experiment run_*.sh point at /home/admin/cpfs paths,
deleted output directories, or a sampled trace file that no longer exists.
Keep them in scripts/legacy/ for historical reference; the scripts that
remain in scripts/ (analyze_trace, analyze_breakdown, analyze_cache_hit,
analyze_eviction, compare_results, compute_roofline, sample_trace,
analyze_agentic_patterns, simulate_cache_policies, plus launch_*.sh,
gpu_monitor.sh, bench.sh) cover the current workflow.
Adds scripts/legacy/README.md to document the archival policy.
The default MODEL pointed at /home/admin/cpfs/... which never existed on
the public dev machines (other launch_*.sh and TODO.md use $HOME/models),
and the default TRACE pointed at traces/sampled_1000req_seed42.jsonl
which was deleted when the sampler moved to window+thin output. Update
both to the values the rest of the repo already standardized on.
B1: _inst_cumulative_tokens was written by pick_instance but never read
anywhere; delete the variable, global declaration, and per-call increment.
Load is already tracked via inst.ongoing_tokens.
D1: _send_prefill_async + the --fire-and-forget branch were unreachable
in practice (no launch/bench script enabled the flag) and broken even if
exercised: D-decode would fire before P registered the transfer_id,
guaranteeing a Mooncake 502. Collapse _handle_pd_sep to its synchronous
path and drop the CLI flag.
Added EXIT/INT/TERM traps to ensure vLLM, proxy, and gpu_monitor
processes are cleaned up even when bench.sh is killed externally.
Also includes gpu_monitor in cleanup_gpu pattern matching.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vLLM Mooncake patch:
- get_num_new_matched_tokens: support remote_num_tokens parameter for
partial remote prefill (pull N tokens from remote, compute rest locally)
- update_state_after_alloc: only allocate receive blocks for external portion
Proxy _handle_heavy_offload rewrite:
- Step 1: C_s exports ONLY cached blocks (truncated prompt, 0 compute)
- Step 2: D pulls cached blocks + does local prefill for new tokens + decodes
- C_s's blocks auto-freed by Mooncake delay_free after D confirms receipt
This enables true session migration: C_s releases cache, D takes over.
C_s's GPU is freed immediately (no compute), vs old approach where C_s
had to do full prefill (1-15s GPU occupancy).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Old gate: cache_ratio >= 0.3 (static, only 14% of HEAVY triggered)
New gate: offload when offload_cost < colocated_cost, where:
colocated_cost = queue(C_s) + prefill(new_tokens)
offload_cost = queue(P_idle) + prefill(P_tokens) + RDMA_overhead
Key changes:
- P is now least-loaded instance (not session-sticky C_s)
- Gate considers C_s queue depth dynamically
- Crossover: offload wins when C_s queue >= 38k tokens (~5.4s)
- Cold HEAVY requests CAN be offloaded if C_s is busy enough
- P accounting uses P's actual cache hit, not C_s's
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bug 1+5: D instance had no accounting during prefill phase (7-11s window).
Router saw D as idle, routing extra traffic that caused KV allocation failures.
Fix: reserve D's ongoing_tokens+num_requests at offload decision time.
Bug 7: No cap on concurrent offloads despite REPORT claiming MAX_OFFLOAD=4.
Fix: add MAX_OFFLOAD_INFLIGHT=4 check before offloading.
Bug 6: Session affinity migrated to D but proxy cache estimator wasn't
updated for D. Future turns scored D as cache-cold.
Fix: call d_inst.record_prefix(token_ids) after successful decode.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Random session sampling destroys cross-session hash block sharing
(52% -> 16%) because sessions sharing system prompts get scattered.
New approach: take a contiguous time window from the trace (preserving
temporal locality of shared-prefix sessions), then thin within the
window to hit target QPS. This preserves both intra-session reuse
(62% of reusable tokens) and cross-session sharing (38%).
Results (block sharing rate):
Old random r=0.002: 16.0% -> Window+thin: 29.7%
Old random r=0.016: 19.5% -> Window+thin: 42.7%
Full trace baseline: 52%
Also corrected the "91% intra-session" claim: actual split is
62% intra / 38% cross (token-level), making cross-session sharing
preservation critical for valid APC benchmarks.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The replayer was artificially limiting concurrency with --max-inflight-sessions
(semaphore) and --time-scale (time compression), producing unrealistically low
1 req/GPU load that masked prefill-decode interference.
Replayer changes:
- Remove session_sem and time_scale entirely
- Each request dispatched at its trace timestamp exactly
- Sessions still sequential (turn N+1 waits for turn N completion)
- If turn completes late, next turn fires immediately
Sampler changes:
- Add --sample-ratio for GPU-proportional session sampling
- Keep --target-requests for backwards compat
- No time compression (preserve original arrival pattern)
bench.sh: remove --time-scale and --max-inflight-sessions args
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LMetric was incorrectly sharing session-sticky logic with Linear policy.
Fixed to pure per-request routing: score = P_tokens × BS where
P = pending_prefill + (input - cache_hit), BS = num_requests.
Experiment result (200 req, fresh restart): Linear vs corrected LMetric
show <2% difference on all metrics — LMetric's cache-hit estimation
provides implicit soft affinity that preserves locality without explicit
session stickiness.
Also fix bench.sh missing cd (replayer module not found from non-project
cwd) and rewrite run_lmetric_ab.sh as thin wrapper around bench.sh to
eliminate duplicated launch/cleanup logic that broke under set -euo.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
H7: Sweeping OVERLOAD_FACTOR (2.0/1.5/1.3/1.0) has no effect on GPU
imbalance (~3.5-4x across all settings). Root cause: imbalance is from
workload skew at session placement (turn 1), not from routing at turn 2+.
H4 GPU profiling confirms: GPU balance improvement IS real (4.0x→2.0x),
and it directly improves HEAVY_COLO TTFT by 10.5%. But RDMA-offloaded
requests have bimodal transfer times (0.6s or 18-31s) that negate the
routing benefit.
Updated elastic_hypotheses.md with H7 results and next directions:
higher load experiments where contention amplifies routing differences.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same-condition comparison (both fresh restart, same trace, same params):
Baseline (combined): TTFT=2.383/27.622 TPOT90=0.117 E2E=10.232
Elastic P2P (cap=4): TTFT=1.315/13.179 TPOT90=0.075 E2E=5.708
Delta: -45% / -52% -36% -44%
Key finding: TPOT p90 dropped 36% — confirming heavy prefill DOES
disrupt decode in combined mode, and elastic offload effectively
isolates it. Previous comparisons missed this because baselines
were run under different conditions (stale instances, different time_scale).
GPU util: elastic uses less GPU (15.8% vs 28.7%) but achieves better
latency — higher efficiency through better cache distribution.
APC: elastic has more balanced per-instance APC (36-38% prefix + 30-35%
external) vs baseline's skewed distribution (3.8% - 68.3%).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixed offload decision: removed p>=d gate (was blocking all offloads),
added MAX_OFFLOAD_INFLIGHT=4 cap and p_saturated threshold.
Result (200 req, fresh restart):
Baseline: 99% success, TTFT=1.080/9.410, TPOT90=0.076, E2E=5.306
Elastic: 96% success, TTFT=0.946/15.843, TPOT90=0.077, E2E=5.717
Architectural tradeoff confirmed:
- Median (p50) improves: D instances not disrupted by heavy prefill
- Tail (p90) worsens: offloaded HEAVY requests pay KV transfer cost
- TPOT unchanged: decode isolation is not the bottleneck
To improve p90: need layerwise pipelined KV transfer (overlap with prefill
compute) or smarter offload gating that avoids offloading the very largest
requests (which have the longest prefill time and generate the most KV).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Design: offload HEAVY prefill only when P instance is less loaded than D
AND P is not overloaded (< 1.5x avg). Preserves session-sticky on D
for future KV reuse. External KV correctly registered in prefix cache.
Result (67/200 processed, 75% success):
TTFT p50: 0.551s (-49% vs baseline 1.080s)
TTFT p90: 4.135s (vs baseline 9.410s, -56%)
TPOT p90: 0.074s (same as baseline)
E2E p50: 2.938s (-45% vs baseline 5.306s)
25% error rate from ReadTimeout on very large HEAVY requests queuing on P.
Needs stricter elastic gate or higher timeout. But successful requests
show significant improvement over both baseline and previous P2P.
Also: added external_prefix_cache metrics tracking to replayer summary.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Investigation confirms vLLM Mooncake connector DOES correctly register
externally-received KV blocks in the prefix cache. No bug exists.
Evidence from vLLM logs (per-instance):
inst_1: prefix_cache=14.7%, external_cache=72.1% <- high external hit
inst_4: prefix_cache=52.4%, external_cache=59.0%
The 0.5% aggregate APC from /metrics was a measurement artifact:
inst_0 received 718M query tokens (cold-start prefills) with 0% hit,
diluting the aggregate. D-instances have 20-72% external cache hit.
The /metrics endpoint's prefix_cache_hits_total counter does not include
external hits. The vLLM log's "External prefix cache hit rate" is the
correct metric for Mooncake-transferred KV reuse.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixed race condition in P instance selection (all going to inst_0).
P2P design: HEAVY requests prefill on least-loaded OTHER instance,
KV transfer via Mooncake, decode on session-sticky instance.
Result (200 req, fresh restart, vs baseline):
TTFT p50: 1.080 -> 0.939 (-13%) <- median improves (decode not disrupted)
TTFT p90: 9.410 -> 14.987 (+59%) <- tail worsens (KV transfer on large req)
TPOT p90: 0.076 -> 0.075 (-1%) <- unchanged (not the bottleneck)
E2E p50: 5.306 -> 5.565 (+5%) <- slightly worse overall
The P2P offload helps the common case (WARM/MEDIUM get lower TTFT because
their instance isn't blocked by a heavy prefill) but hurts HEAVY requests
(extra KV transfer latency). This is a median-vs-tail tradeoff.
For SLOs targeting p50: P2P offload helps.
For SLOs targeting p90/p99: baseline combined is better.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Evidence-backed analysis with per-request matched comparison:
1. KV CACHE MEMORY WALL (Evidence 3)
Combined: 12% KV cache per instance (comfortable)
PD-Sep 6P+2D: 48-97% on decode instances (saturation -> 100s waits)
2. KV TRANSFER OVERHEAD (Evidence 4, matched requests)
Mean 1.79s extra TTFT per request, 3.3x slower overall
Small requests (<5k) hit 8.0x ratio (transfer dominates prefill)
Large requests (>50k) hit 1.3x ratio (prefill dominates)
3. SESSION AFFINITY BROKEN (Evidence 5)
Combined: turn N+1 hits same GPU -> 80% multi-turn APC
PD-Sep: turn N+1 prefill on P has NO prior KV (sent to D) -> 0% APC on P
Must re-prefill + re-transfer on every turn
4. GPU UNDERUTILIZATION (Evidence 2)
PD-Sep: 12-17% GPU util (decode is memory-bound, wastes GPU compute)
Combined: 28-54% GPU util (flexible P+D on same GPU)
Root cause: agentic workloads break PD-Sep's assumptions (short input,
no prefix sharing, compute-heavy prefill) with long context, 91%
intra-session KV reuse, and lightweight MoE compute.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TP=2 DP=4 with hybrid routing achieves TTFT p50=0.611s (-43% vs TP=1),
the best TTFT across all tested configurations. But TPOT p90=0.109s
(+51% vs TP=1) due to cross-GPU all-reduce in decode.
Full comparison across 7 configurations shows two Pareto-optimal points:
TP=1 DP=8 hybrid: best TPOT (0.072s), good TTFT (1.064s)
TP=2 DP=4 hybrid: best TTFT (0.611s), acceptable TPOT (0.109s)
The choice depends on SLO:
TTFT-sensitive (interactive) -> TP=2 DP=4
TPOT-sensitive (streaming) -> TP=1 DP=8
All PD-Sep configurations are strictly dominated by one of these two.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implemented --offload mode: HEAVY requests (>20k new tokens) get P on
least-loaded instance, KV via Mooncake RDMA, D on session-sticky instance.
WARM/MEDIUM stay co-located (no KV transfer). All 8 instances run kv_both.
Result (200 req, same instances, fresh restart):
Baseline (no offload): TTFT=1.073 TPOT90=0.074 E2E=5.086
Offload HEAVY: TTFT=1.462 TPOT90=0.077 E2E=6.847
Delta: +36% +4% +35%
Conclusion: even selective KV transfer (only 44% of requests) adds more
overhead than the isolation benefit provides. On single-machine 8 GPU,
PD-combined with hybrid routing is strictly optimal. No form of KV
transfer — full PD-sep, selective offload, or otherwise — improves
over co-located serving for this workload.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Balanced session-sticky routing improves APC from 44.7% to 48.7% (+4pp,
close to simulated 49.2%) but TTFT worsens by 30% and E2E by 23%.
Root cause: session-sticky creates load hotspots — some instances get
multiple heavy concurrent sessions, causing queue delays, despite higher
per-instance APC.
Key finding: APC optimization and latency optimization are in tension.
- Cache affinity (sticky) -> higher APC, worse load balance -> worse latency
- Load-based routing (old) -> lower APC, better load balance -> better latency
The optimal design must balance both dimensions, not optimize one at
the expense of the other.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Routing fix: new sessions placed by cumulative token load (greedy bin
packing) with cache-hit tiebreak. Session affinity for turn 2+.
Replayer now sends X-Session-Id header for proper session tracking.
Agentic workload core patterns (GLM-5.1 trace):
- 91% of reusable KV is intra-session (not cross-session)
- Session-sticky routing is THE critical optimization
- 36% warm requests (1.3k new tokens), 64% cold (17k+)
- After cache: effective prefill/decode ratio drops from 61.5x to 28.7x
- Cross-session sharing (system prompt) is only 4.8% of tokens
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>