Files
obsidian/review/assignments/EuroSys26-Spring-Shadow.md

54 KiB
Raw Permalink Blame History

Unilink: Unleash the Power of Unikernel by Rack Level Inter-connection

CXL, remote fork

unikernel: execute a single application, have smaller image sizes, faster execution speeds, and lower memory footprints

current works:

  • live migration of VMs or processes, distributed shared memory
  • fine-grained microservice and serverless function decompositions
  • page-faultdriven memory pooling
  • transient “Spot” VM instances
  • hardware-level disaggregation via Network

Since there is no actual CXL Switch device available at present, we use a CXL memory expansion card as shared memory

Existing solutions tackle server resource fragmentation through elastic scheduling frameworks that dynamically allocate resources based on workload patterns [22, 41, 42, 44, 47].

Review

overall merits: 3 expertise: 1

Paper summary

This paper presents Unilink, a system that enables cross-server unikernel collaboration through CXL-based memory sharing, aiming to deliver low-latency inter-server communication. At the core of Unilink is a novel abstraction, Uniprocess, which treats each application process as an individual unikernel VM. The system is designed to mitigate resource fragmentation in modern datacenter environments by supporting features such as remote fork, distributed process management, and fast inter-Uniprocess communication across servers. The authors implement Unilink as a full-stack solution involving QEMU-based VM migration, a memory pool leveraging CXL, and forkd for managing process lifecycles. The evaluation demonstrates promising performance gains on representative workloads including Nginx, Redis, and KANN neural network training.

Comments for authors

Thank you for your submission to EuroSys26. This work tackles an important and timely challenge by enabling cross-node unikernel execution with support for CXL memory disaggregation. The system is implemented end-to-end, integrating QEMU-based VM migration, a CXL-backed memory pool, a userspace process daemon (forkd), and a shared memory communication channel. The system architecture is thoughtfully structured, with clear separation of concerns among forkd, the hypervisor layer, and libunilink.

That said, there are several areas where the presentation and evaluation could be improved:

  • While the paper compares serverless, VM, and Unilink-based approaches, the rationale for converging on the Unilink solution remains somewhat ad hoc. It would be helpful to articulate the design motivations more clearly: why is the Uniprocess abstraction, in particular, a better fit for solving the stated problems than extending existing container or VM orchestration models?
  • The explanation of how the Uniprocess abstraction helps mitigate memory fragmentation is insufficient. A more concrete analysis—perhaps with metrics or scenarios illustrating better packing or reduced overhead—would clarify its advantage in multi-tenant environments.
  • The paper does not sufficiently address fault tolerance in the face of real-world failures. How does the system behave under node failures, network partitions, or inconsistent state across distributed forkd daemons? A discussion of consistency guarantees and failure recovery mechanisms would strengthen the robustness of the design.
  • The evaluation would benefit from additional microbenchmarks that isolate the overheads of core components—e.g., the scheduling latency of forkd, the per-byte communication cost of Unilink IPC, and the startup overhead of Uniprocess VMs. These metrics are important to fully understand the trade-offs involved in the system.
  • It is impressive that multi-worker Nginx running on Unilink outperforms the Linux baseline. However, the paper attributes some unexpected results to bugs in lwip. Please clarify whether these are known bugs, whether you applied any patches, and whether the fixes are general or specific to your use case.

Overall, this is a promising system with a strong implementation and initial evaluation. I encourage the authors to strengthen the motivation, deepen the evaluation, and elaborate on the failure semantics to improve the clarity and rigor of the work.

LLM review

Paper Summary

The paper presents Unilink, a distributed computing framework that enables cross-node collaboration between Unikernels by leveraging CXL-based memory sharing. It introduces a novel abstraction called Uniprocess, which models each application process as a Unikernel VM. Unilink aims to solve the resource fragmentation problem prevalent in modern data centers, allowing for remote fork, distributed process management, and low-latency inter-Uniprocess communication across servers. The system is implemented on KVM/QEMU with minimal guest modification and provides significant performance benefits in applications like Nginx, Redis, and KANN neural network training.

Strengths

  • Novelty: First system to enable cross-node Unikernel collaboration using CXL memory, a new and underutilized capability.
  • Timeliness: Tackles cloud resource fragmentation, a highly relevant and impactful problem.
  • Technical depth: Introduces a full-stack solution (VM migration via QEMU, CXL-based memory pool, process daemon forkd, shared memory communication) with deep system-level integration.
  • Backward compatibility: Minimal modification to applications and reuse of existing components (e.g., QEMU migration and ivshmem).
  • Good evaluation: Demonstrates improvements in communication latency, throughput, and fork behavior with real-world applications like Nginx and Redis.
  • Well-structured system design: Clear separation of responsibilities between fork daemon, hypervisor, and libunilink.

Opportunities for Improvement

  • Security concerns are acknowledged (e.g., shared memory access control), but not addressed in the current implementation.
  • Limited discussion of failure handling in fork and communication mechanisms—what happens under node crashes, network partitions, or inconsistent metadata across forkd instances?
  • The CXL-based file system design is functional but relatively simplistic (e.g., based on libfuse in user space); it might not scale well under high concurrency.
  • Some parts of the paper suffer from grammar and clarity issues, which occasionally obscure technical explanations.
  • The evaluation lacks microbenchmarks isolating the overhead of the key components like forkd scheduling delay or IPC per-byte cost.

Comments for authors

  • The introduction and motivation are strong, particularly the decomposition into #M1#M3 and challenges #C1#C3.
  • The design of forkd is clean and seems extendable. However, a description of fault-tolerant metadata management (e.g., group id/pid recovery or synchronization) would enhance robustness.
  • You may want to elaborate further on how CoW is implemented at the QEMU level—is it CoW only for guest RAM, or are disk states also CoW-protected?
  • In Figure 3, the control/data flows between QEMU and forkd could be more clearly labeled (e.g., which messages use shared memory vs. socket).
  • In the evaluation, it's impressive that multi-worker Nginx on Unilink outperforms the Linux version, but some performance anomalies are attributed to lwip bugs. Please clarify whether those fixes are generic or application-specific.
  • The system's reliance on Unikraft is reasonable, but can Unilink support other unikernels, such as MirageOS or OSv? If so, a brief porting guide or feasibility comment would help.
  • Although MPK and VMFUNC are mentioned as future work, discussing isolation policies or memory protection trade-offs in the current system would improve security discussion.

Questions for authors response

  • How does the system handle crashes or restarts of forkd? Is there a persistent state for process group tracking?
  • Can Unilink support migrating a Uniprocess after its been forked, i.e., live migration for load balancing?
  • How does Unilink scale in terms of number of Uniprocesses or fork requests per second? Are there bottlenecks in forkd or shared memory contention?
  • Does your CoW mechanism have limitations when forking very large memory snapshots (e.g., >2GB)? Are there any fragmentation issues in CXL memory?
  • Could the routing layer (e.g., message action chains in §7.2) introduce noticeable overhead when scaling up to many communicating nodes?

This paper addresses an important systems problem with a novel, well-engineered, and effective solution. The implementation is deep and grounded, and the evaluation is strong. While there are some areas that would benefit from polishing (clarity, robustness, and deeper analysis in a few places), the core ideas and execution meet the bar for a top-tier systems venue.


REACH: Reinforcement Learning for Adaptive Microservice Rescheduling in the CloudEdge Continuum

RL 的 formalization 听起来很自然,为什么该工作训的出来的效果比基于 RL 的 prior work 好?他们是怎么定义 award 的?

Review

overall merits: 2 expertise: 1

Paper Summary

This paper presents REACH, a RL-based rescheduling algorithm for adaptive microservice placement in the cloud-edge continuum. The key idea is to dynamically reschedule microservices in real time to minimize end-to-end latency while accounting for resource heterogeneity and network conditions. REACH formulates rescheduling as a Markov Decision Process (MDP), employs Proximal Policy Optimization (PPO) with invalid-action masking to learn effective policies, and leverages a simulation-to-real deployment pipeline to bridge the gap between training and production environments. Experimental evaluations on a Kubernetes-based testbed demonstrate that REACH reduces latency by 7.910% compared to strong baselines, maintains performance under node failures and traffic surges, and minimizes rescheduling overhead.

Comments for authors

Thank you for submitting your paper to EuroSys26. This work tackles dynamic, resource-aware microservice rescheduling in cloud-edge systems, where existing methods struggle with adaptability and overhead. REACH mitigates these issues via incremental RL-based rescheduling, with an MDP formulation that accounts for rescheduling costs to avoid full redeployment. The use of PPO with invalid-action masking improves training stability, and the combination of simulation with real-world Kubernetes experiments demonstrates consistent performance gains across multiple microservice architectures and dynamic scenarios such as node failures and autoscaling.

That said, several aspects could be improved or clarified:

  • While the paper mentions rescheduling overhead, it lacks quantitative comparisons with baseline methods. A detailed breakdown of the rescheduling cost relative to latency gains would strengthen the evaluation.
  • The RL formalization seems natural, but why does REACH outperform prior RL-based approaches? More discussion on how the reward structure and policy design differ from existing RL methods for scheduling would be valuable.
  • The penalty term (Penaltycost) is key to controlling the frequency of rescheduling. However, its tuning and sensitivity are only briefly addressed. How robust is the learned policy to variations in Penaltycost? Is there a risk of converging to suboptimal policies that avoid necessary rescheduling?
  • The runtime performance of the RL agent during rescheduling (e.g., decision latency) is not reported. Measuring this overhead and its impact on the Kubernetes control-plane load would help assess the practicality of deploying REACH in production.
  • The need to profile microservice execution times on heterogeneous nodes may present challenges in large-scale or rapidly changing environments. A discussion on the scalability and automation of this profiling step would be beneficial.

Overall, this paper presents a direction by combining RL with adaptive microservice rescheduling. Addressing the above concerns, particularly around overhead quantification, robustness analysis, and comparisons with prior RL-based work, would be necessary to strengthen the paper.

LLM review

Paper Summary

The paper proposes REACH, a reinforcement learning (RL)-based rescheduling algorithm for adaptive microservice placement in the cloud-edge continuum. The core idea is to dynamically reschedule microservices in real-time using RL to minimize end-to-end latency while considering resource heterogeneity and network conditions. REACH formulates rescheduling as a Markov Decision Process (MDP), employs Proximal Policy Optimization (PPO) to learn policies, and integrates a simulation-to-real deployment pipeline. Experiments on a Kubernetes-based testbed show REACH reduces latency by 7.910% compared to baselines, stabilizes performance during node failures and traffic surges, and minimizes rescheduling overhead.

Strengths

  • Problem Significance: Addresses a critical challenge in cloud-edge systems—dynamic, resource-aware microservice rescheduling—where existing methods struggle with adaptability and overhead.
  • Technical Novelty: Proposes incremental rescheduling via RL, explicitly modeling rescheduling costs to avoid full-application redeployment. The use of PPO with invalid-action masking improves training efficiency.
  • Evaluation Rigor: Combines simulation with real-world Kubernetes experiments, demonstrating consistent gains across three microservice architectures (Chain, Aggregator-Parallel, Aggregator-Sequential) under dynamic scenarios (node failures, autoscaling).
  • Practical Impact: The simulation-to-real pipeline and open-source tools (e.g., CEEnv) enhance reproducibility and deployment feasibility.

Opportunities for Improvement

  • Generalization: The evaluation focuses on DAG-based invocation patterns. How would REACH perform with cyclic or dynamic service graphs (e.g., stateful services)?
  • Ablation Studies: Missing ablation on the impact of reward components (e.g., penalty term) and the contribution of simulation pre-training to real-world performance.
  • Scalability: Results are limited to 18 nodes and 36 pods. Larger-scale experiments would strengthen claims about scalability.

Comments for Authors

  1. Simulation-to-Reality Gap: While CEEnv aligns with the real testbed, clarify how discrepancies in resource heterogeneity or network dynamics might affect policy transfer. For example, how does REACH handle edge devices with intermittent connectivity?
  2. Penalty Term Sensitivity: The reward functions penalty term (Penaltycost) is critical for minimizing rescheduling steps, but its tuning is only briefly discussed. How robust is the policy to variations in Penaltycost? Does it risk premature convergence to suboptimal "idle" actions?
  3. Service Group Handling: The paper defines service groups as sequentially invoked subsets but does not evaluate scenarios where group dependencies change dynamically. How would REACH adapt to shifting invocation patterns?
  4. Computational Overhead: Quantify the runtime of the RL agent during rescheduling (e.g., decision latency) and its impact on Kubernetes control-plane load. This would contextualize the practicality of real-time deployment.
  5. Multi-Objective Trade-offs: While focusing on latency, the work could briefly discuss how energy consumption or cost metrics might be incorporated into the RL reward framework.

Questions for Authors Response

  1. Non-DAG Invocation Patterns: Does REACHs DAG-based formulation limit its applicability to real-world microservices with cyclic dependencies? If so, how could the framework be extended?
  2. Training Data Requirements: How does the need for profiling microservice execution times on heterogeneous nodes affect deployment in large-scale environments?
  3. Comparison with Baselines: The rescheduling baseline RSDQL underperforms REACH. Is this due to RSDQLs Q-learning design, or would a PPO-based version of RSDQL narrow the gap?
  4. Latency Spike Mitigation: The paper claims rescheduling-induced latency spikes are “transient” but does not quantify their duration or frequency. Could frequent spikes degrade user experience in latency-critical applications?

Overall Recommendation

Score: 3 / 5
The paper presents a compelling solution to dynamic microservice rescheduling with strong experimental validation. However, gaps in generalization, scalability analysis, and ablation studies prevent a stronger recommendation. Addressing these concerns would elevate the works impact and applicability.


AICraft: Facilitating Advancement for Cloud-native Generative AI Serving Stack

SageServe [20] and Chiron [37] exploring SLO-aware autoscaling and predictive methods to manage heterogeneous requests and optimize resource allocation proactively.

Review

overall merit: 1 expertise: 3

Paper summary

This paper presents AICraft, a cloud-native orchestration system for efficient and scalable LLM serving in production environments. AICraft incorporates a suite of advanced features, including adaptive autoscaling, KV cache offloading, prefix-cache-aware routing, heterogeneous hardware scheduling, and multi-LoRA support. The system is motivated and guided by insights from real-world workload traces and aims to bridge the gap between academic research and production-grade deployment. Additionally, the authors develop a benchmarking and workload generation framework designed to enable realistic evaluation of LLM serving systems.

Comments for authors

Thank you for your submission to EuroSys'26. This work tackles an important problem in the systems community: making LLM inference practical and efficient in production-scale, cloud-native environments. The paper ambitiously proposes a unified platform that integrates multiple techniques that have previously been explored in isolation. The proposed system, AICraft, is comprehensive in scope, touching on several relevant challenges such as dynamic autoscaling, KVCache offload, heterogeneous hardware scheduling and workload generation.

That said, I have several concerns regarding the clarity of the technical contributions and the novelty of the system components:

1. Lack of clarity on novelty vs. integration

AICraft appears to combine several well-studied techniques—e.g., autoscaling, KV cache offloading, and prefix cache aware scheduling, all of which have been explored in recent works such as BlitzScale (OSDI'25), CachedAttention (ATC'24), and Mooncake (FAST'25). It remains unclear whether AICrafts contributions lie primarily in the novel integration of these components, or if it introduces fundamentally new designs for each component. The current presentation suggests both, but neither direction is fully developed:

  • If the contribution is in integration, then the paper should clearly articulate the challenges in composing these techniques in a cohesive, cloud-native stack. What unique tensions or interactions arise between components? How did the authors resolve them?
  • If the contribution lies in improving individual components, then each such improvement needs to be described in greater technical depth. For instance, claims such as: “The KV cache offloading framework addresses critical bottlenecks in LLM inference through optimized data transfer and memory hierarchy management.”, "The connector employs optimized CUDA kernels to significantly accelerate data movement between GPU and CPU. For memory scalability, its multi-tiered cache manager dynamically balances workloads across storage layers, alleviating GPU memory capacity limits while minimizing latency penalties." are too high-level to evaluate. What is the actual mechanism used? How does it compare against state-of-the-art alternatives (e.g., CachedAttention)? What were the key design trade-offs?

Without this clarity, it is difficult to assess the systems novelty and generalizability.

2. Workload characterization and evaluation mismatch

The paper emphasizes that AICraft is grounded in insights from production traces, yet the evaluation appears to rely predominantly on synthetic workloads. This discrepancy undermines the strength of the evaluation. It is essential to either:

  • Include evaluation results on actual production traces (or anonymized versions).
  • Justify convincingly why synthetic workloads faithfully capture the relevant workload characteristics seen in production.

Currently, the workload generation framework is described, but its unclear whether it reproduces the temporal, spatial, or statistical characteristics of real workloads.

3. Unclear system insights and contributions

The paper reads more like an engineering report that introduces a large-scale production system, rather than a research paper that distills novel systems insights. In its current form, the paper tries to cover too many features at a high level, rather than diving deeply into the design and implications of one or two novel contributions. As a result, the key takeaway is unclear. I encourage the authors to reframe the paper around the most novel or non-trivial aspects of AICraft and reduce the scope accordingly to allow for deeper discussion and evaluation.

LLM review

Paper Summary

The paper introduces AICraft, a cloud-native orchestration system designed to optimize large language model (LLM) serving in production environments. It addresses key challenges such as dynamic workload variability, heterogeneous hardware utilization, and stateful inference optimization. AICraft integrates features like adaptive autoscaling, KV cache offloading, multi-LoRA support, and heterogeneous resource scheduling, grounded in insights from real-world workload analysis. The system emphasizes practical deployment readiness and serves as a platform for evaluating cloud-native LLM infrastructure innovations.

Strengths

  • Practical Significance: AICraft addresses critical gaps in LLM serving, including cost-efficient scaling, support for multi-LoRA adapters, and KV cache optimization, which are highly relevant for production systems.
  • Real-World Workload Insights: The analysis of production workloads (e.g., variable traffic patterns, token length distributions) provides actionable insights for system design, enhancing the paper's practical impact.
  • Integrated Framework: The unified approach combining dynamic scaling, heterogeneous hardware support, and benchmarking tools offers a holistic solution for LLM serving, filling a gap between academic research and industrial deployment.
  • Evaluation Infrastructure: The proposed workload generation and replay framework enables rigorous testing of LLM-serving systems under realistic conditions, advancing reproducibility and iterative improvement.

Opportunities for Improvement

  • Experimental Depth: The evaluation lacks detailed comparisons with baseline systems (e.g., vLLM, KServe) and ablation studies to isolate the impact of individual components (e.g., KV cache offloading vs. autoscaling).
  • Scalability Discussion: The paper could elaborate on AICrafts performance in large-scale, multi-node deployments and its limitations in handling extreme-scale models (e.g., 100B+ parameters).
  • Reproducibility: Workload datasets and profiling tools are based on internal production systems; releasing anonymized datasets or benchmarks would strengthen reproducibility.
  • Related Work: Recent advancements in LLM serving (e.g., SOSP 2023's "Cachegen" 25, EuroSys 2024's "Splitwise" 36 should be explicitly compared to position AICraft within the broader research landscape.

Comments for Authors

  • Design Trade-offs: Clarify why certain design choices (e.g., offline GPU profiling for autoscaling) were prioritized over online adaptive methods. How does AICraft handle unexpected workload shifts?
  • KV Cache Offloading: Detail the interaction between the distributed KV cache (e.g., InfiniStore 3 and inference engines. How does network latency affect performance?
  • Multi-LoRA Evaluation: Include experiments demonstrating the overhead of managing thousands of LoRA adapters (cf. SOSP 2023's "S-Lora" 46.
  • Failure Scenarios: Discuss fault tolerance mechanisms (e.g., pod failures, network partitions) and their impact on SLA compliance.

Questions for Authors Response

  1. How does AICraft ensure fairness in multi-tenant scenarios with competing LoRA workloads?
  2. What is the runtime overhead of the evaluation framework (e.g., benchmark client instrumentation)?
  3. Why was a concurrency-based autoscaler (knative-pod-autoscaler) chosen over learning-based approaches (e.g., [11])?
  4. How does AICraft adapt to models with non-uniform memory access patterns (e.g., mixture-of-experts)?
  5. Are there plans to open-source AICraft or its benchmarking tools to foster community adoption?

Overall Recommendation

The paper presents a timely and practical contribution to LLM-serving infrastructure, with strong engineering insights and real-world relevance. However, the experimental section and comparison to prior work need strengthening to fully validate its claims.
Score: 3 / 5 (Weak Accept)


BlockVector: Unlocking the Potential of Dense Tensor Cores for Accelerating SpMM with N:M Sparsity

TC: Tensor Core DETC: Dense Tensor Cores 为什么 bert (2030x) 的测试性能提升比 DLMC (24x) 好那么多?矩阵大小不一样 为什么只在 Long Range Arena (LRA) 上测试了 acc别的 benchmark 呢?

Review

overall merits: 2 expertise: 1

Paper Summary

This paper presents BlockVector, an approach to accelerate SpMM under N:M structured sparsity using Dense Tensor Cores (DETCs). The authors point out inefficiencies in existing SpMM implementations, including metadata overhead and limited DETC utilization. The paper claims two main contributions: (1) a sparse format (Tiled-VW) intended to reduce metadata while supporting flexible N:M sparsity, and (2) several kernel-level optimizations (e.g., bank-conflict-free shared memory access and k-stage pipelining). The reported experiments show performance gains, with up to 57× speedup over current libraries such as Spatha and cuBLAS, primarily in large-scale and high-sparsity settings.

Comments for authors

Thank you for submitting your paper to EuroSys26. While the paper addresses an important problem of accelerating SpMM on GPUs, the current presentation and evaluation leave several questions unanswered. The proposed Tiled-VW format and GPU kernel techniques are described as improvements over existing approaches, but the novelty and generality of these optimizations are not convincingly demonstrated.

Specific concerns include:

  • The accuracy evaluation is limited to Long Range Arena (LRA) and shows unexpectedly high accuracy even with 90% sparsity. Broader evaluation on diverse benchmarks is needed to validate these claims.
  • The performance of Opt2 drops at 90% sparsity compared to 50% and 75% in Figure 17. Could the authors provide an analysis of the architectural or algorithmic factors behind this trend?
  • Symbols such as N, M, K, and V in Section 3.2 are not properly defined before they are used in the metadata memory usage discussion, making this section hard to follow.
  • The presentation of Figure 13 is unclear and makes it difficult to interpret and compare the methods. A redesign is needed to highlight key performance differences.
  • There is no clear comparison of Tiled-VWs metadata efficiency with standard unstructured formats (e.g., CSR) on irregular sparsity patterns.
  • Current optimizations seem targeted at large-scale matrices. Can strategies like batching or hybrid sparse formats be explored to enhance performance for smaller matrices?

Overall, the paper makes a solid design to the filed of GPU-accelerated sparse computation, but the evaluation is not comprehensive enough to justify the claimed benefits. Improving clarity, extending evaluation, and addressing the above questions would significantly strengthen its impact.

LLM review

Paper Summary

The paper proposes BlockVector, a novel approach to accelerate Sparse Matrix-Matrix Multiplication (SpMM) with N:M structured sparsity using Dense Tensor Cores (DETCs) on GPUs. The authors identify inefficiencies in existing SpMM implementations, such as metadata overhead and suboptimal utilization of DETCs, and address these via a memory-efficient sparse format (Tiled-VW) and kernel optimizations (e.g., bank-conflict-free shared memory access, pipelined computation). BlockVector demonstrates significant speedups (up to 57×) over state-of-the-art SpMM libraries like Spatha and cuBLAS, particularly for large-scale and high-sparsity scenarios. The work is relevant to applications in deep learning (e.g., LLM inference) where structured sparsity and efficient hardware utilization are critical.

Strengths

  • Novelty: The Tiled-VW format is a unique contribution that reduces metadata overhead while enabling flexible N:M sparsity ratios on DETCs. This contrasts with prior formats (e.g., V:N:M) that either incur high metadata costs or are restricted to specific sparsity patterns.
  • Technical Depth: The analysis of performance gaps in existing SpMM implementations (e.g., metadata overhead, DETC vs. SPTC trade-offs) is rigorous and insightful. The proposed optimizations (e.g., address offsetting, k-stage pipelining) are well-justified and validated.
  • Impact: The demonstrated speedups (e.g., 28× on RTX 3090 Ti) and memory efficiency gains (e.g., 67% reduction in usage) are substantial for LLM inference and other SpMM-heavy applications.
  • Clarity: The paper is well-written, with clear problem formulation, technical explanations, and evaluation results. Figures (e.g., Figure 8, Figure 17) effectively illustrate key concepts.

Opportunities for Improvement

  • Comparisons with Prior Work: The exclusion of Jigsaw from experiments (due to lack of open-source code) weakens the evaluation. Including synthetic benchmarks or discussing functional differences could address this.
  • Hardware-Specific Analysis: More details on how BlockVector adapts to different GPU architectures (e.g., Ampere vs. Turing) would strengthen generalizability claims. For instance, why does VectorSparse perform better on RTX 2080 Ti?
  • Limitations Discussion: The performance drop on small matrices (e.g., DLMC dataset) is acknowledged but not fully addressed. A mitigation strategy (e.g., dynamic tile size selection) could improve practicality.
  • Pipeline Optimization Details: The impact of pipeline stages (k-value selection) on overlap efficiency and thread-level parallelism could benefit from deeper analysis, including trade-offs with shared memory usage.

Comments for Authors

  • Jigsaw Comparison: Clarify whether Jigsaws non-inclusion affects the validity of comparisons. If testing is infeasible, discuss architectural differences (e.g., SPTC vs. DETC focus) to contextualize BlockVectors advantages.
  • Hardware Adaptability: Expand on how BlockVector leverages DETC-specific features (e.g., MMA instructions, memory hierarchy) and whether it can be ported to other architectures (e.g., AMD GPUs).
  • Small-Matrix Performance: Propose strategies to improve performance on smaller matrices (e.g., batching, hybrid formats) to broaden applicability.
  • Pipeline Overhead: Quantify the shared memory overhead of the k-stage pipeline and its impact on thread-block parallelism. Consider providing guidelines for selecting k in different scenarios.

Questions for Authors Response

  1. How does Tiled-VWs metadata efficiency compare to unstructured sparsity formats (e.g., CSR) when applied to irregular sparsity patterns?
  2. The paper attributes DETCs superiority to better memory bandwidth utilization. Could the authors provide hardware-level metrics (e.g., memory throughput, cache utilization) to validate this claim?
  3. Why were batch sizes 832 chosen for the OPT-30B evaluation? How does BlockVector scale with larger batches typical in production LLMs?
  4. In Figure 17, the speedup from Opt2 decreases at 90% sparsity. What architectural or algorithmic factors cause this trend?

Overall Recommendation

The paper presents a technically sound and impactful solution to a critical problem in GPU-accelerated machine learning. While the evaluation could be more comprehensive, the contributions to sparse formats and DETC optimization are significant. Addressing the raised questions and limitations would strengthen the work.

Score: 3 / 5 (Weak Accept)


SlimPack: Minimizing Workload Imbalance in Variable-Length LLM Training with Slice-Level Packing

optimized for variable-length LLM training scenarios

粗粒度的 sample-level packing 容易导致 workload imbalance forward / backward computational asymmetry pipeline parallelism bubble

和 FlexSP 比有什么区别?有什么优势? slice 之后需要在训练中维护 kvcache会有多大的 memory overhead

Review

overall merits: 2 expertise: 2

Paper Summary

This paper presents SlimPack, a training framework designed to mitigate workload imbalance in LLM training caused by variable-length sequences. SlimPack introduces slice-level packing, which replaces conventional sample-level packing to achieve finer-grained workload distribution across microbatches. To address computational asymmetry between forward and backward passes (e.g., the 2.5× higher backward cost in FlashAttention), it proposes asymmetric partitioning, enabling dynamic balancing between stages. Additionally, SlimPack incorporates a DAG-based simulator that models pipeline bubbles and optimizes hybrid parallel configurations. The authors evaluate SlimPack on four LLM sizes (7B150B) and three datasets (CommonCrawl, GitHub, Wikipedia) with context lengths up to 256K tokens, reporting speedups of 1.07× to 2.8× compared to Megatron-LM.

Comments for authors

Thank you for submitting your paper to EuroSys26. The work addresses an important bottleneck in variable-length LLM training by proposing slice-level packing and asymmetric partitioning to balance both forward/backward computation and pipeline bubbles. The DAG-based simulator for pipeline modeling is also a promising idea for optimizing hybrid parallelism.

However, the paper would benefit from further clarification and evaluation in the following aspects:

  • While the simulators memory prediction accuracy (MAPE 1.6%) is impressive, the paper lacks an analysis of runtime overheads due to slice management (e.g., potential KV cache fragmentation or inefficiencies in slice-level autoregressive decoding).
  • FlexSP (ASPLOS'25) has also been evaluated on GitHub, CommonCrawl, and Wikipedia datasets, achieving latency reductions of 1.05×1.98×. It remains unclear how SlimPacks design compares to or surpasses such state-of-the-art approaches. It would significantly strengthen the paper to include prior works (such as ByteScale, FlexSP, and WLB-LLM as discussed in the Introduction) as additional baselines for a more comprehensive and fair comparison.
  • Section 4.3 should provide more details on how asymmetric partitioning dynamically reorganizes slices, especially how backward repacking interacts with KV cache dependencies and whether this introduces synchronization overhead.
  • An ablation study is needed to quantify the contribution of each component: slice-level packing and asymmetric partitioning, to the overall performance improvements.
  • The paper claims that pipeline imbalance bubbles are “eliminated” via slice-level packing. What is the residual bubble overhead (e.g., percentage of total runtime) in the reported 2.8× speedup scenario? A more detailed breakdown would be helpful.

Overall, SlimPack presents interesting and practical ideas to improve training efficiency for variable-length sequences, but the evaluation and mechanistic insights need to be deepened for a stronger impact.

LLM review

Paper Summary

The paper proposes SlimPack, a training framework for large language models (LLMs) that addresses workload imbalance caused by variable-length sequences. SlimPack introduces slice-level packing to replace traditional sample-level packing, enabling finer-grained workload distribution across microbatches. It also introduces asymmetric partitioning to balance computational asymmetry between forward and backward passes (e.g., FlashAttention's 2.5× higher backward cost). A DAG-based simulator models pipeline imbalance bubbles and optimizes hybrid parallel configurations. The work is significant as LLM training scales to longer contexts and heterogeneous workloads, where existing methods suffer from inefficiencies.

Strengths

  • Relevance: Tackles a critical problem (workload imbalance in variable-length LLM training) with practical implications for modern systems.
  • Novelty: Slice-level packing and asymmetric partitioning are distinct improvements over prior packing strategies (e.g., sample-level methods [13]), addressing both forward/backward asymmetry and pipeline bubbles.
  • Technical Depth: The DAG-based simulator rigorously models pipeline dynamics, including inter-slice dependencies and memory footprints, outperforming prior analytical models [14,36].
  • Evaluation: Comprehensive experiments across four LLM sizes (7B150B), three datasets (CommonCrawl, GitHub, Wikipedia), and context lengths up to 256K tokens. Speedups of up to 2.8× over Megatron-LM demonstrate practical impact.
  • Clarity: The paper is well-structured, with clear technical explanations (e.g., micropack scheduling in Figure 5) and visualizations (e.g., workload imbalance comparisons in Figures 1, 1213).

Opportunities for Improvement

  • Hardware Generality: Experiments are conducted on NVIDIA Hopper GPUs with NVLink. The approach's applicability to other hardware (e.g., TPUs, RDMA-based clusters) or lower-bandwidth interconnects requires discussion.
  • Memory Overhead: While the simulator's memory predictions are accurate (MAPE 1.6%), the paper lacks analysis of runtime overhead from slice management (e.g., KV cache fragmentation in slice-level autoregressive decoding).
  • Limitations: The paper does not address scenarios where slice-level packing may underperform, such as uniform sequence lengths or extreme sequence sparsity.

Comments for Authors

  • Mechanistic Explanation: Section 4.3 should elaborate on how asymmetric partitioning dynamically regroups slices. For example, how backward repacking interacts with KV cache dependencies.
  • Baselines: Include comparisons to dynamic sequence parallelism [9,31] and context parallelism optimizations [3,18] to isolate SlimPack's unique contributions.
  • Ablation Studies: Add ablation results for individual components (e.g., slice-level packing vs. asymmetric partitioning vs. simulator-guided configurations) to quantify their relative impact.
  • Computational Asymmetry: Expand discussion on asymmetric partitioning (e.g., quantify GEMM vs. FlashAttention costs in Figure 4 with concrete numbers from experiments).

Questions for Authors Response

  1. How does SlimPack interact with attention variants like grouped-query attention (GQA) [1] or sparse attention? Could such optimizations affect slice-level workload balance?
  2. The paper claims pipeline imbalance bubbles are "eliminated" with slice-level packing. What is the residual bubble overhead (e.g., percentage of total runtime) in the 2.8× speedup case?
  3. How sensitive is the simulator to inaccuracies in operator profiling (e.g., FLOPs/watt measurements)? What is the error margin for end-to-end predictions at scale?
  4. The GitHub dataset shows declining TPS/GPU for larger models (Figure 11). Does this reflect inherent computational properties (e.g., attention overhead) or implementation bottlenecks?

Overall Recommendation

Score: 4 / 5
The paper presents a technically strong solution to a timely problem with compelling experimental results. While the evaluation and novelty are robust, addressing comparisons to concurrent work, hardware generality, and mechanistic details would strengthen the contribution. Suitable for acceptance with minor revisions.


Extra:

Novemton: Expressing Tiled Computations with Tensor-Oriented Metaprogramming

Review

Summary

Comments for authors

The presentation of figures should be highly refined, such as Figure 5, the legend, labels text should be larger and the colors are not high contrast for reader to differentiate. The presentation and construction should be highly reorganized. For example, the section 3 still contains a lot of features in prior works like Triton and Graphene and the inspired from these prior works. Many typos like "A visualization of the meta-operations listed in Table 1 is provided in Figure 5." should be Figure 3 actually. Insufficient evaluation, for example, the section 5.3.7 claim the 0.06 token/s higher throughput compared to Triton, but there is no figure to show that. What's the expansibility of Novemton? For example, running on mulit GPUs, how Novemton handles the communication?

Paper Summary

This paper presents Novemton, a domain-specific language designed to simplify GPU kernel development for deep learning. Unlike existing DSLs like Triton that still expose low-level parallel constructs (e.g., pointer arithmetic), Novemton lets developers write high-level serial-style code using symbolic tensors and meta-operations like tile and flatten. These are compiled into efficient parallel code via a code generator. Besides, it introduces a new arrange-and-apply paradigm that separates tensor layout from computation logic.

Comments for authors

Thank you for submitting your paper. This paper presents Novemton, which attempts to improve the programming interface for compute kernel development by enabling high-level serial code to be transformed into efficient parallel code. The core ideas—symbolic tensors, tensor-oriented metaprogramming, and arrange-apply separation—are potentially valuable for improving kernel modularity and developer productivity.

However, the current version of the paper has substantial shortcomings in terms of clarity, novelty, and evaluation. Below are detailed comments for improvement:

  1. Presentation and Writing Quality
  • The overall paper reads more like a developer note or internal tool documentation than a polished academic systems paper.
  • There are numerous typos and inconsistencies, e.g., Figure 3 is referred to incorrectly as Figure 5 in section 3.1.
  • Several key figures (e.g., Figure 5) are visually weak: small fonts and poor color contrast make them hard to interpret.
  1. System Novelty and Comparison to Prior Work
  • Novemton's innovation seems incremental—primarily repackaging Triton+Graphene-like ideas with a Pythonic arrange-and-apply pattern and some syntactic sugar.
  • The paper lacks a compelling systems insight or new technique that significantly advances the state of the art.
  1. Evaluation Weaknesses
  • The performance evaluation is unconvincing. Section 5.3.7 claims a 0.06 token/s improvement over Triton, but this is not supported by any figure or visualization.
  • The performance comparison to PyTorch and Triton in just 10 operators is insufficient. For example, multi-GPU scalability or end-to-end model training performance is not explored.
  1. Overclaims and Lack of Justification
  • The paper repeatedly claims that Novemton achieves "comparable performance to Triton" but does not justify why this is the case from a compiler or kernel-generation perspective. Is it using Triton as the backend? If so, the performance equivalence is unsurprising.
  • The DSL appears more like a thin frontend wrapper over Triton. If Novemton simply emits Triton code, then the novelty lies purely in syntax/API design—not in compiler, runtime, or systems contributions.

Overall, this submission is currently not ready to be accepted. The incremental contribution, insufficient evaluation, and poor paper construction all hinder its impact. With a clearer focus, stronger empirical validation, and significantly improved writing and presentation, this work could be better positioned for a more domain-specific venue or as a tool paper.

ZeRO-Libra: Towards Zero-Stall Offloading with Importance-Awareness for LLM Fine-Tuning

fig 2 十分清晰 在 fine-tune 做 offloading 的场景下,高速执行的 GPU 与 CPU update 缓慢的矛盾,通过 CPU 上 accumulate 来减少 bubble本工作进一步选出了 important 的 params 留在 GPU 上做 update其它更新很少fig 4基本不太变化的 params把梯度 offload 到 CPU 并攒一个大 batch 再做,进一步优化了这个问题。 Offloading the optimizer states to the CPU memory introduces substantial communication overhead. 选 top-k 存在的问题:分布式,找到全局 top-k 需要大量通信,分析了 top-k params 的 spatial/temporal locality发现随着 stepstop-k 基本固定,从而解决这个问题 MNLI 中 accuracy 的 bias 明显更大原因是什么文中已回答S=4 导致小参数的模型上粒度过粗

Review

overall merits: 3 expertise: 2

Paper Summary

This paper presents ZeRO-Libra, an importance-aware offloading framework eliminating GPU stalls during LLM fine-tuning. The authors observe that existing systems (e.g., ZeRO-Offload) uniformly offload parameters despite severe gradient imbalance (top 1% gradients contribute >90% of total norm). ZeRO-Libra prioritizes critical gradients by retaining them on-GPU for immediate updates while asynchronously offloading less important gradients to CPU. Leveraging spatial/temporal locality in critical gradients, it avoids costly global synchronization. Evaluations demonstrate up to 5× speedup, 2× lower PCIe traffic, and 85% fewer GPU stalls with maintained accuracy.

Comments for authors

Thank you for submitting to EuroSys 26. ZeRO-Libra offers a valuable solution to GPU stalls in LLM fine-tuning.

The spatial/temporal locality insight for critical gradients is compelling, well-supported by Figs. 46. Fig. 2 effectively contrasts offloading strategies, highlighting ZeRO-Libras improvements. Problem motivation is strong, with CPU updates (4,600ms) dwarfing GPU compute (2,000ms). The zero-stall pipeline elegantly overlaps CPU-GPU operations, and convergence analysis (Sec 3.4) quantifies staleness penalties.

However, the accuracy dip in smaller models (OPT-350M) due to fixed S=4 warrants deeper analysis across architectures/datasets. The "preserving accuracy" claim requires qualification given minor drops in Fig. 10. Extending evaluations to real-world scenarios beyond GLUE would strengthen impact. Additionally, does spatial locality (Fig. 5) hold across architectures, including transformers with varied attention mechanisms?

Overall, a strong contribution; minor revisions recommended for acceptance.


Request Dropping is Not Always Bad: Enhancing Goodput for Inference Pipeline with Bi-directional Adaptive Request Dropping

Review

overall merits: 2 expertise: 2

estimate the E2E time for request for the proper rejection of request to avoid early or late reject. 在文中使用了很长时间的 bi-directional request dropping 这个术语,但是一直没有提出明确的定义,容易让人 confused。 既然做更好的 request dropping 决策需要依据 monitor 的历史信息,从 monitor 的历史信息如何应对 burst 的?否则一个稳定可根据 monitor 预测的未来调度开销,我们应该选择扩容而不是 request dropping。

Paper Summary

This paper presents BARD, a DNN inference system designed to improve goodput by proactively dropping requests in multi-DNN pipelines to meet latency objectives. Existing systems use reactive dropping, which often results in drop too late or drop the wrong set issues that hurt goodput. BARD proposes bi-directional request dropping and adaptive request priority to decide when and which requests to drop at each pipeline stage. The evaluation shows BARD achieving 21%176% higher goodput than state-of-the-art systems while lowering drop rates.

Comments for authors

Thank you for submitting your paper to EuroSys'26. The work that models the latency in multi-DNN pipelines fully is interesting, but the paper has several shortcomings that limit its contribution.

  • The paper does not convincingly argue why proactive dropping is preferable to resource scaling in bursty workloads. The key idea based on the monitor of history information to determine the dropping of request, if the prediction can be precise, why can we scaling the resource?
  • The core concept of "bi-directional request dropping" is never precisely defined. Readers are left to infer its meaning, making the innovation unclear.
  • The methodology for estimating end-to-end latency is insufficiently explained. How BARD achieves accurate predictions under batching, queueing, and dynamic conditions remains unclear.

Overall, while the paper identifies a promising problem, the lack of clear definitions, missing technical detail, and weak justification against alternatives reduce its impact. Clearer articulation of core concepts, a stronger E2E estimation methodology, and a deeper comparison with scaling and related systems would significantly improve the work.


FastDC: Fast KV Dimensionality Compression for Efficient LLM Inference

overall merits: 2 expertise: 3

Good:

  • Open sourced FastDC code is a big bonus! Bad:
  • How is FastDC compared to MLA [1] which also targets to reduce KVCache and handle the RoPE problems by adding a q^R and k^R for rotation information.
  • Current paper organization is hard to follow especially for a system conference. I encourage the authors to reconstruct the paper with a clearer challenges statement and insights for the achievement to the compression after RoPE.

I believe this is a solid work with a open sourced code and detailed algorithm analysis, but the presentation of the paper is a little hard to follow currently. I think after the reorganization will make this paper easy to be accepted.

[1] https://arxiv.org/pdf/2405.04434

Review

overall merits: 2 expertise: 3

Paper Summary

This paper presents FastDC, a system for efficient KV cache compression in LLM inference. Unlike prior work (e.g., Palu), which suffers from decompression overhead and limited use of compressed KV with RoPE, FastDC applies post-RoPE compression, supports adaptive rates across heads/layers, and optimizes the attention kernel for workload balance. Experiments show up to 64% faster job completion and nearly 2× throughput while preserving ~99% accuracy. The implementation is open-sourced.

Comments for authors

Thank you for submitting your paper to EuroSys'26. This work addresses an important bottleneck in LLM inference by improving KV cache compression. The algorithmic design is well-motivated, and the open-sourced implementation makes the contribution practical and impactful.

However, I have several concerns:

  • The evaluation focuses on Palu but does not sufficiently compare with MLA [1], which also reduces KV cache size and handles RoPE. A direct comparison with MLA is critical to clarify novelty and benefits.
  • The presentation is dense and mathematically heavy, making it harder for a systems audience to follow. A clearer narrative that emphasizes (1) the core challenge, (2) the main insights (e.g., post-RoPE compression, common rotation matrix), and (3) system-level contributions and trade-offs would significantly improve readability and impact.

Overall, this is a strong technical work with promising results, but the current presentation reduces its clarity and positioning. Addressing these issues would greatly strengthen the paper.

[1] https://arxiv.org/pdf/2405.04434

AccountAnalysis: A Distributed and Parallel Framework for Comprehensive Graph Processing in Account-based Blockchains

Review

overall merits: 2 expertise: 1

Paper Summary

This paper presents AccountAnalysis, a distributed framework for analyzing account-based blockchains. It constructs a heterogeneous graph that captures different transaction types and uses parallelized methods for data collection, parsing, and storage. The system supports ecological and similarity analyses, with experiments on Ethereum and Binance Smart Chain showing efficiency improvements over baselines.

Comments for authors

Although this paper was assigned to me on short notice, I try to cover some basic opinions from my review.

Strengths:

  • The heterogeneous graph abstraction is a useful way to capture diverse account behaviors.

Weaknesses:

  • The design combines well-known techniques (bulk processing, parallel workers, subgraph merging, graph databases) with incremental optimizations, lacking fundamentally new ideas.
  • The paper reads like an engineering report. It is not clear what new insights or principles advance the state of distributed systems or data management.
  • The writing is descriptive and unfocused, making it hard to see the core contributions.

Overall, the work demonstrates solid engineering but lacks novelty and depth expected at EuroSys. A clearer articulation of contributions and broader evaluation would strengthen the paper.

Automating Conflict-Aware ACL Configurations with Natural Language Intents

Review

overall merits: 2 expertise: 1

Paper Summary

The paper presents Lena, a system that automates ACL configuration from natural language intents. It uses LLMs with network-specific knowledge to generate rules, detect and resolve conflicts, and optimize deployment to reduce redundancy. Key techniques include a mapping table for accurate attribute reasoning, “truly matched flows” for precise conflict detection, and optimization for efficient deployment. Evaluations on campus to large-scale networks show improved accuracy, conflict handling, and deployment efficiency.

Comments for authors

Although this paper was assigned to me on short notice, I try to cover some basic opinions from my review.

Strengths:

  • Conflict detection contributions (e.g., “truly matched flows”) and deployment optimization insights are interesting and potentially impactful.
  • Evaluation covers multiple network scales, including large topologies.

Weaknesses:

  • The system heavily relies on LLMs. Robustness under noisy or adversarial intents is not evaluated, and real-world operator inputs may be more ambiguous than tested.
  • The paper is overly detailed in some areas (prompting strategies, optimization formulations) yet lacks higher-level insights on integration, scalability, and usability.