What is RDMA over Converged Ethernet (RoCE)?
Benjamin Ryzman
on 9 June 2026
Tags: AI , AI Factory , Data center networking , HPC , inference , network
Previous articles walked through RDMA (Remote Direct Memory Access) as a programming model and InfiniBand as the fabric that was built around it. Both led to the same conclusion, even if it was never stated outright: moving data, not compute, becomes the bottleneck once systems scale.
So what happens when you want RDMA, but you’re already running an Ethernet network you’re not keen to replace? That’s usually where RDMA over Converged Ethernet (RoCE) enters the conversation.
At first, it sounds straightforward. Keep the RDMA semantics, keep the verbs model (the low-level RDMA API used to post send/receive and memory operations), just run it over Ethernet. In reality, it works a bit like fitting a racing engine into a standard road car. You can do it, but the rest of the system has to be able to keep up. In this article, we’ll explore what RoCE is, how it works, and when to use it.
What is RoCE?
RDMA over Converged Ethernet (RoCE) runs the RDMA programming model over standard Ethernet networks. Applications use the same verbs interface to read and write directly to remote memory, bypassing the kernel and minimizing CPU involvement. The change is in the transport: instead of a purpose-built InfiniBand fabric, RDMA operations are carried over Ethernet.
RoCE exists in two variants. RoCEv1 operates within a single Layer 2 broadcast domain. RoCEv2 encapsulates RDMA traffic in UDP/IP, which makes it routable across Layer 3 networks. In practice, deployments standardize on RoCEv2 because it fits leaf–spine designs, network segmentation, and multi-rack scale.
While the programming model does not change, the network behavior does. Ethernet does not guarantee lossless delivery by default, so RoCE relies on additional mechanisms to control congestion and avoid drops. That design choice shifts responsibility from the application to the network. Performance and predictability now depend on how the Ethernet fabric is designed, configured, and operated.
This is the key idea to keep in mind. RDMA is the model. RoCE is one way to implement it on top of Ethernet.
Where RoCE fits
RoCE is most relevant in environments where Ethernet is already the dominant networking model and introducing a separate fabric would increase operational complexity, not only because it introduces additional infrastructure, but also because it brings a different networking model, tooling, and operational expertise that teams may not already have in place. It allows RDMA to be introduced incrementally, without changing how the network is provisioned or managed at a high level.
In practice, this shows up in distributed storage systems, database clusters, and accelerator-driven workloads that are deployed on top of standard Ethernet infrastructure. In these environments, the ability to reuse the existing network is often as important as the performance characteristics themselves. In many cases, this direction reflects a practical compromise: pushing beyond application performance limits while avoiding a full redesign of the network stack from scratch.
Ethernet versus InfiniBand behavior
To understand why RoCE behaves differently in practice, it helps to compare the two worlds it bridges: InfiniBand, where RDMA was designed to run, and Ethernet, where it is being adapted.
InfiniBand enforces lossless communication as part of the fabric. Flow control and congestion management are integrated into the transport, which keeps latency stable under load.
Ethernet follows a different model, where packet loss is expected and recovery is handled by higher layers. That choice is not accidental; it comes from how Ethernet evolved. Early Ethernet was designed as a shared medium, with many hosts contending for the same wire. Simplicity and cost mattered more than strict delivery guarantees, so the network pushed complexity up the stack. If a frame collided or a buffer overflowed, it was cheaper to drop it and let higher layers retry than to coordinate every sender and receiver in real time.
That design scaled well as speeds increased and switching replaced hubs. The network stayed simple and fast, while protocols like TCP took on responsibility for reliability, ordering, and congestion control. It works well for web traffic, storage over IP, and most enterprise workloads, where a lost packet can be retransmitted without much consequence.
RDMA changes that assumption. It expects the network to behave predictably and avoid drops altogether, because retransmissions break latency guarantees and disrupt fine-grained communication patterns. RoCE therefore has to retrofit loss-avoidance mechanisms onto a transport that was never designed to be strictly lossless in the first place. RoCE bridges that gap by introducing a set of mechanisms that approximate lossless behavior for RDMA traffic.
At this point, it helps to step back and ask why these additional mechanisms are needed in the first place. Ethernet did not suddenly become problematic; the workload changed.
Why RDMA stresses Ethernet
The key challenge with RoCE is that RDMA expects a predictable, near-lossless transport, while Ethernet was originally designed around best-effort delivery. Most of the additional mechanisms around RoCE exist to bridge that gap.
RDMA traffic patterns tend to be highly synchronized and bursty. Distributed training jobs and tightly coupled HPC workloads often trigger collective operations where many nodes exchange data simultaneously. This creates what operators call “incast”: multiple senders targeting the same receiver or the same set of links at once. Switch buffers fill rapidly, and in a traditional Ethernet network that would simply lead to packet drops.
TCP tolerates that behavior because retransmission is built into the transport. RDMA does not. A single dropped packet can stall a queue pair and introduce latency spikes that ripple through the entire workload.
Priority flow control (PFC) was introduced to avoid drops by pausing traffic before queues overflow. The problem is that the pause applies to the ingress port and priority class, not only to the congested flow itself. Unrelated traffic can therefore be blocked as well. This phenomenon, known as head-of-line blocking, is one of the reasons RoCE fabrics can become unstable under congestion.
Load balancing creates additional pressure. Ethernet fabrics typically rely on equal-cost multipath (ECMP) hashing to spread flows across multiple paths, but synchronized AI and HPC traffic patterns do not distribute evenly. Large flows can land on the same links while other paths remain underutilized, concentrating congestion inside a subset of the fabric.
Mechanisms such as explicit congestion notification (ECN), data center bridging (DCB), and later data center quantized congestion notification (DCQCN) were introduced to make this behavior manageable. ECN marks packets before queues overflow so endpoints can slow down gradually. DCB defines how traffic classes are isolated and prioritized across the fabric. DCQCN builds on those signals to regulate sender rates dynamically.
RoCE performance thus depends heavily on how the Ethernet network is engineered. A well-tuned fabric can achieve latency and throughput close to InfiniBand. A poorly tuned one can become unpredictable under load. Queue thresholds, ECN marking points, PFC priorities, maximum transmission unit (MTU) values, and network interface card (NIC) parameters all interact, which is why production RoCE deployments often require careful validation and iterative tuning under realistic traffic conditions.
Can RoCE share the same fabric as IP traffic?
The mechanisms described above already push Ethernet close to its limits under synchronized RDMA workloads. When the same fabric also carries general IP traffic, those effects can become harder to control and easier to trigger.
Can RoCEv2 coexistence with standard Ethernet and IP traffic be maintained without affecting performance? RDMA traffic assumes a controlled, near-lossless environment. IP traffic assumes packet loss and recovery. Bringing both together requires explicit separation within the same network, typically through traffic classes, buffer allocation, and scheduling policy.
Some environments operate a fully converged fabric, where all traffic shares the same infrastructure. This approach reduces hardware footprint but increases sensitivity to configuration errors, as congestion in one class can affect others. Other environments introduce logical separation, using VLANs or QoS classes to isolate RDMA traffic while keeping a shared physical network. The most conservative approach is a dedicated fabric, where RDMA traffic is completely isolated from other workloads.
Operational experience tends to favor some form of isolation. It reduces the number of variables involved when diagnosing performance issues and makes behavior easier to reason about under load.
Evolving congestion control
While RoCE can deliver excellent performance, large deployments also expose some of the weaker aspects of Ethernet under sustained congestion and synchronized traffic. Congestion spreads more easily, synchronized traffic patterns create hotspots, and mechanisms such as PFC can stabilize the fabric in one situation while amplifying instability in another.
Large AI training clusters pushed these issues into the spotlight once deployments started scaling into thousands of GPUs. In her OCP Global Summit 2022 keynote, Alexis Bjorlin, then VP of Infrastructure at Meta, explained that communication overhead and tail latency were directly affecting training efficiency: around 30% of AI training time was spent in networking and, for some benchmark models, more than 50%. That drove investment into topology-aware collectives, routing, and congestion management. At that scale, the network becomes part of the distributed compute system rather than simple transport infrastructure.
The industry response has focused on making Ethernet behave more predictably under large-scale RDMA traffic, while reducing dependence on mechanisms such as PFC.
A few notable examples illustrate where things are heading:
- NVIDIA Spectrum-X keeps the RoCE model, but tightly integrates Spectrum switches, BlueField DPUs, adaptive routing, telemetry, and congestion control into a single Ethernet platform optimized for AI clusters.
- The Ultra Ethernet Consortium (UEC), formed in 2023 under the Linux Foundation, takes a more open approach. Its specifications focus on stronger congestion signaling, multipathing, packet spraying, and transports that tolerate out-of-order delivery.
- Google’s Falcon transport, opened through OCP in 2023, moves even further away from traditional RoCE designs. It introduces hardware-assisted retransmission, programmable congestion control, and multipath transport semantics intended to work efficiently on lossy Ethernet fabrics.
- Broadcom’s Scale-Up Ethernet (SUE) framework focuses on tightly coupled accelerator domains using mechanisms such as credit-based flow control, topology-aware routing, and in-network collectives. Recent products such as Tomahawk Ultra and Thor Ultra align closely with the broader UEC direction.
RoCE exposed how difficult it is to make traditional Ethernet behave predictably under synchronized RDMA traffic at large scale. These newer approaches are all attempts to reduce that sensitivity, either by tightening integration around RoCE itself or by moving toward new Ethernet transport models designed specifically for AI and HPC communication patterns.
The Canonical perspective
RoCE depends on alignment across the entire software and hardware stack. Kernel drivers, user space libraries, NIC firmware, and the way workloads are attached to the network all contribute to the final behavior.
On Ubuntu, that alignment starts with upstream RDMA plumbing that operators already recognize. The rdma-core stack provides libibverbs and its providers, which expose a consistent RDMA programming interface regardless of whether the underlying transport is InfiniBand or RoCE. It also includes connection management and tooling such as ibv_devinfo and rping. In the kernel, the mlx5 driver family (for NVIDIA/Mellanox ConnectX), irdma (Intel E810/Columbiaville), bnxt_re (Broadcom NetXtreme-E), and ice/ixgbe (Intel NICs) backends expose RoCEv2 capabilities consistently. Features such as SR-IOV, VF representors, and devlink are available out of the box, which matters when you need to reason about queues, rate limiting, or firmware settings without stepping outside the distribution.
Day‑to‑day operations tend to rely on a small set of Linux primitives that are easy to overlook but critical for RoCE. ethtool and devlink-health expose NIC capabilities and health. tc with mqprio shape traffic classes. They are used to map RoCE traffic to a dedicated priority, enable PFC on that class, and tune queue bandwidth. cgroups and CPU pinning keep data paths predictable to isolate RDMA poller threads or user-space data plane processes on specific cores and reduce scheduler jitter. All of this while standard Linux networking tools remain part of the workflow for inspecting and configuring the underlying Ethernet interfaces that carry RoCE traffic. When things misbehave, counters exposed through /sys/class/infiniband (used for both InfiniBand and RoCE devices), ethtool -S, and switch telemetry are what let you correlate queue pressure with application latency.
Driver and firmware cohesion is where many deployments struggle. Ubuntu tracks rdma-core and kernel changes together, which reduces surprises when kernels move. Where vendor features matter, NVIDIA DOCA-OFED is available through Ubuntu packaging so platform teams can use advanced capabilities on BlueField or ConnectX without forking the base system. The same applies to Intel and Broadcom stacks, which are validated against LTS kernels rather than treated as one-off add-ons.
This matters even more as Ethernet fabrics continue to evolve for AI and HPC workloads. Canonical works closely with silicon vendors, switch manufacturers, and ecosystem partners to follow developments around technologies such as Spectrum-X, Ultra Ethernet, Falcon, and next-generation RoCE congestion control. The goal is not simply hardware enablement. It is making sure Ubuntu and the surrounding cloud-native stack can expose those capabilities consistently through upstream kernels, drivers, orchestration tooling, and CNCF integrations as the ecosystem evolves.
On the orchestration side, RDMA is rarely something a single process “just uses.” It has to be surfaced cleanly to containers and VMs, which is where SR-IOV VFs, device plugins, and CNI integrations come into play. Canonical keeps close to CNCF projects here, so the platform can consume what vendors and the community already provide rather than inventing a parallel stack. In practice, that means platform teams can bring in components such as NVIDIA’s Network Operator, SR-IOV-focused operators, RDMA device plugins, and Multus-based secondary networking. Ubuntu provides the predictable host layer that allows those components to work together consistently. These pieces take care of discovering devices, carving out VFs, and attaching them to workloads, while Ubuntu underneath keeps the drivers, kernel interfaces, and networking behavior consistent enough that those higher layers behave as expected.
Bare-metal provisioning with MAAS and lifecycle management with Juju fit into the same picture. They let platform teams model NIC firmware, kernel parameters, and network configuration alongside the workload, rather than treating RDMA as a special case configured by hand. In practice, this reduces drift. And with RoCE, drift is often what turns a stable cluster into one with unpredictable tail latency.
Observability ties it all together. RDMA issues rarely show up as hard failures; they surface as latency spread, queue buildup, or uneven throughput. Having a consistent set of kernel drivers, user-space tools, and metrics on Ubuntu makes it possible to trace those symptoms back to specific queues, links, or policies, which is what ultimately keeps a RoCE deployment usable under load.
Designing and operating RoCE in practice
RoCE aligns with Ethernet and integrates naturally into existing environments, which is precisely why it keeps coming up in design discussions. It allows teams to push performance further without introducing a parallel networking stack. At the same time, that flexibility comes with a shift in responsibility. The network no longer guarantees behavior by construction. It has to be engineered, observed, and continuously validated.
Most teams follow a similar path, even if they describe it differently. They start by validating whether RDMA is actually the right tool for their workload. Not every distributed system benefits from it, and some behave better with simpler transport models. From there, the focus moves quickly to the network: topology consistency, traffic isolation, buffer tuning, and congestion control are not optional details. They are part of the application design.
This is also where many projects stall. RoCE works well in controlled environments, then becomes unpredictable when scaled or mixed with other traffic. That gap is rarely about hardware capability. It is about how the system is put together and how much visibility teams have into its behavior.
This is the point where it usually makes sense to step back and look at the full stack rather than individual components. The NIC, the kernel, the switch configuration, and the orchestration layer all interact. Treating them separately often leads to local optimizations and global instability. Treating them as a system tends to produce more consistent results.
From a practical standpoint, the next step is not to choose between RoCE and something else in isolation. It is to validate the workload, test the network behavior under realistic conditions, and understand where variability comes from. That work tends to surface quickly whether the existing Ethernet fabric can support the requirements or whether a more controlled approach is needed.
The right approach depends on context. Some deployments converge successfully on RoCE with careful design. Others decide that a more tightly controlled fabric is the better option. What matters is having the data and the operational understanding to make that decision with confidence.
If you are exploring RDMA over Ethernet today, this is exactly the kind of problem Canonical works on with customers and partners. From enabling RDMA stacks on Ubuntu, to validating performance across different NICs and switches, to integrating those capabilities into production platforms, the focus is on making these systems behave predictably under real workloads rather than synthetic benchmarks.
If you are evaluating RoCE, next-generation Ethernet fabrics, or AI/HPC networking architectures on Ubuntu, get in touch with Canonical to discuss your requirements and deployment goals.
Talk to us today
Interested in running Ubuntu in your organisation?
Newsletter signup
Related posts
What is InfiniBand?
When distributed workloads stall because nodes cannot exchange small messages quickly and consistently, the network is the limiting factor. How do you solve...
The bare metal problem in AI Factories
As AI platforms grow into large-scale “AI Factories,” the real bottleneck shifts from model design to operational complexity. With expensive GPU accelerators,...
What is RDMA?
Modern data centres are hitting a wall that faster CPUs alone cannot fix. As workloads scale out and latency budgets shrink, the impact of moving data between...