How SONiC Powers the World’s Largest AI Infrastructure

By Guohan Lu, Azure Networking, Microsoft and Mehak Mahajan, Broadcom Engineering

AI infrastructure has introduced a class of networking challenges that traditional data center software was never designed to handle. Over the past few years, the SONiC community—spanning silicon vendors, system vendors, cloud operators, and network engineers—has been working together to address these challenges head-on.

This blog covers what makes AI networking fundamentally different, the four key capabilities the community has built into SONiC to address it, and what comes next as the industry pushes toward even larger scale.

Why AI Networking Is a Different Problem

Traditional data center workloads—web requests, API calls, database transactions—generate traffic with natural entropy. Thousands of independent processes send packets that follow different paths through the network. Congestion builds gradually and dissipates gradually. That natural variation gives the network room to recover.

AI training workloads are the opposite. Tens of thousands of GPUs finish their compute step simultaneously, pause simultaneously, and then start communicating simultaneously. There is almost no entropy. Every flow follows the same path, hits the same switch, and lands on the same buffer. The result is what the community calls elephant flows—long-lived, synchronous flows that create sharp microbursts rather than gradual congestion.

This breaks the assumptions that traditional congestion control tools were built on. PFC (Priority Flow Control) causes head-of-line blocking that spreads across the fabric. ECN was designed for gradual congestion, not nanosecond-scale bursts. Pull-based telemetry sampled every few seconds misses events entirely. The burst has come and gone before the counter is read.

There is also a compounding effect unique to AI: a training job is only as fast as its slowest GPU. A single dropped packet, a single congested path, or a single delayed flow is enough to stall every other GPU in the job and inflate the overall completion time. At the scale of hundreds of thousands of accelerators, this is not a rare edge case. It is a constant pressure the network must be designed around.

The network is no longer just the plumbing. In AI infrastructure, the network is the system.

SONiC in Production: The Fairwater Deployment

One of the most demanding tests of these new capabilities has been the Fairwater AI data center, which uses a multi-plane, multi-rail topology designed to support up to 512,000 GPUs in a two-tier fabric. The network is built on 51-terabit switches running at 100G serdes, with each switch fanning out to 512 physical neighbors. Eight planes run in parallel per GPU NIC, forming a fabric that can sustain the full communication demands of large-scale AI training jobs.

Designing the topology was the starting point. Making it operate reliably at this scale required the SONiC community to solve four distinct software problems that existing tools simply could not address.

Four Capabilities Built into SONiC for AI

1. BGP at Scale: 512 Sessions per Switch

When every 800G port is broken into eight 100G lanes, as required to maximize fabric density, a single two-rack-unit switch must manage 512 BGP sessions simultaneously, each carrying 1K routes and 512 next hops. This is an order of magnitude beyond what pizza-box switches have historically been asked to handle.

The SONiC community addressed this by upgrading FRR to version 10 with approximately 20 targeted patches, enabling stable operation at this session count and achieving data-plane convergence in under 100 milliseconds. The work is upstream and available to anyone building at this scale.

2. SRv6 for Source-Based Traffic Spreading

Standard ECMP load balancing relies on entropy in packet headers to spread traffic across available paths. AI flows have very little of that entropy, which means ECMP tends to concentrate traffic rather than distribute it.

SONiC’s SRv6 support enables a different approach: source-based routing, where the NIC encodes the full forwarding path into the IPv6 destination address using SRv6 uSID. Each switch reads the current segment, forwards accordingly, and shifts to the next. This gives the sending endpoint direct control over path selection, allowing traffic to be spread across the fabric in a coordinated way. The forwarding logic runs entirely on existing ASICs, and no new hardware is required.

3. Packet Trimming for Fast Loss Recovery

Even in networks designed to be lossless, microbursts can cause buffer overflow and packet drops. The traditional recovery mechanism, waiting for the RDMA layer to time out and retransmit, introduces significant latency that compounds across a large job.

Packet trimming, now available in SONiC, changes the response to a drop. When a packet is discarded due to buffer overflow, the switch trims it to a short header and forwards that header at high priority to the destination NIC. The destination recognizes the trimmed packet, sends a NAK to the source, and the source retransmits immediately, well before the RDMA timeout would have fired. On current 512-port hardware, up to 18 ingress ports can generate trimmed packets simultaneously and drain to a single egress port without loss.

4. High Frequency Streaming Telemetry

Telemetry collected at second or multi-second intervals cannot capture the microsecond-scale dynamics of AI traffic. By the time a microburst is reflected in a polled counter, the event is already over and the opportunity to diagnose or respond to it has passed.

The SONiC community has developed High Frequency Streaming Telemetry (HFST) to address this. Rather than polling, the ASIC pushes per-port counters in IPFIX format at millisecond cadence. SONiC’s Counter SyncD receives these messages, matches them against registered IPFIX templates, and stores them in the local counter database. From there, counters can be processed on-switch or exported via OpenTelemetry to any external collector, Prometheus, InfluxDB, or others. This architecture scales to 512-port switches without the latency penalty of software-driven polling.

Availability

All four of these capabilities, BGP scaling, SRv6 source routing, packet trimming, and high frequency streaming telemetry, are available in SONiC as of the 2025.11 release. The High Level Design documents for each are linked below for reference.

What’s Next: Scale-Up Ethernet

The capabilities described above address the scale-out network — the fabric that connects GPUs across a data center. A related challenge is now emerging on the scale-up side: the high-speed interconnect between GPUs within a tightly coupled job.

Scale-up networks have historically been confined to a single rack. As AI jobs grow to involve thousands of tightly coupled accelerators, that constraint is no longer viable. The industry is now designing scale-up networks that span 512, 1,024, and eventually 4,096 endpoints, with requirements for lossless operation, high throughput, and interoperability across GPU vendors.

The OCP community has responded with E-SUN (Ethernet for Scale-Up Networking), a specification developed by approximately ten contributing organizations. E-SUN defines the L2/L3 framing, error recovery, header efficiency, and lossless network requirements for this class of interconnect, with an explicit goal of keeping the network open and interoperable.

The SONiC community has a dedicated Scale-Up Ethernet working group running in parallel with the E-SUN effort, ensuring that the protocols and behaviors defined in E-SUN have a clear implementation path in SONiC. Technologies under active discussion include LLR, CBFC, Adaptive Flow Hashing, and the E-SUN header format. Participation from operators and implementers with real use cases is welcome.

Get Involved

The capabilities in this post exist because contributors across the industry identified problems, proposed designs, and submitted code. If you are running AI workloads, building AI infrastructure, or have networking challenges that SONiC should be able to solve, the community wants to hear from you.