THE LINUX FOUNDATION PROJECTS
BlogHackathon

2025 SONiC Hackathon Most Innovative Award Spotlight: Data Path Diagnostics on SONiC

By December 3, 2025No Comments

At the 2025 SONiC Hackathon, innovation meant tackling challenges that are rarely seen, and even harder to diagnose. The Most Innovative Award was presented to Sridhar Talari, Anand Mehra, and Xiaohu Huang from Cisco for their work on a key feature: data path diagnostics on SONiC.

Silent packet corruption and packet loss are rare, but when they occur, they create significant disruption and are extremely difficult to isolate. This Cisco team developed a proactive, intelligent solution designed to detect these errors early, locate them precisely, and help operators keep SONiC environments healthy and predictable.

The Problem

Instances have been observed where packet data corruption and packet loss occurs along the forwarding path without being detected by hardware. Undetected errors related to corruption may result in packets being dropped at later stages or incorrect data being delivered to the end user. Although the likelihood of these incidents is relatively very low, their occurrence can lead to significant disruptions and are very difficult to diagnose. It is therefore essential to implement proactive and accurate detection mechanisms for such corruption.

The Hackathon Solution

This feature periodically injects specially crafted packets into the forwarding path and subsequently receives them. It then compares the content of the transmitted packets with that of the received packets to ensure the integrity of the data path.

Feature implementation should satisfy the below requirements:

  • It should cover all the blocks in the data path.
  • It should cover all software paths taken by packet from CPU to ASIC and vice versa
  • Along with detection of packet corruption and loss, it should be able to pinpoint the exact location at a granular level. (i.e. corruption caused by Lookup, nexthop resolution, encapsulation, packet buffer etc.)
  • Feature should be enabled by default and should run in background
  • Feature should raise syslog when error is detected
  • Feature should allow user to configure interval, packet size, packet content and burst size of packets used for error detection
  • Additional CPU usage should be minimum.
  • No impact to any existing host path applications
  • These packets should not be forwarded
  • Detection time should be as fast as possible
  • False alarms should not happen.
  • Should be light weight and latency sensitive
  • Should be able to run the test even if all interfaces are down
  • CLI should be available to display packets sent, packets received correctly, packets received out of order, packets corrupted, location of corruption, and additional details as part of operational data.

The method for building a packet that traverses all hardware blocks in the data path and returns to the feature depends on the platform. Therefore, the team chose to add this feature to the SAI code and made provisions for its configuration and operational controls available through the SAI API, ensuring a standardized and hardware-independent format. User will configure the feature via SONIC CLI, which subsequently calls the SAI API to disable/enable the feature and configure it. Operational CLI commands are planned to display the results. FlexCounter mechanism available in SONiC will be enhanced to export the statistics from SAI layer.

Impacts and Benefits

This feature catches data path problems leading to packet drops, packet corruption, and total HW failure proactively in a faster and least intrusive manner.  It saves a lot of time and effort spent debugging the network for these silent errors. It reduces network downtime. Feature attempts to identify the exact problem location which can be used to selectively disable certain sections of data path instead of turning it off completely reducing the scope of impact. In a modular system it helps to replace only the impacted parts instead of the entire chassis.

Next Steps:

The team is working to upstream the below implementations:

  • YANG model for configuration and operational data needed for this feature.
  • SAI API for implementing the feature.
  • Flexcounter enhancements to collect and display operational data

There are instances where packet corruption happens only with traffic bursts at a higher rate, and it will be helpful to detect these scenarios. Such tests cannot be run on production devices as it impacts control plane traffic. They can be run during maintenance window after costing the device from the network.

To address corruption issues with bursts, the team is working to enhance the data path diagnostics tool to handle traffic at very high rate (20K packets per second instead of one packet per second or so) in offline mode.

Their goal is to provide a modular framework which is easily extensible to address all the current data path errors and any new ones identified in the future.