Software feature icon Technical Deep Dive – Techniques for Handling Data Loss with Aeron

In the realm of fintech, the stakes are high with no room for error. Data loss, such as a misplaced trade confirmation or an order execution mishap, can have significant repercussions, making precision, speed, and reliability the cornerstones of the financial industry.

At our Aeron MeetUps in May 2024, we tackled these critical issues head-on, showcasing how Aeron harnesses the power of UDP for unparalleled speed without sacrificing the fault tolerance often linked with the less nimble TCP. We also engaged in candid discussions about the obstacles we’ve overcome to ensure Aeron meets the rigorous demands of financial technology.

TCP vs. UDP: The Challenge of Data Loss

TCP can be likened to a two-way telephone call, a continuous stream of communication that ensures data delivery. In contrast, UDP can be compared to sending a postcard – it’s fast, but there’s no guarantee the message will reach its destination. This fundamental difference is crucial when designing systems that require high throughput and low latency, like Aeron, which operates on UDP, to give developers more control over data transmission.

When using UDP, every individual piece of data, or datagram, can potentially be lost. In a perfect world without data loss, UDP would be the most efficient choice. However, the reality is that we must cope with the possibility of lost datagrams, which introduces the need for a robust loss handling strategy.




Aeron’s Approach to Loss Handling: Reliable Delivery with NAKs and Beyond

Aeron’s protocol is designed to detect gaps in data sequences, allowing the receiver to identify what’s missing and request the specific data. This is where the concept of negative acknowledgment (NAK) comes into play. A NAK is sent when a receiver detects a gap, prompting the sender to retransmit the missing data.

But how do we send these NAKs efficiently?

In controlled network environments, data loss typically involves only a few datagrams. However, as we move to larger data centers and cloud environments, larger gaps in data can occur.

Aeron’s Optimistic Approach to Data Loss

Aeron’s philosophy is rooted in optimism. Unlike TCP, which cautiously ramps up data transmission to avoid congestion and loss, Aeron starts at full speed, only slowing down if issues arise. This approach is well-suited for the high-capacity, low-latency networks typically found in financial services and gaming industries.

The Evolution for Cloud Environments & Data Loss Challenges

With the migration to cloud environments, Aeron has evolved. The cloud can introduce higher latency and increased data loss, especially across regions. To address this, Aeron introduces new features that refine its data loss handling. These include initial and retry delays for negative acknowledgments (NAKs), which prevent the “NAK storms” that can occur when multiple packets are lost. It applies a strategic delay before sending a NAK, allowing time for the network to ‘settle’ and potentially deliver the missing packets without the need for retransmission. This approach is not only more efficient but also reduces the additional latency that can occur with immediate NAKs.

Data Loss Handling – Multicast Scenario in action

When dealing with UDP, the first step is to understand the nature of data loss and implement mechanisms to detect and recover lost data. Unlike TCP, which inherently handles sequencing and flow control, UDP requires additional logic for these functions. A typical approach involves adding sequence numbers or offsets to detect gaps in the data stream. For example, if a receiver gets packets 1, 2, and 4, it recognizes the loss of packet 3 and can send a negative acknowledgement (NAK) to request its retransmission. However, sending NAKs over UDP introduces the risk of losing the NAK itself, necessitating further strategies to ensure reliable communication.

One critical challenge in a multicast environment is managing NAKs from multiple receivers. If all receivers detect a gap and send NAKs simultaneously, it can exacerbate network congestion. A more sophisticated approach uses randomized backoff timers on the receivers. This means that upon detecting a gap, receivers set a random delay before sending their NAKs. Ideally, only one receiver will send the NAK, prompting the sender to retransmit the lost packet, which all receivers benefit from, thus minimizing redundant NAKs and retransmissions.

Data Loss Handling – Unicast Scenario in action

In a unicast scenario, where the communication is one-to-one, the approach differs slightly. Since there is only one receiver, a randomized timer is unnecessary. The receiver can send a NAK immediately upon detecting a gap. Aeron’s protocol, for instance, uses offsets and lengths to track gaps, allowing for efficient recovery. When large contiguous blocks of data are lost, the system ensures that a single NAK can cover multiple lost packets, and the sender retransmits the data in a consolidated form, reducing unnecessary overhead.

Conclusion: Building Resilient Systems with Aeron

In conclusion, handling data loss in UDP-based systems requires a thoughtful combination of detection, acknowledgment, and retransmission strategies. By understanding and addressing the unique challenges posed by UDP, systems like Aeron can achieve robust and efficient communication, even in the face of inherent unreliability. These innovations ensure that UDP remains a viable option for high-performance, low-latency applications, providing the necessary tools to mitigate data loss and maintain system integrity.