Persistent Subscriptions: Managing live and replay transitions in Aeron

Aeron 1.51.0 introduces Persistent Subscriptions, a new Aeron Archive feature, sponsored by Nonco, that manages the transition between historical and live message streams for late-joining consumers. It’s an alternative to ReplayMerge, the existing mechanism for late-joiners, and adds IPC support, automatic slow consumer recovery, and an improved API.

Read all the release notes of Aeron 1.51.0 here >>

The Late-Joiner Problem

In any system, some consumers of a live stream of messages will fall behind it. This could be due to scheduled downtime or the result of network issues, but the outcome is the same. These consumers need to catch up on the missed messages before re-joining the live stream.

Catching up and consuming from live are fundamentally different operations, with different performance characteristics. Transitioning between them is non-trivial as it must be done without any data loss or duplication and with minimal impact on other consumers of the live stream.

Deployment topology complicates this further. Many systems colocate services on the same host to minimize latency by communicating over IPC, while still supporting distributed services over the network. This means that recovery and live consumption can happen over different transports. A solution must bridge this difference while utilizing the fastest transport available for a service.

None of this should be the consumer’s concern. A consumer’s job is to process a continuous stream of messages, regardless of whether it is a historical stream or a live one. Providing that stream is the infrastructure’s job.

Persistent Subscriptions solve exactly this problem.

Background: What’s `ReplayMerge`?

To understand what Persistent Subscriptions changes, it helps to look at ReplayMerge, the original Aeron solution for late-joining consumers.

ReplayMerge uses a multi-destination subscription (MDS) to handle the transition from replay to live. The MDS is an Aeron subscription that receives data from multiple remote endpoints, limiting it to network transports. Replay leverages this mechanism to merge the historical and live streams together at the right point.

ReplayMerge was designed when system deployments typically had individual services on their own hardware, so the UDP-only limitation was not an issue for the distributed deployments of the time.
Colocated services are more common these days, and for these systems, the lack of IPC support means ReplayMerge is not a solution.

For Aeron specifically, this means developers are prevented from using local archives to decrease replay latency for their consumers, and must go over the network even when a consumer is on the same machine as its publisher.

Persistent Subscriptions: A New Solution

Updating ReplayMerge to support IPC would require major breaking changes to its API, disrupting existing users in production.

Persistent Subscriptions were created to provide IPC support for those who need it without forcing existing users through a painful migration.

Persistent Subscriptions use Aeron and Aeron Archive client libraries; they sidestep the UDP-only restriction of ReplayMerge by utilizing standard Aeron subscriptions instead of an MDS. This provides the following benefits:

Media-Agnostic Support: The feature works across IPC and UDP. This allows services to use the most efficient transport available, whether they are on separate machines or on the same host.
Intuitive API: The configuration is straightforward and follows standard Aeron patterns, allowing developers to lean on pre-existing knowledge to get up and running. Here’s an example:

Code snippet – Example configuration for a PersistentSubscription that spies on a Publication and replays via a local archive

Beyond the additional transport support, Persistent Subscriptions also have a fallback mechanism for consumers that fall behind the live stream.

In this scenario, the typical solution with ReplayMerge is to create a new ReplayMerge instance after the consumer falls behind, moving infrastructure concerns into application logic and complicating the code.

With Persistent Subscriptions, once the consumer falls behind, it will start consuming from replay automatically. It then catches up and rejoins the live stream – same as it did initially. This moves the recovery into the infrastructure level, keeping it out of application code.

This is a net positive for end-users, but moves the stream merging from the MDS to the Persistent Subscription logic. This requires a different merging approach from what was implemented in ReplayMerge.

Transitioning from one message stream to another

Persistent subscriptions - Transition from one message stream to another

Simplified Persistent Subscription architecture showing a Publisher and
a PersistentSubscription’s components

A Persistent Subscription uses two subscriptions: a replay subscription for historical data and a live subscription for consuming from the live stream. The live subscription isn’t created until the replay has nearly caught up.

The approach is to replay all missed messages until it is close enough to the end, then switch to consuming from live.

On the surface this is simple, but implementing this approach while ensuring no data is lost or duplicated is complex.

The first problem is choosing when to add the live subscription – the recording follows the live stream, so its end position is always advancing. That means it is impossible to reliably reach the end of a recording before the live subscription is added.

To solve this, Persistent Subscription defines an acceptable switch window for the replay’s position. Once within that window, the replay subscription’s position is deemed ‘close enough’ to the recording’s end and live can be added.

At this point, live is ahead of replay. If a switch were attempted here, the data between the subscriptions’ differing positions would be lost.

To avoid gaps in the stream, the live subscription isn’t consumed from, so its position doesn’t advance, allowing the replay to fully catch up.

This can cause problems with flow control, however, as the live subscription will be the slowest consumer of the stream. This has different implications depending on the flow control strategy.

When set to min, other consumers will be artificially throttled to the pace of the stationary live subscription.

With max, the longer the live subscription is not consumed from, the more likely it’ll fall off the stream, so the Persistent Subscription will stay on replay, defeating the purpose of the feature.

This is mitigated by the switch window as the replay subscription’s position will be fairly close to the live subscription’s after it’s been added. This minimizes the time it spends idle, reducing the flow control impact.

Once it has caught up, the replay subscription is closed and the live subscription is consumed from. This is all completely opaque to the application using the Persistent Subscription; it will always see one continuous stream of messages.

This means the application doesn’t need to concern itself with any of the infrastructure details of the stream; it is entirely focused on business logic.

Performance Impact

Since Persistent Subscriptions are intended to be a ReplayMerge alternative, their performance impact had to be comparable. The main trade-off is memory – due to its two subscriptions, a Persistent Subscription creates two images while ReplayMerge only creates one.

In terms of latency, the impact is roughly the same.

The graph below shows the latency profile of two applications running on identical nodes; one using Persistent Subscriptions (green) and the other ReplayMerge (red).

Performance comparison for the C implementations of ReplayMerge and PersistentSubscription

Each is sent 100,000 messages per second with a stall message injected every 20 seconds that pauses processing so that the applications fall behind the live stream and have to recover by replaying from the archive.

For the Persistent Subscription application, this triggers the slow consumer fallback while the ReplayMerge application re-creates the ReplayMerge instance.

The recovery curves are practically identical. In both cases, latency increases due to the pause in processing, then drops sharply once recovery completes. The only visible difference is that ReplayMerge has higher latency spikes after recovery than Persistent Subscriptions (red peaks of ~10-20 ms versus green peaks of 100 μs). This is a side effect of recreating the ReplayMerge instance.

Overall, Persistent Subscriptions have a higher memory footprint but do not increase the latency impact over ReplayMerge.

When to use Persistent Subscriptions

Persistent Subscriptions are the recommended feature for late-joining applications going forward, and are available in Java, C and C++.

If you’re already using ReplayMerge, you may be wondering if you need to migrate to this new feature.

As ReplayMerge is still actively supported, the general recommendation is that if you don’t need any of the features of Persistent Subscriptions and are already using ReplayMerge, there’s no need to migrate. ReplayMerge also has a lighter memory footprint, which may matter for some deployments.

If you’re planning a colocated deployment or looking to decrease latency by replaying from a local archive, Persistent Subscriptions are the right choice. They encapsulate infrastructure concerns, which simplifies your code.

Nadia Aina Full-stack Java Developer
Adaptive | Aeron
LinkedIn profile
Nadia Aina is a Full-Stack Java Developer at Adaptive, working in the Aeron product team. She joined via Adaptive’s Early Careers Programme and now contributes to Aeron’s open-source releases and community talks.

Persistent Subscriptions: Managing live and replay transitions in Aeron

The Late-Joiner Problem

Background: What’s `ReplayMerge`?

Persistent Subscriptions: A New Solution

Transitioning from one message stream to another

Performance Impact

When to use Persistent Subscriptions

Nadia Aina Full-stack Java Developer
Adaptive | Aeron

Further reading

Software release Version 1.51.0 released

Documentation Aeron Archive Overview

Resources Performance benchmark testing guides

Tech Deep Dive Flow Control and Back Pressure in Distributed Systems

Persistent Subscriptions: Managing live and replay transitions in Aeron

The Late-Joiner Problem

Background: What’s ReplayMerge?

Persistent Subscriptions: A New Solution

Transitioning from one message stream to another

Performance Impact

When to use Persistent Subscriptions

Nadia Aina Full-stack Java DeveloperAdaptive | Aeron

Further reading

Software release Version 1.51.0 released

Documentation Aeron Archive Overview

Resources Performance benchmark testing guides

Tech Deep Dive Flow Control and Back Pressure in Distributed Systems

Background: What’s `ReplayMerge`?

Nadia Aina Full-stack Java Developer
Adaptive | Aeron