Aeron Metrics

Metrics published by the Prometheus Exporters all include a help section that describes the metric. This page provides a more detailed description of some of the key metrics that you will probably want to monitor in an Aeron Cluster application, how to interpret the values, and how you can identify if the metrics are highlighting an issue.

The sections on this page match the sections in the sample dashboard.

Cluster Status

Cluster Activity

How recently the cluster has been observed to be active. Cluster nodes update their activity timestamps roughly once per second, so any value in the region of 1 second or less is normal.

Metric: cluster_activity_age_seconds

Driver Activity

How recently the media driver has been observed to be active. The driver updates its activity timestamp roughly once per second, so any value in the region of 1 second or less is normal.

Metric: driver_heartbeat_age_seconds

Consensus Module State

The current state of the Consensus Module component of the Cluster. In normal operation, this will be 1 (Active), but will temporarily be 0 on startup, and 3 while taking a snapshot.

The full set of values are:

0 (INIT) Starting up and recovering state.
1 (ACTIVE) Active state with ingress and expired timers appended to the log.
2 (SUSPENDED) Suspended processing of ingress and expired timers.
3 (SNAPSHOT) In the process of taking a snapshot.
4 (QUITTING) Quitting the cluster and shutting down as soon as services ack without taking a snapshot.
5 (TERMINATING) In the process of terminating the node.
6 (CLOSED) Terminal state.

These values are defined in the State enum in the ConsensusModule class.

Metric: cluster_consensus_module_state

Election State

Shows if an election is taking place and at what stage the election is in. In normal operation, this should be 17 (Closed), indicating that an election has occurred and has completed. All other states are temporary.

The full set of values are:

0 (INIT) Consolidate local state and prepare for new leadership.
1 (CANVASS) Canvass members for current state and to assess if a successful leadership attempt can be mounted.
2 (NOMINATE) Nominate member for new leadership by requesting votes.
3 (CANDIDATE_BALLOT) Await ballot outcome from members on candidacy for leadership.
4 (FOLLOWER_BALLOT) Await ballot outcome after voting for a candidate.
5 (LEADER_LOG_REPLICATION) Wait for followers to replicate any missing log entries to track commit position.
6 (LEADER_REPLAY) Replay local log in preparation for new leadership term.
7 (LEADER_INIT) Initialise state for new leadership term.
8 (LEADER_READY) Publish new leadership term and commit position, while awaiting followers ready.
9 (FOLLOWER_LOG_REPLICATION) Replicate missing log entries from the leader.
10 (FOLLOWER_REPLAY) Replay local log in preparation for following new leader.
11 (FOLLOWER_CATCHUP_INIT) Initialise catch-up in preparation of receiving a replay from the leader to catch up in current term.
12 (FOLLOWER_CATCHUP_AWAIT) Await joining a replay from leader to catch-up.
13 (FOLLOWER_CATCHUP) Catch-up to leader until live log can be added and merged.
14 (FOLLOWER_LOG_INIT) Initialise follower in preparation for joining the live log.
15 (FOLLOWER_LOG_AWAIT) Await joining the live log from the leader.
16 (FOLLOWER_READY) Publish append position to leader to signify ready for new term.
17 (CLOSED) Election is closed after new leader is established.

These values are defined in the ElectionState enum.

Metric: cluster_election_state

Node Role

The role this node plays in the cluster. One node should be 2 (Leader), the others should be 0 (Follower). Nodes will temporarily be 1 (Candidate) during an election.

These values are defined in the Role enum in the Cluster class.

Metric: cluster_node_role

Cluster Commit Position

The position in the ingress log that has been replicated to a quorum of nodes. This position should be roughly the same on all nodes. In a cluster that is handling traffic, this position should be increasing.

Metric: cluster_commit_position_total

Number of Valid Snapshots

The number of valid snapshots on the cluster. This metric will increase as new snapshots are taken, and decrease as old snapshots are removed.

This metric is distinct from the related snapshots_taken metric, which reports how many snapshots have been taken since the cluster node started. The snapshots_taken metric will reset if the cluster node is restarted, and does not take snapshot invalidation into account.

Metric: cluster_valid_snapshots

Last Snapshot Time

The timestamp of the most recent valid snapshot on the cluster.

Metric: cluster_last_snapshot_timestamp_seconds

Archive Status

Archive Control Sessions

The number of connections to a Cluster Node’s archive for issuing instructions to the Archive. There should be at least 1 connection, which will be the cluster node itself.

Metric: archive_control_sessions

Archive Recording Sessions

The number of sessions currently recording in a Cluster Node’s archive. There should be at least 1, which will be the cluster node itself.

Metric: archive_recording_sessions

Errors

How many errors have been recorded by the different components. This refers to errors that have been handled by the Aeron error handler and recorded in an Aeron distinct error. Other application level errors that were caught without notifying Aeron will not be included in this figure.

A non-zero value for this indicates errors have occurred at some point while the component has been running. The errors command on the Insights CLI will show you what errors occurred and when.

There are metrics for each component with a distinct error log.

Metrics:

Media driver
- driver_errors_total
Archive
- archive_errors_total
Cluster
- cluster_errors_total
Cluster Service
- cluster_service_errors_total

Processing Time

Timing metrics show how much time was spent executing different components in the system. There are two kinds of timing metrics: Maximum time; what was the maximum time spent performing a particular operation, such as executing an Agent’s duty cycle, or taking a snapshot. Threshold exceeded; how many times the processing time for the operation exceeded the threshold. The default value for the threshold is one second but can be changed for each operation by setting system properties when running the application. For example: aeron.driver.conductor.cycle.threshold to change the threshold for the driver conductor.

The operations for which processing time is recorded are:

Time the Media Driver conductor agent spent executing a duty cycle.
- driver_max_cycle_time_seconds{agent="sender"}
- driver_cycle_time_threshold_exceeded_total{agent="sender"}
Time the Media Driver sender agent spent executing a duty cycle.
- driver_max_cycle_time_seconds{agent="sender"}
- driver_cycle_time_threshold_exceeded_total{agent="receiver"}
Time the Media Driver received agent spent executing a duty cycle.
- driver_max_cycle_time_seconds{agent="receiver"}
- driver_cycle_time_threshold_exceeded_total{agent="receiver"}
Time the Media Driver spent resolving an address.
- driver_name_resolver_max_time_seconds
- driver_name_resolver_time_threshold_exceeded_total
Time the Archive Conductor agent spent executing a duty cycle.
- archive_max_cycle_time_seconds{agent="archive-conductor"}
- archive_cycle_time_threshold_exceeded_total{agent="archive-conductor"}
Time the Archive Recorder agent spent executing a duty cycle.
- archive_max_cycle_time_seconds{agent="archive-recorder"}
- archive_cycle_time_threshold_exceeded_total{agent="archive-recorder"}
Time the Archive Recorder spent writing to disk.
- archive_recorder_max_write_time_seconds
Time Archive Replayer agent spent executing a duty cycle.
- archive_max_cycle_time_seconds{agent="archive-replayer"}
- archive_cycle_time_threshold_exceeded_total{agent="archive-replayer"}
Time the Archive Replayer spent reading from disk.
- archive_replayer_max_read_time_seconds
Time the Consensus Module agent spent executing a duty cycle.
- cluster_max_cycle_time_seconds
- cluster_cycle_time_threshold_exceeded_total
Time a Clustered Service agent spent executing a duty cycle.
- cluster_service_max_cycle_time_seconds
- cluster_service_cycle_time_threshold_exceeded_total
Time spent taking a snapshot for the Consensus Module.
- cluster_max_snapshot_duration_seconds
- cluster_snapshot_duration_threshold_exceeded_total
Time spent taking a snapshot for a Clustered Service.
- cluster_snapshot_duration_threshold_exceeded_total
- cluster_service_snapshot_duration_threshold_exceeded_total

As it is possible to run multiple clusters and archives on a single media driver, the cluster and archive metrics are qualified with a clusterId or archiveId. As a cluster can run multiple clustered services, the clustered services metrics are further qualified with a serviceId.

Note that when running the archive in shared threading mode, there will not be separate metrics for the Recorder and Replayer agent duty cycles. Only the Conductor agent exists in this case.

You can also see this information by running the processing-time command on the Insights CLI.

Packet Loss

Amount of packet loss recorded by the Media Driver. Any non-zero value indicates packet loss has occurred. If the number is increasing, it indicates packet loss is ongoing. You can see more information about the packet loss by running the loss command on the Insights CLI.

Metrics:

How many times the media driver for a receiver has detected lost packets and sent a NAK to request the sender resend the missing data.
- driver_nak_messages_sent_total
How many times the media driver for a sender has been notified that packet loss has occurred and it should resend data.
- driver_nak_messages_received_total
Number of times the media driver for a receiver has detected lost packets on a stream where reliable transmission has been disabled.
- driver_loss_gap_fills_total
Number of packets resent as a result of NAKs.
- driver_retransmits_sent_total

Traffic Stats

Bytes Sent

The total number of bytes sent by the Media Driver while it has been running. In a system that is handling traffic, this number should be increasing.

Metric: driver_bytes_sent_total

Bytes Received

The total number of bytes received by the Media Driver while it has been running. In a system that is handling traffic, this number should be increasing.

Metric: driver_bytes_received_total

Back Pressure

The number of back pressure events experienced while attempting to send data over the network. There is an aggregate metric showing the total number of back pressure events experienced while the media driver has been running, as well as back pressure metrics for individual streams.

In a healthy system, the number of back pressure events should be zero.

Metrics:

Aggregate metric
- driver_sender_bpe_total
Stream metrics
- driver_sender_bpe_total{streamId}

Potential Problem Indicators

These are other metrics that can indicate potential issues. In a healthy system, these should all be zero.

How many cluster clients have been disconnected due to inactivity.
- cluster_timed_out_clients_total
The number of packets received that could not be parsed. For example, packets that are too small to contain a frame header.
- driver_invalid_packets_total
The number of times a socket send operation resulted in sending less than the expected length of the packet. In these cases the packet is not sent.
- driver_short_sends_total
How many times the Media Driver attempted to cleanup an unused resource and was not successful. For example, if deleting an old log buffer fails because it is still open/mapped on the clients.
- driver_free_fails_total
How many times a publication needed to be unblocked. This will happen if a publisher claims space in a publication, but does not complete the publication of the message.
- driver_unblocked_publications_total
How many times a command in the toDriver buffer needed to be unblocked. This will happen if a client claims space in the buffer to send a command to the Media Driver, but does not complete the writing of the instruction (typically the client is closed/crashed).
- driver_unblocked_commands_total
How many times a media driver client was closed due to a timeout.
- driver_client_timeouts_total
How many times an error frame was sent. This indicates that the other side of an interaction attempted something and was rejected. For example, a publisher that attempts to create too many publications for a subscription.
- driver_error_frames_sent_total
How many times an error frame was received.
- driver_error_frames_received_total
How many packets have been received beyond the receiver’s flow control window. This indicates the receiver is falling behind other subscribers in a stream.
- driver_flow_control_over_runs_total