Aeron Metrics
Metrics published by the Prometheus
Exporters all include a help section that describes the metric. This
page provides a more detailed description of some of the key metrics
that you will probably want to monitor in an Aeron Cluster application,
how to interpret the values, and how you can identify if the metrics are
highlighting an issue.
The sections on this page match the sections in the sample dashboard.
Cluster Status
Cluster Activity
How recently the cluster has been observed to be active. Cluster nodes update their activity timestamps roughly once per second, so any value in the region of 1 second or less is normal.
Metric: cluster_activity_age_seconds
Driver Activity
How recently the media driver has been observed to be active. The driver updates its activity timestamp roughly once per second, so any value in the region of 1 second or less is normal.
Metric: driver_heartbeat_age_seconds
Consensus Module State
The current state of the Consensus Module component of the Cluster. In
normal operation, this will be 1 (Active), but will temporarily be 0
on startup, and 3 while taking a snapshot.
The full set of values are:
-
0 (INIT) Starting up and recovering state.
-
1 (ACTIVE) Active state with ingress and expired timers appended to the log.
-
2 (SUSPENDED) Suspended processing of ingress and expired timers.
-
3 (SNAPSHOT) In the process of taking a snapshot.
-
4 (QUITTING) Quitting the cluster and shutting down as soon as services ack without taking a snapshot.
-
5 (TERMINATING) In the process of terminating the node.
-
6 (CLOSED) Terminal state.
These values are defined in the State enum in the
ConsensusModule
class.
Metric: cluster_consensus_module_state
Election State
Shows if an election is taking place and at what stage the election is
in. In normal operation, this should be 17 (Closed), indicating that
an election has occurred and has completed. All other states are
temporary.
The full set of values are:
-
0 (INIT) Consolidate local state and prepare for new leadership.
-
1 (CANVASS) Canvass members for current state and to assess if a successful leadership attempt can be mounted.
-
2 (NOMINATE) Nominate member for new leadership by requesting votes.
-
3 (CANDIDATE_BALLOT) Await ballot outcome from members on candidacy for leadership.
-
4 (FOLLOWER_BALLOT) Await ballot outcome after voting for a candidate.
-
5 (LEADER_LOG_REPLICATION) Wait for followers to replicate any missing log entries to track commit position.
-
6 (LEADER_REPLAY) Replay local log in preparation for new leadership term.
-
7 (LEADER_INIT) Initialise state for new leadership term.
-
8 (LEADER_READY) Publish new leadership term and commit position, while awaiting followers ready.
-
9 (FOLLOWER_LOG_REPLICATION) Replicate missing log entries from the leader.
-
10 (FOLLOWER_REPLAY) Replay local log in preparation for following new leader.
-
11 (FOLLOWER_CATCHUP_INIT) Initialise catch-up in preparation of receiving a replay from the leader to catch up in current term.
-
12 (FOLLOWER_CATCHUP_AWAIT) Await joining a replay from leader to catch-up.
-
13 (FOLLOWER_CATCHUP) Catch-up to leader until live log can be added and merged.
-
14 (FOLLOWER_LOG_INIT) Initialise follower in preparation for joining the live log.
-
15 (FOLLOWER_LOG_AWAIT) Await joining the live log from the leader.
-
16 (FOLLOWER_READY) Publish append position to leader to signify ready for new term.
-
17 (CLOSED) Election is closed after new leader is established.
These values are defined in the
ElectionState
enum.
Metric: cluster_election_state
Node Role
The role this node plays in the cluster. One node should be 2
(Leader), the others should be 0 (Follower). Nodes will temporarily be
1 (Candidate) during an election.
These values are defined in the Role enum in the
Cluster
class.
Metric: cluster_node_role
Cluster Commit Position
The position in the ingress log that has been replicated to a quorum of nodes. This position should be roughly the same on all nodes. In a cluster that is handling traffic, this position should be increasing.
Metric: cluster_commit_position_total
Number of Valid Snapshots
The number of valid snapshots on the cluster. This metric will increase as new snapshots are taken, and decrease as old snapshots are removed.
This metric is distinct from the related snapshots_taken metric, which reports how many snapshots have been taken since the cluster node started. The snapshots_taken metric will reset if the cluster node is restarted, and does not take snapshot invalidation into account.
Metric: cluster_valid_snapshots
Archive Status
Errors
How many errors have been recorded by the different components. This refers to errors that have been handled by the Aeron error handler and recorded in an Aeron distinct error. Other application level errors that were caught without notifying Aeron will not be included in this figure.
A non-zero value for this indicates errors have occurred at some point
while the component has been running. The errors command on the
Insights CLI will show you what errors occurred and when.
There are metrics for each component with a distinct error log.
Metrics:
-
Media driver
-
driver_errors_total
-
-
Archive
-
archive_errors_total
-
-
Cluster
-
cluster_errors_total
-
-
Cluster Service
-
cluster_service_errors_total
-
Processing Time
Timing metrics show how much time was spent executing different
components in the system. There are two kinds of timing metrics:
Maximum time; what was the maximum time spent performing a particular
operation, such as executing an Agent’s duty cycle, or taking a
snapshot. Threshold exceeded; how many times the processing time for
the operation exceeded the threshold. The default value for the
threshold is one second but can be changed for each operation by setting
system properties when running the application. For example:
aeron.driver.conductor.cycle.threshold to change the threshold for the
driver conductor.
The operations for which processing time is recorded are:
-
Time the Media Driver conductor agent spent executing a duty cycle.
-
driver_max_cycle_time_seconds{agent="sender"} -
driver_cycle_time_threshold_exceeded_total{agent="sender"}
-
-
Time the Media Driver sender agent spent executing a duty cycle.
-
driver_max_cycle_time_seconds{agent="sender"} -
driver_cycle_time_threshold_exceeded_total{agent="receiver"}
-
-
Time the Media Driver received agent spent executing a duty cycle.
-
driver_max_cycle_time_seconds{agent="receiver"} -
driver_cycle_time_threshold_exceeded_total{agent="receiver"}
-
-
Time the Media Driver spent resolving an address.
-
driver_name_resolver_max_time_seconds -
driver_name_resolver_time_threshold_exceeded_total
-
-
Time the Archive Conductor agent spent executing a duty cycle.
-
archive_max_cycle_time_seconds{agent="archive-conductor"} -
archive_cycle_time_threshold_exceeded_total{agent="archive-conductor"}
-
-
Time the Archive Recorder agent spent executing a duty cycle.
-
archive_max_cycle_time_seconds{agent="archive-recorder"} -
archive_cycle_time_threshold_exceeded_total{agent="archive-recorder"}
-
-
Time the Archive Recorder spent writing to disk.
-
archive_recorder_max_write_time_seconds
-
-
Time Archive Replayer agent spent executing a duty cycle.
-
archive_max_cycle_time_seconds{agent="archive-replayer"} -
archive_cycle_time_threshold_exceeded_total{agent="archive-replayer"}
-
-
Time the Archive Replayer spent reading from disk.
-
archive_replayer_max_read_time_seconds
-
-
Time the Consensus Module agent spent executing a duty cycle.
-
cluster_max_cycle_time_seconds -
cluster_cycle_time_threshold_exceeded_total
-
-
Time a Clustered Service agent spent executing a duty cycle.
-
cluster_service_max_cycle_time_seconds -
cluster_service_cycle_time_threshold_exceeded_total
-
-
Time spent taking a snapshot for the Consensus Module.
-
cluster_max_snapshot_duration_seconds -
cluster_snapshot_duration_threshold_exceeded_total
-
-
Time spent taking a snapshot for a Clustered Service.
-
cluster_snapshot_duration_threshold_exceeded_total -
cluster_service_snapshot_duration_threshold_exceeded_total
-
As it is possible to run multiple clusters and archives on a single
media driver, the cluster and archive metrics are qualified with a
clusterId or archiveId. As a cluster can run multiple clustered
services, the clustered services metrics are further qualified with a
serviceId.
Note that when running the archive in shared threading mode, there will not be separate metrics for the Recorder and Replayer agent duty cycles. Only the Conductor agent exists in this case.
You can also see this information by running the processing-time
command on the Insights CLI.
Packet Loss
Amount of packet loss recorded by the Media Driver. Any non-zero value
indicates packet loss has occurred. If the number is increasing, it
indicates packet loss is ongoing. You can see more information about the
packet loss by running the loss command on the Insights CLI.
Metrics:
-
How many times the media driver for a receiver has detected lost packets and sent a NAK to request the sender resend the missing data.
-
driver_nak_messages_sent_total
-
-
How many times the media driver for a sender has been notified that packet loss has occurred and it should resend data.
-
driver_nak_messages_received_total
-
-
Number of times the media driver for a receiver has detected lost packets on a stream where reliable transmission has been disabled.
-
driver_loss_gap_fills_total
-
-
Number of packets resent as a result of NAKs.
-
driver_retransmits_sent_total
-
Traffic Stats
Bytes Sent
The total number of bytes sent by the Media Driver while it has been running. In a system that is handling traffic, this number should be increasing.
Metric: driver_bytes_sent_total
Bytes Received
The total number of bytes received by the Media Driver while it has been running. In a system that is handling traffic, this number should be increasing.
Metric: driver_bytes_received_total
Back Pressure
The number of back pressure events experienced while attempting to send data over the network. There is an aggregate metric showing the total number of back pressure events experienced while the media driver has been running, as well as back pressure metrics for individual streams.
In a healthy system, the number of back pressure events should be zero.
Metrics:
-
Aggregate metric
-
driver_sender_bpe_total
-
-
Stream metrics
-
driver_sender_bpe_total{streamId}
-
Potential Problem Indicators
These are other metrics that can indicate potential issues. In a healthy system, these should all be zero.
-
How many cluster clients have been disconnected due to inactivity.
-
cluster_timed_out_clients_total
-
-
The number of packets received that could not be parsed. For example, packets that are too small to contain a frame header.
-
driver_invalid_packets_total
-
-
The number of times a socket send operation resulted in sending less than the expected length of the packet. In these cases the packet is not sent.
-
driver_short_sends_total
-
-
How many times the Media Driver attempted to cleanup an unused resource and was not successful. For example, if deleting an old log buffer fails because it is still open/mapped on the clients.
-
driver_free_fails_total
-
-
How many times a publication needed to be unblocked. This will happen if a publisher claims space in a publication, but does not complete the publication of the message.
-
driver_unblocked_publications_total
-
-
How many times a command in the toDriver buffer needed to be unblocked. This will happen if a client claims space in the buffer to send a command to the Media Driver, but does not complete the writing of the instruction (typically the client is closed/crashed).
-
driver_unblocked_commands_total
-
-
How many times a media driver client was closed due to a timeout.
-
driver_client_timeouts_total
-
-
How many times an error frame was sent. This indicates that the other side of an interaction attempted something and was rejected. For example, a publisher that attempts to create too many publications for a subscription.
-
driver_error_frames_sent_total
-
-
How many times an error frame was received.
-
driver_error_frames_received_total
-
-
How many packets have been received beyond the receiver’s flow control window. This indicates the receiver is falling behind other subscribers in a stream.
-
driver_flow_control_over_runs_total
-