Aeron Tooling
Aeron Stat¶
Aeron Stat outputs the key counters from Aeron plus the positions and key counters of all the active and recently active streams.
To use Aeron Stat, you must provide the folder for the Media Driver that you want to inspect, for example if you configured the Media Driver context to be:
final MediaDriver.Context mediaDriverCtx = new MediaDriver.Context()
.aeronDirectoryName("/dev/shm/md");
then the path provided to AeronStat would be as follows:
java -cp aeron-all-*.jar -Daeron.dir=/dev/shm/md io.aeron.samples.AeronStat
Output (looking at a running Archive Replication Client)
17:03:52 - Aeron Stat (CnC v0.2.0), pid 2771, heartbeat age 451ms
======================================================================
0: 60,704 - Bytes sent
1: 122,848 - Bytes received
2: 0 - Failed offers to ReceiverProxy
3: 0 - Failed offers to SenderProxy
4: 0 - Failed offers to DriverConductorProxy
5: 0 - NAKs sent
6: 0 - NAKs received
7: 1,875 - Status Messages sent
8: 941 - Status Messages received
9: 1,865 - Heartbeats sent
10: 3,610 - Heartbeats received
11: 0 - Retransmits sent
12: 0 - Flow control under runs
13: 0 - Flow control over runs
14: 0 - Invalid packets
15: 0 - Errors
16: 0 - Short sends
17: 0 - Failed attempts to free log buffers
18: 0 - Sender flow control limits, i.e. back-pressure events
19: 0 - Unblocked Publications
20: 0 - Unblocked Control Commands
21: 0 - Possible TTL Asymmetry
22: 0 - ControllableIdleStrategy status
23: 0 - Loss gap fills
24: 0 - Client liveness timeouts
25: 0 - Resolution changes: driverName=null hostname=archive-client
26: 150,858,350 - Conductor max cycle time doing its work in ns: SHARED
27: 0 - Conductor work cycle exceeded threshold count: threshold=1000000000ns SHARED
28: 149,104,126 - Sender max cycle time doing its work in ns: SHARED
29: 0 - Sender work cycle exceeded threshold count: threshold=1000000000ns SHARED
30: 149,144,918 - Receiver max cycle time doing its work in ns: SHARED
31: 0 - Receiver work cycle exceeded threshold count: threshold=1000000000ns SHARED
32: 1,838,850 - NameResolver max time in ns
33: 0 - NameResolver exceeded threshold count
36: 1,692,637,432,558 - client-heartbeat: 1
52: 1 - rcv-channel: aeron:udp?term-length=65536|sparse=true|mtu=1408|endpoint=10.1.0.4:0 10.1.0.4:45494
53: 1 - rcv-local-sockaddr: 52 10.1.0.4:45494
54: 1 - snd-channel: aeron:udp?term-length=65536|sparse=true|mtu=1408|endpoint=archive-backup:17000 10.1.0.4:33378
55: 1 - snd-local-sockaddr: 54 10.1.0.4:33378
56: 448 - pub-pos (sampled): 15 -1436025328 10 aeron:udp?term-length=65536|sparse=true|mtu=1408|endpoint=archive-backup:17000
57: 33,216 - pub-lmt: 15 -1436025328 10 aeron:udp?term-length=65536|sparse=true|mtu=1408|endpoint=archive-backup:17000
58: 448 - snd-pos: 15 -1436025328 10 aeron:udp?term-length=65536|sparse=true|mtu=1408|endpoint=archive-backup:17000
59: 32,768 - snd-lmt: 15 -1436025328 10 aeron:udp?term-length=65536|sparse=true|mtu=1408|endpoint=archive-backup:17000
60: 0 - snd-bpe: 15 -1436025328 10 aeron:udp?term-length=65536|sparse=true|mtu=1408|endpoint=archive-backup:17000
61: 608 - sub-pos: 14 1817141198 20 aeron:udp?term-length=65536|sparse=true|mtu=1408|endpoint=10.1.0.4:0 @0
62: 608 - rcv-hwm: 17 1817141198 20 aeron:udp?term-length=65536|sparse=true|mtu=1408|endpoint=10.1.0.4:0
63: 608 - rcv-pos: 17 1817141198 20 aeron:udp?term-length=65536|sparse=true|mtu=1408|endpoint=10.1.0.4:0
64: 1 - rcv-channel: aeron:udp?endpoint=10.1.0.4:0 10.1.0.4:33933
65: 1 - rcv-local-sockaddr: 64 10.1.0.4:33933
66: 6,016 - sub-pos: 19 1817141199 200 aeron:udp?endpoint=10.1.0.4:0 @1280
67: 6,016 - rcv-hwm: 21 1817141199 200 aeron:udp?endpoint=10.1.0.4:0
68: 6,016 - rcv-pos: 21 1817141199 200 aeron:udp?endpoint=10.1.0.4:0
--
Core Counters¶
Row | Description |
---|---|
Top Line | Most important piece of data here is the heartbeat age - this is the elapsed time since the last Media Driver heartbeat in the cnc.dat . If this is a large number (over 1000ms), check if your media driver is still running |
0 | Total bytes sent by the current media driver over UDP, excluding IP headers. If this isn't increasing at the pace expected by your application, something is wrong. |
1 | Total bytes received by the current media driver over UDP, excluding IP headers. If this isn't increasing at the pace expected by your application, something is wrong. |
2 | Failed offers to the Media Driver's Receiver Proxy; this suggests back pressure |
3 | Failed offers to the Media Driver's Sender Proxy; this suggests back pressure |
4 | Failed offers to the Media Driver's Conductor Proxy; this suggests back pressure |
5 | Total number of NAKs sent. This is how many times this Media Driver had to send a NAK in order to request a missing packet |
6 | Total number of NAKs received. This is how many times this Media Driver received a NAK in order to replay a missing packet to a remote Media Driver |
7 | Status Messages sent. This is a running count of the number of status messages sent by this Media Driver for flow control. This should increase over time. |
8 | Status Messages received. This is a running count of the number of status messages received by this Media Driver for flow control. This should increase over time. |
9 | Heartbeats sent. This is how many heartbeats have been sent by this Media Driver to indicate liveness to another Media Driver when there is no data to send. This should increase over time. |
10 | Heartbeats received. This is how many heartbeats have been received by this Media Driver to indicate liveness from another Media Driver when there is no data to send. This should increase over time. |
11 | Retransmits sent. This is how many packet retransmits have been sent by this Media Driver as a result of a NAK message. This will typically stay zero or very low in a healthy network (and with well behaved processes). |
12 | Flow control under runs. This is the count of packets which under-run the current flow control window for Images |
13 | Flow control over runs. This is the count of packets which over-run the current flow control window for Images |
14 | Count of invalid packets received by this Media Driver |
15 | Count of errors observed by this Media Driver. ErrorStat (see below) will provide details. |
16 | Short send count. A short send happens when the Media Driver's Sender agent expects to send a given buffer of data over the network, but the socket did not take all the data from the buffer. Typically, Aeron will recover from this. When this increases beyond a low number, it can be a complex problem to solve with causes ranging from incorrect buffer sizing to network equipment failure. The first place to look is typically the settings for the network buffer sizes: aeron.socket.so_rcvbuf and aeron.socket.so_sndbuf . aeron.rcv.initial.window.length must be less than or equal to aeron.socket.so_rcvbuf . Correct sizing can be an art, and can be especially challenging in a network with a large RTT variance. See also Bandwidth Delay Product. Note: you may need to update maximum socket buffer sizes in your operating system. |
17 | The number of times the Media Driver could not free a log buffer |
18 | Total number of back-pressure events over all streams. See also Back pressure |
19 | Count of times a publication has been unblocked after a client failed to commit() or abort() a tryClaim within timeout (see Publication TryClaim and Log Buffer Unblocking) |
20 | Count of times a command has been unblocked after a client failed to complete an offer within a timeout |
21 | The number of times a channel endpoint detected a possible TTL asymmetry between its config and a connection |
23 | This is the number of times a loss gap has been filled when NAKs have been disabled |
24 | The number of Aeron clients that have timed out without a graceful close (as in Aeron clients of this Media Driver) |
25 | The number of times the endpoints have been re-resolved (i.e. name resolution) resulting in a change |
26 | The maximum time taken in a conductor duty cycle in nanoseconds. Found in Aeron 1.33.0+ |
27 | The number of times the time spent in a conductor duty cycle exceeded a configurable threshold (1s default). Found in Aeron 1.33.0+ |
28 | The maximum time taken in a sender duty cycle in nanoseconds. |
29 | The number of times the time spent in a sender duty cycle exceeded a configurable threshold (1s default). |
30 | The maximum time taken in a receiver duty cycle in nanoseconds. |
31 | The number of times the time spent in a receiver duty cycle exceeded a configurable threshold (1s default). |
32 | The maximum time taken for Name Resolution in nanoseconds. Found in Aeron 1.42.0+ |
33 | The number of times the time spent in Name Resolution exceeded a configurable threshold. Found in Aeron 1.42.0+ |
Variable Counters¶
Row | Description |
---|---|
36 in above example; varies | Epoch millisecond value of the last client heartbeat from the given client. The client in this context is the Aeron Client on the Media Driver. |
52 in above example; varies | Receive channel |
53 in above example; varies | Receive socket address |
54 in above example; varies | Send channel |
55 in above example; varies | Send socket address |
Lines 31 to 45 include position values. For more info on how to understand these, see Understanding Aeron Position. The lines with @
, such as sub-pos
on line 32 refer to the join position for the subscription - in this example, the subscription joined at position 0
.
Note: There is a C version of the Aeron Stat tool. It's compiled and built with the C Media Driver. See C Media Driver.
AeronStat options¶
Arg | Description |
---|---|
-h | Shows the help text |
watch=true or false | If set to true, refreshes every n seconds. If set to false, runs once and exits. Defaults to true. |
delay=seconds | Specifies how often to refresh the output. Delay in seconds between update. Valid only if watch=true (or watch not specified) |
stream={regex} | Filters streams to only those that match the regex. Example: stream=101 |
type={regex} | Filters output type (as in the counter type) to only those that match |
session={regex} | Filters sessions to only those that match |
channel={regex} | Filters channels to only those that match |
identity={regex} | Filters identity to only those that match |
Error Stat¶
Error Stat prints all of the errors raised within the Aeron process. As with AeronStat, you must point ErrorStat at a Media Driver folder.
java -cp aeron-all-*.jar -Daeron.dir=/dev/shm/md io.aeron.samples.ErrorStat
When all is running as expected, error stat will produce output such as:
0 distinct errors observed.
Note: There is a C version of the Error Stat tool. It's compiled and built with the C Media Driver. See C Media Driver.
Stream Stat¶
Stream Stat is found within the Aeron samples folder, and can be launched from the aeron-all jar as below. As with AeronStat, you must point StreamStat at a Media Driver folder.
java -cp aeron-all-*.jar -Daeron.dir=/dev/shm/md io.aeron.samples.StreamStat
Stream stat provides a view on each stream within the media driver, including the publisher and sender view. The view is much like aeron stat, except the view is flattend. To allow it to be shown on a page, the single line 2 is broken into lines 2-10 below.
Command `n Control file /dev/shm/md/cnc.dat
sessionId=-1245628686 streamId=10
channel=aeron:udp?endpoint=localhost:40123 :
pub-pos (sampled):3:320
pub-lmt:3:8388992
snd-pos:3:384
snd-lmt:3:131456
sub-pos:1:384
rcv-hwm:4:384
rcv-pos:4:384
Backlog Stat¶
Backlog Stat is a tool to highlight backlogs in data flows. It functions on both IPC and UDP channels. As with AeronStat, you must point BacklogStat at a Media Driver folder.
java -cp aeron-all-*.jar -Daeron.dir=/dev/shm/md io.aeron.samples.BacklogStat
Sample output:
sessionId=1155221173 streamId=8 channel=aeron:udp?endpoint=10.1.1.1:4000 :
┌─for publisher 77 the last sampled position is 187392 (~0 bytes before back-pressure)
└─sender 77 has to send 0 bytes (2031779 butes remaining in the sender window)
sessionId=-614368527 streamId=9 channel=aeron:udp?endpoint=10.1.1.1:4001 :
┌─for publisher 6333 the last sampled position is 12739208 (~0 bytes before back-pressure)
└─sender 6333 has to send 65373 bytes (2031779 butes remaining in the sender window)
This tool can highlight problems with backlogs of data in given channels. In the example run above, the top session has no backlog, and the lower session has an outstanding backlog of 65373 bytes. Use this information to investigate network, process and/or design problems.
Loss Stat¶
LossStat logs all the data loss events suffered by Aeron. Note that IPC data is not subject to data loss, and will not appear in LossStat. As with AeronStat, you must point LossStat at a Media Driver folder.
java -cp aeron-all-*.jar -Daeron.dir=/dev/shm/md io.aeron.samples.LossStat
An example run:
#OBSERVATION_COUNT,TOTAL_BYTES_LOST,FIRST_OBSERVATION,LAST_OBSERVATION,SESSION_ID,STREAM_ID,CHANNEL,SOURCE
688,4167028,2020-08-16 13:53:39.053+0000,2020-08-16 13:53:41.003+0000,1155221173,8,aeron:udp?endpoint=10.1.1.1:4000;10.1.1.2:60950
This tells us the following about the session 1155221173
on channel aeron:udp?endpoint=10.1.1.1:4000
on stream 8 ⤌⤍ 10.1.1.2:60950
flow:
- there were 688 data loss events
- a total of 4,167,028 bytes were impacted
- the loss first happened at 2020-08-16 16:53:39.053+0000
- the last loss happened at 2020-08-16 16:53:41.003+0000
Armed with this information, you're able to investigate any network or host problems around those times. Note that little amounts of loss are fairly common.
Note: There is a C version of the Loss Stat tool. It's compiled and built with the C Media Driver. See C Media Driver.
Log Inspector¶
Log Inspector is found within the Aeron samples folder, and can be launched from the aeron-all jar as below. You must point the Log Inspector at a LogBuffer file directly.
java -cp aeron-all-*.jar io.aeron.samples.LogInspector <logbuffer file>
Log Inspector allows us to inspect Log Buffer files, including:
- if the log buffer is connected
- how many terms the log buffer has been through (see Log Buffers & Images)
- the state of the 3 terms in the log buffer
- and the data within a term, dumped as hex. This includes details on which session and stream produced the data.
======================================================================
Thu Dec 31 09:46:19 EST 2020 Inspection dump for 3.logbuffer
======================================================================
Is Connected: true
Initial term id: -1822262504
Term Count: 20
Active index: 2
Term length: 67108864
MTU length: 1408
Page Size: 4096
EOS Position: 9223372036854775807
default DATA Header{frame-length=0 version=0 flags=11000000 type=1 term-offset=0 session-id=301746870 stream-id=10 term-id=-1822262504 reserved-value=0}
Index 0 Term Meta Data termOffset=67108928 termId=-1822262486 rawTail=-7826557782030548928 position=1275068416
Index 1 Term Meta Data termOffset=67108928 termId=-1822262485 rawTail=-7826557777735581632 position=1342177280
Index 2 Term Meta Data termOffset=1822720 termId=-1822262484 rawTail=-7826557773505900544 position=1344000000
======================================================================
Index 0 Term Data
0: DATA Header{frame-length=0 version=0 flags=00000000 type=0 term-offset=0 session-id=0 stream-id=0 term-id=0 reserved-value=0}
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
======================================================================
Index 1 Term Data
0: DATA Header{frame-length=0 version=0 flags=00000000 type=0 term-offset=0 session-id=0 stream-id=0 term-id=0 reserved-value=0}
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
======================================================================
Index 2 Term Data
0: DATA Header{frame-length=36 version=0 flags=11000000 type=1 term-offset=0 session-id=301746870 stream-id=10 term-id=-1822262484 reserved-value=0}
02004001
64: DATA Header{frame-length=36 version=0 flags=11000000 type=1 term-offset=64 session-id=301746870 stream-id=10 term-id=-1822262484 reserved-value=0}
03004001
...