Interpreting Prometheus metrics for Linux disk I/O utilization

10 min readJan 28, 2021

Prometheus is a metrics collection system, and its node_exporter exposes a rich range of system metrics.

In this article I’m going to break down the individual metrics for disk I/O. They provide critical information about how your disks are performing, how busy they are and the I/O latency that your applications are experiencing.

There are a number of Grafana dashboards for node_exporter, but not all of them label the stats correctly. Hence it’s well worth understanding exactly what you’re looking at.

Raw diskstats

On a Linux system, node_exporter reads disk metrics from /proc/diskstats. The format of this file is given in the kernel documentation; the first three columns identify the device, and the subsequent fields contain the data.

Each prometheus metric corresponds directly to one of the fields in that file, as can be found in the source code for NewDiskstatsCollector. So the first step in understanding the prometheus metrics is to match them up with their kernel source data.

To do this, I’ve taken the kernel documentation and added the corresponding prometheus metric name to each field.

node_disk_reads_completed_total (field 1)
    This is the total number of reads completed successfully.node_disk_reads_merged_total (field 2)
node_disk_writes_merged_total (field 6)
node_disk_discards_merged_total (field 13)
    Reads and writes which are adjacent to each other may be merged
    for efficiency.  Thus two 4K reads may become one 8K read before
    it is ultimately handed to the disk, and so it will be counted
    (and queued) as only one I/O.  This field lets you know how
    often this was done.node_disk_read_bytes_total (field 3)
    This is the total number of bytes read successfully.node_disk_read_time_seconds_total (field 4)
    This is the total number of seconds spent by all reads (as
    measured from __make_request() to end_that_request_last()).node_disk_writes_completed_total (field 5)
    This is the total number of writes completed successfully.node_disk_written_bytes_total (field 7)
    This is the total number of bytes written successfully.node_disk_write_time_seconds_total (field 8)
    This is the total number of seconds spent by all writes (as
    measured from __make_request() to end_that_request_last()).node_disk_io_now (field 9)
    The only field that should go to zero. Incremented as requests
    are given to appropriate struct request_queue and decremented as
    they finish.node_disk_io_time_seconds_total (field 10)
    Number of seconds spent doing I/Os.
    This field increases so long as field 9 is nonzero.node_disk_io_time_weighted_seconds_total (field 11)
    Weighted # of seconds spent doing I/Os.
    This field is incremented at each I/O start, I/O completion, I/O
    merge, or read of these stats by the number of I/Os in progress
    (field 9) times the number of seconds spent doing I/O since the
    last update of this field.  This can provide an easy measure of
    both I/O completion time and the backlog that may be
    accumulating.node_disk_discards_completed_total (field 12)
    This is the total number of discards completed successfully.node_disk_discarded_sectors_total (field 14)
    This is the total number of sectors discarded successfully.node_disk_discard_time_seconds_total (field 15)
    This is the total number of seconds spent by all discards (as
    measured from __make_request() to end_that_request_last()).node_disk_flush_requests_total (field 16)
    The total number of flush requests completed successfully.node_disk_flush_requests_time_seconds_total (field 17)
    The total number of seconds spent by all flush requests.

node_exporter exposes the raw data returned by the kernel, apart from a couple of scaling factors. (Where the kernel returns a number of sectors, node_exporter multiplies it by the sector size to become a number of bytes; and where the kernel reports a time in milliseconds, node_exporter multiplies it by 0.001 to make seconds. I’ve adjusted the descriptions above to match)

Note: older kernels don’t return all the fields shown. Some systems may only return data for fields 1–11; others only fields 1–15.

Analyzing the data: iostat

We’ve got raw data, in the form of counters and accumulated durations. How do we interpret it, in terms of what’s happening between each point in time that the data is sampled?

To answer that, I’m first going to look at a core Linux utility, iostat (in Ubuntu it’s part of the sysstat package). This interprets the same underlying diskstats, and it’s enlightening to see how it does so.

The typical way to run iostat is like this: iostat 5 -x

This means: sample stats every 5 seconds, and display the results in extended form. Here is some example output:

Sample iostat output

The first set of stats you’ll see is the average over the entire time since the system has booted (which may be over months or years). After that, you get data covering each 5 second period.

Looking at the columns one by one, here are their descriptions from the iostat(1) manpage:

r/s: The number (after merges) of read requests completed per second for the device.
w/s: The number (after merges) of write requests completed per second for the device.
rkB/s:The number of kilobytes¹ read from the device per second.
wkB/s: The number of kilobytes¹ written to the device per second.
rrqm/s: The number of read requests merged per second that were queued to the device.
wrqm/s: The number of write requests merged per second that were queued to the device.
%rrqm: The percentage of read requests merged together before being sent to the device.
%wrqm: The percentage of write requests merged together before being sent to the device.
r_await: The average time (in milliseconds) for read requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.
w_await: The average time (in milliseconds) for write requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.
aqu-sz: The average queue length of the requests that were issued to the device.
rareq-sz: The average size (in kilobytes) of the read requests that were issued to the device.
wareq-sz: The average size (in kilobytes) of the write requests that were issued to the device.
svctm: The average service time for I/O requests that were issued to the device. Unreliable, and has been removed from current versions of iostat.
%util: Percentage of elapsed time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.

How are these derived from /proc/diskstats, and hence what are the corresponding Prometheus queries? The answers are in the source code, and I’ve done the translations here.

r/s: The rate of increase of field 1.
rate(node_disk_reads_completed_total[*])
Unit: rate (operations per second)
w/s: The rate of increase of field 5:
rate(node_disk_writes_completed_total[*])
Unit: rate (operations per second)
rkB/s:
rate(node_disk_read_bytes_total[*])Unit: bytes per second
wkB/s:
rate(node_disk_written_bytes_total[*])
Unit: bytes per second
rrqm/s:
rate(node_disk_reads_merged_total[*])
Unit: rate (operations per second)
wrqm/s:
rate(node_disk_writes_merged_total[*])
Unit: rate (operations per second)
%rrqm²:
rate(node_disk_reads_merged_total[*]) / (rate(node_disk_reads_merged_total[*] + rate(node_disk_reads_completed_total[*]))Unit: dimensionless (fraction 0–1)
%wrqm²:
rate(node_disk_writes_merged_total[*]) / (rate(node_disk_writes_merged_total[*] + rate(node_disk_writes_completed_total[*]))Unit: dimensionless (fraction 0–1)
r_await²:
rate(node_disk_read_time_seconds_total[*]) / rate(node_disk_reads_completed_total[*])Unit: seconds
w_await²:
rate(node_disk_write_time_seconds_total[*]) / rate(node_disk_writes_completed_total[*])Unit: seconds
aqu-sz:
rate(node_disk_io_time_weighted_seconds_total[*])
Unit: dimensionless (number of queued operations)
rareq-sz²:
rate(node_disk_read_bytes_total[*]) / rate(node_disk_reads_completed_total[*])
Unit: bytes
wareq-sz²:
rate(node_disk_written_bytes_total[*]) / rate(node_disk_writes_completed_total[*])
Unit: bytes
%util:
rate(node_disk_io_time_seconds_total[*])
Unit: dimensionless (fraction 0–1)

I’ve used [*] to duck the question of what time range to use in these queries. That may be the subject of another article, but for now, choose a range which is at least double your sampling interval. That is, if you’re sampling at 1 minute intervals, it must be at least [2m] to be able to calculate rates successfully. In Grafana 7.2 or later, use [$__rate_interval], and make sure you set the correct sample rate in your Grafana data source.

Interpreting the results

Apart from the obvious throughput figures — the number of operations per second and the number of bytes written or read per second — a few metrics are particularly important.

aqu-sz: the average queue size

This indicates the average number of operations waiting to be serviced. Note that some devices, like SSDs, need to have multiple outstanding requests in order to achieve maximum throughput: for example, an SSD with 8 internal controller channels will only achieve full throughput when there are at least 8 outstanding concurrent requests. The same applies to some degree to hard drives, which are able to optimise their head movements across the platter if there are several outstanding requests.

This value is calculated accurately by taking the rate of increase of node_disk_io_time_weighted_seconds_total.

This is distinct from node_disk_io_now, which gives the instantaneous queue depth. Let’s say you are polling node_exporter every 5 seconds; then node_disk_io_now gives you the number of items which were in the queue at the sampling instant only. This value can vary massively from millisecond to millisecond, so the sampled value can be very noisy. This is the reason for the node_disk_io_time_weighted_seconds_total metric; by multiplying the queue depth by the amount of time the queue was at that given depth, and summing it up, the rate of increase of this metric gives you the average queue depth over the period in question.

r_await / w_await: service times

These are the average times taken to service each read request and write request respectively. If these values become high, then the application will be suffering I/O latency, due to being held in line waiting for other requests.

rareq-sz and wareq-sz: average request sizes

These are useful for understanding I/O patterns. Applications may make a mix of small transfers (e.g. 4KB) or large ones (e.g. 512KB); the average gives you an idea of which dominates. Larger transfers are more efficient, especially for spinning hard drives.

%util: utilization

If this value is below 100%, then the device is completely idle for some of the time. It means there’s definitely spare capacity.

However if this is at 100%, it doesn’t necessarily mean that the device is saturated. As explained before, some devices require multiple outstanding I/O requests to deliver their full throughput. So a device which can handle (say) 8 concurrent requests could still show 100% utilisation when only 1 or 2 concurrent requests are being made to it all the time. In that case, it still has plenty more to give. You’ll need to look at the the throughput (kB/s), the number of operations per second, and the queue depth and services times, to gauge how heavily it is being used.

Common errors

There are a number of ready-made Grafana dashboards for Prometheus, but many get these stats wrong — either by using the wrong query, or more commonly by mislabelling the graphs or displaying the wrong units.

As an example, I’m going to take the otherwise excellent dashboard 1860, “Node Exporter Full”. I use it all the time, and strongly recommend it.

At the time of writing, one of the panels is labelled “Disk I/Os Weighted”, and shows units in milliseconds:

The underlying query is:

rate(node_disk_io_time_weighted_seconds_total{instance="$node",job="$job"}[5m])

You may recognise this as the “aqu-sz” query. What’s the problem with this?

Firstly, the Y-axis is labelled as “time” and has time units. This is wrong.

The source metric has units of (queue depth * seconds), so its dimension is “seconds”. But then you’re taking a rate of increase in this metric. The units are therefore “seconds per second”, which is dimensionless. The value represents the average queue depth over the sampling interval, which is just a number.

Secondly, the graph is labelled “Disk IOs Weighted”. Now, the underlying metric is not number of IOs, it’s IO time (weighted by queue depth). Should it say “Disk IO Time Weighted” instead? Not really, because by the time you’ve taken a rate of this, you’ve just got the queue depth. So a better label would simply be “Average queue depth” (or size).

Once these corrections are made, the graph makes much more sense: it shows the average number of outstanding I/O operations over the time interval covered by each data point.

Similarly, there is a graph labelled “Time spent doing I/Os” — again with Y axis labelled in seconds.

The value displayed ranges between 0s and 1s. Its query is:

rate(node_disk_io_time_seconds_total{instance="$node",job="$job"}[5m])

In fact, this is the query for “%util” shown before. Because this is a rate, the value displayed is really the number of seconds every second for which I/O is active. So it would be better displayed as a fraction or percentage, not as a time. Once again, a simple change to the axis legend and units makes it much clearer.

Another graph, labelled “Disk R/W Time”, shows these metrics:

rate(node_disk_read_time_seconds_total{...}[5m])rate(node_disk_write_time_seconds_total{...}[5m])

Unfortunately, those graphs aren’t very meaningful. Despite being labelled as seconds, again they are actually dimensionless (seconds per second). If your disks are doing 100 read operations per second, and each one takes 20ms, then the value displayed is “2 seconds”, but that just means an average of 2 reads are taking place concurrently.

What would be much more useful is to use the queries for r_await and w_await, which are:

rate(node_disk_read_time_seconds_total{...}[5m]) / rate(node_disk_reads_completed_total{...}[5m])rate(node_disk_write_time_seconds_total{...}[5m]) / rate(node_disk_writes_completed_total{...}[5m])

The dimension of these queries is “seconds”, and gives the average service time for read and write requests respectively —in this example the graph would show “20ms” — and that is much more interesting.

UPDATE: the dashboard author has graciously taken these updates on board. Thank you Ricardo!

Conclusion

Hopefully this article has helped you understand node_exporter's disk metrics, how they relate to the raw kernel data from /proc/diskstats, how to build useful Prometheus queries with them, and how to interpret the results.

[1] To be strictly accurate, this is in kibibytes (1024 bytes). Disk transfers are in whole sectors, which are 512 bytes or 4096 bytes, so this is the conventional unit for disk transfer speeds. However to be awkward, disk storage capacities are generally in power-of-ten units (e.g. 1 megabyte = 1,000,000 bytes), as are network transfer speeds (e.g. 1 megabit per second = 1,000,000 bits per second).
Prometheus happily ducks this question by converting to bytes. How you scale your graphs is up to you; Grafana supports both options, calling them “SI” for powers of ten, and “IEC” for powers of two.

[2] In iostat there is logic to force the answer to zero when the denominator is zero, i.e. the device is idle. With the prometheus queries shown, you’ll end up with NaN or infinity.