Interpreting Prometheus metrics for Linux disk I/O utilization

Raw diskstats

On a Linux system, node_exporter reads disk metrics from /proc/diskstats. The format of this file is given in the kernel documentation; the first three columns identify the device, and the subsequent fields contain the data.

node_disk_reads_completed_total (field 1)
This is the total number of reads completed successfully.
node_disk_reads_merged_total (field 2)
node_disk_writes_merged_total (field 6)
node_disk_discards_merged_total (field 13)
Reads and writes which are adjacent to each other may be merged
for efficiency. Thus two 4K reads may become one 8K read before
it is ultimately handed to the disk, and so it will be counted
(and queued) as only one I/O. This field lets you know how
often this was done.
node_disk_read_bytes_total (field 3)
This is the total number of bytes read successfully.
node_disk_read_time_seconds_total (field 4)
This is the total number of seconds spent by all reads (as
measured from __make_request() to end_that_request_last()).
node_disk_writes_completed_total (field 5)
This is the total number of writes completed successfully.
node_disk_written_bytes_total (field 7)
This is the total number of bytes written successfully.
node_disk_write_time_seconds_total (field 8)
This is the total number of seconds spent by all writes (as
measured from __make_request() to end_that_request_last()).
node_disk_io_now (field 9)
The only field that should go to zero. Incremented as requests
are given to appropriate struct request_queue and decremented as
they finish.
node_disk_io_time_seconds_total (field 10)
Number of seconds spent doing I/Os.
This field increases so long as field 9 is nonzero.
node_disk_io_time_weighted_seconds_total (field 11)
Weighted # of seconds spent doing I/Os.
This field is incremented at each I/O start, I/O completion, I/O
merge, or read of these stats by the number of I/Os in progress
(field 9) times the number of seconds spent doing I/O since the
last update of this field. This can provide an easy measure of
both I/O completion time and the backlog that may be
accumulating.
node_disk_discards_completed_total (field 12)
This is the total number of discards completed successfully.
node_disk_discarded_sectors_total (field 14)
This is the total number of sectors discarded successfully.
node_disk_discard_time_seconds_total (field 15)
This is the total number of seconds spent by all discards (as
measured from __make_request() to end_that_request_last()).
node_disk_flush_requests_total (field 16)
The total number of flush requests completed successfully.
node_disk_flush_requests_time_seconds_total (field 17)
The total number of seconds spent by all flush requests.

Analyzing the data: iostat

We’ve got raw data, in the form of counters and accumulated durations. How do we interpret it, in terms of what’s happening between each point in time that the data is sampled?

Sample iostat output
  • r/s: The number (after merges) of read requests completed per second for the device.
  • w/s: The number (after merges) of write requests completed per second for the device.
  • rkB/s:The number of kilobytes¹ read from the device per second.
  • wkB/s: The number of kilobytes¹ written to the device per second.
  • rrqm/s: The number of read requests merged per second that were queued to the device.
  • wrqm/s: The number of write requests merged per second that were queued to the device.
  • %rrqm: The percentage of read requests merged together before being sent to the device.
  • %wrqm: The percentage of write requests merged together before being sent to the device.
  • r_await: The average time (in milliseconds) for read requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.
  • w_await: The average time (in milliseconds) for write requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.
  • aqu-sz: The average queue length of the requests that were issued to the device.
  • rareq-sz: The average size (in kilobytes) of the read requests that were issued to the device.
  • wareq-sz: The average size (in kilobytes) of the write requests that were issued to the device.
  • svctm: The average service time for I/O requests that were issued to the device. Unreliable, and has been removed from current versions of iostat.
  • %util: Percentage of elapsed time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.
  • r/s: The rate of increase of field 1.
    rate(node_disk_reads_completed_total[*])
    Unit: rate (operations per second)
  • w/s: The rate of increase of field 5:
    rate(node_disk_writes_completed_total[*])
    Unit: rate (operations per second)
  • rkB/s:
    rate(node_disk_read_bytes_total[*])
    Unit: bytes per second
  • wkB/s:
    rate(node_disk_written_bytes_total[*])
    Unit: bytes per second
  • rrqm/s:
    rate(node_disk_reads_merged_total[*])
    Unit: rate (operations per second)
  • wrqm/s:
    rate(node_disk_writes_merged_total[*])
    Unit: rate (operations per second)
  • %rrqm²:
    rate(node_disk_reads_merged_total[*]) / (rate(node_disk_reads_merged_total[*] + rate(node_disk_reads_completed_total[*]))
    Unit: dimensionless (fraction 0–1)
  • %wrqm²:
    rate(node_disk_writes_merged_total[*]) / (rate(node_disk_writes_merged_total[*] + rate(node_disk_writes_completed_total[*]))
    Unit: dimensionless (fraction 0–1)
  • r_await²:
    rate(node_disk_read_time_seconds_total[*]) / rate(node_disk_reads_completed_total[*])
    Unit: seconds
  • w_await²:
    rate(node_disk_write_time_seconds_total[*]) / rate(node_disk_writes_completed_total[*])
    Unit: seconds
  • aqu-sz:
    rate(node_disk_io_time_weighted_seconds_total[*])
    Unit: dimensionless (number of queued operations)
  • rareq-sz²:
    rate(node_disk_read_bytes_total[*]) / rate(node_disk_reads_completed_total[*])
    Unit: bytes
  • wareq-sz²:
    rate(node_disk_written_bytes_total[*]) / rate(node_disk_writes_completed_total[*])
    Unit: bytes
  • %util:
    rate(node_disk_io_time_seconds_total[*])
    Unit: dimensionless (fraction 0–1)

Interpreting the results

Apart from the obvious throughput figures — the number of operations per second and the number of bytes written or read per second — a few metrics are particularly important.

aqu-sz: the average queue size

This indicates the average number of operations waiting to be serviced. Note that some devices, like SSDs, need to have multiple outstanding requests in order to achieve maximum throughput: for example, an SSD with 8 internal controller channels will only achieve full throughput when there are at least 8 outstanding concurrent requests. The same applies to some degree to hard drives, which are able to optimise their head movements across the platter if there are several outstanding requests.

r_await / w_await: service times

These are the average times taken to service each read request and write request respectively. If these values become high, then the application will be suffering I/O latency, due to being held in line waiting for other requests.

rareq-sz and wareq-sz: average request sizes

These are useful for understanding I/O patterns. Applications may make a mix of small transfers (e.g. 4KB) or large ones (e.g. 512KB); the average gives you an idea of which dominates. Larger transfers are more efficient, especially for spinning hard drives.

%util: utilization

If this value is below 100%, then the device is completely idle for some of the time. It means there’s definitely spare capacity.

Common errors

There are a number of ready-made Grafana dashboards for Prometheus, but many get these stats wrong — either by using the wrong query, or more commonly by mislabelling the graphs or displaying the wrong units.

rate(node_disk_io_time_weighted_seconds_total{instance="$node",job="$job"}[5m])
rate(node_disk_io_time_seconds_total{instance="$node",job="$job"}[5m])
rate(node_disk_read_time_seconds_total{...}[5m])rate(node_disk_write_time_seconds_total{...}[5m])
rate(node_disk_read_time_seconds_total{...}[5m]) / rate(node_disk_reads_completed_total{...}[5m])rate(node_disk_write_time_seconds_total{...}[5m]) / rate(node_disk_writes_completed_total{...}[5m])

Conclusion

Hopefully this article has helped you understand node_exporter's disk metrics, how they relate to the raw kernel data from /proc/diskstats, how to build useful Prometheus queries with them, and how to interpret the results.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store