Check disk-smart

Overview

Multi HDD/SSD scan. No need to provide any warning/critical thresholds, no need to maintain any disk or property databases, no need for any additional libraries.

This check will scan for devices and attempt to open each device first. If successful, all information for the device will be parsed.

The check calls smartctl, which itself controls the Self-Monitoring, Analysis and Reporting Technology (SMART) system built into most ATA/SATA and SCSI/SAS hard drives and solid-state drives. The purpose of SMART is to monitor the reliability of the hard drive and predict drive failures. (from the man page of smart)

Hints:

Needs sudo.
Running this check just makes sense on hardware using ATA/SATA and/or SCSI/SAS HDDs and SSDs.
The check tries to identify all disks automatically. Disks without SMART capability can be ignored using the --ignore parameter manually.
Keep in mind that a smartctl run can take up to one or two seconds per disk, depending on its health and (interface/bus) speed.
Don’t forget to run /usr/sbin/update-smart-drivedb from time to time to get the newest drive database (sometimes there are improvements on how to interpret some attributes).
Use --full to get also a warning for notices.

Fact Sheet

Fact	Value
Check Plugin Download	https://github.com/Linuxfabrik/monitoring-plugins/tree/main/check-plugins/disk-smart
Check Interval Recommendation	Every 8 hours
Can be called without parameters	Yes
Compiled for Windows	No

Help

usage: disk-smart [-h] [-V] [--always-ok] [--full] [--ignore IGNORE]
                  [--test TEST]

This check is some kind of user interface for smartctl, which is a tool for
querying and controlling SMART (Self-Monitoring, Analysis, and Reporting
Technology) data in hard disk and solid-state drives. It allows you to inspect
the drive's SMART data to determine its health.

options:
  -h, --help       show this help message and exit
  -V, --version    show program's version number and exit
  --always-ok      Always returns OK.
  --full           If set, also warn on any assumptions (in GSmartControl
                   stated as "notice" messages), otherwise just warn on "real"
                   SMART issues. Default: False
  --ignore IGNORE  A comma-separated list of disks which should be ignored, in
                   the format 'sda,sdb'. Default: []
  --test TEST      For unit tests. Needs "path-to-stdout-file,path-to-stderr-
                   file,expected-retc".

Usage Examples

./disk-smart --ignore sdd,sdbx,mmcblk0 --full

Output:

Checked 6 disks. There are critical errors.
* sda (Crucial/Micron Client SSDs, Crucial_CT525MX300SSD1, SerNo 1a2b3c4d)
* sdb (Crucial/Micron Client SSDs, Crucial_CT525MX300SSD1, SerNo 1a2b3c4d)
* [CRITICAL] sdc (Seagate IronWolf, ST12000VN0007-2GS116, SerNo 1a2b3c4d)
  - The device error log contains records of errors.
  - Error Log: Drive is reporting 2 internal errors. Usually this means uncorrectable data loss and similar severe errors. Check the actual errors for details.
  - Error Log: Error "Uncorrectable error in data".
  - Error Log: Error "Uncorrectable error in data".
  - Attributes: Drive has a non-zero Raw value ("5 Reallocated_Sector_Ct"), but there is no SMART warning yet. This could be an indication of future failures and/or potential data loss in bad sectors.
* sdd (Seagate IronWolf, ST12000VN0007-2GS116, SerNo 1a2b3c4d)
  - The device error log contains records of errors.
* sde (Seagate IronWolf, ST12000VN0007-2GS116, SerNo 1a2b3c4d)
  - The device error log contains records of errors.
* sdf (Seagate IronWolf, ST12000VN0007-2GS116, SerNo 1a2b3c4d)
  - The device error log contains records of errors.

States

CRIT, if SMART reports

any messages in subsection „health“
drive has a failing pre-fail attribute
„Address mark not found“ in subsection „error_log“
„Identity not found“ in subsection „error_log“
„Track 0 not found“ in subsection „error_log“
„Uncorrectable error in data“ in subsection „error_log“
SMART status check returned DISK FAILING

WARN, if SMART reports

failing old-age attribute
failing pre-fail attribute in the past
„Command completion timed out“ in subsection „error_log“
„End of media“ in subsection „error_log“
„Interface CRC error“ in subsection „error_log“
Drive is past its estimated lifespan
Drive is reporting surface errors

UNKNOWN on smartctl not found, errors running smartctl, SMART not available or not supported.

If smartctl reports more than one issue, the worst issue state over all disks is returned.

Perfdata / Metrics

Temperatures
Remaining or used Lifetimes
Power On Hours
Power Cycle Counts

Troubleshooting

smartctl failed with exit status „Device open failed, device did not return an IDENTIFY DEVICE structure, or device is in a low-power mode.
Run the check with root privileges, for example using sudo.

Credits, License

Authors: Linuxfabrik GmbH, Zurich
License: The Unlicense, see LICENSE file.
Credits: GSmartControl: We re-implemented parts of the logic in Python and used its excellent output.