Table des matières

SMARTd

S.M.A.R.T permet d'anticiper les pannes des supports de stockage (disques dur, SSD, etc.). Enfin… en théorie. :P

Nous utilisons smartd sur nos deux hyperviseurs pour surveiller régulièrement l'état de nos SSD et avertir les admins ARN par mail en cas de détection d'un problème.

Installation

sudo apt-get install smartmontools

Configuration

Hwhost-1

Sur notre première machine physique, hwhost-1, un serveur Dell, on commente tout le contenu de /etc/smard.conf et on ajoute les lignes suivantes :

/dev/sda -d sat -H -l error -l selftest -s S/../01/./06 -m root
/dev/sdb -d sat -H -l error -l selftest -s S/../01/./06 -m root
/dev/sdc -d sat -H -l error -l selftest -s S/../01/./06 -m root
/dev/sdd -d sat -H -l error -l selftest -s S/../01/./06 -m root
/dev/sde -d sat -H -l error -l selftest -s S/../01/./06 -m root

Oui, on pourrait factoriser en une seule ligne :

DEVICESCAN -d sat -H -l error -l selftest -s S/../01/./06 -m root

Cela permettrait de prendre en compte automatiquement les nouveaux supports de stockage. Nous ne le faisons pas par cohérence avec hwhost-2 (voir ci-dessous) et parce que le man de smartd dit :

Most users should comment out DEVICESCAN and explicitly list the devices that they wish to monitor.

On démarre smartd :

sudo systemctl restart smartd

sudo grep smartd /var/log/syslog
Aug 23 13:05:53 hwhost-1 smartd[30370]: smartd 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
Aug 23 13:05:53 hwhost-1 smartd[30370]: Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
Aug 23 13:05:53 hwhost-1 smartd[30370]: Opened configuration file /etc/smartd.conf
Aug 23 13:05:53 hwhost-1 smartd[30370]: Configuration file /etc/smartd.conf parsed.
Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sda [SAT], opened
Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sda [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14178P, WWN:5-002538-8a08d4814, FW:EXM01B6Q, 512 GB
Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sda [SAT], not found in smartd database.
Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sda [SAT], can't monitor Current_Pending_Sector count - no Attribute 197
Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sda [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198
Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sda [SAT], is SMART capable. Adding to "monitor" list.
Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sda [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14178P.ata.state
Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sdb [SAT], opened
Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sdb [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14246V, WWN:5-002538-8a08d4858, FW:EXM01B6Q, 512 GB
Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sdb [SAT], not found in smartd database.
Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sdb [SAT], can't monitor Current_Pending_Sector count - no Attribute 197
Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sdb [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198
Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sdb [SAT], is SMART capable. Adding to "monitor" list.
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdb [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14246V.ata.state
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdc [SAT], opened
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdc [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14166Y, WWN:5-002538-8a08d4808, FW:EXM01B6Q, 512 GB
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdc [SAT], not found in smartd database.
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdc [SAT], can't monitor Current_Pending_Sector count - no Attribute 197
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdc [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdc [SAT], is SMART capable. Adding to "monitor" list.
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdc [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14166Y.ata.state
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdd [SAT], opened
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdd [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14249F, WWN:5-002538-8a08d485b, FW:EXM01B6Q, 512 GB
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdd [SAT], not found in smartd database.
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdd [SAT], can't monitor Current_Pending_Sector count - no Attribute 197
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdd [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdd [SAT], is SMART capable. Adding to "monitor" list.
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdd [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14249F.ata.state
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sde [SAT], opened
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sde [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14254X, WWN:5-002538-8a08d4860, FW:EXM01B6Q, 512 GB
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sde [SAT], not found in smartd database.
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sde [SAT], can't monitor Current_Pending_Sector count - no Attribute 197
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sde [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sde [SAT], is SMART capable. Adding to "monitor" list.
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sde [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14254X.ata.state
Aug 23 13:05:54 hwhost-1 smartd[30370]: Monitoring 5 ATA and 0 SCSI devices
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sda [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14178P.ata.state
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdb [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14246V.ata.state
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdc [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14166Y.ata.state
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdd [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14249F.ata.state
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sde [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14254X.ata.state

Hwhost-2

Sur notre deuxième machine physique, hwhost-2, un serveur HP, on commente tout le contenu de /etc/smard.conf et on ajoute les lignes suivantes :

/dev/sda -d cciss,0 -H -l error -l selftest -s S/../01/./06 -m root
/dev/sdb -d cciss,1 -H -l error -l selftest -s S/../01/./06 -m root
/dev/sdc -d cciss,2 -H -l error -l selftest -s S/../01/./06 -m root
/dev/sdd -d cciss,3 -H -l error -l selftest -s S/../01/./06 -m root
/dev/sde -d cciss,4 -H -l error -l selftest -s S/../01/./06 -m root

Cette fois-ci, on ne peut pas factoriser, à cause du X dans -d cciss,X ;)

On démarre smartd :

sudo systemctl restart smartd

sudo grep smartd /var/log/syslog
Aug 23 13:05:22 hwhost-2 smartd[27096]: smartd 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
Aug 23 13:05:22 hwhost-2 smartd[27096]: Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
Aug 23 13:05:22 hwhost-2 smartd[27096]: Opened configuration file /etc/smartd.conf
Aug 23 13:05:22 hwhost-2 smartd[27096]: Configuration file /etc/smartd.conf parsed.
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda, type changed from 'sat,auto+cciss' to 'sat'
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], opened
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14194D, WWN:5-002538-8a08d4824, FW:EXM01B6Q, 512 GB
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], not found in smartd database.
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], not capable of SMART Health Status check
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], can't monitor Current_Pending_Sector count - no Attribute 197
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], is SMART capable. Adding to "monitor" list.
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14194D.ata.state
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb, type changed from 'sat,auto+cciss' to 'sat'
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], opened
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14262L, WWN:5-002538-8a08d4868, FW:EXM01B6Q, 512 GB
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], not found in smartd database.
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], not capable of SMART Health Status check
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], can't monitor Current_Pending_Sector count - no Attribute 197
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], is SMART capable. Adding to "monitor" list.
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14262L.ata.state
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc, type changed from 'sat,auto+cciss' to 'sat'
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], opened
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14251J, WWN:5-002538-8a08d485d, FW:EXM01B6Q, 512 GB
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], not found in smartd database.
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], not capable of SMART Health Status check
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], can't monitor Current_Pending_Sector count - no Attribute 197
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], is SMART capable. Adding to "monitor" list.
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14251J.ata.state
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd, type changed from 'sat,auto+cciss' to 'sat'
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], opened
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14253E, WWN:5-002538-8a08d485f, FW:EXM01B6Q, 512 GB
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], not found in smartd database.
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], not capable of SMART Health Status check
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], can't monitor Current_Pending_Sector count - no Attribute 197
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], is SMART capable. Adding to "monitor" list.
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14253E.ata.state
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde, type changed from 'sat,auto+cciss' to 'sat'
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], opened
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14250W, WWN:5-002538-8a08d485c, FW:EXM01B6Q, 512 GB
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], not found in smartd database.
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], not capable of SMART Health Status check
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], can't monitor Current_Pending_Sector count - no Attribute 197
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], is SMART capable. Adding to "monitor" list.
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14250W.ata.state
Aug 23 13:05:22 hwhost-2 smartd[27096]: Monitoring 5 ATA and 0 SCSI devices
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14194D.ata.state
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14262L.ata.state
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14251J.ata.state
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14253E.ata.state
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14250W.ata.state

Explications (que font ces fichiers smartd.conf ?)

$ sudo smartctl -a /dev/sda
[…]
=== START OF INFORMATION SECTION ===
Vendor: HP
Product: LOGICAL VOLUME
Revision: 2.74
User Capacity: 512 076 636 160 bytes [512 GB]
Logical block size: 512 bytes
Rotation Rate: 15000 rpm
Logical Unit id: 0x600508b1001037383941424344450800
Serial number: 50123456789ABCDE
Device type: disk
Local Time is: Tue Aug 23 13:25:55 2016 CEST
SMART support is: Unavailable - device lacks SMART capability.