Outils pour utilisateurs

Outils du site


technique:smartd

SMARTd

S.M.A.R.T permet d'anticiper les pannes des supports de stockage (disques dur, SSD, etc.). Enfin… en théorie. :P

Nous utilisons smartd sur nos deux hyperviseurs pour surveiller régulièrement l'état de nos SSD et avertir les admins ARN par mail en cas de détection d'un problème.

Installation

sudo apt-get install smartmontools

Configuration

Hwhost-1

Sur notre première machine physique, hwhost-1, un serveur Dell, on commente tout le contenu de /etc/smard.conf et on ajoute les lignes suivantes :

/dev/sda -d sat -H -l error -l selftest -s S/../01/./06 -m root
/dev/sdb -d sat -H -l error -l selftest -s S/../01/./06 -m root
/dev/sdc -d sat -H -l error -l selftest -s S/../01/./06 -m root
/dev/sdd -d sat -H -l error -l selftest -s S/../01/./06 -m root
/dev/sde -d sat -H -l error -l selftest -s S/../01/./06 -m root

Oui, on pourrait factoriser en une seule ligne :

DEVICESCAN -d sat -H -l error -l selftest -s S/../01/./06 -m root

Cela permettrait de prendre en compte automatiquement les nouveaux supports de stockage. Nous ne le faisons pas par cohérence avec hwhost-2 (voir ci-dessous) et parce que le man de smartd dit :

Most users should comment out DEVICESCAN and explicitly list the devices that they wish to monitor.

On démarre smartd :

sudo systemctl restart smartd

sudo grep smartd /var/log/syslog
Aug 23 13:05:53 hwhost-1 smartd[30370]: smartd 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
Aug 23 13:05:53 hwhost-1 smartd[30370]: Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
Aug 23 13:05:53 hwhost-1 smartd[30370]: Opened configuration file /etc/smartd.conf
Aug 23 13:05:53 hwhost-1 smartd[30370]: Configuration file /etc/smartd.conf parsed.
Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sda [SAT], opened
Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sda [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14178P, WWN:5-002538-8a08d4814, FW:EXM01B6Q, 512 GB
Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sda [SAT], not found in smartd database.
Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sda [SAT], can't monitor Current_Pending_Sector count - no Attribute 197
Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sda [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198
Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sda [SAT], is SMART capable. Adding to "monitor" list.
Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sda [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14178P.ata.state
Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sdb [SAT], opened
Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sdb [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14246V, WWN:5-002538-8a08d4858, FW:EXM01B6Q, 512 GB
Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sdb [SAT], not found in smartd database.
Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sdb [SAT], can't monitor Current_Pending_Sector count - no Attribute 197
Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sdb [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198
Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sdb [SAT], is SMART capable. Adding to "monitor" list.
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdb [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14246V.ata.state
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdc [SAT], opened
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdc [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14166Y, WWN:5-002538-8a08d4808, FW:EXM01B6Q, 512 GB
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdc [SAT], not found in smartd database.
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdc [SAT], can't monitor Current_Pending_Sector count - no Attribute 197
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdc [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdc [SAT], is SMART capable. Adding to "monitor" list.
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdc [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14166Y.ata.state
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdd [SAT], opened
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdd [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14249F, WWN:5-002538-8a08d485b, FW:EXM01B6Q, 512 GB
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdd [SAT], not found in smartd database.
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdd [SAT], can't monitor Current_Pending_Sector count - no Attribute 197
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdd [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdd [SAT], is SMART capable. Adding to "monitor" list.
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdd [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14249F.ata.state
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sde [SAT], opened
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sde [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14254X, WWN:5-002538-8a08d4860, FW:EXM01B6Q, 512 GB
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sde [SAT], not found in smartd database.
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sde [SAT], can't monitor Current_Pending_Sector count - no Attribute 197
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sde [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sde [SAT], is SMART capable. Adding to "monitor" list.
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sde [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14254X.ata.state
Aug 23 13:05:54 hwhost-1 smartd[30370]: Monitoring 5 ATA and 0 SCSI devices
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sda [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14178P.ata.state
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdb [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14246V.ata.state
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdc [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14166Y.ata.state
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdd [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14249F.ata.state
Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sde [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14254X.ata.state

Hwhost-2

Sur notre deuxième machine physique, hwhost-2, un serveur HP, on commente tout le contenu de /etc/smard.conf et on ajoute les lignes suivantes :

/dev/sda -d cciss,0 -H -l error -l selftest -s S/../01/./06 -m root
/dev/sdb -d cciss,1 -H -l error -l selftest -s S/../01/./06 -m root
/dev/sdc -d cciss,2 -H -l error -l selftest -s S/../01/./06 -m root
/dev/sdd -d cciss,3 -H -l error -l selftest -s S/../01/./06 -m root
/dev/sde -d cciss,4 -H -l error -l selftest -s S/../01/./06 -m root

Cette fois-ci, on ne peut pas factoriser, à cause du X dans -d cciss,X ;)

On démarre smartd :

sudo systemctl restart smartd

sudo grep smartd /var/log/syslog
Aug 23 13:05:22 hwhost-2 smartd[27096]: smartd 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
Aug 23 13:05:22 hwhost-2 smartd[27096]: Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
Aug 23 13:05:22 hwhost-2 smartd[27096]: Opened configuration file /etc/smartd.conf
Aug 23 13:05:22 hwhost-2 smartd[27096]: Configuration file /etc/smartd.conf parsed.
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda, type changed from 'sat,auto+cciss' to 'sat'
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], opened
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14194D, WWN:5-002538-8a08d4824, FW:EXM01B6Q, 512 GB
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], not found in smartd database.
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], not capable of SMART Health Status check
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], can't monitor Current_Pending_Sector count - no Attribute 197
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], is SMART capable. Adding to "monitor" list.
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14194D.ata.state
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb, type changed from 'sat,auto+cciss' to 'sat'
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], opened
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14262L, WWN:5-002538-8a08d4868, FW:EXM01B6Q, 512 GB
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], not found in smartd database.
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], not capable of SMART Health Status check
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], can't monitor Current_Pending_Sector count - no Attribute 197
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], is SMART capable. Adding to "monitor" list.
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14262L.ata.state
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc, type changed from 'sat,auto+cciss' to 'sat'
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], opened
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14251J, WWN:5-002538-8a08d485d, FW:EXM01B6Q, 512 GB
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], not found in smartd database.
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], not capable of SMART Health Status check
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], can't monitor Current_Pending_Sector count - no Attribute 197
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], is SMART capable. Adding to "monitor" list.
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14251J.ata.state
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd, type changed from 'sat,auto+cciss' to 'sat'
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], opened
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14253E, WWN:5-002538-8a08d485f, FW:EXM01B6Q, 512 GB
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], not found in smartd database.
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], not capable of SMART Health Status check
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], can't monitor Current_Pending_Sector count - no Attribute 197
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], is SMART capable. Adding to "monitor" list.
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14253E.ata.state
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde, type changed from 'sat,auto+cciss' to 'sat'
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], opened
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14250W, WWN:5-002538-8a08d485c, FW:EXM01B6Q, 512 GB
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], not found in smartd database.
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], not capable of SMART Health Status check
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], can't monitor Current_Pending_Sector count - no Attribute 197
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], is SMART capable. Adding to "monitor" list.
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14250W.ata.state
Aug 23 13:05:22 hwhost-2 smartd[27096]: Monitoring 5 ATA and 0 SCSI devices
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14194D.ata.state
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14262L.ata.state
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14251J.ata.state
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14253E.ata.state
Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14250W.ata.state

Explications (que font ces fichiers smartd.conf ?)

  • Smartd va surveiller l'état global du disque (-H), c'est-à-dire les attributs (comprendre les indicateurs, les métriques) pré-fail (qui indiquent que le disque va mourir bientôt) + le journal général des erreurs (-l error) + le journal des tests (-l selftest). Si les attributs préfail passent en dessous du seuil défini ou si le nombre d'erreurs dans le journal général ou le journal des test a augmenté, alors smartd envoie un mail à root. Grâce à notre configuration email, les admins ARN reçoivent ces emails d'alertes.
  • De plus, smartd va programmer (-s) un short test (le « S ») le premier jour de chaque mois à 6 heures du mat'. Si ce test détecte quelque chose, « -l selftest » fera que smartd enverra un mail à root.
  • « -d » permet de préciser le type de disque dur pour pas que smartd utilise des commandes SCSI sur un disque SATA et inversement. En vrai, osef de préciser ça, smartd trouve tout seul le type de disque dans l'écrasante majorité des cas (sauf bug du firmware ou contrôleur RAID vaseux, quoi).
    • « sat » signifie que chacun de nos SSD est derrière un adaptateur SCSI to SATA.
    • « cciss,X » : viser un disque particulier sur un contrôleur RAID HP P410i. On notera que le device (/dev/sdX) est inutile : le contrôleur RAID nous fait toujours pointer vers le même SSD tant qu'on n'a pas fait varier le X dans -d cciss,X (pour s'en rendre compte, faire varier le device et constater que le serial number du SSD remonté par « smartctl -a » reste identique). Si l'on ne précise pas le type cciss, le contrôleur intercepte les demandes SMART et les bloque. Source. Exemple d'affichage :
$ sudo smartctl -a /dev/sda
[…]
=== START OF INFORMATION SECTION ===
Vendor: HP
Product: LOGICAL VOLUME
Revision: 2.74
User Capacity: 512 076 636 160 bytes [512 GB]
Logical block size: 512 bytes
Rotation Rate: 15000 rpm
Logical Unit id: 0x600508b1001037383941424344450800
Serial number: 50123456789ABCDE
Device type: disk
Local Time is: Tue Aug 23 13:25:55 2016 CEST
SMART support is: Unavailable - device lacks SMART capability.
technique/smartd.txt · Dernière modification: 2016/08/23 13:45 par lg