====== SMARTd ====== [[https://fr.wikipedia.org/wiki/Self-Monitoring,_Analysis_and_Reporting_Technology|S.M.A.R.T]] permet d'anticiper les pannes des supports de stockage (disques dur, SSD, etc.). Enfin… en théorie. :P **Nous utilisons smartd sur nos deux hyperviseurs pour surveiller régulièrement l'état de nos SSD et avertir les admins ARN par mail en cas de détection d'un problème.** ===== Installation ===== sudo apt-get install smartmontools ===== Configuration ===== ==== Hwhost-1 ==== Sur notre première machine physique, hwhost-1, un serveur Dell, on commente tout le contenu de /etc/smard.conf et on ajoute les lignes suivantes : /dev/sda -d sat -H -l error -l selftest -s S/../01/./06 -m root /dev/sdb -d sat -H -l error -l selftest -s S/../01/./06 -m root /dev/sdc -d sat -H -l error -l selftest -s S/../01/./06 -m root /dev/sdd -d sat -H -l error -l selftest -s S/../01/./06 -m root /dev/sde -d sat -H -l error -l selftest -s S/../01/./06 -m root Oui, on pourrait factoriser en une seule ligne : DEVICESCAN -d sat -H -l error -l selftest -s S/../01/./06 -m root Cela permettrait de prendre en compte automatiquement les nouveaux supports de stockage. Nous ne le faisons pas par cohérence avec hwhost-2 (voir ci-dessous) et parce que le man de smartd dit : > **Most users should comment out DEVICESCAN and explicitly list the devices that they wish to monitor**. On démarre smartd : sudo systemctl restart smartd sudo grep smartd /var/log/syslog Aug 23 13:05:53 hwhost-1 smartd[30370]: smartd 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build) Aug 23 13:05:53 hwhost-1 smartd[30370]: Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org Aug 23 13:05:53 hwhost-1 smartd[30370]: Opened configuration file /etc/smartd.conf Aug 23 13:05:53 hwhost-1 smartd[30370]: Configuration file /etc/smartd.conf parsed. Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sda [SAT], opened Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sda [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14178P, WWN:5-002538-8a08d4814, FW:EXM01B6Q, 512 GB Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sda [SAT], not found in smartd database. Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sda [SAT], can't monitor Current_Pending_Sector count - no Attribute 197 Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sda [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198 Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sda [SAT], is SMART capable. Adding to "monitor" list. Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sda [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14178P.ata.state Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sdb [SAT], opened Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sdb [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14246V, WWN:5-002538-8a08d4858, FW:EXM01B6Q, 512 GB Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sdb [SAT], not found in smartd database. Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sdb [SAT], can't monitor Current_Pending_Sector count - no Attribute 197 Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sdb [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198 Aug 23 13:05:53 hwhost-1 smartd[30370]: Device: /dev/sdb [SAT], is SMART capable. Adding to "monitor" list. Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdb [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14246V.ata.state Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdc [SAT], opened Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdc [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14166Y, WWN:5-002538-8a08d4808, FW:EXM01B6Q, 512 GB Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdc [SAT], not found in smartd database. Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdc [SAT], can't monitor Current_Pending_Sector count - no Attribute 197 Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdc [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198 Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdc [SAT], is SMART capable. Adding to "monitor" list. Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdc [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14166Y.ata.state Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdd [SAT], opened Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdd [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14249F, WWN:5-002538-8a08d485b, FW:EXM01B6Q, 512 GB Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdd [SAT], not found in smartd database. Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdd [SAT], can't monitor Current_Pending_Sector count - no Attribute 197 Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdd [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198 Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdd [SAT], is SMART capable. Adding to "monitor" list. Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdd [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14249F.ata.state Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sde [SAT], opened Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sde [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14254X, WWN:5-002538-8a08d4860, FW:EXM01B6Q, 512 GB Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sde [SAT], not found in smartd database. Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sde [SAT], can't monitor Current_Pending_Sector count - no Attribute 197 Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sde [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198 Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sde [SAT], is SMART capable. Adding to "monitor" list. Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sde [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14254X.ata.state Aug 23 13:05:54 hwhost-1 smartd[30370]: Monitoring 5 ATA and 0 SCSI devices Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sda [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14178P.ata.state Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdb [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14246V.ata.state Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdc [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14166Y.ata.state Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sdd [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14249F.ata.state Aug 23 13:05:54 hwhost-1 smartd[30370]: Device: /dev/sde [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14254X.ata.state ==== Hwhost-2 ==== Sur notre deuxième machine physique, hwhost-2, un serveur HP, on commente tout le contenu de /etc/smard.conf et on ajoute les lignes suivantes : /dev/sda -d cciss,0 -H -l error -l selftest -s S/../01/./06 -m root /dev/sdb -d cciss,1 -H -l error -l selftest -s S/../01/./06 -m root /dev/sdc -d cciss,2 -H -l error -l selftest -s S/../01/./06 -m root /dev/sdd -d cciss,3 -H -l error -l selftest -s S/../01/./06 -m root /dev/sde -d cciss,4 -H -l error -l selftest -s S/../01/./06 -m root Cette fois-ci, on ne peut pas factoriser, à cause du X dans -d cciss,X ;) On démarre smartd : sudo systemctl restart smartd sudo grep smartd /var/log/syslog Aug 23 13:05:22 hwhost-2 smartd[27096]: smartd 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build) Aug 23 13:05:22 hwhost-2 smartd[27096]: Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org Aug 23 13:05:22 hwhost-2 smartd[27096]: Opened configuration file /etc/smartd.conf Aug 23 13:05:22 hwhost-2 smartd[27096]: Configuration file /etc/smartd.conf parsed. Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda, type changed from 'sat,auto+cciss' to 'sat' Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], opened Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14194D, WWN:5-002538-8a08d4824, FW:EXM01B6Q, 512 GB Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], not found in smartd database. Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], not capable of SMART Health Status check Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], can't monitor Current_Pending_Sector count - no Attribute 197 Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198 Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], is SMART capable. Adding to "monitor" list. Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14194D.ata.state Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb, type changed from 'sat,auto+cciss' to 'sat' Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], opened Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14262L, WWN:5-002538-8a08d4868, FW:EXM01B6Q, 512 GB Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], not found in smartd database. Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], not capable of SMART Health Status check Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], can't monitor Current_Pending_Sector count - no Attribute 197 Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198 Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], is SMART capable. Adding to "monitor" list. Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14262L.ata.state Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc, type changed from 'sat,auto+cciss' to 'sat' Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], opened Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14251J, WWN:5-002538-8a08d485d, FW:EXM01B6Q, 512 GB Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], not found in smartd database. Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], not capable of SMART Health Status check Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], can't monitor Current_Pending_Sector count - no Attribute 197 Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198 Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], is SMART capable. Adding to "monitor" list. Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14251J.ata.state Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd, type changed from 'sat,auto+cciss' to 'sat' Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], opened Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14253E, WWN:5-002538-8a08d485f, FW:EXM01B6Q, 512 GB Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], not found in smartd database. Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], not capable of SMART Health Status check Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], can't monitor Current_Pending_Sector count - no Attribute 197 Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198 Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], is SMART capable. Adding to "monitor" list. Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14253E.ata.state Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde, type changed from 'sat,auto+cciss' to 'sat' Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], opened Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], Samsung SSD 850 PRO 512GB, S/N:S1SXNSAFC14250W, WWN:5-002538-8a08d485c, FW:EXM01B6Q, 512 GB Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], not found in smartd database. Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], not capable of SMART Health Status check Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], can't monitor Current_Pending_Sector count - no Attribute 197 Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198 Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], is SMART capable. Adding to "monitor" list. Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14250W.ata.state Aug 23 13:05:22 hwhost-2 smartd[27096]: Monitoring 5 ATA and 0 SCSI devices Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sda [cciss_disk_00] [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14194D.ata.state Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdb [cciss_disk_01] [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14262L.ata.state Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdc [cciss_disk_02] [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14251J.ata.state Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sdd [cciss_disk_03] [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14253E.ata.state Aug 23 13:05:22 hwhost-2 smartd[27096]: Device: /dev/sde [cciss_disk_04] [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_PRO_512GB-S1SXNSAFC14250W.ata.state ==== Explications (que font ces fichiers smartd.conf ?) ==== * Smartd va surveiller l'état global du disque (-H), c'est-à-dire les attributs (comprendre les indicateurs, les métriques) pré-fail (qui indiquent que le disque va mourir bientôt) + le journal général des erreurs (-l error) + le journal des tests (-l selftest). Si les attributs préfail passent en dessous du seuil défini ou si le nombre d'erreurs dans le journal général ou le journal des test a augmenté, alors smartd envoie un mail à root. Grâce à notre [[benevoles:technique:emails|configuration email]], les admins ARN reçoivent ces emails d'alertes. * De plus, smartd va programmer (-s) un short test (le « S ») le premier jour de chaque mois à 6 heures du mat'. Si ce test détecte quelque chose, « -l selftest » fera que smartd enverra un mail à root. * « -d » permet de préciser le type de disque dur pour pas que smartd utilise des commandes SCSI sur un disque SATA et inversement. En vrai, osef de préciser ça, smartd trouve tout seul le type de disque dans l'écrasante majorité des cas (sauf bug du firmware ou contrôleur RAID vaseux, quoi). * « sat » signifie que chacun de nos SSD est derrière un adaptateur SCSI to SATA. * « cciss,X » : viser un disque particulier sur un contrôleur RAID HP P410i. On notera que le device (/dev/sdX) est inutile : le contrôleur RAID nous fait toujours pointer vers le même SSD tant qu'on n'a pas fait varier le X dans -d cciss,X (pour s'en rendre compte, faire varier le device et constater que le serial number du SSD remonté par « smartctl -a » reste identique). Si l'on ne précise pas le type cciss, le contrôleur intercepte les demandes SMART et les bloque. [[http://community.hpe.com/t5/System-Administration/How-to-use-smartctl-with-cciss/td-p/4036978|Source]]. Exemple d'affichage : > $ sudo smartctl -a /dev/sda > [...] > === START OF INFORMATION SECTION === > Vendor: HP > Product: LOGICAL VOLUME > Revision: 2.74 > User Capacity: 512 076 636 160 bytes [512 GB] > Logical block size: 512 bytes > Rotation Rate: 15000 rpm > Logical Unit id: 0x600508b1001037383941424344450800 > Serial number: 50123456789ABCDE > Device type: disk > Local Time is: Tue Aug 23 13:25:55 2016 CEST > SMART support is: Unavailable - device lacks SMART capability.