smartctl : Terminate command early due to bad response to IEC mode page

gehrke · 17 Nov. 2019

Moin *

Scheinbar schon wieder ein Problem mit einer Festplatte:

Code:

Nov 17 07:50:55 j4 kernel: sd 0:0:0:0: [sda] tag#24 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Nov 17 07:50:55 j4 kernel: sd 0:0:0:0: [sda] tag#24 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
Nov 17 07:50:55 j4 kernel: sd 0:0:0:0: [sda] tag#25 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Nov 17 07:50:55 j4 kernel: sd 0:0:0:0: [sda] tag#25 CDB: ATA command pass through(16) 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00
Nov 17 07:50:55 j4 smartd[29383]: Device: /dev/sda [SAT], SLEEP mode ignored due to reached limit of skipped checks (10 checks skipped)
Nov 17 07:50:55 j4 smartd[29383]: Device: /dev/sda [SAT], not capable of SMART self-check
Nov 17 07:50:55 j4 smartd[29383]: Device: /dev/sda [SAT], failed to read SMART Attribute Data

Versuch einer Analyse:

Code:

# smartctl --all -T permissive /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-957.21.3.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               vices/pc
Product:              i0000:00/0000:00
Revision:             :1f.
Compliance:           SPC-5
User Capacity:        13.236.715.697.634.807.358 bytes [13236 PB]
Logical block size:   1986618213 bytes
Physical block size:  2981265408 bytes
Lowest aligned LBA:   12387
Formatted with type 2 protection
4 protection information intervals per logical block
32 bytes of protection information per logical block
>> Terminate command early due to bad response to IEC mode page

=== START OF READ SMART DATA SECTION ===
Current Drive Temperature:     0 C
Drive Trip Temperature:        0 C

Error Counter logging not supported

Device does not support Self Test logging

Ist da noch was zu machen oder muss die neu?
TNX

Gückauf, gehrke

gehrke · 17 Nov. 2019

Nachtrag:

Code:

# fdisk -l  /dev/sda
fdisk: /dev/sda kann nicht geöffnet werden: Eingabe-/Ausgabefehler

manzek · 17 Nov. 2019

Hallo Gehrke,

irgendwie liefert das Tool unsinnige Werte; bei mir sehen die dann doch anders aus.

Code:

smartctl --all -T permissive /dev/sdc
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-lp151.28.32-default] (SUSE RPM)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST500DM002-1BC142
Serial Number:    XXXXXXXX
LU WWN Device Id: 5 000c50 03e76a31b
Firmware Version: JC4B
User Capacity:    500.107.862.016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sun Nov 17 10:21:13 2019 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Ich würde zunächst ein neues Kabel sowie einen anderen SATA-Port testen - bestenfalls sogar an einem anderen Rechner.
Falls das nichts bringt, dann /dev/null

gehrke · 17 Nov. 2019

Habe die Platte mal herausgenommen und in einen anderen Slot eingeschoben:

Code:

Nov 17 12:36:03 j4 kernel: ata1: exception Emask 0x50 SAct 0x0 SErr 0x4080800 action 0xe frozen
Nov 17 12:36:03 j4 kernel: ata1: irq_stat 0x00000040, connection status changed
Nov 17 12:36:03 j4 kernel: ata1: SError: { HostInt 10B8B DevExch }
Nov 17 12:36:03 j4 kernel: ata1: hard resetting link
Nov 17 12:36:04 j4 kernel: ata1: SATA link down (SStatus 0 SControl 300)
Nov 17 12:36:04 j4 kernel: ata1: EH complete
Nov 17 12:36:04 j4 kernel: ata1.00: detaching (SCSI 0:0:0:0)
Nov 17 12:36:04 j4 kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
Nov 17 12:36:04 j4 kernel: sd 0:0:0:0: [sda] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Nov 17 12:36:04 j4 kernel: sd 0:0:0:0: [sda] Stopping disk
Nov 17 12:36:04 j4 kernel: sd 0:0:0:0: [sda] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Nov 17 12:36:04 j4 systemd[1]: Stopped target Local File Systems.
Nov 17 12:36:04 j4 systemd[1]: Unmounting /boot/efi...
Nov 17 12:36:04 j4 systemd[1]: Unmounted /boot/efi.
Nov 17 12:36:04 j4 systemd[1]: Unmounting /boot...
Nov 17 12:36:04 j4 systemd[1]: Unmounted /boot.
Nov 17 12:36:04 j4 systemd[1]: Stopped File System Check on /dev/disk/by-uuid/151a10d5-9161-4f5a-848d-598639c90998.
[...]
Nov 17 12:38:21 j4 kernel: ata6: irq_stat 0x00000040, connection status changed
Nov 17 12:38:21 j4 kernel: ata6: SError: { CommWake DevExch }
Nov 17 12:38:21 j4 kernel: ata6: hard resetting link
Nov 17 12:38:27 j4 kernel: ata6: link is slow to respond, please be patient (ready=0)
Nov 17 12:38:31 j4 kernel: ata6: COMRESET failed (errno=-16)
Nov 17 12:38:31 j4 kernel: ata6: hard resetting link
Nov 17 12:38:32 j4 kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Nov 17 12:38:32 j4 kernel: ata6.00: ATA-9: WDC WD60EFRX-68L0BN1, 82.00A82, max UDMA/133
Nov 17 12:38:32 j4 kernel: ata6.00: 11721045168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
Nov 17 12:38:32 j4 kernel: ata6.00: configured for UDMA/133
Nov 17 12:38:32 j4 kernel: ata6: EH complete
Nov 17 12:38:32 j4 kernel: scsi 5:0:0:0: Direct-Access     ATA      WDC WD60EFRX-68L 0A82 PQ: 0 ANSI: 5
Nov 17 12:38:33 j4 kernel: sd 5:0:0:0: [sda] 11721045168 512-byte logical blocks: (6.00 TB/5.45 TiB)
Nov 17 12:38:33 j4 kernel: sd 5:0:0:0: [sda] 4096-byte physical blocks
Nov 17 12:38:33 j4 kernel: sd 5:0:0:0: [sda] Write Protect is off
Nov 17 12:38:33 j4 kernel: sd 5:0:0:0: [sda] Mode Sense: 00 3a 00 00
Nov 17 12:38:33 j4 kernel: sd 5:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Nov 17 12:38:33 j4 kernel: sd 5:0:0:0: Attached scsi generic sg0 type 0
Nov 17 12:38:33 j4 kernel:  sda: sda1 sda2 sda3
Nov 17 12:38:33 j4 kernel: sd 5:0:0:0: [sda] Attached SCSI disk
Nov 17 12:38:33 j4 systemd[1]: Found device WDC_WD60EFRX-68L0BN1 3.
Nov 17 12:38:33 j4 systemd[1]: Starting File System Check on /dev/disk/by-uuid/151a10d5-9161-4f5a-848d-598639c90998...
Nov 17 12:38:33 j4 systemd-fsck[5901]: /dev/sda3: stelle das Journal wieder her
Nov 17 12:38:33 j4 systemd-fsck[5901]: /dev/sda3: sauber, 82/59136 Dateien, 103363/236288 Blöcke
Nov 17 12:38:33 j4 systemd[1]: Started File System Check on /dev/disk/by-uuid/151a10d5-9161-4f5a-848d-598639c90998.
Nov 17 12:38:33 j4 systemd[1]: Mounting /boot...
Nov 17 12:38:33 j4 kernel: EXT4-fs (sda3): mounted filesystem with ordered data mode. Opts: (null)
Nov 17 12:38:33 j4 systemd[1]: Mounted /boot.
Nov 17 12:38:33 j4 systemd[1]: Mounting /boot/efi...
Nov 17 12:38:33 j4 kernel: FAT-fs (sda1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
Nov 17 12:38:33 j4 systemd[1]: Mounted /boot/efi.

Code:

[root@j4 ~]# fdisk -l  /dev/sda
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.

Disk /dev/sda: 6001.2 GB, 6001175126016 bytes, 11721045168 sectors
Units = Sektoren of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: gpt
Disk identifier: AF4D1E0C-AA57-449D-9433-<xxx>


#         Start          End    Size  Type            Name
 1         2048       206847    100M  EFI System      EFI System Partition
 2      2099200  11721045134    5,5T  Linux RAID      Linux RAID
 3       206848      2097151    923M  Microsoft basic

:???:

gehrke · 17 Nov. 2019

Auch im ursprünglichen Slot macht die Platte erstmal keine Probleme. Die Boot-Devices werden gemountet und die Logs scheinen sauber.

Ich versuche mal, die Platte wieder in das RAID zu integrieren, das dürfte für Last und möglicherwiese für neue Probleme sorgen:

Code:

# mdadm --add /dev/md1 /dev/sda2
mdadm: re-added /dev/sda2

# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md1 : active raid6 sda2[0] sdd2[4] sdc2[3] sdb2[1]
      11718683648 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/3] [_UUU]
      [>....................]  recovery =  0.0% (2483456/5859341824) finish=2597.5min speed=37578K/sec
      bitmap: 11/44 pages [44KB], 65536KB chunk

unused devices: <none>

gehrke · 17 Nov. 2019

Oh, das ging aber schnell:

Code:

Nov 17 13:55:50 j4 kernel: md: md1: recovery done.

Code:

# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md1 : active raid6 sda2[0] sdd2[4] sdc2[3] sdb2[1]
      11718683648 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 1/44 pages [4KB], 65536KB chunk

unused devices: <none>

josef-wien · 17 Nov. 2019

Nach meinen Erfahrungen liegt es in der Regel am SATA-Kabel (und dann löst Ab- und Anstecken das Problem nur temporär).

gehrke · 23 Nov. 2019

Keine weiteren Vorfälle bislang.

Es fällt mir schwer, hier an ein Kabel als Verursacher zu glauben, weil der Server ohne Fremdeinwirkung/mechanische Beanspruchung vor sich hin steht und die Platten an einer Backplane (heißen die so?) hängen und Kontakt via Slots umgesetzt sind. Da war niemand dran.
However, kann natürlich trotzdem sein, Altersschwäche oder so.

Die Platte ist übrigens aus 2018.

Da sie Bestandteil eines RAID-6 ist und somit doppelte Redundanz abgebildet ist, kann ich mir wohl den Luxus erlauben, das weitere Verhalten einfach zu beobachten.

Vielen Dank an alle Beteiligten!

smartctl : Terminate command early due to bad response to IEC mode page

gehrke

Administrator

gehrke

Administrator

manzek

gehrke

Administrator

gehrke

Administrator

gehrke

Administrator

josef-wien

gehrke

Administrator