Disk weirdness on pre production fuller

From PrgmrWiki

So, a disk mysteriously dropped out of fuller's raid.[1] If I'm counting right, it was the disk on the add-on card. I go to test the disk and smart is spotless, and passes smart tests. I start running badblocks, and I start getting pcieport errors for the ethernet card, of all things[2]




so, I replaced the add-in card and see:




[root@fuller ~]# smartctl -a /dev/sdg
smartctl 5.39.1 program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Device: /6:0:0:0  Version: 
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.




[1]
ata8.00: exception Emask 0x52 SAct 0x0 SErr 0xffffffff action 0xe frozen
ata8: SError: { RecovData RecovComm UnrecovData Persist Proto HostInt PHYRdyChg PHYInt CommWake 10B8B Dispar BadCRC Handshk LinkSeq TrStaTrns UnrecFIS DevExch }
ata8.00: failed command: FLUSH CACHE EXT
ata8.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
         res 40/00:04:e0:4a:1d/00:00:01:00:00/40 Emask 0x56 (ATA bus error)
ata8.00: status: { DRDY }
ata8: hard resetting link
ata8: failed to resume link (SControl FFFFFFFF)
ata8: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF)
ata8: hard resetting link
ata8: failed to resume link (SControl FFFFFFFF)
ata8: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF)
ata8: limiting SATA link speed to 3.0 Gbps
ata8: hard resetting link
ata8: failed to resume link (SControl FFFFFFFF)
ata8: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF)
ata8.00: disabled
sd 7:0:0:0: rejecting I/O to offline device
md: super_written gets error=-5, uptodate=0
md/raid1:md0: Disk failure on sdg1, disabling device.
md/raid1:md0: Operation continuing on 6 devices.
[2]
[root@fuller ~]# badblocks -w /dev/sdg
pcieport 0000:00:05.0: AER: Uncorrected (Non-Fatal) error received: id=0000
pcieport 0000:00:05.0: AER: Uncorrected (Non-Fatal) error received: id=0000
pcieport 0000:00:05.0: AER: Uncorrected (Non-Fatal) error received: id=0000
pcieport 0000:00:05.0: AER: Uncorrected (Non-Fatal) error received: id=0000
pcieport 0000:00:05.0: AER: Uncorrected (Non-Fatal) error received: id=0000
pcieport 0000:00:05.0: AER: Uncorrected (Non-Fatal) error received: id=0000
pcieport 0000:00:05.0: AER: Uncorrected (Non-Fatal) error received: id=0000
pcieport 0000:00:05.0: AER: Uncorrected (Non-Fatal) error received: id=0000


from lspci:

05:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)


actually, I"m not at all sure those two numbers are the same device.  pretty sure they are not.


[root@fuller ~]# lspci |grep "05\.0 "
00:05.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 5 (rev 22)


So my error is on the root hub port 5.   yeah. I'm going to blame the add-on sata card.  So, just remove it?  or replace it?  that is the question.

www.redhat.com/promo/summit/.../fal_prarit_rhsummit2010.pdf