10-03-04 01:46 AM
Folks,
this is the second time in only a couple of weeks that we had raid5
arrays mysteriously going bad on us, reporting bad blocks to the os
without any discs being tagged as faulty. Yes, i do mean the virtual
disc representing the (redundant) array reporting bad blocks :-(
I'm really starting to wonder and would like to know if you've had
similar experiences.
The first incident involved a (relative to our IT budget) very expensive
Dell PowerVault 220S with two Perc2/DC raid controllers (cluster
configuration). The Netware 6 started crashing with "node castout/fatal
san error", and running a bad blocks scan indicated the array was indeed
reporting bad blocks. After quite some back and forth trying to save the
data we had to replace 6 out of 8 drives in the array which the firmware
_failed_ to report as faulty. We had to reinitialize the array and
restore our data from tape. We've also updated the firmware and the
vendor tells us "it shouldn't happen again with the new firmware".
The second incident is yet to be resolved. We have an EasyRaid II (ca. 4
years old) with 8 discs, also in raid5 configuration (with one hot
spare) that has started to report bad blocks:
-------------------------- snip ------------------------------
Sep 29 18:33:41 koala10 kernel: scsi1: ERROR on channel 0, id 2, lun 0,
CDB: Read (10) 00 28 73 69 f6 00 01 00 00
Sep 29 18:33:41 koala10 kernel: Info fld=0x0, Current sd08:25: sense key
Medium Error
Sep 29 18:33:41 koala10 kernel: I/O error: dev 08:25, sector 678652280
Sep 29 18:33:41 koala10 kernel: scsi1: ERROR on channel 0, id 2, lun 0,
CDB: Read (10) 00 28 73 69 fe 00 00 f8 00
Sep 29 18:33:41 koala10 kernel: Info fld=0x0, Current sd08:25: sense key
Medium Error
Sep 29 18:33:41 koala10 kernel: I/O error: dev 08:25, sector 678652288
Sep 29 18:33:41 koala10 kernel: scsi1: ERROR on channel 0, id 2, lun 0,
CDB: Read (10) 00 28 73 6a 06 00 00 f0 00
Sep 29 18:33:41 koala10 kernel: Info fld=0x0, Current sd08:25: sense key
Medium Error
Sep 29 18:33:41 koala10 kernel: I/O error: dev 08:25, sector 678652296
-------------------------- snip ------------------------------
Over the years a lot of those 8 discs have been replaced after going
bad, and the array always reported them and started rebuilding with the
hot spare.
A similar thing happend right before it started reporting bad blocks -
drive goes bad, the array removes it from the array and starts
rebuilding with the hot spare. The array log reported a couple of
remapped blocks during the rebuild, but as i understand it remaps are
not too uncommon.
So now, after the rebuild, the virtual array disc is reporting bad
blocks but no drives are being marked as bad by the array and the array
log shows no errors.
We're going to replace the older drives in the array over the next days
and hope it's going away, but i'd really like to know if you have had
similar experience with raid5 firmware failing to report bad discs.
thx a lot,
Nils
[ Post a follow-up to this message ]
|