|
Home > Archive > Data Storage > June 2007 > there has to be a better way, or we are stuffed!
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
there has to be a better way, or we are stuffed!
|
|
| efffemm@f-m.fm 2007-05-24, 1:14 am |
| Company I work for has a SAN of about 50 TB.
It is configured as 4 logical disks. So when there is a
failure, that logical disk is out of action while
the RAID rebuilds itself. "Hot swapping" doesn't
help much when 1/4 of the system is paralysed for
hours afterwards. It seems about once a month that
one hard drive shits itself, and has to be replaced,
triggering the fiasco again.
Then system engineer says it would be good idea to
run complete diagnostics. That means taking all offline
for 172800 seconds = gazillions of dollars lost.
| |
| ajm163@yahoo.com 2007-05-24, 1:14 pm |
| On May 23, 6:55 pm, efff...@f-m.fm wrote:
> Company I work for has a SAN of about 50 TB.
> It is configured as 4 logical disks. So when there is a
> failure, that logical disk is out of action while
> the RAID rebuilds itself. "Hot swapping" doesn't
> help much when 1/4 of the system is paralysed for
> hours afterwards. It seems about once a month that
> one hard drive shits itself, and has to be replaced,
> triggering the fiasco again.
> Then system engineer says it would be good idea to
> run complete diagnostics. That means taking all offline
> for 172800 seconds = gazillions of dollars lost.
how many physical disks are we talking about?? Even if there are alot
loosing one a month seems like a really high failure rate to me. what
type of drives are they??? also i dont know who your Raid vendor is
but with raid5 you should be able to continue writing to the LUN even
with a failed disk. The lun should be critical but still accessable
at a slower speed (overhead of the rebuild process). Id talk to my
raid vendor about the massive failure rate. also what you woukd need
to do is make smaller LUNS that way when a drive fails you dont take
as much of a hit mabe 1/15 of your storage is offline instead of 1/4
just some suggestions
AJ
| |
| Nik Simpson 2007-05-24, 7:14 pm |
| ajm163@yahoo.com wrote:
> On May 23, 6:55 pm, efff...@f-m.fm wrote:
>
>
>
> how many physical disks are we talking about?? Even if there are alot
> loosing one a month seems like a really high failure rate to me. what
> type of drives are they??? also i dont know who your Raid vendor is
> but with raid5 you should be able to continue writing to the LUN even
> with a failed disk. The lun should be critical but still accessable
> at a slower speed (overhead of the rebuild process). Id talk to my
> raid vendor about the massive failure rate. also what you woukd need
> to do is make smaller LUNS that way when a drive fails you dont take
> as much of a hit mabe 1/15 of your storage is offline instead of 1/4
>
> just some suggestions
>
> AJ
>
Have to agree with AJ,
1. losing a drive shouldn't take the volumes offline, the whole point of
RAID is too prevent that.
2. Failure rates seem very high
3. If your vendor can't figure it out, it's time to look at a new vendor
--
Nik Simpson
| |
| Rob Turk 2007-05-24, 7:14 pm |
| "Nik Simpson" <n_simpson@bellsouth.net> wrote in message
news:1Dk5i.14316$KC4.4587@bignews6.bellsouth.net...
> ajm163@yahoo.com wrote:
>
> 2. Failure rates seem very high
>
It all depends on the number of drives involved. 50TB RAID capacity might be
about 60TB native. If that's made up of 146GB disks you'd be talking about
400+ drives. With 3% annual failure rate (which is not unusual) that would
be about 12 per year.
Rob
| |
| Nik Simpson 2007-05-24, 7:14 pm |
| Rob Turk wrote:
> "Nik Simpson" <n_simpson@bellsouth.net> wrote in message
> news:1Dk5i.14316$KC4.4587@bignews6.bellsouth.net...
>
> It all depends on the number of drives involved. 50TB RAID capacity might be
> about 60TB native. If that's made up of 146GB disks you'd be talking about
> 400+ drives. With 3% annual failure rate (which is not unusual) that would
> be about 12 per year.
>
> Rob
>
>
True, I guess I was just assuming larger drives, but good point, OP
needs to tell us a little more about the configuration.
--
Nik Simpson
| |
| GraemeDods@gmail.com 2007-05-25, 1:18 am |
| On May 24, 8:55 am, efff...@f-m.fm wrote:
> Company I work for has a SAN of about 50 TB.
> It is configured as 4 logical disks. So when there is a
> failure, that logical disk is out of action while
> the RAID rebuilds itself. "Hot swapping" doesn't
> help much when 1/4 of the system is paralysed for
> hours afterwards. It seems about once a month that
> one hard drive shits itself, and has to be replaced,
> triggering the fiasco again.
> Then system engineer says it would be good idea to
> run complete diagnostics. That means taking all offline
> for 172800 seconds = gazillions of dollars lost.
Either the storage system is configured very badly or it's a very poor
design. A single disk failure should not have such a significant
impact on performance. You should be able to replace the drive and let
the system rebuild it in the background and still allow user/
application access to the logical disks. For that quantity of storage
and for the cost of down-time (given your mention of lost revenue)
this storage should be a highly available enterprise level solution.
If that's what you've paid for, it certainly sounds like that's not
what you've got. Care to elaborate on what systems you're actually
running?
Graeme
| |
|
| efffemm@f-m.fm proclaimed:
> Company I work for has a SAN of about 50 TB.
> It is configured as 4 logical disks. So when there is a
> failure, that logical disk is out of action while
> the RAID rebuilds itself. "Hot swapping" doesn't
> help much when 1/4 of the system is paralysed for
> hours afterwards. It seems about once a month that
> one hard drive shits itself, and has to be replaced,
> triggering the fiasco again.
> Then system engineer says it would be good idea to
> run complete diagnostics. That means taking all offline
> for 172800 seconds = gazillions of dollars lost.
>
One of the better ways is not to create such huge logical disks.
Common practice, wait till you see the fsck or chkdisk times on 40
terabytes.
|
|
|
|
|