Estimating RAID 1 MTBF?
Web Server forum
Back To The Forum Home!Search!Private Messaging System

Web Server Talk Web Server Talk > WebserverTalk Community > Data Storage > Estimating RAID 1 MTBF?




Pages (7): [1] 2 3 4 5 6 » ... Last »   Last Thread   Next Thread Next
  Show Printable Version Email this Page Subscribe to this Thread      Post New Thread    Post A Reply      

    Estimating RAID 1 MTBF?  
ohaya


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
07-15-04 07:45 AM

Hi,

I was wondering if anyone could tell me how to calculate/estimate the
overall MTBF of a RAID 1 (mirrored) configuration?  I'm just looking for
a simple, "rule-of-thumb" type of calculation, assuming ideal
conditions.

I've been looking around for this, and I've seen a number of different
"takes" on this, and some of them seem to be quite at odds with each
other (and sometimes with themselves), so I thought that I'd post here
in the hopes that someone might be able to help.

Thanks,
Jim





[ Post a follow-up to this message ]



    Re: Estimating RAID 1 MTBF?  
Ron Reaugh


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
07-15-04 07:45 AM


"ohaya" <ohaya@cox.net> wrote in message news:40F5F095.422CFEE5@cox.net...
> Hi,
>
> I was wondering if anyone could tell me how to calculate/estimate the
> overall MTBF of a RAID 1 (mirrored) configuration?  I'm just looking for
> a simple, "rule-of-thumb" type of calculation, assuming ideal
> conditions.
>
> I've been looking around for this, and I've seen a number of different
> "takes" on this, and some of them seem to be quite at odds with each
> other (and sometimes with themselves), so I thought that I'd post here
> in the hopes that someone might be able to help.

The basis will of course start with the MTBF of the HD over its rated
service life.  That figure is not published by HD mfgs.  The MTBF published
is a projection, educated guess plus imperical data from drives in early
life.  The problem after that is any kind of assumptions about the pure
randomness of a failure or whether a failure might be clustered over
time/usage destroys any feasible precise math attempt.

The next issue is what kind of failures to take into consideration.  Are SW,
OS, malice and external physical events like lightning, earthquakes, EMP,
PWS failure,  other HW failure or overheating to be excluded?  Excluding
such then my take is that if you replace a failing or potentially
failing(SMART) member of a RAID 1 set within 8 hours of failure/warning
during the drive's rated service life then it'll be a VERY cold day in hell
before you lose the RAID 1 set IF the HD model/batch does not have a
pathological failure mode that is intensely clustered.

An actual calculation would require information that is not available and
even the mfgs may not know precisely that information until towards the end
of a model's service life if then.

What takes on this have you found?  I'd like to see how anyone would shoot
at this issue.  The point is that with the exclusions noted then a RAID 1
set is VASTLY more reliable than a single HD.  A shot would be at least 5000
times more reliable.
5000 is a rough shot at the number of 8 hour periods in five years.

So if you took 10K user of a given model HD and ran them all for the rated
service life and got 500 failures.  Then take that same 10K group but all
using 2 drive RAID 1, there would only be a 1 chance in ten of a single
failure in the whole group or 1 failure if the group were 100K.

The accumulated threat of all the noted exclusions is VASTLY greater than
this so this issue is really a non-issue as RAID 1 is as good as it needs to
be.    Keep a good backup.







[ Post a follow-up to this message ]



    Re: Estimating RAID 1 MTBF?  
ohaya


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
07-15-04 07:45 AM


> The basis will of course start with the MTBF of the HD over its rated
> service life.  That figure is not published by HD mfgs.  The MTBF publishe
d
> is a projection, educated guess plus imperical data from drives in early
> life.  The problem after that is any kind of assumptions about the pure
> randomness of a failure or whether a failure might be clustered over
> time/usage destroys any feasible precise math attempt.
>
> The next issue is what kind of failures to take into consideration.  Are S
W,
> OS, malice and external physical events like lightning, earthquakes, EMP,
> PWS failure,  other HW failure or overheating to be excluded?  Excluding
> such then my take is that if you replace a failing or potentially
> failing(SMART) member of a RAID 1 set within 8 hours of failure/warning
> during the drive's rated service life then it'll be a VERY cold day in hel
l
> before you lose the RAID 1 set IF the HD model/batch does not have a
> pathological failure mode that is intensely clustered.
>
> An actual calculation would require information that is not available and
> even the mfgs may not know precisely that information until towards the en
d
> of a model's service life if then.
>
> What takes on this have you found?  I'd like to see how anyone would shoot
> at this issue.  The point is that with the exclusions noted then a RAID 1
> set is VASTLY more reliable than a single HD.  A shot would be at least 50
00
> times more reliable.
> 5000 is a rough shot at the number of 8 hour periods in five years.
>
> So if you took 10K user of a given model HD and ran them all for the rated
> service life and got 500 failures.  Then take that same 10K group but all
> using 2 drive RAID 1, there would only be a 1 chance in ten of a single
> failure in the whole group or 1 failure if the group were 100K.
>
> The accumulated threat of all the noted exclusions is VASTLY greater than
> this so this issue is really a non-issue as RAID 1 is as good as it needs 
to
> be.    Keep a good backup.


Ron,

Thanks for your response.  I've looked at so many different sources the
last couple of days, my eyes are blurring and my head is aching .

Before I begin, I was really looking for just a kind of "ballpark" kind
of "rule of thumb" for now, with as many assumptions/caveats as needed
to make it simple, i.e., something like assume drives are in their
"life" (the flat part of the Weibull/bathtub curve), ignore software,
etc.

Think of it like this:  I just gave you two SCSI drives, I guarantee you
their MTBF is 1.2 Mhours, which won't vary over the time period that
they'll be in-service, no other hardware will ever fail (i.e., don't
worry about the processor board or raid controller), and it takes ~0
time to repair a failure.

Given something like that, and assuming I RAID1 these two drives, what
kind of MTBF would you expect over time?

- Is it the square of the individual drive MTBF?
See: http://www.phptr.com/articles/article.asp?p=28689
Or: http://tech-report.com/reviews/2001...id/index.x?pg=2 (this
one doesn't make sense if MTTR=0 ==> MTBF=infinity?)
Or: http://www.teradataforum.com/terada...0107_214543.htm (again,
don't know how MTTR=0 would work)

- Is it 150% the individual drive MTBF?
See:
[url]http://www.zzyzx.com/products/whitepapers/pdf/MTBF_and_availability_primer.pdf[/ur
l]

- Is it double the individual drive MTBF?  (I don't remember where I saw
this one.)


It's kind of funny, but when I first started looking, I thought that I'd
find something simple.  That was this weekend ...


Jim





[ Post a follow-up to this message ]



    Re: Estimating RAID 1 MTBF?  
Ron Reaugh


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
07-15-04 07:45 AM


"ohaya" <ohaya@cox.net> wrote in message news:40F60365.73BF82D@cox.net...
> 
published[vbcol=seagreen] 
SW,[vbcol=seagreen] 
EMP,[vbcol=seagreen] 
hell[vbcol=seagreen] 
and[vbcol=seagreen] 
end[vbcol=seagreen] 
shoot[vbcol=seagreen] 
1[vbcol=seagreen] 
5000[vbcol=seagreen] 
rated[vbcol=seagreen] 
all[vbcol=seagreen] 
than[vbcol=seagreen] 
needs to[vbcol=seagreen] 
>
>
> Ron,
>
> Thanks for your response.  I've looked at so many different sources the
> last couple of days, my eyes are blurring and my head is aching .
>
> Before I begin, I was really looking for just a kind of "ballpark" kind
> of "rule of thumb" for now, with as many assumptions/caveats as needed
> to make it simple, i.e., something like assume drives are in their
> "life" (the flat part of the Weibull/bathtub curve), ignore software,
> etc.
>
> Think of it like this:  I just gave you two SCSI drives, I guarantee you
> their MTBF is 1.2 Mhours,

1,200,000 hours ~=  137 years
Now do you think that means that 1/2 fail in 137 years?

> which won't vary over the time period that
> they'll be in-service,

That's known to be false.

> no other hardware will ever fail (i.e., don't
> worry about the processor board or raid controller), and it takes ~0
> time to repair a failure.
>
> Given something like that, and assuming I RAID1 these two drives, what
> kind of MTBF would you expect over time?

Zero repair time?

> - Is it the square of the individual drive MTBF?
> See: http://www.phptr.com/articles/article.asp?p=28689

All obvious and based on known inaccurate assumptions

> Or: http://tech-report.com/reviews/2001...id/index.x?pg=2 (this
> one doesn't make sense if MTTR=0 ==> MTBF=infinity?)
> Or: http://www.teradataforum.com/terada...0107_214543.htm (again,
> don't know how MTTR=0 would work)
>
> - Is it 150% the individual drive MTBF?
> See:
>
[url]http://www.zzyzx.com/products/whitepapers/pdf/MTBF_and_availability_primer.pdf[/ur
l]

"Industry standards have determined that redundant components increase the
MTBF by 50%."   No citation supplied.

"It should be noted that in the example above, if the downtime is reduced to
zero, availability changes to 1 or 100% regardless of the MTBF."

> - Is it double the individual drive MTBF?  (I don't remember where I saw
> this one.)
>
>
> It's kind of funny, but when I first started looking, I thought that I'd
> find something simple.  That was this weekend ...

As I said in my prior post.  Maintained RAID 1 failure(of the cases
included) can be ignored as it's swamped by other failures in the real
world.  It's a great academic exercise with little practical application
here.











[ Post a follow-up to this message ]



    Re: Estimating RAID 1 MTBF?  
Bill Todd


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
07-15-04 07:45 AM


"ohaya" <ohaya@cox.net> wrote in message news:40F60365.73BF82D@cox.net...

...

> Before I begin, I was really looking for just a kind of "ballpark" kind
> of "rule of thumb" for now, with as many assumptions/caveats as needed
> to make it simple, i.e., something like assume drives are in their
> "life" (the flat part of the Weibull/bathtub curve), ignore software,
> etc.

The drives *have* to be in their nominal service life:  once you go beyond
that, you won't get any meaningful numbers (because they have no
significance to the product, and thus the manufacturer won't have performed
any real testing in that life range).

>
> Think of it like this:  I just gave you two SCSI drives, I guarantee you
> their MTBF is 1.2 Mhours, which won't vary over the time period that
> they'll be in-service, no other hardware will ever fail (i.e., don't
> worry about the processor board or raid controller), and it takes ~0
> time to repair a failure.
>
> Given something like that, and assuming I RAID1 these two drives, what
> kind of MTBF would you expect over time?

Infinite.

>
> - Is it the square of the individual drive MTBF?
> See: http://www.phptr.com/articles/article.asp?p=28689

No.  This example applies to something like an unmanned spacecraft, where no
repairs or replacements can be made.  Such a system has no meaningful MTBF
beyond its nominal service life (which will usually be much less than the
MTBF of even a single component, when that component is something as
reliable as a disk drive).

> Or: http://tech-report.com/reviews/2001...id/index.x?pg=2 (this
> one doesn't make sense if MTTR=0 ==> MTBF=infinity?)

That's how it works, and this is the applicable formula to use.  For
completeness, you'd need to factor in the fact that drives have to be
replaced not only when they fail but when they reach the end of their
nominal service life, unless you reserved an extra slot to use to build the
new drive's contents (effectively, temporarily creating a double mirror)
before taking the old drive out.

> Or: http://www.teradataforum.com/terada...0107_214543.htm (again,
> don't know how MTTR=0 would work)

The same way:  though the explanation for RAID-5 MTBF is not in the usual
form, it's equivalent.

>
> - Is it 150% the individual drive MTBF?
> See:
>
http://www.zzyzx.com/products/white...bility_primer.p
df

No:  the comment you saw there is just some half-assed rule of thumb that
once again assumes no repairs are effected (and is still wrong even under
that assumption, though the later text that explains the value of repair is
qualitatively valid).

>
> - Is it double the individual drive MTBF?  (I don't remember where I saw
> this one.)

No.

The second paper that you cited has a decent explanation of why the formula
is what it is.  If you'd like a more detailed one, check out Transaction
Processing:  Concepts and Techniques by Jim Gray and Andreas Reuter.

- bill








[ Post a follow-up to this message ]



    Re: Estimating RAID 1 MTBF?  
ohaya


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
07-15-04 07:45 AM



Bill Todd wrote:
>
> "ohaya" <ohaya@cox.net> wrote in message news:40F60365.73BF82D@cox.net...
>
> ...
> 
>
> The drives *have* to be in their nominal service life:  once you go beyond
> that, you won't get any meaningful numbers (because they have no
> significance to the product, and thus the manufacturer won't have performe
d
> any real testing in that life range).
> 
>
> Infinite.
> 
>
> No.  This example applies to something like an unmanned spacecraft, where 
no
> repairs or replacements can be made.  Such a system has no meaningful MTBF
> beyond its nominal service life (which will usually be much less than the
> MTBF of even a single component, when that component is something as
> reliable as a disk drive).
> 
>
> That's how it works, and this is the applicable formula to use.  For
> completeness, you'd need to factor in the fact that drives have to be
> replaced not only when they fail but when they reach the end of their
> nominal service life, unless you reserved an extra slot to use to build th
e
> new drive's contents (effectively, temporarily creating a double mirror)
> before taking the old drive out.
> 
>
> The same way:  though the explanation for RAID-5 MTBF is not in the usual
> form, it's equivalent.
> 
> [url]http://www.zzyzx.com/products/whitepapers/pdf/MTBF_and_availability_primer.p[/ur
l]
> df
>
> No:  the comment you saw there is just some half-assed rule of thumb that
> once again assumes no repairs are effected (and is still wrong even under
> that assumption, though the later text that explains the value of repair i
s
> qualitatively valid).
> 
>
> No.
>
> The second paper that you cited has a decent explanation of why the formul
a
> is what it is.  If you'd like a more detailed one, check out Transaction
> Processing:  Concepts and Techniques by Jim Gray and Andreas Reuter.
>
> - bill


Bill,

Thanks.  This kind of goes along with some other info I've just been
looking at (something like "Product of Reliabilities" on a website).

If the above calculation is in fact a good estimate, and just so that
I'm clear, if:

- I had a RAID1 setup with two SCSI drives that really have an MTBF of
1.2Mhours, and
- The drives are within their "normal" lifetime (i.e., not in infant
mortality or end-of-life), and
- The processor board/hardware was such that it supported a hot swap
such that if one of the drives failed, it could be replaced without
having halting the system, and
- We estimated (for planning purposes) that let's say, worst-case, it
took someone an 4 hours to detect the failure, get another identical
drive, and replace it (so MTTR ~4 hours).

Then a reasonable ballpark estimate for the "theoretical" MTTF (which is
~MTBF) to be:

(1.2Mhours)(1.2Mhours)
---------------------- = MTTF(RAID1)
2 x 4 hours


Is that correct?

Wow!!!

Somehow, this seems "counter-intuitive" (sorry) ....

Jim





[ Post a follow-up to this message ]



    Re: Estimating RAID 1 MTBF?  
ohaya


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
07-15-04 07:45 AM

 
>
> As I said in my prior post.  Maintained RAID 1 failure(of the cases
> included) can be ignored as it's swamped by other failures in the real
> world.  It's a great academic exercise with little practical application
> here.


Ron,

Thanks again.  I'm starting to understand your 2nd sentence above .

If I'm understanding what you're saying, with a RAID1 setup, with 2
drives with reasonable (i.e., 1.2Mhours) MTBF, from a design standpoint,
you wouldn't be worried about failures of the drives themselves, because
there are other failures/components (e.g., the processor board, etc.)
that would have an MTBF much lower than the raid'ed drives themselves.

Did I get that right?

BTW, re. the "0" MTTR, see my post back to Bill Todd.  I had given 4
hours as an example in that post, but after posting and thinking about
it, given the scenario that I posed, it really seems like the MTTR would
be more like "0" than like 4 hours, since with my scenario, the "system"
never really fails (since the drives are hot-swappable).

Comments?

Jim





[ Post a follow-up to this message ]



    Re: Estimating RAID 1 MTBF?  
Bill Todd


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
07-15-04 07:45 AM


"ohaya" <ohaya@cox.net> wrote in message news:40F6162F.3C6681C6@cox.net...

...

> - I had a RAID1 setup with two SCSI drives that really have an MTBF of
> 1.2Mhours, and
> - The drives are within their "normal" lifetime (i.e., not in infant
> mortality or end-of-life), and
> - The processor board/hardware was such that it supported a hot swap
> such that if one of the drives failed, it could be replaced without
> having halting the system, and
> - We estimated (for planning purposes) that let's say, worst-case, it
> took someone an 4 hours to detect the failure, get another identical
> drive, and replace it (so MTTR ~4 hours).
>
> Then a reasonable ballpark estimate for the "theoretical" MTTF (which is
> ~MTBF) to be:
>
> (1.2Mhours)(1.2Mhours)
> ---------------------- = MTTF(RAID1)
>     2 x 4 hours
>
>
> Is that correct?
>
> Wow!!!
>
> Somehow, this seems "counter-intuitive" (sorry) ....

Hey, *single* disks are pretty damn reliable in the kind of ideal service
conditions you postulate:  mirrored disks are just (reliable) squared.

A 2,000,000-year RAID-1-pair MTBF sounds great, until you recognize that if
you have 2,000,000 installations, about one of them will fail each year.  If
each site has 100 disk pairs rather than just one, then someone will lose
data every 3+ days (or you'll need only 20,000 sites for about one to lose
data every year).

That's still really good, but not so far beyond something you'd start
worrying about to be utterly ridiculous - at least if you're a manufacturer
(individual customers still have almost no chance of seeing a failure, but
even a single one that does is still very bad publicity).  Start including
RAID-5 configurations, and system MTBF drops by roughly the square of the
number of drives in a set, which starts getting significant before long
(again, especially from the manufacturer's viewpoint, even if very few
individual customers actually experience data loss:  some of the new
virtualization architectures have RAID-5-like failure characteristics - even
if they're not using parity but mirroring to protect data, they're
distributing it around the disk set in a manner that can cause data loss if
*any* two disks fail - which users should at least be aware of).

- bill








[ Post a follow-up to this message ]



    Re: Estimating RAID 1 MTBF?  
Bill Todd


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
07-15-04 07:45 AM


"ohaya" <ohaya@cox.net> wrote in message news:40F6199E.2DEAE730@cox.net...

...

> BTW, re. the "0" MTTR, see my post back to Bill Todd.  I had given 4
> hours as an example in that post, but after posting and thinking about
> it, given the scenario that I posed, it really seems like the MTTR would
> be more like "0" than like 4 hours, since with my scenario, the "system"
> never really fails (since the drives are hot-swappable).
>
> Comments?

If you've learned how to repopulate on the order of 100 GB of failed drive
in zero time, especially while not seriously degrading on-going processing
(so don't just assert that you can use anything like the full bandwidth of
its partner to restore it), I suspect that there are many people who would
be very interested in talking with you.

- bill








[ Post a follow-up to this message ]



    Re: Estimating RAID 1 MTBF?  
ohaya


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
07-15-04 12:45 PM

 
>
> Hey, *single* disks are pretty damn reliable in the kind of ideal service
> conditions you postulate:  mirrored disks are just (reliable) squared.
>
> A 2,000,000-year RAID-1-pair MTBF sounds great, until you recognize that i
f
> you have 2,000,000 installations, about one of them will fail each year.  
If
> each site has 100 disk pairs rather than just one, then someone will lose
> data every 3+ days (or you'll need only 20,000 sites for about one to lose
> data every year).


Bill,

Thanks for the perspective.

But, so that I'm clear, if the individual drives really have 1.2Mhours
MTBF (and I think the Atlas 15K II spec sheet actually claims
1.4Mhours), then the "squared" MTBF would indicate that RAID 1 pair
would be something like 1+ TRILLION hours MTBF, not 1+ MILLION hours.
Have I misinterpreted something?

Jim





[ Post a follow-up to this message ]



    Sponsored Links  




 





   All times are GMT. The time now is 06:37 AM.      Post New Thread    Post A Reply      
Pages (7): [1] 2 3 4 5 6 » ... Last »   Last Thread   Next Thread Next


Most Popular forums 

Forum Jump:
Rate This Thread:

Forum Rules:
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is OFF
vB code is ON
Smilies are ON
[IMG] code is OFF
 
Medical and Health forum | Computer Games Reviews | Graphics design forum

Back To The Top
Home | Usercp | Faq | Register