|
Home > Archive > Data Storage > November 2004 > Is the RAID-5 write penalty really necessary?
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Is the RAID-5 write penalty really necessary?
|
|
| Adam Megacz 2004-11-13, 2:45 am |
|
No really, I checked the FAQ on this.
I understand the reason for the RAID-5 write penalty. What I don't
understand is this: why not just set the block size to (N-1)*blocksize
(where N is the number of drives in the array and blocksize is the
hardware-level block size).
Since you can only write full blocks, there's never any need to read
back the underlying parity -- because you know that if you're writing
to a given slice on any particular drive, you're also certain that
you'll be overwriting the corresponding slice on all the other drives.
So, I guess my quesiton is: why not just make the block size bigger
and never read before writing?
I think I saw a paper on something similar dubbed "RAID 3.5", but
haven't seen an implementation of it yet.
- a
--
I wrote my own mail server and it still has a few bugs.
If you send me a message and it bounces, please forward the
bounce message to megacz@gmail.com. Thanks!
| |
| Malcolm Weir 2004-11-14, 2:45 am |
| On Fri, 12 Nov 2004 20:41:05 -0800, Adam Megacz <adam@megacz.com>
wrote:
>No really, I checked the FAQ on this.
>
>I understand the reason for the RAID-5 write penalty. What I don't
>understand is this: why not just set the block size to (N-1)*blocksize
>(where N is the number of drives in the array and blocksize is the
>hardware-level block size).
>
>Since you can only write full blocks, there's never any need to read
>back the underlying parity -- because you know that if you're writing
>to a given slice on any particular drive, you're also certain that
>you'll be overwriting the corresponding slice on all the other drives.
>
>So, I guess my quesiton is: why not just make the block size bigger
>and never read before writing?
Why mess with the block size? You can simply hold off updating a
block to see if the adjacent block is coming along from the host, and
if it his, repeat until the entire stripe needs to be written. Then
generate parity and splat away. It's usually referred to as "full
stripe optimization" or something similar.
One marginally awkward consequence is that it tends to work best if
you have 5, 9 or 17 disks.
>I think I saw a paper on something similar dubbed "RAID 3.5", but
>haven't seen an implementation of it yet.
That would be a silly name. RAID 3 stripes below the block level,
RAID 4 & 5 above it.
> - a
Malc.
| |
| Adam Megacz 2004-11-15, 7:45 am |
|
Malcolm Weir <malc@gelt.org> writes:
> Why mess with the block size? You can simply hold off updating a
> block to see if the adjacent block is coming along from the host, and
> if it his, repeat until the entire stripe needs to be written. Then
> generate parity and splat away. It's usually referred to as "full
> stripe optimization" or something similar.
Ah, good point!
Hrm, in light of this, it seems that the "raid 5 write penalty" is
just an artifact of poor implementations!
Do you happen to know offhand if the Linux md driver does this (ie
maintain a kernel-space buffer and treat userspace as the "host")?
- a
--
I wrote my own mail server and it still has a few bugs.
If you send me a message and it bounces, please forward the
bounce message to megacz@gmail.com. Thanks!
| |
| Scott Howard 2004-11-15, 7:45 am |
| Adam Megacz <adam@megacz.com> wrote:
> Hrm, in light of this, it seems that the "raid 5 write penalty" is
> just an artifact of poor implementations!
So every vendor out there has a poor implemention? Wow, you're going
to make yourself rich with this discovery - time to go talk to EMC, HDS,
etc, etc 
Hint: what happens when someone writes 512 bytes to a disk.
And then doesn't write anything else around it?
How do you de-stage it?
Hardware RAID-5 arrays get around this by doing exactly what you're
suggesting - holding things in cache in an attempt to turn them into larger
writes. Software RAID-5 (or non battery-backed hardware RAID-5) can't do
this without risking data loss in the event of an outage.
Scott.
| |
| Thor Lancelot Simon 2004-11-15, 7:45 am |
| In article <x1mzxj1loh.fsf@nowhere.com>, Adam Megacz <adam@megacz.com> wrote:
>
>Malcolm Weir <malc@gelt.org> writes:
>
>Ah, good point!
>
>Hrm, in light of this, it seems that the "raid 5 write penalty" is
>just an artifact of poor implementations!
No. If you have more than one write stream at a time, consecutive writes
may not be into the same stripe. To some extent, this problem too can be
defeated by caching; but, obviously, such caching needs to take place in
nonvolatile RAM or you have to leave _all_ the writes pending on the bus
until you commit whichever full stripes you're going to commit.
It's easy to "solve" this problem for a single write stream. Unfortunately,
single-stream performance is not actually indicative of performance for
most real applications in the real world.
--
Thor Lancelot Simon tls@rek.tjls.com
But as he knew no bad language, he had called him all the names of common
objects that he could think of, and had screamed: "You lamp! You towel! You
plate!" and so on. --Sigmund Freud
| |
| Robert Wessel 2004-11-16, 2:45 am |
| Adam Megacz <adam@megacz.com> wrote in message news:<x1k6sqgy3i.fsf@nowhere.com>...
> No really, I checked the FAQ on this.
>
> I understand the reason for the RAID-5 write penalty. What I don't
> understand is this: why not just set the block size to (N-1)*blocksize
> (where N is the number of drives in the array and blocksize is the
> hardware-level block size).
>
> Since you can only write full blocks, there's never any need to read
> back the underlying parity -- because you know that if you're writing
> to a given slice on any particular drive, you're also certain that
> you'll be overwriting the corresponding slice on all the other drives.
>
> So, I guess my quesiton is: why not just make the block size bigger
> and never read before writing?
>
> I think I saw a paper on something similar dubbed "RAID 3.5", but
> haven't seen an implementation of it yet.
There's the minor problem that for all but the smallest array your
solution is worse than the normal RAID-5 write penalty, even
considering only writes.
Consider a simple six disk array. Your solution requires that I issue
six write operations, rather than the two reads and two writes. While
this might be a small win for latency, it certainly is much worse for
throughput. Remember that with more drives to involve the average
time to complete the operations will increase just because your likely
to have a larger "longest seek" if you have to seek on six rather than
two drives, and in the RRWW/RAID-5 case, the writes don't need seeks,
and tend to be reasonably quick. You also loose the ability to run
multiple writes in parallel.
Of course real workloads are not just write only, and if you consider
reads, your solution is massively worse. With your one block per
stripe configuration, you've managed to involve every (OK, all but
one) drive in every read operation. Thus reducing the random I/O
throughput to something similar to that of a single drive. With the
above mentioned six drive array, I can be doing six separate random
reads at once, so long as they hit different drives.
Now what you've described is quite similar to RAID-2 or RAID-3, and
RAID-3 is quite commonly supported. RAID-2/3 usually has a vastly
larger "block size" than you are proposing, however. But while RAID-3
(or the practically nonexistent RAID-2) provides excellent data rates
for largely sequential I/Os, random I/O performance is at basically
single-disk levels. RAID-3 is commonly used in HPC systems that have
to stream very large datasets at high rates, it doesn't have much
appeal for most database/commercial server systems or desktops.
If the RAID-5 write penalty is an issue for you, the usual solution is
to go to a RAID-0+1 approach.
|
|
|
|
|