Data Storage - Raid level write verification

This is Interesting: Free IT Magazines  
Home > Archive > Data Storage > March 2005 > Raid level write verification





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author Raid level write verification
teckytim

2005-03-23, 2:45 am

Is it true that Raid levels like 1, 1+0, 3, 4 verify on the fly that
both the data write and the ecc write match while levels like raid 5
don't (with few exceptions on the very high end)? Is this dependant
rather on make & model?

TIA

Bill Todd

2005-03-23, 2:45 am

teckytim wrote:
> Is it true that Raid levels like 1, 1+0, 3, 4 verify on the fly that
> both the data write and the ecc write match while levels like raid 5
> don't (with few exceptions on the very high end)?


No. No RAID level *requires* any kind of read-after-write verification,
though any RAID *implementation* could offer it as an additional feature.

However, I think some RAID-3 implementations verify on the fly that the
parity information matches the stripe being *read* (since that has no
impact on performance, save for the CPU cycles required by the
comparison and the bus cycles consumed by reading the parity). Though I
don't recall that the accepted RAID-3 definition *requires* this.

- bill
teckytim

2005-03-24, 2:48 am


Bill Todd wrote:
> teckytim wrote:
that[vbcol=seagreen]
5[vbcol=seagreen]
>
> No. No RAID level *requires* any kind of read-after-write

verification,
> though any RAID *implementation* could offer it as an additional

feature.
>
> However, I think some RAID-3 implementations verify on the fly that

the
> parity information matches the stripe being *read* (since that has no


> impact on performance, save for the CPU cycles required by the
> comparison and the bus cycles consumed by reading the parity).

Though I
> don't recall that the accepted RAID-3 definition *requires* this.
>
> - bill



Thanks. I was afraid that was the answer as so many raid details are
nonstandard or rather manufacturer specific.

So if I *require* a raid implementation that does this, where do I have
to look? Are there PCI card products (SATA & SCSI), raid boxes, or is
this only available in the high end non-das raid like emc, etc.

Are background media scans sufficient protection against failing/flaky
media so the verify feature discussed above is not necessary?

Thanks again.

Bill Todd

2005-03-24, 2:48 am

teckytim wrote:

....

> So if I *require* a raid implementation that does this, where do I have
> to look? Are there PCI card products (SATA & SCSI), raid boxes, or is
> this only available in the high end non-das raid like emc, etc.


Someone else here might know, but I don't.

>
> Are background media scans sufficient protection against failing/flaky
> media so the verify feature discussed above is not necessary?
>
> Thanks again.
>


I don't think the two are all that closely related. All
read-after-write does is verify that the data written was what you
intended to write: while this does guard against very low-probability
errors like silently-failing null writes or 'wild' writes (though with
the latter you have to worry about what got clobbered as well), it isn't
likely to be any kind of substitute for background 'scrubbing' to catch
deteriorating sectors (which I think are orders of magnitude more likely
than unheralded write failures, but that's just my impression).

Sun claims that its new ZFS file system for Solaris has supplementary
checksum information that guards data from main-memory to disk and back
again - you might find a look there interesting. But that's not
specifically RAID-related.

- bill
teckytim

2005-03-24, 2:48 am


Bill Todd wrote:
> teckytim wrote:
>
> ...
>
have[vbcol=seagreen]
is[vbcol=seagreen]
>
> Someone else here might know, but I don't.
>
failing/flaky[vbcol=seagreen]
>
> I don't think the two are all that closely related. All
> read-after-write does is verify that the data written was what you
> intended to write: while this does guard against very

low-probability
> errors like silently-failing null writes or 'wild' writes (though

with
> the latter you have to worry about what got clobbered as well), it

isn't
> likely to be any kind of substitute for background 'scrubbing' to

catch
> deteriorating sectors (which I think are orders of magnitude more

likely
> than unheralded write failures, but that's just my impression).


I didn't think they are related, at least not outside of the most
general sense. Read-after-write just seems to me a reasonable extra
failsafe where data integrity/security trumps all else. That
perception could be wrong though.

I have occasionally read about transient write errors in raid 5
implementations which writers/poster believe make raid 5 less reliable
than other levels. Also I have read about some interesting data
protection features in EMC & I think Netapp which I believe combat
these fears.

It seems to me the likelihood of a flakey drive causing problems
increases with array size (drive #) and esp in larger ATA arrays. In
the event of, say, a weakening sector which causes a write to fail but
is not quite weak enough to be marked bad it would cause confusion on
defect scan. I have also seen a drive or two which was failing by
corrupting data, but still spinning & not showing much or anything in
the way of bad sectors. It's rare, but I've seen it and wouldn't want
one such drive to take a crap all over an arrays stripes.
"read-after-write" in addition to background defect scanning makes
sense to me. I usually see only the latter. That makes me wonder.

> Sun claims that its new ZFS file system for Solaris has supplementary


> checksum information that guards data from main-memory to disk and

back
> again - you might find a look there interesting. But that's not
> specifically RAID-related.
>
> - bill


Very interesting. Will look. Thanks again for the response.

Dave Sheehy

2005-03-25, 5:45 pm

teckytim (technotim@hotmail.com) wrote:
: I have occasionally read about transient write errors in raid 5
: implementations which writers/poster believe make raid 5 less reliable
: than other levels. Also I have read about some interesting data
: protection features in EMC & I think Netapp which I believe combat
: these fears.

A block protection scheme (aka DIF) has recently been standardized by T10.
That protection scheme has been implemented by a few of the silicon
suppliers (including my employer). Look for that scheme to become a
pretty common feature in the next couple of years. It has been a
proprietary feature of a few storage vendors for a number of years already.
The recently announced SGI 4G FC array (OEMed from Engenio) is an example
that has this new standardized feature built into it.

Dave

Bill Todd

2005-03-25, 5:45 pm

Dave Sheehy wrote:
> teckytim (technotim@hotmail.com) wrote:
> : I have occasionally read about transient write errors in raid 5
> : implementations which writers/poster believe make raid 5 less reliable
> : than other levels. Also I have read about some interesting data
> : protection features in EMC & I think Netapp which I believe combat
> : these fears.
>
> A block protection scheme (aka DIF) has recently been standardized by T10.
> That protection scheme has been implemented by a few of the silicon
> suppliers (including my employer). Look for that scheme to become a
> pretty common feature in the next couple of years. It has been a
> proprietary feature of a few storage vendors for a number of years already.
> The recently announced SGI 4G FC array (OEMed from Engenio) is an example
> that has this new standardized feature built into it.


If it is indeed now a standard I suspect that given sufficient effort I
could learn its details. But if you found it convenient to post them
(at least to the degree that one could understand the technology
involved - e.g., is it simply an additional checksum, does it live with
the data or separate from it, etc.), it would save me and other curious
individuals some time.

Thanks,

- bill
Dave Sheehy

2005-03-25, 5:45 pm

Bill Todd (billtodd@metrocast.net) wrote:
: Dave Sheehy wrote:
: > teckytim (technotim@hotmail.com) wrote:
: > : I have occasionally read about transient write errors in raid 5
: > : implementations which writers/poster believe make raid 5 less reliable
: > : than other levels. Also I have read about some interesting data
: > : protection features in EMC & I think Netapp which I believe combat
: > : these fears.
: >
: > A block protection scheme (aka DIF) has recently been standardized by T10.
: > That protection scheme has been implemented by a few of the silicon
: > suppliers (including my employer). Look for that scheme to become a
: > pretty common feature in the next couple of years. It has been a
: > proprietary feature of a few storage vendors for a number of years already.
: > The recently announced SGI 4G FC array (OEMed from Engenio) is an example
: > that has this new standardized feature built into it.

: If it is indeed now a standard I suspect that given sufficient effort I
: could learn its details. But if you found it convenient to post them
: (at least to the degree that one could understand the technology
: involved - e.g., is it simply an additional checksum, does it live with
: the data or separate from it, etc.), it would save me and other curious
: individuals some time.

The details can be found in the SBC-2 or -3 standard at t10.org. Look for
the section on "Protection Information". Also, some new 32 bit extended SCSI
commands are being proposed to support this functionality. There are some
rumblings about adding this to T13 as well but I'm not familiar with the
status of that.

Briefly, 8 bytes of information are appended to each block. There are 3
fields of information, a 2 byte CRC (of the data), a 4 byte LBA count, and
a 2 byte application tag. Theoretically, the information can be applied end
to end (i.e. generated at the server and sent to and returned from the
array) but that is not a typical deployment (although a few HBA manufacturers
are incorporating the feature). The typical deployment is to generate the
information in the protocol controller on the front end of the array as
its written to memory (i.e. data cache). It is written to disk by the back
end. The information is validated by both back end and front end when the
data is read by the protocol controller. When performend in this fashion the
data is protected as it traverses the bus (e.g. PCI and PCIX only have
simple parity protection), while it resides in memory, and while it resides
on the disk.

Dave

Bill Todd

2005-03-25, 5:45 pm

Dave Sheehy wrote:

....

> Briefly, 8 bytes of information are appended to each block. There are 3
> fields of information, a 2 byte CRC (of the data), a 4 byte LBA count, and
> a 2 byte application tag. Theoretically, the information can be applied end
> to end (i.e. generated at the server and sent to and returned from the
> array) but that is not a typical deployment (although a few HBA manufacturers
> are incorporating the feature). The typical deployment is to generate the
> information in the protocol controller on the front end of the array as
> its written to memory (i.e. data cache). It is written to disk by the back
> end. The information is validated by both back end and front end when the
> data is read by the protocol controller. When performend in this fashion the
> data is protected as it traverses the bus (e.g. PCI and PCIX only have
> simple parity protection), while it resides in memory, and while it resides
> on the disk.


Thanks. That's the kind of thing I thought might be useful a decade
ago, though seems a little stingy today - e.g., limiting the LBA address
to 32 bits (common arrays below the level that the host system may be
aware of already exceed this size, though when used only as a sanity
check the low 32 bits of the LBA may be sufficient) and the
application-specific area to 16 (if both fields were longer the
application-specific area could be used, e.g., to hold a file identifier
which would facilitate reconstruction of a file system - I have a vague
recollection that IBM's i-series boxes and their ancestors may have done
this).

It should at least allow a host which cares enough to implement the
functionality the ability to generate the validation information before
the data leaves main memory and check it after it returns. This will
catch otherwise undetected bus errors and anything clobbered by a wild
write, but unfortunately still won't detect that the intended
destination was never updated (or that a silent null write failure
occurred).

And the largest single potential market for such a feature could turn
out to be SATA based...

- bill
teckytim

2005-03-26, 2:45 am

Thanks for the follow-up posts Bill & Dave. Very helpful. In addition
I see a proposal for a "Write Read Verify" feature extension over
at T13.org
http://www.t13.org/docs2005/e04129r...read_verify.pdf

I am specifically looking for SCSI & SATA DAS & controllers that
utilizes advanced protection mechanisms as has been mentioned here.
Any product recommendations along those lines?


Thanks again for your time.

Jaroslaw Weglinski

2005-03-26, 7:45 am

> Thanks. That's the kind of thing I thought might be useful a decade
> ago, though seems a little stingy today - e.g., limiting the LBA address
> to 32 bits (common arrays below the level that the host system may be
> aware of already exceed this size, though when used only as a sanity
> check the low 32 bits of the LBA may be sufficient)

I think, that this is LBA address for given drive inside array (and we
still have to wait for drives with 2^32 sectors) - because IMHO there is
little sense to address linear sectors from different drives in RAID;
and this is controller job to specify drive and sector on that drive for
given data portion

regards
Jaroslaw Weglinski
Bill Todd

2005-03-26, 8:45 pm

Jaroslaw Weglinski wrote:
>
> I think, that this is LBA address for given drive inside array (and we
> still have to wait for drives with 2^32 sectors)


That may be how those who currently use it do so, but it's hardly the
only useful way to use it.

- because IMHO there is
> little sense to address linear sectors from different drives in RAID;
> and this is controller job to specify drive and sector on that drive for
> given data portion


The sense was implicit in my following description of how a
seriously-interested host system could use the fields. A host system
should (at least at the level where the code generating these validation
checks would likely operate) know nothing about the nature of the
apparent SCSI device it's talking to: it's just a device (and in the
case of a hardware array, a 'device' which can already often be larger
than 2 TB).

Although this would raise issues about dynamically expanding or
contracting the array, since its controller would have to understand how
the host was using the fields to the extent that it could propagate them
to the new block locations rather than create new ones there. Of
course, this may already be an issue for its application-controlled
portion: if the application actually is cognizant of the locations on
the physical disks, then how it uses that portion might differ from how
it would if it were considering the location as a logical block address
in the context of the array.

- bill
Jaroslaw Weglinski

2005-03-28, 7:45 am

>>> Thanks. That's the kind of thing I thought might be useful a decade
> That may be how those who currently use it do so, but it's hardly the
> only useful way to use it.


yes - for example HDS is using 520 byte sectors internally, and they
contain 32bit sector number

>
> - because IMHO there is
>
>
>
> The sense was implicit in my following description of how a
> seriously-interested host system could use the fields.

ok - in case of host<->array communication, 32bit may be too short (but
it is not used for error detection anyway - because of address
translation on controller [if we are storing physical sector numbers on
disks])


another way is to store host level sector numbers on disks - but this
would need second field, describing association with specific, host
visible lun (if not, and if for example, we have many host visible disks
with few sectors each - there is probability of not detecting wild
writes - if we write sector X for lun A in place of sector X for lun B)
- and, in this case, we have problem with 32bit address space
> A host system
> should (at least at the level where the code generating these validation
> checks would likely operate) know nothing about the nature of the
> apparent SCSI device it's talking to: it's just a device (and in the
> case of a hardware array, a 'device' which can already often be larger
> than 2 TB).
>
> Although this would raise issues about dynamically expanding or
> contracting the array, since its controller would have to understand how
> the host was using the fields to the extent that it could propagate them
> to the new block locations rather than create new ones there. Of
> course, this may already be an issue for its application-controlled
> portion: if the application actually is cognizant of the locations on
> the physical disks, then how it uses that portion might differ from how
> it would if it were considering the location as a logical block address
> in the context of the array.
>
> - bill


I think that host would be mostly interested in CRC [and app. field] -
as you have written - and sector count would be mainly interesting for
the array controller (to verify wild writes - and restore from RAID in
that case) [of course controller can use CRC too (to verify data - and
restore form RAID if needed)]; so we can as well store physical sector
counts on disks (in case of intelligent RAID controller) - and send
rewritten, proper sector numbers to host

another question is - if crc includes only data, or also LBA (in second
case, controller has to recalculate crc on transfer, as sector number
changes)


regards,
Jaroslaw Weglinski
Bill Todd

2005-03-28, 5:52 pm

Jaroslaw Weglinski wrote:

....

> ok - in case of host<->array communication, 32bit may be too short (but
> it is not used for error detection anyway - because of address
> translation on controller [if we are storing physical sector numbers on
> disks])


I'm not quite sure what you meant to say there. It would be nice to
have sufficient host-manageable space to store both a modest CRC (for
end-to-end content-integrity validation) and, say, a file ID/file block
number (both for end-to-end address-integrity validation and to
facilitate rebuilding a file system after some catastrophe). I'd be
willing to give up the latter, but if I let the controller handle the
address validation it can't catch addressing errors between it and the
host memory (though things like packet CRCs and bus parity may catch
most of them, and something as simple as returning the input address as
the operation ACK could allow the host to check what the controller
thinks it wrote or read).

....

> I think that host would be mostly interested in CRC [and app. field] -
> as you have written - and sector count would be mainly interesting for
> the array controller (to verify wild writes - and restore from RAID in
> that case)


While the controller could correct the effects of a
previously-undetected wild write when it encountered the data that it
clobbered (leaving aside the potential addressing errors between host
and controller described above), if the host uses the *intended* target
of that write before the unintended target is noticed neither the host
nor the controller has any indication that the data being read should
have been updated but wasn't.

Gives one a good reason not to be too lazy about background
validation-scan activity, I guess.

[of course controller can use CRC too (to verify data - and
> restore form RAID if needed)]; so we can as well store physical sector
> counts on disks (in case of intelligent RAID controller) - and send
> rewritten, proper sector numbers to host
>
> another question is - if crc includes only data, or also LBA (in second
> case, controller has to recalculate crc on transfer, as sector number
> changes)


If I didn't have additional space available to use for file ID/file
block number information, I think I'd include them in my CRC (tell the
controller that I was handling the CRC in the host so that it would
leave it alone - which may already be pretty clear on writes by virtue
of seeing that the host is providing a CRC, but is not obvious on
reads). Then the only additional feature I'd need would be the ability
to tell the controller to fetch the 'other' copy of the data if I didn't
like the one I got back, and update the faulty one.

That still wouldn't normally protect against the undetected update
failure above, though.

- bill
Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com