|
Home > Archive > Data Storage > April 2005 > consistency
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
|
|
| lindahlb@hotmail.com 2005-04-13, 5:46 pm |
| For a custom database implementation, I'm generating 128 byte log
pre-image logs of sections of the page. I need to pair the log data
with the file offset (4 bytes).
For this problem, assume:
1) 512-byte or larger binary disk sectors (very reasonably portable)
2) meta-data won't need updating (I precommit file extensions)
3) the file is precommitted with invalid addresses
5) data written to the file is aligned on disk sector size
6) OS file cache buffers are bypassed (i.e. O_DIRECT)
To maintain consistency my log records look like:
unsigned long address1;
char data[128];
unsigned long address2;
I surround the log data with the address, giving me 136-byte log
records. I assume that if both addresses match, then I'm guaranteed
that the data in the middle is completely consistent (providing there
is no outside corruption). Is this assumption correct?
I feel its safe to make this assumption. The reason: a log record will
reside on, at most, 2 disk sectors. If the first part of the log record
is written but not the second part, then address2 won't match. If the
last part of the log record is written but not the first part, then
address1 won't match. The only time a match occurs is if both ends of
the log record are on disk. This means that the entire log record
(including data) made it to disk, giving me the consistency.
So.. again.. is this assumption correct? If not, what will break it? -
as I clearly don't understand file systems as well as I thought I did.
| |
| Maxim S. Shatskih 2005-04-13, 8:45 pm |
| From what I know, any single-sector update is atomic on modern disks. As
about multi sector - it is not guaranteed so. So, it is better to keep some
"generation" count in the beginning of the sectors, and maintain this count
always the same for all sectors participating in a multi-sector write (a
database page or such). Before writing, increment the counter in all sectors,
then write them all. If the future reads will show different counters in these
sectors - then the record or page is damaged by disk drive or power failure.
At least this is how NTFS works 
BTW - I don't think O_DIRECT will allow non-sector-aligned IO on the file.
At least its Windows counterpart does not allow this. Windows just locks these
pages to the MDL structure, then scatters this MDL to several child MDLs
according to the file runlist, and sends all of them down to disk stack in
parallel.
Dunno on UNIXen, but in such a thing I expect them to do the same.
--
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
maxim@storagecraft.com
http://www.storagecraft.com
<lindahlb@hotmail.com> wrote in message
news:1113405092.941660.151870@o13g2000cwo.googlegroups.com...
> For a custom database implementation, I'm generating 128 byte log
> pre-image logs of sections of the page. I need to pair the log data
> with the file offset (4 bytes).
>
> For this problem, assume:
> 1) 512-byte or larger binary disk sectors (very reasonably portable)
> 2) meta-data won't need updating (I precommit file extensions)
> 3) the file is precommitted with invalid addresses
> 5) data written to the file is aligned on disk sector size
> 6) OS file cache buffers are bypassed (i.e. O_DIRECT)
>
> To maintain consistency my log records look like:
> unsigned long address1;
> char data[128];
> unsigned long address2;
>
> I surround the log data with the address, giving me 136-byte log
> records. I assume that if both addresses match, then I'm guaranteed
> that the data in the middle is completely consistent (providing there
> is no outside corruption). Is this assumption correct?
>
> I feel its safe to make this assumption. The reason: a log record will
> reside on, at most, 2 disk sectors. If the first part of the log record
> is written but not the second part, then address2 won't match. If the
> last part of the log record is written but not the first part, then
> address1 won't match. The only time a match occurs is if both ends of
> the log record are on disk. This means that the entire log record
> (including data) made it to disk, giving me the consistency.
>
> So.. again.. is this assumption correct? If not, what will break it? -
> as I clearly don't understand file systems as well as I thought I did.
>
| |
| Bill Todd 2005-04-14, 2:46 am |
| lindahlb@hotmail.com wrote:
....
> I surround the log data with the address, giving me 136-byte log
> records. I assume that if both addresses match, then I'm guaranteed
> that the data in the middle is completely consistent (providing there
> is no outside corruption). Is this assumption correct?
Technically, no - but it may well be adequate.
>
> I feel its safe to make this assumption. The reason: a log record will
> reside on, at most, 2 disk sectors. If the first part of the log record
> is written but not the second part, then address2 won't match. If the
> last part of the log record is written but not the first part, then
> address1 won't match. The only time a match occurs is if both ends of
> the log record are on disk. This means that the entire log record
> (including data) made it to disk, giving me the consistency.
>
> So.. again.. is this assumption correct? If not, what will break it? -
> as I clearly don't understand file systems as well as I thought I did.
1. Obviously, should one of the two sectors fail to be written but just
happen to contain the address value you're looking for, you'll have a
garbage log that looks good.
2. Another (*very* low probability) possibility is that you'll get a
partial sector write - not because the disk didn't finish writing the
sector (most modern disks will, even if power fails), but because the
failing power caused RAM or bus errors (undetected by the disk) that
corrupted the end of the sector transfer (I've seen first-person reports
by multiple people of such occurrences, though they appear to be very
rare). If the disk also happened to write the two sectors out of order
(disks sometimes optimize writes by starting with the first of the
target sectors that's writable after the head stabilizes and then finish
up at the end of the disk revolution, though for just two sectors this
should be rather improbable - but they also revector bad sectors, and if
the first of the two sectors was so revectored it could easily wind up
being written out of order), and the later-written (but first of the log
record) sector write was corrupted as described, there could be garbage
within the log record even if both addresses checked out.
Processors are powerful enough these days that including a CRC for the
entire log record shouldn't be unreasonably expensive - and it validates
the entire record (at least to the probability of just happening to
match the CRC in the same way described in point 1 above, but you can
make the CRC as long as you want to minimize that).
Other things to watch out for logging include stumbling upon unwritten
data during recovery that just happens to look like valid log data -
especially if you're reusing space by treating the log as a ring buffer
so that 'unwritten' space will in fact be space containing obsolete log
records (in which case you should at a minimum ensure that the total
ring buffer size is not an integral multiple of the log record size,
when that size is fixed - but that only works when you have the luxury
of batching up log writes such that each exactly fills an integral
number of sectors). A CRC over the rest of the record helps minimize
this probability as well.
If you write multiple log records at a time, it significantly increases
the probability that a situation like that described in point 2 above
might leave a 'hole' in the middle of a long log write. The normal
mechanisms you use to detect the end of the log should safely stop you
at the start of the hole, but if another quick failure occurred such
that the new portion of the log ended within the sectors that were
written before the *first* failure - and which therefore look like valid
this-pass sectors - you could be in trouble on the second recovery. One
way to guard against this is to clear out the sectors just after the end
of the log during recovery equal to the size of the longest log write
you ever perform.
Logs are what makes ensuring the integrity (and sometimes availability
as well) of the rest of the system easy. But ensuring the integrity
(and if applicable availability) of the log itself is anything but.
- bill
| |
| lindahlb@hotmail.com 2005-04-14, 2:46 am |
| > 1. Obviously, should one of the two sectors fail to be written but
just
> happen to contain the address value you're looking for, you'll have a
> garbage log that looks good.
I guess I wasn't quite clear enough about condition 3. This is
impossible because I split the log file into multiples of 'extents'
(customizable and usually large size - 1MB+) and when a log buffer sync
will rub up against or go beyond the next extent, the area will be
filled with null addresses (followed by an fsync, so this also prevents
metadata problems if the OS file size isn't large enough) --
essentially, when recycling the log file, log record space is
precommitted with null addresses. Under this assumption we protect
against case 1, correct?
> failing power caused RAM or bus errors (undetected by the disk) that
> corrupted the end of the sector transfer
So in this scenario, given the address '1' and data 'logdata' and
corruption '*', and after an out-of-order write of sector 2 before
sector 1, with a power failure while writing sector 2, causing RAM to
be corrupted in the second half of the sector buffer, you could see
something like this?
sector 1: 1lo**
sector 2: ata1
It sounds like the only way to prevent this would be a CRC, correct?
But it sounds like this is very rare and can be ignored for all but the
most critical of data (the user can specify to the database what is
'critical')?
> From what I know, any single-sector update is atomic on modern disks.
As
> about multi sector - it is not guaranteed so. So, it is better to
keep some
> "generation" count in the beginning of the sectors, and maintain this
count
> always the same for all sectors participating in a multi-sector write
(a
> database page or such). Before writing, increment the counter in all
sectors,
> then write them all. If the future reads will show different counters
in these
> sectors - then the record or page is damaged by disk drive or power
failure.
This is a good idea. It should help prevent ignorance of outside
corruption and allow the database to be restored from a backup or
possible user intervention. However, I think it is unnecessary for
other types of problems.
> I don't think O_DIRECT will allow non-sector-aligned IO on the file.
To be more clear about this, I keep a log record ring buffer that is
512-byte aligned. The buffer is committed to the log file in 512-byte
blocks and if a partial block needs to be written, the rest of it is
filled with null addresses. So we often write out many sectors at once.
However, as I've stated above, outside corruption (ram/bus) aside,
out-of-order writes are not a problem, because of precommitting null
addresses - so I don't think this introduces any problems.
|
|
|
|
|