I get the following error message when doing a select on a table:

ERROR: invalid page header in block 295 of relation “reported_titles”

I found some messages that said this means a block of this table is
corrupt. I found some suspicious lines in the server log just before:

ERROR: could not access status of transaction 3651584
DETAIL: could not open file “/usr/local/pgsql/data/pg_clog/0003”: No
such file or directory

How do I fix this corruption? I have dumped as much of the databases as
I can including about half of this table. Only this table is corrupt.

What could cause the corruption? We are using custom C code. Could
bugs in this be causing it? Or is hardware problems more likely?


> I get the following error message when doing a select on a table:
> ERROR: invalid page header in block 295 of relation “reported_titles”


> How do I fix this corruption?
You can zap just the failed block by turning on “zero_damaged_pages”;
that will at least allow you to recover the rest of the table. If you
want to try harder, you could look at the damaged page with pg_filedump
(http://sources.redhat.com/rhdb/) or a similar tool and try to intuit
how to fix it manually.

> What could cause the corruption? We are using custom C code. Could
> bugs in this be causing it? Or is hardware problems more likely?

Hmm. A scribble-on-memory kind of bug could cause this, but in my
experience it’s unusual for coding errors to trash the disk buffers —
that’s a relatively small part of your address space, and usually a
memory clobber will crash the backend elsewhere before it hits a disk
buffer. (BTW, one reason we force a database restart after a backend
crash is in hopes of not letting any such clobber make it to disk. The
contents of shared disk buffers are simply thrown away in a restart.)

It would probably be worth your while to look at the damaged page with
pg_filedump before you zap it. The symptoms of hardware misfeasance and
software errors are enough different that you can often tell which
theory to believe by examining the bits.


>
> You can zap just the failed block by turning on “zero_damaged_pages”;
> that will at least allow you to recover the rest of the table. If you
> want to try harder, you could look at the damaged page with pg_filedump
> (http://sources.redhat.com/rhdb/) or a similar tool and try to intuit
> how to fix it manually.
>
I zapped the damaged block. It didn’t seem to effect the rows in the
table. My suspicion is that the page only contained deleted rows since
the table had many updates done recently.

> Hmm. A scribble-on-memory kind of bug could cause this, but in my
> experience it’s unusual for coding errors to trash the disk buffers —
> that’s a relatively small part of your address space, and usually a
> memory clobber will crash the backend elsewhere before it hits a disk
> buffer. (BTW, one reason we force a database restart after a backend
> crash is in hopes of not letting any such clobber make it to disk. The
> contents of shared disk buffers are simply thrown away in a restart.)
>
> It would probably be worth your while to look at the damaged page with
> pg_filedump before you zap it. The symptoms of hardware misfeasance and
> software errors are enough different that you can often tell which
> theory to believe by examining the bits.
>

I used pg_filedump on a backup of the database files. The block looks
like it is mostly zero bytes with a few x02 bytes thrown to just be
confusing.

 

[vbcol=seagreen]
> I used pg_filedump on a backup of the database files. The block looks
> like it is mostly zero bytes with a few x02 bytes thrown to just be
> confusing.

My interpretation of that would be a hardware glitch. A software
problem would be more likely to look like copying the wrong data
into the block, or possibly zeroing out the block when it shouldn’t
— but the sprinkling of x02’s rules out a misaimed memset().

Time to break out the RAM and disk test programs …