Data Storage - Very Large Filesystems

This is Interesting: Free IT Magazines  
Home > Archive > Data Storage > May 2007 > Very Large Filesystems





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author Very Large Filesystems
Aknin

2007-04-28, 7:12 am

Following some research I've been doing on the matter across
newsgroups and mailing lists, I'd be glad if people could share
numbers about real life large filesystem and their experience with
them. I'm slowly coming to a realization that regardless of
theoretical filesystem capabilities (1TB, 32TB, 256TB or more), more
or less across the enterprise filesystem arena people are recommending
to keep practical filesystems up to 1TB in size, for manageability and
recoverability.

What's the maximum filesystem size you've used in production
environment? How did the experience come out?

Thanks,
-Yaniv

Faeandar

2007-04-29, 7:12 pm

On 28 Apr 2007 02:21:30 -0700, Aknin <the.aknin@gmail.com> wrote:

>Following some research I've been doing on the matter across
>newsgroups and mailing lists, I'd be glad if people could share
>numbers about real life large filesystem and their experience with
>them. I'm slowly coming to a realization that regardless of
>theoretical filesystem capabilities (1TB, 32TB, 256TB or more), more
>or less across the enterprise filesystem arena people are recommending
>to keep practical filesystems up to 1TB in size, for manageability and
>recoverability.
>
>What's the maximum filesystem size you've used in production
>environment? How did the experience come out?
>
>Thanks,
> -Yaniv


The true constraint, as you've pointed out, is recoverability. If you
need to recover and entire file system in any sane amount of time 16TB
and bigger is out of the question.

I think 3-4TB is fine with today's tape drive speeds but there may be
limitations from your backup software. I recall hearing a limit of
4TB per NDMP stream for NBU.

You could go higher I think if you have a directory structure that
allows for recovery prioritization. If you have a 36TB file system
but you know that these 9 directories are the priority, then you
really only have a recover limit of those 9 directories. The rest can
be done as time permits.

~F
Bill Todd

2007-04-29, 7:12 pm

Faeandar wrote:
> On 28 Apr 2007 02:21:30 -0700, Aknin <the.aknin@gmail.com> wrote:
>
>
> The true constraint, as you've pointed out, is recoverability. If you
> need to recover and entire file system in any sane amount of time 16TB
> and bigger is out of the question.


That really depends on how you're recovering it, which in turn depends
on what kind of problem you need to recover it from.

If you're talking about restoring from backup tapes, fine. If you're
talking about recovery from backup disks (plus a few recent
incrementals, whether on disk or on tape, that can be applied directly
to them to recreate the running system), you can usually probably go
larger. If you're talking about recovery using a synchronous
replication site, no size limit exists at all (though you need a)
snapshot or CDP facilities to ensure that common corruption at both
sites can be quickly backed out and b) *real* confidence in the software
not to have introduced system-level corruption at both sites, though the
latter can in part be addressed by using logical inter-site mirroring
with different software implementations at the two sites).

As the required software matures, CDP in combination with inter-site
synchronous replication (or low-delay asynchronous replication plus
local logging to cover the gap for anything save complete primary site
destruction) should help make make backups as obsolete as paper tape:
decreasing hardware costs for such services should make the management
costs of backups (let alone their effect on recovery-time objectives)
increasingly untenable.

- bill
Aknin

2007-04-30, 7:12 am

On Apr 30, 12:48 am, Bill Todd <billt...@metrocast.net> wrote:
> Faeandar wrote:
>
>
>
>
>
> That really depends on how you're recovering it, which in turn depends
> on what kind of problem you need to recover it from.
>
> If you're talking about restoring from backup tapes, fine. If you're
> talking about recovery from backup disks (plus a few recent
> incrementals, whether on disk or on tape, that can be applied directly
> to them to recreate the running system), you can usually probably go
> larger. If you're talking about recovery using a synchronous
> replication site, no size limit exists at all (though you need a)
> snapshot or CDP facilities to ensure that common corruption at both
> sites can be quickly backed out and b) *real* confidence in the software
> not to have introduced system-level corruption at both sites, though the
> latter can in part be addressed by using logical inter-site mirroring
> with different software implementations at the two sites).
>
> As the required software matures, CDP in combination with inter-site
> synchronous replication (or low-delay asynchronous replication plus
> local logging to cover the gap for anything save complete primary site
> destruction) should help make make backups as obsolete as paper tape:
> decreasing hardware costs for such services should make the management
> costs of backups (let alone their effect on recovery-time objectives)
> increasingly untenable.
>
> - bill


The system in question is made of millions (sometimes more) of small
files. Corruption in any particular file isn't troublesome, nor even
in hundreds of files. The block device is mirrored and is stored on
expensive SAN arrays that are trusted not to choke and die, and
snapshots can be taken at regular intervals.

As you can probably understand, the amount of files times the capacity
(tens of TBs and growing...) makes backups quite irrelevant, and what
we're counting on (maybe unjustly) is the mirroring and the
snapshotting. We trust the system in the sense that it's too stupid to
do something wrong, it works at the file level and is exceedingly
unlikely to corrupt more than a file (or two, or a hundred - but no
more) at a time.

What /is/ worrying to me is silent filesystem corruption that will at
some point jump and bite my arse. Filesystem corruption will cause
prompt snapshot rollback and incremental recovery*, but I'm worried
about rolling back only to discover the filesystem was already
corrupted at the time of the snap. I don't have room for much more
than one or two snaps.

So you see the most complex part of my scenario is the filesystem,
rather than the system, and tape backup is totally impractical even
for sizes much smaller than 4TB.

Does that change your advice?

Ernst S Blofeld

2007-04-30, 7:12 am

Aknin wrote:
> What /is/ worrying to me is silent filesystem corruption that will at
> some point jump and bite my arse.


Which is leading you to suggest that your files are best kept across a
number of independent filesystem 'domains' so as to contain the possible
effects of any corruption. This would seem a reasonable suggestion with
the proviso that the 'domains' are genuinely independent and not sitting
on the same SAN, fileserver etc. etc. You also need to be confident of
detecting the corruption as soon as possible, for the reasons that you
outline.

It seems that the only logical solution is automatic checksumming
coupled with redundancy, in the manner that ZFS does. No doubt this
feature will be found in other filesystems in the future.

ESB
Jan-Frode Myklebust

2007-04-30, 1:13 pm

On 2007-04-30, Ernst S Blofeld <E.Blofeld@new-spectre-base.com> wrote:
>
> It seems that the only logical solution is automatic checksumming
> coupled with redundancy, in the manner that ZFS does.


This blind trust in ZFS amazes me.. ZFS will have bugs, and get corrupted
like any other file system, and then you'll need your backups. Also, when
the automatic checksumming finds a corruption, you'll need your backups.

So the answers is backup (online, nearline, offline, whatever), and spread
your files over many small'ish fs's to reduce the time to recover from a
fs corrution.


-jf
Bill Todd

2007-04-30, 1:13 pm

Jan-Frode Myklebust wrote:
> On 2007-04-30, Ernst S Blofeld <E.Blofeld@new-spectre-base.com> wrote:
>
> This blind trust in ZFS amazes me..


Perhaps as much as blind ignorance like yours amazes me. The main
difference between us being that no one talking about ZFS in any way
suggested trusting it blindly: ESB above suggested a mechanism *like*
ZFS's (in case you're unaware of the fact, other more mature systems
provide features of this nature), and I suggested that ZFS, while by no
means mature, *might* still satisfy the expressed needs..

ZFS will have bugs, and get corrupted
> like any other file system, and then you'll need your backups. Also, when
> the automatic checksumming finds a corruption, you'll need your backups.


Well, no: the redundancy is used to correct it. And in the unlikely
event that the corruption was system-caused and hence loyally replicated
by lower-level functions, that's what no-overwrite snapshotting is for:
it would take a particularly pathological bug to subvert both the
main-line data and the separate snapshots.

>
> So the answers is backup (online, nearline, offline, whatever)


Since the original poster just told us that this is *not* a suitable
answer, one can only assume that you're listening to him as poorly as
you've apparently listened to others.

- bill
Jan-Frode Myklebust

2007-04-30, 7:13 pm

On 2007-04-30, Bill Todd <billtodd@metrocast.net> wrote:
>
> Perhaps as much as blind ignorance like yours amazes me. The main
> difference between us being that no one talking about ZFS in any way
> suggested trusting it blindly: ESB above suggested a mechanism *like*
> ZFS's (in case you're unaware of the fact, other more mature systems
> provide features of this nature), and I suggested that ZFS, while by no
> means mature, *might* still satisfy the expressed needs..


To quote the OP:

"What /is/ worrying to me is silent filesystem corruption that will at
some point jump and bite my arse. Filesystem corruption will cause
prompt snapshot rollback and incremental recovery*, but I'm worried
about rolling back only to discover the filesystem was already
corrupted at the time of the snap. I don't have room for much more
than one or two snaps."

Is there any other solution than backups, if neither the fs nor the two
snaps can be trusted ? I would argue that making your fs's as small as
possible, to confine the damage, and keeping good backups is the best
option. Why would tape backup be "totally impractical even for sizes
much smaller than 4TB." ?


And the quoting you from another recent thread:

"Though (as I already noted) I don't have any direct experience with it,
my impression is that people are using it in production systems
successfully "

"My impression is that *some* customers have workloads that have found
ZFS to be very stable already, while others push corner cases that are
still uncovering bugs."

So you agree it's a fairly new fs where people are still uncovering bugs,
have no direct experience with it, and do you still think it's the
solution to the OP's worry about file system corruption ?

>
>
> Since the original poster just told us that this is *not* a suitable
> answer, one can only assume that you're listening to him as poorly as
> you've apparently listened to others.


He doesn't say much about why backups would be "totally impractical", so
I'm suggesting the best option (when you have fs corruption, and the 2
snaps isn't good enough) is to spread the files over as many fs's as
possible to confine the damage and amount of files that's needed to
restore from backup.


-jf
Ernst S Blofeld

2007-05-01, 1:13 am

Jan-Frode Myklebust wrote:
> Is there any other solution than backups, if neither the fs nor the two
> snaps can be trusted ? I would argue that making your fs's as small as
> possible, to confine the damage, and keeping good backups is the best
> option. Why would tape backup be "totally impractical even for sizes
> much smaller than 4TB." ?


Who said don't make backups ? ZFS is not a backup solution but a
filesystem with checksumming and redundancy features. I've never heard
anyone seriously suggest that ZFS obviated the need for backups, not in
this thread or anywhere else. Rant about non-issues elsewhere please.

As already pointed out, increasing the number of filesystems does not
increase the protection because you still have all the common modes of
failure (including the software bugs that you are so apparently keen
on). How much better off are a million files on a single filesystem
against the same files on a thousand filesystems if everything else
remains equal? There is no meaningful difference at all.

Moreover backups do not address the OP's point - silent corruption. If
you aren't checking your files how can you have any confidence in your
backups? A backup is as problematic in terms of integrity as the
filesystem it is read from. Backing-up a corrupt file doesn't fix it.

You cannot avoid the need for checksumming to detect errors and
redundancy to fix them. Putting these features directly in your
filesystem is a good idea - integrity is maintained and there is fast
recovery. The fact that there will be teething problems in ZFS or an
equivalent filesystem is not a sound basis for rejecting these features.

There will still be backups in the future too.

ESB
Aknin

2007-05-01, 7:14 am

On May 1, 3:15 am, Ernst S Blofeld <E.Blof...@new-spectre-base.com>
wrote:
> Jan-Frode Myklebust wrote:
>
> Who said don't make backups ? ZFS is not a backup solution but a
> filesystem with checksumming and redundancy features. I've never heard
> anyone seriously suggest that ZFS obviated the need for backups, not in
> this thread or anywhere else. Rant about non-issues elsewhere please.
>
> As already pointed out, increasing the number of filesystems does not
> increase the protection because you still have all the common modes of
> failure (including the software bugs that you are so apparently keen
> on). How much better off are a million files on a single filesystem
> against the same files on a thousand filesystems if everything else
> remains equal? There is no meaningful difference at all.
>
> Moreover backups do not address the OP's point - silent corruption. If
> you aren't checking your files how can you have any confidence in your
> backups? A backup is as problematic in terms of integrity as the
> filesystem it is read from. Backing-up a corrupt file doesn't fix it.
>
> You cannot avoid the need for checksumming to detect errors and
> redundancy to fix them. Putting these features directly in your
> filesystem is a good idea - integrity is maintained and there is fast
> recovery. The fact that there will be teething problems in ZFS or an
> equivalent filesystem is not a sound basis for rejecting these features.
>
> There will still be backups in the future too.
>
> ESB


I've cross-posted this question on several places, and practically all
answers switched immediately to backup/restore issues. It seems that
no-one puts any kind of trust in filesystems, in the sense that even
if you have an expensive mirrored SAN, the system (the software
managing the data) is too stupid to cause corruption (more about that
in my previous post) and small amounts of data /may/ be lost without
too much pain, people here (and on VxFS ML, and on ZFS-discuss)
recommend to backup the filesystem (i.e., copy all it's data to
something which has a different data structure than the filesystem
itself, implicitly because the FS /will/ get corrupt at some point) or
split it into smaller FSs (implicitly because then if one of them gets
corrupt, we can contain the damage and restore backups).

So it seems like 'we' always think an FS will get corrupt, and no
amount of sophistication will make it not-to, or at least not in a way
that is a total-loss. Would anyone here trust the filesystem (any
filesystem, name your pick) enough to make a few (say 3 or 4) 32TB
monsters holding the above-mentioned kind of data and being backed
solely by snaps? If you feel that it's not safe - what good are those
gigantic-interconnected/grid-multi-TB-super-expensive SANs, if you
can't mkfs more than a few TBs without fear because of filesystem
limitation?

Thanks for your replies, they've been very interesting and useful so
far!

- Yaniv

Al Dykes

2007-05-01, 1:13 pm

In article <1178001817.609578.20510@p77g2000hsh.googlegroups.com>,
Aknin <the.aknin@gmail.com> wrote:
>On May 1, 3:15 am, Ernst S Blofeld <E.Blof...@new-spectre-base.com>
>wrote:
>
>I've cross-posted this question on several places, and practically all
>answers switched immediately to backup/restore issues. It seems that
>no-one puts any kind of trust in filesystems, in the sense that even



Filesystems are not the problem. hardware is.

I've worked with many thousands of PC disks starting with the first
release of NTFS, almost 15 years ago. I have never seen NTFS
"corrupt" itself. All failures were traced to dying hardware. Sh*t
happens. I have to admit that my experience with RAID is much less.

I'd like to hear of documented cases of such NTFS problems.

In any case, you need a strategy for backup and recovery of your data.
Even if the filesystem is fine, the building can burn down.




--
a d y k e s @ p a n i x . c o m
Don't blame me. I voted for Gore. A Proud signature since 2001
Craig Ruff

2007-05-01, 7:15 pm

In article <1177752090.144496.248950@n59g2000hsh.googlegroups.com>,
Aknin <the.aknin@gmail.com> wrote:
>What's the maximum filesystem size you've used in production
>environment? How did the experience come out?


In the NCAR Mass Storage Service (MSS), a tape archive that is approaching
3 PB in size, we currently have a disk cache of 48 TB on 4 FC->SATA RAIDs.
I have it configured as 24 logical units of just under 2 TB each, each as
a single Irix XFS file system. When a disk in the RAID fails, the controller
can rebuild the RAID group in about 4-6 hours. Files written to the
disk cache (between 112 KB and 1 GB in size) are usually written to tape
within 24 hours. Residency in the cache varies between 30-60 days. We've
not had any problems with XFS.
Sandman

2007-05-04, 1:14 pm

One option is to go with segmented filesystems: www.ibrix.com.

Instead of having one monolithic filesystem, break it up across
several segments. Ibrix still provides a single namespace. Back up the
segments separately, recover them separately.


On Apr 28, 2:21 am, Aknin <the.ak...@gmail.com> wrote:
> Following some research I've been doing on the matter across
> newsgroups and mailing lists, I'd be glad if people could share
> numbers about real life large filesystem and their experience with
> them. I'm slowly coming to a realization that regardless of
> theoretical filesystem capabilities (1TB, 32TB, 256TB or more), more
> or less across the enterprise filesystem arena people are recommending
> to keep practical filesystems up to 1TB in size, for manageability and
> recoverability.
>
> What's the maximum filesystem size you've used in production
> environment? How did the experience come out?
>
> Thanks,
> -Yaniv



Kraft Fan!

2007-05-05, 1:15 am

> Filesystems are not the problem. hardware is.
>
> I've worked with many thousands of PC disks starting
> with the first
> release of NTFS, almost 15 years ago. I have never seen
> NTFS
> "corrupt" itself. All failures were traced to dying
> hardware. Sh*t
> happens. I have to admit that my experience with RAID is
> much less.
>
> I'd like to hear of documented cases of such NTFS
> problems.
>
> In any case, you need a strategy for backup and recovery
> of your data.
> Even if the filesystem is fine, the building can burn
> down.


here's a well-known NTFS example:
http://support.microsoft.com/kb/229607

I also have another from personal experience. When you
get way out of the norm, you are much more likely to
encounter problems that have nothing to do with your
hardware. I had a case open with MS in which I was told
they had internal documentation suggesting limits that,
while beyond what you'd likely
ever see in 'normal' scenarious, are not out of the realm
of possibility for poorly-designed applications... of
which I inherited one.

Put 9 figures worth of dirs/files on a single NTFS volume
in a heavily write-intensive environment and tell me all
is well. It's very scary, and shit starts to break down,
write failures, etc. I know -- I've been there, and I'm
doing it now until such time things get rewritten. I've
lost it all more than once. Takes weeks to restore. Yes,
the app is broken, but it's what I'm stuck with for now.

Also, ask your vendors (any of them) for documented
studies of heavy IO in that type of environment. None of
them have any because for the most part, they do not test
to those levels. Even MS only tests to 100M dirs/files
for milestone releases (SPs) of Win2K3, and this is the
first release where they went that high.

You want to be safe, you'd better stay below 10M
dirs/files on a single volume. That's realistically the
highest you can go and count on all of your vendors
possibly having tested to (I'm talking file systems,
file-based replication software). You uncover the bugs
their normal stress-testing doesn't. Believe me, in
trying to deal with my mess, I've done a lot of talking to
vendors. Their sales reps tell you everything is 'not a
problem.' Their technical guys get really quiet when you
ask for proof or customer examples you can speak with.



Al Dykes

2007-05-05, 1:15 am

In article <eUQ_h.19133$Pq5.11312@bignews6.bellsouth.net>,
Kraft Fan! < numberoneKraftfan@littlerockarkansasmass
ageschool.com> wrote:
>
>here's a well-known NTFS example:
>http://support.microsoft.com/kb/229607
>
>I also have another from personal experience. When you
>get way out of the norm, you are much more likely to
>encounter problems that have nothing to do with your
>hardware. I had a case open with MS in which I was told
>they had internal documentation suggesting limits that,



TNX, I knew there would be something and I figured it would
be a case of pushing some scale limit.

--
a d y k e s @ p a n i x . c o m
Don't blame me. I voted for Gore. A Proud signature since 2001
Bill Todd

2007-05-05, 7:14 pm

Kraft Fan! wrote:

....

> here's a well-known NTFS example:
> http://support.microsoft.com/kb/229607


Since that particular bug was fixed just over 8 years ago, something a
bit more recent might be a more convincing argument for not trusting a
reasonably mature file system.

- bill
Lon

2007-05-07, 1:13 am

Aknin proclaimed:

> Following some research I've been doing on the matter across
> newsgroups and mailing lists, I'd be glad if people could share
> numbers about real life large filesystem and their experience with
> them. I'm slowly coming to a realization that regardless of
> theoretical filesystem capabilities (1TB, 32TB, 256TB or more), more
> or less across the enterprise filesystem arena people are recommending
> to keep practical filesystems up to 1TB in size, for manageability and
> recoverability.
>
> What's the maximum filesystem size you've used in production
> environment? How did the experience come out?


I'd guess the biggest problem with very large file systems would be when
you need to run a file system check against them and dont have a few
days to run the check on 100 terabytes or so. Some scale better than
others, particularly if they are practically full. Backups and restores
can be helped by delta style technology.

Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com