 |
|
 |
|
04-18-07 06:14 PM
Hello,
I'd like to plan a new storage solution for a system currently in
production.
The system's storage is based on code which writes many files to the
file system, with overall storage needs currently around 40TB and
expected to reach hundreds of TBs. The average file size of the system
is ~100K, which translates to ~500 million files today, and billions
of files in the future. This storage is accessed over NFS by a rack of
40 Linux blades, and is mostly read-only (99% of the activity is
reads). While I realize calling this sub-optimal system design is
probably an understatement, the design of the system is beyond my
control and isn't likely to change in the near future.
The system's current storage is based on 4 VxFS filesystems, created
on SVM meta-devices each ~10TB in size. A 2-node Sun Cluster serves
the filesystems, 2 filesystems per node. Each of the filesystems
undergoes growfs as more storage is made available. We're looking for
an alternative solution, in an attempt to improve performance and
ability to recover from disasters (fsck on 2^42 files isn't practical,
and I'm getting pretty worried due to this fact - even the smallest
filesystem inconsistency will leave me lots of useless bits).
Question is - can someone with experience with large filesystems and
many small-files share his stories? Is it practical to base such a
solution on a few (8) large volumes, each with single large filesystem
in it?
Many thanks in advance for any advice,
- Yaniv
[ Post a follow-up to this message ]
|
|
|
 |
|
 |
|
 |
|
|
 |
Re: Big volumes, small files |
 |
 |
|
|
04-19-07 06:14 PM
On Apr 18, 9:48 am, the.ak...@gmail.com wrote:
> Hello,
>
> I'd like to plan a new storage solution for a system currently in
> production.
>
> The system's storage is based on code which writes many files to the
> file system, with overall storage needs currently around 40TB and
> expected to reach hundreds of TBs. The average file size of the system
> is ~100K, which translates to ~500 million files today, and billions
> of files in the future. This storage is accessed over NFS by a rack of
> 40 Linux blades, and is mostly read-only (99% of the activity is
> reads). While I realize calling this sub-optimal system design is
> probably an understatement, the design of the system is beyond my
> control and isn't likely to change in the near future.
>
> The system's current storage is based on 4 VxFS filesystems, created
> on SVM meta-devices each ~10TB in size. A 2-node Sun Cluster serves
> the filesystems, 2 filesystems per node. Each of the filesystems
> undergoes growfs as more storage is made available. We're looking for
> an alternative solution, in an attempt to improve performance and
> ability to recover from disasters (fsck on 2^42 files isn't practical,
> and I'm getting pretty worried due to this fact - even the smallest
> filesystem inconsistency will leave me lots of useless bits).
>
> Question is - can someone with experience with large filesystems and
> many small-files share his stories? Is it practical to base such a
> solution on a few (8) large volumes, each with single large filesystem
> in it?
>
> Many thanks in advance for any advice,
> - Yaniv
The best bet, would e to go to a NAS appliance, IE EMC or NetApp.
There NAS devices can handle this load better than any veritas
solution. The NSX model from EMC will let you go to 32 TB per file
system per data mover. It also allows for backups via snap shots. Do
not use ATA storage, try to use low cost fiber channel drives. They
have a higher run rate then standard ATA. If you used a DMX3 and a
NSX you would be able to handle 2 to 3 years worth of growth within a
two unit environment.
[ Post a follow-up to this message ]
|
|
|
 |
|
 |
|
 |
|
|
 |
Re: Big volumes, small files |
 |
 |
|
|
04-19-07 06:14 PM
On 18 Apr 2007 06:48:30 -0700, the.aknin@gmail.com wrote:
>Hello,
>
>I'd like to plan a new storage solution for a system currently in
>production.
>
>The system's storage is based on code which writes many files to the
>file system, with overall storage needs currently around 40TB and
>expected to reach hundreds of TBs. The average file size of the system
>is ~100K, which translates to ~500 million files today, and billions
>of files in the future. This storage is accessed over NFS by a rack of
>40 Linux blades, and is mostly read-only (99% of the activity is
>reads). While I realize calling this sub-optimal system design is
>probably an understatement, the design of the system is beyond my
>control and isn't likely to change in the near future.
>
>The system's current storage is based on 4 VxFS filesystems, created
>on SVM meta-devices each ~10TB in size. A 2-node Sun Cluster serves
>the filesystems, 2 filesystems per node. Each of the filesystems
>undergoes growfs as more storage is made available. We're looking for
>an alternative solution, in an attempt to improve performance and
>ability to recover from disasters (fsck on 2^42 files isn't practical,
>and I'm getting pretty worried due to this fact - even the smallest
>filesystem inconsistency will leave me lots of useless bits).
>
>Question is - can someone with experience with large filesystems and
>many small-files share his stories? Is it practical to base such a
>solution on a few (8) large volumes, each with single large filesystem
>in it?
>
>Many thanks in advance for any advice,
> - Yaniv
For your particular situation there is a universal statement that
applies; you're screwed.
The simple answer is there is nothing out there yet that willl handle
lots of small files well, except maybe a RamSAN.
I agree with carmelomcc that a proprietary NAS may be a good fit, but
I disagree that EMC makes a NAS. It's a pile of crap.
Depending on budget NetApp may do fine. There's also BlueArc and
agami. I don;t think clustered storage is going to help you in any
way.
However, if you are mostly reads what is keeping you from using a
boatload of cache?
Backups are going to be painful, no way around it. The best thing I
can think of given your current environment is using cache devices for
client performance enhancement and FlashBackup for backups and
recovery.
FlashBackup will take the entire image as a backup and not spend eons
mapping blocks to files. I'm not exactly sure how it works but people
swear by it.
~F
[ Post a follow-up to this message ]
|
|
|
 |
|
 |
|
 |
|
|
 |
Re: Big volumes, small files |
 |
 |
|
|
04-20-07 06:14 AM
Faeandar wrote:
> On 18 Apr 2007 06:48:30 -0700, the.aknin@gmail.com wrote:
>
>
> For your particular situation there is a universal statement that
> applies; you're screwed.
>
> The simple answer is there is nothing out there yet that willl handle
> lots of small files well
Though I have no direct experience with it, my impression is that this
may be a workload which ZFS could handle well (I don't know what level
of maturity ZFS has attained by now, but Apple's recent embrace of it
suggests that it may be pretty solid). Its maximum block size goes to
128KB, so many/most files could fit in a single block. It grows
dynamically as required, without the 16TB (?) limit of ext[2|3]fs on
Linux (though other mature Linux file systems like JFS and XFS might be
worth considering - possibly even ReiserFS if V3 is sufficiently
stable). Sun may support ZFS as a cluster file system by now (IIRC
plans were in place to).
Any mature journaling file system with snapshots should address the fsck
and backup issues (one of the nice things about ZFS is that its
background integrity-checking and increased metadata-replication
mechanisms reduce even further the chances that the system will ever get
sufficiently hosed that fsck would be required). If directories must be
large (are all the files in just one?), you'd want a file system with
b-tree or hash-indexed directories (I think everything I mentioned above
qualifies).
Good luck,
- bill
[ Post a follow-up to this message ]
|
|
|
 |
|
 |
|
 |
|
|
 |
Re: Big volumes, small files |
 |
 |
|
|
04-20-07 06:14 PM
On Fri, 20 Apr 2007 00:40:13 -0400, Bill Todd <billtodd@metrocast.net>
wrote:
>Faeandar wrote:
>
>Though I have no direct experience with it, my impression is that this
>may be a workload which ZFS could handle well (I don't know what level
>of maturity ZFS has attained by now, but Apple's recent embrace of it
>suggests that it may be pretty solid). Its maximum block size goes to
>128KB, so many/most files could fit in a single block. It grows
>dynamically as required, without the 16TB (?) limit of ext[2|3]fs on
>Linux (though other mature Linux file systems like JFS and XFS might be
>worth considering - possibly even ReiserFS if V3 is sufficiently
>stable). Sun may support ZFS as a cluster file system by now (IIRC
>plans were in place to).
>
>Any mature journaling file system with snapshots should address the fsck
>and backup issues (one of the nice things about ZFS is that its
>background integrity-checking and increased metadata-replication
>mechanisms reduce even further the chances that the system will ever get
>sufficiently hosed that fsck would be required). If directories must be
>large (are all the files in just one?), you'd want a file system with
>b-tree or hash-indexed directories (I think everything I mentioned above
>qualifies).
>
>Good luck,
>
>- bill
ZFS is great in concept and I think they are on the right path,
however it's not yet ready for primetime imo.
The integrated integrity checking is extremely cpu intesive. It does
not cluster yet, at least not as of 2 weeks ago. Many file systems
grow dynamically so I would make that a check in ZFS's column. No
practical TB limit is a win if you need to go beyond 16TB in a single
FS.
I'm not sure I see how snapshots or journaling helps with backups. It
still has to map blocks to files, which is the long part of a backup.
I know when NetApp backups occur it takes the snapshot and then tries
to do a dump. If you have millions of files it can be hours before
data is actually transferred, I believe ZFS is no different.
Since the OP's IO pattern is mostly reads the cpu load may not be an
issue but writes suffer a serious penalty if you are not cpu-rich.
I've spoken with people who ran an Oracle db on ZFS and said they had
to move back until they had a T2000 or so.
~F
[ Post a follow-up to this message ]
|
|
|
 |
|
 |
|
 |
|
|
 |
Re: Big volumes, small files |
 |
 |
|
|
04-21-07 06:14 AM
Faeandar wrote:
...
> ZFS is great in concept and I think they are on the right path,
> however it's not yet ready for primetime imo.
Though (as I already noted) I don't have any direct experience with it,
my impression is that people are using it in production systems
successfully - so a description of your specific reservations would be
useful.
>
> The integrated integrity checking is extremely cpu intesive.
I suspect that you're mistaken: IIRC it occurs as part of an
already-existing data copy operation at a very low level in the disk
read/write routines, and at close to memory-streaming speeds (i.e.,
mostly using CPU cycles that are being used anyway just to copy the data).
It does
> not cluster yet, at least not as of 2 weeks ago.
It was not clear that this was a requirement in this case - but since
the OP mentioned clustering, I mentioned the soon-to-arrive capability.
Many file systems
> grow dynamically so I would make that a check in ZFS's column.
I'm not sure they grow dynamically quite as painlessly as ZFS does:
usually, you first have to arrange to expand the underlying disk storage
at the volume-manager level, and then have to incorporate the increase
in volume size into the file system.
No
> practical TB limit is a win if you need to go beyond 16TB in a single
> FS.
> I'm not sure I see how snapshots or journaling helps with backups.
I should have added the word 'respectively', I guess: journaling helps
avoid the need for fsck, and snapshots help expedite backups (by
avoiding any need for down-time while making them).
It
> still has to map blocks to files, which is the long part of a backup.
> I know when NetApp backups occur it takes the snapshot and then tries
> to do a dump. If you have millions of files it can be hours before
> data is actually transferred, I believe ZFS is no different.
Actually, it is, since it allows block sizes up to 128 KB (vs. 4 KB for
WAFL IIRC, though if WAFL does a good job of defragmenting files the
difference may not be too substantial). With the OP's 100 KB file
sizes, this means that each file can be accessed (backed up) with a
single disk access, yielding a fairly respectable backup bandwidth of
about 6 MB/sec (assuming that such an access takes about 16 ms. for a
7200 rpm drive, including transfer time, and that the associated
directory accesses can be batched during the scan).
>
> Since the OP's IO pattern is mostly reads the cpu load may not be an
> issue but writes suffer a serious penalty if you are not cpu-rich.
I'm not sure why that would be the case even if the integrity-checking
*were* CPU-intensive, since the overhead to check the integrity on a
read should be just about the same as the overhead to generate the
checksum on a write. True, one must generate it all the way back up to
the system superblock for a write (one reason why I prefer a
log-oriented implementation that can defer and consolidate such
activity), but below the root unless you've got many of the
intermediate-level blocks cached you have to access and validate them on
each read (and with on the order of a billion files, my guess is that
needed directory data will quite frequently not be cached).
> I've spoken with people who ran an Oracle db on ZFS and said they had
> to move back until they had a T2000 or so.
Now in *that* application I suspect that a lot of the intermediate
blocks *are* often cached on reads, which does drive up relative write
overhead substantially (not so much due to integrity-checking per se -
since as I already noted I think that it piggybacks on a copy operation
- as due to the need to write back the entire block-tree path on each
update).
- bill
[ Post a follow-up to this message ]
|
|
|
 |
|
 |
|
 |
|
|
 |
Re: Big volumes, small files |
 |
 |
|
|
04-21-07 06:13 PM
On 2007-04-18, the.aknin@gmail.com <the.aknin@gmail.com> wrote:
[saw your question on the GPFS forum, but prefer news..]
> The system's storage is based on code which writes many files to the
> file system, with overall storage needs currently around 40TB and
> expected to reach hundreds of TBs. The average file size of the system
> is ~100K, which translates to ~500 million files today, and billions
> of files in the future. This storage is accessed over NFS by a rack of
> 40 Linux blades, and is mostly read-only (99% of the activity is
> reads). While I realize calling this sub-optimal system design is
> probably an understatement, the design of the system is beyond my
> control and isn't likely to change in the near future.
> Question is - can someone with experience with large filesystems and
> many small-files share his stories? Is it practical to base such a
> solution on a few (8) large volumes, each with single large filesystem
> in it?
My largest GPFS system is a 10 TB usable, 700 GB currently used, average
file size of 70-80KB, 9M inodes used. Not quite as large as your current,
but it might have some of the same properties.. It's a Maildir-based
mailserver-cluster, and is working very well. Only issue we'd had is
that writing to directories with 10's of thousands files can be too
expensive. We will probably need to move metadata to separate volumes
to give them more cache than the data-volumes to fix this.
If I was to do your huge system with GPFS, I would try to first spread
the files over as many directories as possible, and also across as many
separate file systems as possible. Spreading over many directories,
because I think GPFS is doing directory-level locking for some
operations (adding new files?), and spread over as many file systems
as possible to reduce the fsck time (GPFS does do online fsck, but
not everything can be fixed while online) and make sure that a
catastrophic file system error doesn't take down everything.
I would also try to avoid NFS if possible. Having the clients mount the
fs's as GPFS clients (tcpip or SAN) will probably be much better, and
will avoid the bottlenecks and SPOFs of NFS.
-jf
[ Post a follow-up to this message ]
|
|
|
 |
|
 |
|
 |
|
|
 |
Re: Big volumes, small files |
 |
 |
|
|
04-22-07 12:13 AM
Jan-Frode Myklebust wrote:
> On 2007-04-18, the.aknin@gmail.com <the.aknin@gmail.com> wrote:
>
> [saw your question on the GPFS forum, but prefer news..]
>
>
>
> My largest GPFS system is a 10 TB usable, 700 GB currently used, average
> file size of 70-80KB, 9M inodes used. Not quite as large as your current,
> but it might have some of the same properties.. It's a Maildir-based
> mailserver-cluster, and is working very well. Only issue we'd had is
> that writing to directories with 10's of thousands files can be too
> expensive. We will probably need to move metadata to separate volumes
> to give them more cache than the data-volumes to fix this.
>
> If I was to do your huge system with GPFS, I would try to first spread
> the files over as many directories as possible, and also across as many
> separate file systems as possible. Spreading over many directories,
> because I think GPFS is doing directory-level locking for some
> operations (adding new files?), and spread over as many file systems
> as possible to reduce the fsck time (GPFS does do online fsck, but
> not everything can be fixed while online) and make sure that a
> catastrophic file system error doesn't take down everything.
>
> I would also try to avoid NFS if possible. Having the clients mount the
> fs's as GPFS clients (tcpip or SAN) will probably be much better, and
> will avoid the bottlenecks and SPOFs of NFS.
>
What features does GPFS have that will make in better than NFS? GPFS is
a good filesystem with cluster support, but I don't see anything special
that will help with the large numbers of files the OP is trying to deal
with.
Pete
[ Post a follow-up to this message ]
|
|
|
 |
|
 |
|
 |
|
|
 |
Re: Big volumes, small files |
 |
 |
|
|
04-22-07 06:13 AM
In article <1176904109.982831.76900@b75g2000hsg.googlegroups.com>,
<the.aknin@gmail.com> wrote:
>Hello,
>
>Question is - can someone with experience with large filesystems and
>many small-files share his stories? Is it practical to base such a
>solution on a few (8) large volumes, each with single large filesystem
>in it?
You've already got Sun. Why not just migrate to ZFS? ZFS
operations are constant speed regardless of size of file system or
size of directory. Lots of other neat stuff too.
[ Post a follow-up to this message ]
|
|
|
 |
|
 |
|
 |
|
|
 |
Re: Big volumes, small files |
 |
 |
|
|
04-22-07 12:12 PM
On Apr 22, 7:59 am, w...@panix.com (the wharf rat) wrote:
> In article <1176904109.982831.76...@b75g2000hsg.googlegroups.com>,
>
> <the.ak...@gmail.com> wrote:
>
>
> You've already got Sun. Why not just migrate to ZFS? ZFS
> operations are constant speed regardless of size of file system or
> size of directory. Lots of other neat stuff too.
We have seen unexplained performance issues with NFS/ZFS. We've tried
several configurations in which we ran the SPEC SFS test against
identical systems, one exporting NFS over ZFS, the other over UFS.
UFS' performance was an order of magnitude better.
While there may be several explanations, we're still pretty worried
about switching such a large installation to a relatively new
filesystem, esp. with the performance questions hovering above its
head. We've tried several performance and tuning advice, including
setting zil_disable, to no avail. UFS beat ZFS every time (in other
tests, not SFS and NFS based, ZFS was superior).
Anyone else with information about ZFS/NFS is very much invited to
share his or her experience.
[ Post a follow-up to this message ]
|
|
|
 |
|
 |
|
 |
|
|
|
Sponsored Links |
 |
 |
|
|
 |
All times are GMT. The time now is 06:04 PM. |
 |
|
|
 |
|
 |
|
|
 |
|
Forum Rules:
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
|
HTML code is OFF
vB code is ON
Smilies are ON
[IMG] code is OFF
|
|
|
|
Medical and Health forum | Computer Games Reviews | Graphics design forum
|
 |
|
 |
|