|
Home > Archive > Data Storage > October 2004 > Commodity hardware file server
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Commodity hardware file server
|
|
| Gianni Mariani 2004-09-04, 5:45 pm |
|
I an trying to see what you can build with off-the-shelf hardware.
I have a dual opteron (248 2.2ghz) cpu with 2gig RAM and 4 drives.
The system has
1. A "WDC WD2500JB" IDE drive connected to the on-board controller of a
Tyan S2882 mobo (hda)
2. A "WDC WD2500JB" IDE drive connected to a Promise (?model?) IDE
controller (hde)
3. 2x "Maxtor 6B300S0" SATA drive connected to the (SI) on-board SATA
controller (sda and sdb)
The system runs Fedora Core 2 (AMD-64).
These tests were all done with "hdparm" (yes not very scientific). I
probably should find a better test tool so take these numbers with a
grain-o-salt.
(The controller on the system board)
/dev/hda:
Timing buffered disk reads: 170 MB in 3.00 seconds = 56.66 MB/sec
(The promise controller)
/dev/hde:
Timing buffered disk reads: 160 MB in 3.02 seconds = 52.99 MB/sec
(A SATA controller)
/dev/sda:
Timing buffered disk reads: 180 MB in 3.00 seconds = 59.97 MB/sec
/dev/sdb:
Timing buffered disk reads: 182 MB in 3.02 seconds = 60.19 MB/sec
Two simultaneous hdparm tests on the SATA drives (sda and sdb)
138 MB in 3.04 seconds = 45.46 MB/sec
140 MB in 3.04 seconds = 46.04 MB/sec
** Note that there is a significant difference in aggregate performance
of individual vs simultaneous tests. (120MB/sec to 90MB/Sec - 25%)
Two simultaneous hdparm tests on the hda and sda drives.
184 MB in 3.02 seconds = 61.00 MB/sec
170 MB in 3.02 seconds = 56.28 MB/sec
** Note that these two drives are able to run at full speed - no contention
Two simultaneous hdparm tests on the hde and sda drives
180 MB in 3.01 seconds = 59.79 MB/sec
156 MB in 3.01 seconds = 51.80 MB/sec
** Note again there is no drop in performance
Two simultaneous hdparm tests on the hda and hde drives
172 MB in 3.01 seconds = 57.08 MB/sec
166 MB in 3.07 seconds = 54.10 MB/sec
** Note - no performance drop for the two IDE drives (on different
controllers)
3 Simultaneous tests sda, hda and hde.
182 MB in 3.02 seconds = 60.33 MB/sec
152 MB in 3.02 seconds = 50.31 MB/sec
172 MB in 3.11 seconds = 55.39 MB/sec
** slight drop in performance - aggregate 165MB/sec
3 Simultaneous tests sda, sdb and hda
136 MB in 3.01 seconds = 45.23 MB/sec
138 MB in 3.05 seconds = 45.24 MB/sec
172 MB in 3.09 seconds = 55.69 MB/sec
** Note the drop mimics the sda/sdb test - aggregate 145MB/sec
Running 4 simultaneous disk read tests, I
146 MB in 3.03 seconds = 48.13 MB/sec
138 MB in 3.04 seconds = 45.34 MB/sec
138 MB in 3.07 seconds = 44.94 MB/sec
172 MB in 3.11 seconds = 55.35 MB/sec
** Note that it appears that hde dropped to 48mb/sec but hda remained at
close to peak performace - aggregate 190MB/sec.
Conclusion. It seems like (in theory) that I can drive a gige interface
at full speed reading if I can simultaneously read from 3 drives. The
question is, can I saturate 2 gige interfaces with a few minor mods ?
| |
| Jesper Monsted 2004-09-05, 2:45 am |
| Gianni Mariani <gi2nospam@mariani.ws> wrote in
news:chcq4o$1qg@dispatch.concentric.net:
> Conclusion. It seems like (in theory) that I can drive a gige
> interface at full speed reading if I can simultaneously read from 3
> drives. The question is, can I saturate 2 gige interfaces with a few
> minor mods ?
Doubtful. In a file server, the IO is rarely sequential and when the IOs
are random, ATA drives drop a lot in performance. I'd be surprised if you
could even drive one GigE.
--
/Jesper Monsted
| |
| Gianni Mariani 2004-09-05, 5:45 pm |
| Jesper Monsted wrote:
> Gianni Mariani <gi2nospam@mariani.ws> wrote in
> news:chcq4o$1qg@dispatch.concentric.net:
>
>
>
>
> Doubtful. In a file server, the IO is rarely sequential and when the IOs
> are random, ATA drives drop a lot in performance. I'd be surprised if you
> could even drive one GigE.
>
But that depends on how much time the drive needs to spend seek()ing.
The files are 1-2 gig in size - the problem requirement is that it can
handle 2000+ clients.
11ms seek * 2000 = 22 seconds spent seeking (worst case in a round-robin
scenario).
If each client reads at 500kb/sec then the client reads 1.1MB in a 22
second period.
Hence, in theory, if I can have 2.2GB of buffer memory, it can support
2000 clients reading at 500kb/sec and each client reading a different
section off a single drive. I can probably fit 8 300GB drives in an
enclosure so that's 2.4TB. That also means that the effective seek time
went from 11ms to 1.4ms (best case).
A scsi drive has the problem that even the highest performance drives
just don't have the capacity.
Seagate ST3146807 :
147GB
seek 4.7ms.
Sustained 38-68 MB/sec
8 drives -> 1.2TB
So the trade-off is seek time, capacity (density) and cost/GB.
4.7ms vs 11ms
147GB vs 300GB
$3/GB vs $0.80/GB.
It appears the transfer rates are similar (accoding to the specs).
The only way to really tell is to test it.
| |
| Thanatos 2004-09-28, 2:50 am |
| Whoa, Gianni, it's not even *close* to being that simple...
Wider sharing loads do NOT scale linearly. This is because each active
client requires handling, and process scheduling means that this doesn't
scale linearly due to the overheads involved in switching contexts. NFS
loads are one thing, if you planning to share via CIFS (using SAMBA) the
situation is even worse due to session state handling overheads.
i.e. after a point the performance degradation per additional user itself
increases with each additional user - the response time degradation curve is
exponential.
Have a look at some of the graphs on http://www.spec.org/sfs97r1/results/
and you'll see what I mean.
Your application sounds interesting - large transfers AND high concurrency.
You'll most likely find that your simple Linux box is just plain not going
to cope. At the very least you should add some more RAM - since it's 64 bit
you could probably go to 8GB fairly cheaply, and play with the fs caching...
Realistically, this is probably a job for a NetApp or an EMC Celerra.
"Gianni Mariani" <gi2nospam@mariani.ws> wrote in message
news:chfd5v$1oj@dispatch.concentric.net...
> Jesper Monsted wrote:
you[vbcol=seagreen]
>
> But that depends on how much time the drive needs to spend seek()ing.
>
> The files are 1-2 gig in size - the problem requirement is that it can
> handle 2000+ clients.
>
> 11ms seek * 2000 = 22 seconds spent seeking (worst case in a round-robin
> scenario).
>
> If each client reads at 500kb/sec then the client reads 1.1MB in a 22
> second period.
>
> Hence, in theory, if I can have 2.2GB of buffer memory, it can support
> 2000 clients reading at 500kb/sec and each client reading a different
> section off a single drive. I can probably fit 8 300GB drives in an
> enclosure so that's 2.4TB. That also means that the effective seek time
> went from 11ms to 1.4ms (best case).
>
> A scsi drive has the problem that even the highest performance drives
> just don't have the capacity.
>
> Seagate ST3146807 :
> 147GB
> seek 4.7ms.
> Sustained 38-68 MB/sec
>
> 8 drives -> 1.2TB
>
> So the trade-off is seek time, capacity (density) and cost/GB.
>
> 4.7ms vs 11ms
> 147GB vs 300GB
> $3/GB vs $0.80/GB.
>
> It appears the transfer rates are similar (accoding to the specs).
>
> The only way to really tell is to test it.
>
| |
| Gianni Mariani 2004-10-03, 2:45 am |
| Thanatos wrote:
> Whoa, Gianni, it's not even *close* to being that simple...
>
> Wider sharing loads do NOT scale linearly. This is because each active
> client requires handling, and process scheduling means that this doesn't
> scale linearly due to the overheads involved in switching contexts. NFS
> loads are one thing, if you planning to share via CIFS (using SAMBA) the
> situation is even worse due to session state handling overheads.
Yep.
There is a proprietary server where nthreads=ncpu. In theory there is
no (significant) context switching.
The cache effects may be mitigated and epoll and sendfile api's can be
used and there are no copies of data (at least in the application).
>
> i.e. after a point the performance degradation per additional user itself
> increases with each additional user - the response time degradation curve is
> exponential.
That's a characteristic of the thread/process per connection model.
In the single thread per cpu model you get an interesting performance
increase due to being able to do more work per connection as the system
is loaded and hence automatically getting advantages from increased
cache affinity. In general, the performance does eventually degrade,
but nowhere near as bad as in the thread per connection model.
See http://www.kegel.com/c10k.html for a discussion on how to manage
large numbers of simultaneous connections efficiently. The realy
question is can you sustain 200mb/sec from N hard drives to the NICs
across 2000 simultaneous connections each reading at 1Mb/sec, or even
get into this ball-park. Or more accurately, just what can you do with
a box like I described earlier. About 2 years ago I was able to
saturate a gige NIC with an AMD Athon 1.2GHz. I suspect (need to test)
I can saturate 2 gige NICs with this box but disk I/O was not somthing
that factored into that test (streaming the same file). In this case
each connection could be a different file. There are algorithms that
can minimize the time spent seeking (basically very large read-ahead).
>
> Have a look at some of the graphs on http://www.spec.org/sfs97r1/results/
> and you'll see what I mean.
>
> Your application sounds interesting - large transfers AND high concurrency.
> You'll most likely find that your simple Linux box is just plain not going
> to cope. At the very least you should add some more RAM - since it's 64 bit
> you could probably go to 8GB fairly cheaply, and play with the fs caching...
Yep - max out the memory and load as much into memory as possibe (either
by mapping files to memory or by tweaking the fs cache or even tweak the
read-ahead code in the kernel).
xfs does have real-time functionality and this may also be somthing to
investigate.
>
> Realistically, this is probably a job for a NetApp or an EMC Celerra.
I might need 300 of these boxes one day ...
| |
| Maxim S. Shatskih 2004-10-03, 7:45 am |
| > > scale linearly due to the overheads involved in switching contexts. NFS[vbcol=seagreen]
NFS also has session state, it just keeps it in "lockd" process.
--
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
maxim@storagecraft.com
http://www.storagecraft.com
| |
| Toomas Soome 2004-10-03, 5:46 pm |
| Maxim S. Shatskih <maxim@storagecraft.com> wrote:
>
> NFS also has session state, it just keeps it in "lockd" process.
>
only if you have set lock. and even then it's state for lock, not for session.
nfs for file operations is stateless. only mount and lock are stateful.
toomas
--
Si jeunesse savait, si vieillesse pouvait.
[If youth but knew, if old age but could.]
-- Henri Estienne
|
|
|
|
|