|
Home > Archive > Unix True 64 > March 2004 > Performance problem (I/O)
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Performance problem (I/O)
|
|
| Fredrik 2004-03-15, 8:34 am |
| Hello!
We have a advfs filesystem with ~ 8 million small files and it is
still growing.
This all is in a SAN, HSG80 with striped disks.
The weekly networker fullbackup takes about 20 hours and it seems that
it is the os that is waiting for io.
It is also lsm mirrored between two different SAN's
The tar command takes about the same time so it seems that it not a
networker problem.
output from collect:
# DISK Statistics
#DSK NAME B/T/L R/S RKB/S W/S WKB/S AVS AVW ACTQ WTQ
%BSY
0 dsk19 - 136 1091 176 1467 5.27 30.76 1.65 3.87
76.69
1 dsk7 - 136 1097 174 1457 4.94 25.20 1.54 3.14
79.09
Anybody has any tips, clues how to speed it up or what the problem is?
| |
| Bob Harris 2004-03-15, 9:34 pm |
| In article <ff81d3b0.0403150527.2b6004aa@posting.google.com>,
paltskallen@hotmail.com (Fredrik) wrote:
> Hello!
>
> We have a advfs filesystem with ~ 8 million small files and it is
> still growing.
> This all is in a SAN, HSG80 with striped disks.
>
> The weekly networker fullbackup takes about 20 hours and it seems that
> it is the os that is waiting for io.
> It is also lsm mirrored between two different SAN's
> The tar command takes about the same time so it seems that it not a
> networker problem.
>
> output from collect:
> # DISK Statistics
> #DSK NAME B/T/L R/S RKB/S W/S WKB/S AVS AVW ACTQ WTQ
> %BSY
> 0 dsk19 - 136 1091 176 1467 5.27 30.76 1.65 3.87
> 76.69
> 1 dsk7 - 136 1097 174 1457 4.94 25.20 1.54 3.14
> 79.09
>
> Anybody has any tips, clues how to speed it up or what the problem is?
That seems about right. 8 million files means most likely a minimum of
about 8 million seeks, with each seek averaging about 8 milli-seconds.
Actually, you are possibily getting faster than 8 milli-seconds per
seek, but figure that there is more than 8 million seeks. There are
seeks to lookup the filename (most likely a lot of this is cached), seek
to get the metadata for the file being opened (but in some cases the
metadata page from a previous file open is cached). Seek from the
metadata to the user data when you read the small file. If the small
file was fragmented, it might be in more than one 8K page, or 8K page
and a less than 8K frag in the frag file (might get some caching if a
frag shared the same page and was still in the cache). Seek for each
non-contiguous extent. And because you opened the file, the file
modification time was changed, so seek back to the metadata and update
the time stamp. And because there is a metadata modification, the
metadata change is transaction logged, so there will be a seek to the
log (this is a lazy update, so multiple updates might be consolidated
into a single write to disk).
If you don't care about access times, you could use the noatimes mount
option.
Place the log file on its own volume (switchlog). Create a volume a bit
larger than the log file, and then addvol and switchlog.
If your files are typically greater than 8K, but less than 100K to 150K,
and are not typically an even multiple of 8K, then if you are not
concerned about the extra disk space, you could disable the frag file so
that the minimum file size if 8K and grows in mulitples of 8K (man
chfsets on more recent versions of AdvFS, otherwise it is a dbx patch
the kernel method; but it only affects new files or files that are
extended; existing files remain frag'ed). NOTE: This can increase the
amount of storage a lot. For example if you have mostly 1K files you
would increase your storage needs 7 fold by disabling frags. But if
your files are more like 100K, then maybe only 10% more space.
I'm sure someone else might have ideas.
Bob Harris
| |
| Fredrik 2004-03-18, 9:48 am |
| Thank you for the information, all the seeks must take a long time.
But I created a new domain with exactly the same configuration och
copied all the files to that domain (with tar) to run some more
performancetests, but tar'ing a directory on the old domain takes
about 10 minutes but on the new it only takes ~1 minute!!!!!
defragment on the old domain says:
defragment -nv store_domain
defragment: Gathering data for domain 'store_domain'
Current domain data:
Extents: 114860
Files w/extents: 90821
Avg exts per file w/exts: 1.26
Aggregate I/O perf: 93%
Free space fragments: 44246
<100K <1M <10M >10M
Free space: 5% 66% 22% 7%
Fragments: 12065 29918 2245 18
This doesnt look all to bad or??
Can any explain the huge difference in performance??
Bob Harris <harris@zk3.dec.com> wrote in message news:<harris-50A08F.20371715032004@cacnews.cac.cpqcorp.net>...
> In article <ff81d3b0.0403150527.2b6004aa@posting.google.com>,
> paltskallen@hotmail.com (Fredrik) wrote:
>
>
> That seems about right. 8 million files means most likely a minimum of
> about 8 million seeks, with each seek averaging about 8 milli-seconds.
>
> Actually, you are possibily getting faster than 8 milli-seconds per
> seek, but figure that there is more than 8 million seeks. There are
> seeks to lookup the filename (most likely a lot of this is cached), seek
> to get the metadata for the file being opened (but in some cases the
> metadata page from a previous file open is cached). Seek from the
> metadata to the user data when you read the small file. If the small
> file was fragmented, it might be in more than one 8K page, or 8K page
> and a less than 8K frag in the frag file (might get some caching if a
> frag shared the same page and was still in the cache). Seek for each
> non-contiguous extent. And because you opened the file, the file
> modification time was changed, so seek back to the metadata and update
> the time stamp. And because there is a metadata modification, the
> metadata change is transaction logged, so there will be a seek to the
> log (this is a lazy update, so multiple updates might be consolidated
> into a single write to disk).
>
> If you don't care about access times, you could use the noatimes mount
> option.
>
> Place the log file on its own volume (switchlog). Create a volume a bit
> larger than the log file, and then addvol and switchlog.
>
> If your files are typically greater than 8K, but less than 100K to 150K,
> and are not typically an even multiple of 8K, then if you are not
> concerned about the extra disk space, you could disable the frag file so
> that the minimum file size if 8K and grows in mulitples of 8K (man
> chfsets on more recent versions of AdvFS, otherwise it is a dbx patch
> the kernel method; but it only affects new files or files that are
> extended; existing files remain frag'ed). NOTE: This can increase the
> amount of storage a lot. For example if you have mostly 1K files you
> would increase your storage needs 7 fold by disabling frags. But if
> your files are more like 100K, then maybe only 10% more space.
>
> I'm sure someone else might have ideas.
>
> Bob Harris
| |
| Bob Harris 2004-03-18, 10:35 pm |
| In article <ff81d3b0.0403180401.73357d13@posting.google.com>,
paltskallen@hotmail.com (Fredrik) wrote:
> Thank you for the information, all the seeks must take a long time.
> But I created a new domain with exactly the same configuration och
> copied all the files to that domain (with tar) to run some more
> performancetests, but tar'ing a directory on the old domain takes
> about 10 minutes but on the new it only takes ~1 minute!!!!!
When you restored the file system using tar, you placed all the files in
the exact order that you would then back them up. You positioned the
metadata for each file next to the file that would preceed it. As a
result you maximized your backup performance.
But it is unlikely that you will actually create 8 million files in the
exact same order that they will be backed up. Generally, files get
created and deleted in different directories in a random order. This
means reusing metadata from a from some other deleted file. It means
reusing storage for the files from a deleted file.
Bob Harris
[color=darkred]
> defragment on the old domain says:
> defragment -nv store_domain
> defragment: Gathering data for domain 'store_domain'
> Current domain data:
> Extents: 114860
> Files w/extents: 90821
> Avg exts per file w/exts: 1.26
> Aggregate I/O perf: 93%
> Free space fragments: 44246
> <100K <1M <10M >10M
> Free space: 5% 66% 22% 7%
> Fragments: 12065 29918 2245 18
>
> This doesnt look all to bad or??
> Can any explain the huge difference in performance??
>
> Bob Harris <harris@zk3.dec.com> wrote in message
> news:<harris-50A08F.20371715032004@cacnews.cac.cpqcorp.net>...
| |
| Peter da Silva 2004-03-19, 6:37 pm |
| In article <harris-F51A96.21182918032004@cacnews.cac.cpqcorp.net>,
Bob Harris <harris@zk3.dec.com> wrote:
> When you restored the file system using tar, you placed all the files in
> the exact order that you would then back them up. You positioned the
> metadata for each file next to the file that would preceed it. As a
> result you maximized your backup performance.
AdvFS doesn't use an analog of the UFS cylinder group technique to keep
metadata near the file data?
That would explain some of the performance problems we see in an app that
writes a lot of small files.
--
I've seen things you people can't imagine. Chimneysweeps on fire over the roofs
of London. I've watched kite-strings glitter in the sun at Hyde Park Gate. All
these things will be lost in time, like chalk-paintings in the rain. `-_-'
Time for your nap. | Peter da Silva | Har du kramat din varg, idag? 'U`
| |
| Bob Harris 2004-03-19, 8:35 pm |
| In article <c3fijl$os9$1@jeeves.eng.abbnm.com>,
peter@abbnm.com (Peter da Silva) wrote:
> In article <harris-F51A96.21182918032004@cacnews.cac.cpqcorp.net>,
> Bob Harris <harris@zk3.dec.com> wrote:
>
> AdvFS doesn't use an analog of the UFS cylinder group technique to keep
> metadata near the file data?
No.
> That would explain some of the performance problems we see in an app that
> writes a lot of small files.
Not having cylinder groups may not have anything to do with small file
write performance. This could be due to the way the last 8K allocation
of a small file is managed. For space efficency, files under 150K tend
to have the last 8K of the file stored in a frag of from 1K to 7K in
length. But while the is being written, a full 8K is allocated. When
the file is closed, the size of the file is checked and if a 10% saving
in space can be obtained by turning the last 8K into a frag, then a frag
is allocated, the end of the flie copied to the frag, and then the
original 8K is deallocated.
All of this results in additional disk I/O for small files when they are
closed.
In the newer versions of Tru64 UNIX, there is chfsets option to disable
this and make all files created from that point forward allocation
storage in multiples of 8K. For older versions there is a global
variable that can be patched in the kernel to disable frag'ing of a file.
In the 8 million file case, if the small files are evenly distributed
between 1K and and any size that is 8K or greater, then turning off file
frag'ing would increase the storage usage for that file system by about
32 gigabytes. At one time I would have choaked on such a number.
Today, I have more storage than that on my laptop. I will not attempt
to place a value on this to you or your company as laptop storage is not
the same as SCSI, RAID, SAN storage which tends to come in smaller sizes
and cost more. But is still a much lower cost than when I was the
system manager for a VAX-11/780 :-) Times change.
Bob Harris
| |
| Bob Harris 2004-03-19, 8:35 pm |
| In article <c3g297$19vl$1@jeeves.eng.abbnm.com>,
peter@abbnm.com (Peter da Silva) wrote:
> In article <harris-DCF0F6.18530419032004@cacnews.cac.cpqcorp.net>,
> Bob Harris <harris@zk3.dec.com> wrote:
>
> UFS does that kind of thing as well, and for the case of small files (<8K)
> the write should be completed and the close performed before anything hits
> the disk... I wouldn't expect actual disk writes for the 8k blocks.
If UFS serves your needs for small files, then by all means use UFS.
Both file systems are fully supported.
>
> Huh, I'll see what that does. You know the variable name?
AdvfsDoFrag
Change it to zero using dbx.
This will _ONLY_ affect files created or extended from this point
forward. It will not affect existing files.
Bob Harris
| |
| Peter da Silva 2004-03-20, 8:32 pm |
| In article <harris-214E79.19555619032004@cacnews.cac.cpqcorp.net>,
Bob Harris <harris@zk3.dec.com> wrote:
> In article <c3g297$19vl$1@jeeves.eng.abbnm.com>,
> peter@abbnm.com (Peter da Silva) wrote:
[color=darkred]
> If UFS serves your needs for small files, then by all means use UFS.
> Both file systems are fully supported.
Um, yeh, can you say whether AdvFS actually does these redundant writes
or not, though?
--
I've seen things you people can't imagine. Chimneysweeps on fire over the roofs
of London. I've watched kite-strings glitter in the sun at Hyde Park Gate. All
these things will be lost in time, like chalk-paintings in the rain. `-_-'
Time for your nap. | Peter da Silva | Har du kramat din varg, idag? 'U`
| |
| Bob Harris 2004-03-21, 12:33 am |
| In article <c3ikhh$j9g$1@jeeves.eng.abbnm.com>,
peter@abbnm.com (Peter da Silva) wrote:
> In article <harris-214E79.19555619032004@cacnews.cac.cpqcorp.net>,
> Bob Harris <harris@zk3.dec.com> wrote:
>
>
> Um, yeh, can you say whether AdvFS actually does these redundant writes
> or not, though?
I'm not sure I understand the question. I think, based on what you have
extracted above, you are wondering if AdvFS defers the 8K allocation
until after the frag has been created, and just what I/O really occurs.
Is this correct?
Anyway, AdvFS will allocate the files last 8K page before it knows if
the file is going to be closed and the last 8K turned into a frag. The
allocation will involve modifications of the SBM (Storage Bit Map), the
insertion of the storage extent information in the file's Mcell extent
map (allocation might have been part of a larger allocation for the
file) and a transaction will be written to the log. The SBM and Mcell
updates will be done as lazy writes, so that multiple updates to the
same pages may occur. The log transaction will get written to disk,
again in a bit of a lazy mode, but not nearly as lazy as the SBM or
Mcell writes.
The data for the last page may or may not get written to disk. Depends
on how long between the last write and the closing of the file. My
guess is that creating a small file is done, create, write, close in
short order, so I would guess that the data in the last 8K allocation of
the file is still sitting in the cache.
The _EXACT_ order that some of the following happens may be wrong, but
what happens is essentially correct.
When the close happens and the decision is made to frag the last 8K of
the file, a transaction is started. A frag of the correct size is
allocated from the frag file (ls -l /mount_point/.tags/1). If the frag
file does not have a free frag of the correct size, then more storage is
allocated to the frag file (SBM updates, frag file's Mcell updated,
information added transaction). This involves updating free pointers in
the frag file and adding information to the transaction. The last 8K of
the file is removed from the file's Mcell, and the pointer to the
allocated frag is stored in the file's Mcell (all of this gets more
information added to the transaction). The data from the last 8K is
copied to the allocated frag. Because the frag file is considered
metadata, all of the data is also added to the transaction. The 8K is
freed by clearing its bit in the SBM (and more information is added to
the transacation). The file's Mcell, the SBM page, the frag in the frag
file will all eventually get flushed to disk in a lazy fashion and if
there are other modifications to the affected pages before that happens
this will be saved write I/Os. The transaction will be flushed to the
log, it may not happen right away, and some additional transactions may
go along with it, but it will go in short order.
So not every action results in its own I/O, however, there are factors
that can detract from the idea of maximum caching and lazy updates of
the metadata. In the case of a well aged file system where files have
been created, created, deleted, etc..., the available free space may not
be contigous. The free Mcells for new files may not be next to each
other. The free frags for fragging files may not be next to each other.
So the ability to have multiple updates to the same page of metadata
storage for multiple file creates may not occur. So while the I/O to
flush a modified metadata page may not happen right away, it may not get
any company during the flush either. So it is still going to be real
I/O that will affect I/O bandwidth.
Possible real I/Os. log transaction flush (best bet for always being
able to enjoy company from other operations). Flush the SBM (better
than average that multiple allocations will happen on the same pages
before a flush, and at least the allocation of the last 8K and the
freeing of it may occur before the flush happens). The flush of the
Mcell (on a well aged file system, this is more than likely to not have
company when it is flushed). The flush of the frag (on a well aged file
system, this is unlikely to have company when it is flushed).
And I didn't discuss reads to bring metadata into memory. On a well
aged file system the avalable Mcells and frags in the frag file may not
already be in memory, so reads may occur to get them.
So on a well aged file system, say we average 1 seek/read and 1
seek/write for metadata, and 1 seek/write because the file data has to
be written to disk eventually. So frag'ing the file on close might
involve 3 I/Os (the data portion had to occur eventually). And as I've
detailed above, it could be worse if there is very little sharing of the
I/O by multiple operations. And this does not include the operations
need to create the file in the first place and allocation the initial
storage.
But because of the dynamic nature of the system, and not knowing how
much sharing can occur with the same I/O, I can not tell you exactly how
much I/O will be needed to frag a file.
Bob Harris
| |
| Bob Harris 2004-03-22, 1:35 pm |
| In article <c3n2v0$1etc$2@jeeves.eng.abbnm.com>,
peter@abbnm.com (Peter da Silva) wrote:
> In article <harris-779E4E.23395720032004@cacnews.cac.cpqcorp.net>,
> Bob Harris <harris@zk3.dec.com> wrote:
>
> Yes, and thanks for the detailed and interesting explanation. I think I
> followed it all, though of course I might be completely misinterpreting
> something so if I go ahead and turn off fragging and it makes things
> worse I won't blame you. 
The most important thing to know about turning off small file frag'ing
is that it can consume a lot more disk space. If you have the disk
space, then it doesn't matter.
Also remember that turning off small file frag'ing only affects new
files or existing files that are extended. All other small files with
frag will keep their frags.
Attached is a script that may give you an estimate of how much extra
space will be needed if all the small files on the file system were
created without frags. It is brand new code, so there could be bugs in
it.
Bob Harris
#!/bin/ksh
#
# estimate_extra_storage_needed_if_no_smal
l_file_frags.ksh /mount_point
#
find $1 -xdev -type f -ls 2>/dev/null |\
awk '
BEGIN {
ONE_K = 1024
SEVEN_K = (ONE_K * 7)
EIGHT_K = (ONE_K * 8)
FRAG_PERCENT = 5
SIZE_LIMIT = ((EIGHT_K - 1) * 100) / FRAG_PERCENT
}
$7 !~ /[0-9]+/ { size = $6 }
$7 ~ /[0-9]+/ { size = $7 }
{
fragsize = size % EIGHT_K
if (fragsize == 0 || fragsize > SEVEN_K || size >= SIZE_LIMIT) {
next # not a canidate
}
frag_storage = int((fragsize + (ONE_K-1)) / ONE_K)
increased_storage = 8 - frag_storage
total_storage_increase += increased_storage
}
END {
print "Disabling small file fragging will need approx"
if ( total_storage_increase < (10*1024) ) {
print total_storage_increase, "Kilobytes more storage"
}
else if ( total_storage_increase < (10*1024*1024) ) {
total_storage_increase /= 1024
print total_storage_increase, "Megabytes more storage"
}
else {
total_storage_increase /= (1024*1024)
print total_storage_increase, "Gigabytes more storage"
}
}
'
exit
|
|
|
|
|