 |
|
 |
|
|
 |
File modification monitoring of many, many files on Solaris |
 |
 |
|
|
10-28-06 12:15 AM
I have a data collection system running on Solaris 9 which is
constantly writing data into numerous files, hundreds of thousands of
files. Each of these files contain a day of data, the current "day
file" has data appended to it throughout the day (something like a
512-byte write every 15 seconds might be average) and other, older
files are normally, but not certainly, idle. There is a fixed
structure but no fixed file names, new data streams can arrive at any
time creating new directories and files. There are a handful of
different software packages doing the writing of the data and the
overall design is not easily changed.
To export this data with as little latency as possible I have written
software to recursively scan a specified directory tree and monitor the
modification dates and sizes of each file found. Given the nature of
how the data is written the scanning software has the concept of 3
different file types: active, idle and quiet. Active files have been
modified recently and are scanned every pass, idle files have been
modified not-so-recently (hours) and are scanned every X passes and
quiet files (days) are never scanned after the initial discovery.
Problem is that this scanning can be pretty resource intensive and is
not all that fast with (nearing) a million files. I've optimized about
as far as I am able via different file types (active, idle, quiet),
file list in binary tree, etc. The calls to stat(2) are what is now
taking all the time.
What I'd like of course is to use some service that will tell me when
any of those files change so I don't have to actively stat all the
files. Something like what inotifiy does, but Solaris doesn't seem to
have it and I don't think Linux's inotify was really designed for
monitoring so many files. There is some effort to add file changes to
the event framework for Open Solaris, but that's probably a ways off
and, like inotify, probably will not be designed for what I need.
I've thought of other ideas like creating an "opaque" file system that
I could mount over any file system I want and have it notify me of
changes via a device or something. That would be ideal.
I would appreciate any thoughts to the feasibility of such a
development and how to get started or any other hair-brained ideas that
might lead me in a new direction.
cheers,
Chad
[ Post a follow-up to this message ]
|
|
|
 |
|
 |
|
 |
|
|
 |
Re: File modification monitoring of many, many files on Solaris |
 |
 |
|
|
10-31-06 12:31 AM
On Fri, 27 Oct 2006 11:27:00 -0700, Chad.Public wrote:
> I have a data collection system running on Solaris 9 which is
> constantly writing data into numerous files, hundreds of thousands of
> files. Each of these files contain a day of data, the current "day
> file" has data appended to it throughout the day (something like a
> 512-byte write every 15 seconds might be average) and other, older
> files are normally, but not certainly, idle. There is a fixed
> structure but no fixed file names, new data streams can arrive at any
> time creating new directories and files. There are a handful of
> different software packages doing the writing of the data and the
> overall design is not easily changed.
>
> To export this data with as little latency as possible I have written
> software to recursively scan a specified directory tree and monitor the
> modification dates and sizes of each file found. Given the nature of
> how the data is written the scanning software has the concept of 3
> different file types: active, idle and quiet. Active files have been
> modified recently and are scanned every pass, idle files have been
> modified not-so-recently (hours) and are scanned every X passes and
> quiet files (days) are never scanned after the initial discovery.
>
> Problem is that this scanning can be pretty resource intensive and is
> not all that fast with (nearing) a million files. I've optimized about
> as far as I am able via different file types (active, idle, quiet),
> file list in binary tree, etc. The calls to stat(2) are what is now
> taking all the time.
>
> What I'd like of course is to use some service that will tell me when
> any of those files change so I don't have to actively stat all the
> files. Something like what inotifiy does, but Solaris doesn't seem to
> have it and I don't think Linux's inotify was really designed for
> monitoring so many files. There is some effort to add file changes to
> the event framework for Open Solaris, but that's probably a ways off
> and, like inotify, probably will not be designed for what I need.
>
> I've thought of other ideas like creating an "opaque" file system that
> I could mount over any file system I want and have it notify me of
> changes via a device or something. That would be ideal.
>
> I would appreciate any thoughts to the feasibility of such a
> development and how to get started or any other hair-brained ideas that
> might lead me in a new direction.
IMO your application performs several tasks:
0) detecting new files (+directories?) in the tree
1) (re-)scanning + stat()ing files
2) processing/sending "hot" files.
This could all be put into a queue/worklist that your program
uses and maintains. A tree or list of a few hundred thousend
items can be kept in memory, probably using some LRU scheme.
0) is pretty straight-forward: a new file will cause the
directory's inode to be touched, so you'll only need to
stat() the directories to detect new files.
(which will then imply a dirscan, but that has to be done, anyway)
2) is the simplest: just do the read/copy, change the 'last change'
- field in the worklist-item, and reschedule as 'hot'
1) is the most difficult. You could walk the LRU chain, starting from
the 'hot' end, stat() the files, and change state accordingly.
Since you want the 'hot' files to be examined more frequently,
you'll probably need two (or more) cursors into the LRU.
Probably the easyest way to accomplish all this is to have separate
queues/lists for the separate tasks. Moving nodes between queues
is not harder (WRT locking, etc) as moving them inside one big queue.
You could even give the (1) stat()-tasks separate queues for separate
sampling intervals.
You'll need to make some provisions WRT persistence between runs.
( := after crashes ;-)
The most straightforward way would be to put the {mtime,pathname}
-tuples in a big ascii-file, one record-per-line. Read it in, sort it
(or select/skip) and put them in the queue(s) maybe rename/rewrite
the workfile, and start doing the 3 tasks above.
If I understand your posting correctly, you are not terribly obcessed
wrt reading parially written/unflushed files. I suppose you have
adressed that problem elsewhere.
BTW: most of this stuff can probably don using "find dirname -mtime xxx"
, but that will probably be too slow (because it will stat() every
file)
You could even consider a hack, involving copies of the directory tree,
with hard links in them, pointing to the original files. Since the
directory entries point to the same inode, you can easily stat the files
from the shadow directory, and move the linked entries to another 'ghost'
copy of the dirtree, once they become too cold to monitor frequently.
(this is a kind of spool directory system).
HTH,
AvK
[ Post a follow-up to this message ]
|
|
|
 |
|
 |
|
 |
|
|
 |
Re: File modification monitoring of many, many files on Solaris |
 |
 |
|
|
12-27-06 06:16 PM
On Fri, 27 Oct 2006 11:27:00 -0700, Chad.Public wrote:
> I have a data collection system running on Solaris 9 which is
> constantly writing data into numerous files, hundreds of thousands of
> files. Each of these files contain a day of data, the current "day
> file" has data appended to it throughout the day (something like a
> 512-byte write every 15 seconds might be average) and other, older
> files are normally, but not certainly, idle. There is a fixed
> structure but no fixed file names, new data streams can arrive at any
> time creating new directories and files. There are a handful of
> different software packages doing the writing of the data and the
> overall design is not easily changed.
>
> To export this data with as little latency as possible I have written
> software to recursively scan a specified directory tree and monitor the
> modification dates and sizes of each file found. Given the nature of
> how the data is written the scanning software has the concept of 3
> different file types: active, idle and quiet. Active files have been
> modified recently and are scanned every pass, idle files have been
> modified not-so-recently (hours) and are scanned every X passes and
> quiet files (days) are never scanned after the initial discovery.
>
> Problem is that this scanning can be pretty resource intensive and is
> not all that fast with (nearing) a million files. I've optimized about
> as far as I am able via different file types (active, idle, quiet),
> file list in binary tree, etc. The calls to stat(2) are what is now
> taking all the time.
>
> What I'd like of course is to use some service that will tell me when
> any of those files change so I don't have to actively stat all the
> files. Something like what inotifiy does, but Solaris doesn't seem to
> have it and I don't think Linux's inotify was really designed for
> monitoring so many files. There is some effort to add file changes to
> the event framework for Open Solaris, but that's probably a ways off
> and, like inotify, probably will not be designed for what I need.
>
> I've thought of other ideas like creating an "opaque" file system that
> I could mount over any file system I want and have it notify me of
> changes via a device or something. That would be ideal.
>
> I would appreciate any thoughts to the feasibility of such a
> development and how to get started or any other hair-brained ideas that
> might lead me in a new direction.
>
IMO your application performs several tasks:
0) detecting new files (+directories?) in the tree
1) (re-)scanning + stat()ing files
2) processing/sending "hot" files.
This could all be put into a queue/worklist that your program
uses and maintains. A tree or list of a few hundred thousend
items can be kept in memory, probably using some LRU scheme.
0) is pretty straight-forward: a new file will cause the
directory's inode to be touched, so you'll only need to
stat() the directories to detect new files.
(which will then imply a dirscan, but that has to be done, anyway)
2) is the simplest: just do the read/copy, change the 'last change'
- field in the worklist-item, and reschedule as 'hot'
1) is the most difficult. You could walk the LRU chain, starting from
the 'hot' end, stat() the files, and change state accordingly.
Since you want the 'hot' files to be examined more frequently,
you'll probably need two (or more) cursors into the LRU.
Probably the easyest way to accomplish all this is to have separate
queues/lists for the separate tasks. Moving nodes between queues
is not harder (WRT locking, etc) as moving them inside one big queue.
You could even give the (1) stat()-tasks separate queues for separate
sampling intervals.
You'll need to make some provisions WRT persistence between runs.
( := after crashes ;-)
The most straightforward way would be to put the {mtime,pathname}
-tuples in a big ascii-file, one record-per-line. Read it in, sort it
(or select/skip) and put them in the queue(s) maybe rename/rewrite
the workfile, and start doing the 3 tasks above.
If I understand your posting correctly, you are not terribly obcessed
wrt reading parially written/unflushed files. I suppose you have
adressed that problem elsewhere.
BTW: most of this stuff can probably don using "find dirname -mtime xxx"
, but that will probably be too slow (because it will stat() every
file)
You could even consider a hack, involving copies of the directory tree,
with hard links in them, pointing to the original files. Since the
directory entries point to the same inode, you can easily stat the files
from the shadow directory, and move the linked entries to another 'ghost'
copy of the dirtree, once they become too cold to monitor frequently.
(this is a kind of spool directory system).(I admit: this would be an ugly
hack)
HTH,
AvK
[ Post a follow-up to this message ]
|
|
|
 |
|
 |
|
 |
|
|
|
Sponsored Links |
 |
 |
|
|
 |
All times are GMT. The time now is 06:54 PM. |
 |
|
|
 |
|
 |
|
|
 |
|
Forum Rules:
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
|
HTML code is OFF
vB code is ON
Smilies are ON
[IMG] code is OFF
|
|
|
|
Medical and Health forum | Computer Games Reviews | Graphics design forum
|
 |
|
 |
|