|
Home > Archive > Unix Shell > August 2007 > file compare script
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
file compare script
|
|
| mvshk@hotmail.com 2007-08-22, 1:22 am |
| Good evening,
I'm struggling to become proficient at scripting, so I apolgize in
advance if this is a glaringly easy question.
I would like to create a script that compares files with the same
prefix (which is the hostname) and uses the oldest as the first and
the most recent as the second. I want to ignore or delete any in
between. Before I use diff/comm, I will grep out the numeric data
(four digit numbers), then show numbers that are only in the most
recent file piped to a text file with the hostname as part of the
name. The input files have the following format: hostname_yyyy-mm-
dd-hh-ss.txt
I would like to process all the files for a particular host in the
directory that that have two files only. Singles would pop an error
message. I have no problem with the grep and compare, what I need
help with is grouping of files and determination of oldest and
newest. I was thinking of a ls with a cut to sort the files, but am
stuck on how to store dates and whole filenames. I have pored over my
scripting and unix books, plus Google with no breakthrough. I plan to
run this in a ksh. Thanks for any help, nw.
| |
| Ed Morton 2007-08-22, 1:22 am |
| mvshk@hotmail.com wrote:
> Good evening,
>
> I'm struggling to become proficient at scripting, so I apolgize in
> advance if this is a glaringly easy question.
> I would like to create a script that compares files with the same
> prefix (which is the hostname) and uses the oldest as the first and
> the most recent as the second. I want to ignore or delete any in
> between. Before I use diff/comm, I will grep out the numeric data
> (four digit numbers), then show numbers that are only in the most
> recent file piped to a text file with the hostname as part of the
> name. The input files have the following format: hostname_yyyy-mm-
> dd-hh-ss.txt
>
> I would like to process all the files for a particular host in the
> directory that that have two files only. Singles would pop an error
> message. I have no problem with the grep and compare, what I need
> help with is grouping of files and determination of oldest and
> newest. I was thinking of a ls with a cut to sort the files, but am
> stuck on how to store dates and whole filenames. I have pored over my
> scripting and unix books, plus Google with no breakthrough. I plan to
> run this in a ksh. Thanks for any help, nw.
>
OK, one step at a time. This will give you the oldest and newest files
with a given prefix "prefix", assuming no newlines in the file names:
newest=`ls -t prefix* | head -1`
oldest=`ls -t prefix* | tail -1`
This will tell you if they're the same file (i.e. only one exists):
if [ "$newest" = "$oldest" ]
then
echo "Oh no, it's a disaster..."
exit 1
fi
From there it gets vague what you want but I THINK what you want is to
just see what 4-digit numbers appear in the second file but not the
first. I also think that those 4 digit numbers appear on lines by
themselves since you say you can "grep them out". If so, you can do this:
awk '$0 !~ /^[0-9][0-9][0-9][0-9]$/{next} NR==FNR{file1[$0];next} !($0
in file1)' "$oldest" "$newest" > outputfile
So, let us know if the above is on the right track and, if so, what else
you're looking for and provide some sample input and expected output and
we'll take it from there.
Ed.
| |
| mvshk@hotmail.com 2007-08-22, 7:26 am |
|
On Aug 21, 10:19 pm, Ed Morton <mor...@lsupcaemnt.com> wrote:
>
> OK, one step at a time. This will give you the oldest and newest files
> with a given prefix "prefix", assuming no newlines in the file names:
>
> newest=`ls -t prefix* | head -1`
> oldest=`ls -t prefix* | tail -1`
>
> This will tell you if they're the same file (i.e. only one exists):
>
> if [ "$newest" = "$oldest" ]
> then
> echo "Oh no, it's a disaster..."
> exit 1
> fi
>
> From there it gets vague what you want but I THINK what you want is to
> just see what 4-digit numbers appear in the second file but not the
> first. I also think that those 4 digit numbers appear on lines by
> themselves since you say you can "grep them out". If so, you can do this:
>
> awk '$0 !~ /^[0-9][0-9][0-9][0-9]$/{next} NR==FNR{file1[$0];next} !($0
> in file1)' "$oldest" "$newest" > outputfile
>
> So, let us know if the above is on the right track and, if so, what else
> you're looking for and provide some sample input and expected output and
> we'll take it from there.
>
> Ed.
Ed,
Wow, that was the breakthrough I was looking for! Thanks. Here's my
cut at making it work:
# It's safe to assume that only the correct files will be in the
folder, really!
# Generate hostlist without duplicates
ls *.txt | cut -d "_" -f1 | sort -u > comparehost.txt
# Start processing loop
for prefix in `cat comparehost.txt` ; do
if [ `ls $prefix | wc -l` < 2 ]; then
echo " "
echo "Only found one file to process, skipping"
continue
fi
oldest=`ls -t $prefix* | tail -1`
newest=`ls -t $prefix* | head -1`
grep Summ $oldest | grep -v Fail | cut -d ";" -f3 > old.$$
grep Summ $newest | grep -v Fail | cut -d ";" -f3 > new.$$
# Compare the two files to get the delta
comm -23 old.$$ new.$$ > $prefix_compare.txt
rm *.$$
done
I decided to add the file count check before processing.
Thanks again,
nw
| |
| Ed Morton 2007-08-22, 1:23 pm |
| mvshk@hotmail.com wrote:
>
> On Aug 21, 10:19 pm, Ed Morton <mor...@lsupcaemnt.com> wrote:
>
>
>
>
> Ed,
>
> Wow, that was the breakthrough I was looking for! Thanks. Here's my
> cut at making it work:
>
> # It's safe to assume that only the correct files will be in the
> folder, really!
> # Generate hostlist without duplicates
> ls *.txt | cut -d "_" -f1 | sort -u > comparehost.txt
> # Start processing loop
> for prefix in `cat comparehost.txt` ; do
> if [ `ls $prefix | wc -l` < 2 ]; then
> echo " "
> echo "Only found one file to process, skipping"
> continue
> fi
> oldest=`ls -t $prefix* | tail -1`
> newest=`ls -t $prefix* | head -1`
> grep Summ $oldest | grep -v Fail | cut -d ";" -f3 > old.$$
> grep Summ $newest | grep -v Fail | cut -d ";" -f3 > new.$$
> # Compare the two files to get the delta
> comm -23 old.$$ new.$$ > $prefix_compare.txt
> rm *.$$
> done
>
> I decided to add the file count check before processing.
>
> Thanks again,
>
> nw
>
You can simplify that a bit and get rid of all the tmp files:
ls *.txt | cut -d "_" -f1 | sort -u |
while IFS= read -r prefix; do
oldest=`ls -t $prefix* | tail -1`
newest=`ls -t $prefix* | head -1`
if [ "$oldest" = "$newest" ]; then
echo "\nOnly found one file to process, skipping"
continue
fi
awk -F\; '!/Summ/ || /Fail/ { next }
NR == FNR { file1[$3]; next }
!($3 in file1)
' "$oldest" "$newest" > "${prefix}_compare.txt"
done
Note that you had "comm -23 "$oldest" "$newest"" which would find the
lines that only appear in the "$oldest" file, but you'd said you wanted
to find those that only appear in "$newest" so I switched them around in
the awk script above to match what you said rather than what you coded.
If you really want to find the lines that only appear in "$oldest", then
just change the order of the 2 files on the awk command line.
Note also that the above will create "${prefix}_compare.txt" files in
the same directory as the original *.txt files - you probably don't want
that so change the suffix or write the output to a different directory.
Ed.
|
|
|
|
|