|
Home > Archive > Unix Shell > October 2005 > cumulative distribution
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
cumulative distribution
|
|
| juliag 2005-10-24, 3:45 pm |
| With
awk -v num="$nombre" '{
acc += $1
printf "%6.2f%%\t\t %6.2f%%\t %3d\n", ($1/ num) * 100, (acc/ num) *100,
$2}' stat.txt >> stat1.txt
I can get amazing table giving frequencies and cumulative frequencies
for each quantities (here they go from 1 to 558):
28.09% 28.09% 1
10.96% 39.04% 2
8.43% 47.47% 3
3.65% 51.12% 4
4.07% 55.20% 5
.....................................
0.56% 74.58% 17
1.40% 75.98% 18
1.12% 77.11% 19
.....................................
0.14% 99.86% 545
0.14% 100.00% 558
Looking at that table I discover that it gives the median (the number
ranked at 50%). In my table it is 3. It can also give the first
quartile (25%) which is 1 and the third quartile (75%) which here is
17.
How can I automatize the identification those special values and send
them to a file named header.txt? Will it be with also with awk?
| |
| Ed Morton 2005-10-24, 3:45 pm |
| juliag wrote:
> With
>
> awk -v num="$nombre" '{
> acc += $1
> printf "%6.2f%%\t\t %6.2f%%\t %3d\n", ($1/ num) * 100, (acc/ num) *100,
> $2}' stat.txt >> stat1.txt
>
> I can get amazing table giving frequencies and cumulative frequencies
> for each quantities (here they go from 1 to 558):
>
> 28.09% 28.09% 1
> 10.96% 39.04% 2
> 8.43% 47.47% 3
> 3.65% 51.12% 4
> 4.07% 55.20% 5
> ....................................
> 0.56% 74.58% 17
> 1.40% 75.98% 18
> 1.12% 77.11% 19
> ....................................
> 0.14% 99.86% 545
> 0.14% 100.00% 558
>
> Looking at that table I discover that it gives the median (the number
> ranked at 50%). In my table it is 3. It can also give the first
> quartile (25%) which is 1 and the third quartile (75%) which here is
> 17.
>
> How can I automatize the identification those special values and send
> them to a file named header.txt? Will it be with also with awk?
>
Try this (untested):
awk -v num="$nombre" '{
acc += $1
freq = ($1 / num) * 100
cumFreq = (acc / num) * 100
printf "%6.2f%%\t\t %6.2f%%\t %3d\n", freq, cumFreq, $2}
}
cumFreq <= 25 { firstQ = $2 }
cumFreq <= 50 { median = $2 }
cumFreq <= 75 { thirdQ = $2 }
END{ printf "%3.2f%% %3.2f%% %3.2f%%\n", firstQ, median, thirdQ >
"header.txt" }' stat.txt > stat1.txt
Regards,
Ed.
| |
| Chris F.A. Johnson 2005-10-24, 3:45 pm |
| On 2005-10-22, juliag wrote:
> With
>
> awk -v num="$nombre" '{
> acc += $1
> printf "%6.2f%%\t\t %6.2f%%\t %3d\n", ($1/ num) * 100, (acc/ num) *100,
> $2}' stat.txt >> stat1.txt
>
> I can get amazing table giving frequencies and cumulative frequencies
> for each quantities (here they go from 1 to 558):
>
> 28.09% 28.09% 1
> 10.96% 39.04% 2
> 8.43% 47.47% 3
> 3.65% 51.12% 4
> 4.07% 55.20% 5
> ....................................
> 0.56% 74.58% 17
> 1.40% 75.98% 18
> 1.12% 77.11% 19
> ....................................
> 0.14% 99.86% 545
> 0.14% 100.00% 558
>
> Looking at that table I discover that it gives the median (the number
> ranked at 50%). In my table it is 3. It can also give the first
> quartile (25%) which is 1 and the third quartile (75%) which here is
> 17.
>
> How can I automatize the identification those special values and send
> them to a file named header.txt? Will it be with also with awk?
This script, from Chapter 5 of my book, finds the median from a set
numbers. It is easily modifiable to fit into your script, and
easily extended to find the quartile boundaries.
## Sort the list obtained from one or more files or the standard input
sort -n ${1+"$@"} |
awk '{x[NR] = $1} ## Store all the values in an array
END {
## Find the middle number
num = int( (NR + 1) / 2 )
## If there are an odd number of values
## use the middle number
if ( NR % 2 == 1 ) print x[num]
## otherwise average the two middle numbers
else print (x[num] + x[num + 1]) / 2
}'
--
Chris F.A. Johnson <http://cfaj.freeshell.org>
========================================
==========================
Shell Scripting Recipes: A Problem-Solution Approach, 2005, Apress
<http://www.torfree.net/~chris/books/cfaj/ssr.html>
| |
| juliag 2005-10-24, 3:45 pm |
| Almost at first try!
There is a '}' too much in the middle. Also since the quantities are
integers, I slightly changed the format for:
awk -v num="$nombre" '{
acc += $1
freq = ($1 / num) * 100
cumFreq = (acc / num) * 100
printf "%6.2f%%\t\t %6.2f%%\t %3d\n", freq, cumFreq, $2}
cumFreq <= 25 { firstQ = $2 }
cumFreq <= 50 { median = $2 }
cumFreq <= 75 { thirdQ = $2 }
END{ printf "%3d %3d %3d\n", firstQ, median, thirdQ > "header.txt" }'
stat.txt > stat1.txt
but the resulting header.txt is:
0 3 17
It should be:
1 3 17
| |
| Janis Papanagnou 2005-10-24, 3:45 pm |
| juliag wrote:
> Almost at first try!
>
> There is a '}' too much in the middle. Also since the quantities are
> integers, I slightly changed the format for:
>
> awk -v num="$nombre" '{
> acc += $1
> freq = ($1 / num) * 100
> cumFreq = (acc / num) * 100
> printf "%6.2f%%\t\t %6.2f%%\t %3d\n", freq, cumFreq, $2}
>
> cumFreq <= 25 { firstQ = $2 }
> cumFreq <= 50 { median = $2 }
> cumFreq <= 75 { thirdQ = $2 }
> END{ printf "%3d %3d %3d\n", firstQ, median, thirdQ > "header.txt" }'
> stat.txt > stat1.txt
>
> but the resulting header.txt is:
>
> 0 3 17
>
> It should be:
> 1 3 17
>
Why not 1 4 17 - what's the logic behind 1 3 17?
Are you looking for the values <= limit (as Ed assumed; result 0 3 17)
or are you looking for the values, where abs(limit-value) is at minimum
(result 1 4 17)?
Janis
| |
| Ed Morton 2005-10-24, 3:45 pm |
| juliag wrote:
> Almost at first try!
>
> There is a '}' too much in the middle. Also since the quantities are
> integers, I slightly changed the format for:
>
> awk -v num="$nombre" '{
> acc += $1
> freq = ($1 / num) * 100
> cumFreq = (acc / num) * 100
> printf "%6.2f%%\t\t %6.2f%%\t %3d\n", freq, cumFreq, $2}
>
> cumFreq <= 25 { firstQ = $2 }
> cumFreq <= 50 { median = $2 }
> cumFreq <= 75 { thirdQ = $2 }
> END{ printf "%3d %3d %3d\n", firstQ, median, thirdQ > "header.txt" }'
> stat.txt > stat1.txt
>
> but the resulting header.txt is:
>
> 0 3 17
>
> It should be:
> 1 3 17
>
Then just change the calculation to something like this to find the
number closest to the quarter rather thean the closest under the quarter
as I originally had (this is just the first thing that springs to mind,
there may be better ways but hopefully you can figure it out given this
example):
awk -v num="$nombre" '{
acc += $1
freq = ($1 / num) * 100
cumFreq = (acc / num) * 100
printf "%6.2f%%\t\t %6.2f%%\t %3d\n", freq, cumFreq, $2}
int(cumFreq - 25) < int(firstQcf - 25) {firstQcf=cumFreq; firstQ=$2}
int(cumFreq - 25) < int(mediancf - 25) {mediancf=cumFreq; median=$2}
int(cumFreq - 75) < int(thirdQcf - 75) {thirdQcf=cumFreq; thirdQ=$2}
END{ printf "%3d %3d %3d\n", firstQ, median, thirdQ > "header.txt" }'
stat.txt > stat1.txt
Regards,
Ed.
| |
| juliag 2005-10-24, 3:45 pm |
| because 25% falls in 1, and 50% falls in 3
| |
| juliag 2005-10-24, 3:45 pm |
| Thank you a lot Chris!
I am going to put it in my stat tool box.
| |
| Chris F.A. Johnson 2005-10-24, 3:45 pm |
| On 2005-10-22, juliag wrote:
> Thank you a lot Chris!
For what?
This is Usenet, not a web forum (though it is also bastardized on
several web sites). You cannot know whether the reader can see or
has seen the previous posts, or, if they have been seen, whether
the reader remembers what they were about.
When using groups.google.com to reply to a Usenet article (better
to use a real newsreader), click on "show options" at the top of
the article, then click on the "Reply" at the bottom of the
article headers (not at the bottom of the article). This will
quote the previous message in the accepted manner. Trim the parts
of it that are not relevant to your follow-up.
> I am going to put it in my stat tool box.
--
Chris F.A. Johnson <http://cfaj.freeshell.org>
========================================
==========================
Shell Scripting Recipes: A Problem-Solution Approach, 2005, Apress
<http://www.torfree.net/~chris/books/cfaj/ssr.html>
| |
| Janis Papanagnou 2005-10-24, 3:45 pm |
| Please quote context if you post to Usenet.
juliag wrote:
> because 25% falls in 1, and 50% falls in 3
>
[vbcol=seagreen]
Well, you'll know what you need.
| |
| juliag 2005-10-24, 3:45 pm |
|
Janis Papanagnou wrote:
> Please quote context if you post to Usenet.
>
> juliag wrote:
>
>
> Well, you'll know what you need.
You are completely right! I see the light now!
It needs to be 1 4 18, the approach from Ed identifies the value
preceding the correct value.
I am one giant step closer.
Thank you!
| |
|
| On 22 Oct 2005 17:08:51 -0700, "juliag" <bertrille@bigfoot.com> wrote:
>
>Janis Papanagnou wrote:
>
>You are completely right! I see the light now!
>It needs to be 1 4 18, the approach from Ed identifies the value
>preceding the correct value.
>
>I am one giant step closer.
Add code (to Ed's) to catch and test the next line after each comparison,
perhaps something like:
cumFreq <= 25 { firstQ = $2 }
if (!have_firstQ && cumFreq >= 25) {
have_firstQ++; if (25 - firstQ > cumFreq - 25) firstQ = cumFreq }
....
Grant.
>
>Thank you!
| |
| Ed Morton 2005-10-24, 3:45 pm |
|
Chris F.A. Johnson wrote:
> On 2005-10-22, juliag wrote:
>
>
>
> For what?
>
> This is Usenet, not a web forum (though it is also bastardized on
> several web sites). You cannot know whether the reader can see or
> has seen the previous posts, or, if they have been seen, whether
> the reader remembers what they were about.
>
> When using groups.google.com to reply to a Usenet article (better
> to use a real newsreader), click on "show options" at the top of
> the article, then click on the "Reply" at the bottom of the
> article headers (not at the bottom of the article). This will
> quote the previous message in the accepted manner. Trim the parts
> of it that are not relevant to your follow-up.
Chris - could you put that blurb on your web site, then we could just
point people to it.
Ed.
| |
|
|
|
|
|