Web Servers General Talk - How to wget download all PDF files larger than 100 Kbytes

This is Interesting: Free IT Magazines  
Home > Archive > Web Servers General Talk > May 2004 > How to wget download all PDF files larger than 100 Kbytes





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author How to wget download all PDF files larger than 100 Kbytes
Orak Listalavostok

2004-05-06, 3:33 pm

How do I get GNU web get (wget) to download all the PDFs
(potentially thousands) on a stated web page but ignore
any PDF smaller than than a given size?

I read the fine manual (wget -help), soon arriving with:
% wget -prA.pdf http://foo.bar.com

Which means (roughly): Copy all the PDF files (A.pdf) from the
specified web page (p), recursively (r) to the default 5 levels.

But how do I eliminate the copying of files smaller than
a certain size; that is, how do I tell wget to ignore PDF
files of (say) 100 Kbytes or smaller?

Orak
Alan Connor

2004-05-06, 4:33 pm

On 6 May 2004 12:34:47 -0700, Orak Listalavostok <oraklistal@yahoo.com> wrote:
>
>
> How do I get GNU web get (wget) to download all the PDFs
> (potentially thousands) on a stated web page but ignore
> any PDF smaller than than a given size?
>
> I read the fine manual (wget -help), soon arriving with:
> % wget -prA.pdf http://foo.bar.com
>
> Which means (roughly): Copy all the PDF files (A.pdf) from the
> specified web page (p), recursively (r) to the default 5 levels.
>
> But how do I eliminate the copying of files smaller than
> a certain size; that is, how do I tell wget to ignore PDF
> files of (say) 100 Kbytes or smaller?
>
> Orak


Doesn't seem to be anything wget can do unaided. If you were to download the
webpage with the pdf links, extract them into a file, you can do this:

$ wget --spider http://home.earthlink.net/~alanconnor/elrav1/er1.tar.gz
--13:06:51-- http://home.earthlink.net/%7Ealanco...rav1/er1.tar.gz
=> `er1.tar.gz'
Resolving home.earthlink.net... done.
Connecting to home.earthlink.net[207.217.98.29]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21,736 [application/x-tar]
200 OK

Notice the "Length: ..." header? Feed the list to wget with the -i file option,
parse out the URLS with the size you want and feed THAT list to wget.

comp.unix.shell for help writing the script you'll need.

Not really hard with sed and/or awk.

Perhaps there is another web-tool that will do the job, but I'm not aware of it.

AC

--
Pass-List -----> Block-List ----> Challenge-Response
The key to taking control of your mailbox. Design Parameters:
http://tinyurl.com/2t5kp || http://tinyurl.com/3c3ag
Challenge-Response links -- http://tinyurl.com/yrfjb
William Park

2004-05-06, 5:33 pm

In <comp.infosystems.www.servers.misc> Orak Listalavostok <oraklistal@yahoo.com> wrote:
> How do I get GNU web get (wget) to download all the PDFs
> (potentially thousands) on a stated web page but ignore
> any PDF smaller than than a given size?
>
> I read the fine manual (wget -help), soon arriving with:
> % wget -prA.pdf http://foo.bar.com
>
> Which means (roughly): Copy all the PDF files (A.pdf) from the
> specified web page (p), recursively (r) to the default 5 levels.
>
> But how do I eliminate the copying of files smaller than
> a certain size; that is, how do I tell wget to ignore PDF
> files of (say) 100 Kbytes or smaller?
>
> Orak


You need to get 'Content-Length:' header. There are many ways to get
this info, eg.
lynx -head ...
echo -ne 'HEAD ... \r\n\r\n' | nc ...

--
William Park, Open Geometry Consulting, <opengeometry@yahoo.ca>
Linux solution/training/migration, Thin-client
moma

2004-05-08, 10:33 am

Orak Listalavostok wrote:
> How do I get GNU web get (wget) to download all the PDFs
> (potentially thousands) on a stated web page but ignore
> any PDF smaller than than a given size?
>
> I read the fine manual (wget -help), soon arriving with:
> % wget -prA.pdf http://foo.bar.com
>
> Which means (roughly): Copy all the PDF files (A.pdf) from the
> specified web page (p), recursively (r) to the default 5 levels.
>
> But how do I eliminate the copying of files smaller than
> a certain size; that is, how do I tell wget to ignore PDF
> files of (say) 100 Kbytes or smaller?
>
> Orak


pavuk has
-minsize $nr option

http://www.granneman.com/techinfo/t...bcontent/pavuk/

# apt-get install pavuk



// moma
http://www.futuredesktop.org
Orak Listalavostok

2004-05-08, 10:33 am

William Park <opengeometry@yahoo.ca> wrote in message news:<2fvn8hF30l0cU1@uni-berlin.de>...
> In <comp.infosystems.www.servers.misc> Orak Listalavostok <oraklistal@yahoo.com> wrote:
>
> You need to get 'Content-Length:' header. There are many ways to get
> this info, eg.
> lynx -head ...
> echo -ne 'HEAD ... \r\n\r\n' | nc ...


Darn. If wget can't download all files above a certain size point
natively, I probably don't have the skill set to code it myself.

Just so I understand the suggested algorithm though:
a. Run wget as '--spider' on a URL to obtain the URLs contained in the URL:
% wget -pkrA.pdf --spider http://foo.bar.com > list_of_urls_in_a_url

(I added the convert (k) option, so that the resulting list of
downloadable PDFs will only contain complete urls ... assuming the
convert (k) option converts these relative URLs into complete URLs.)

b. The "list_of_urls_in_a_url" should now only contain complete URLs:
http://foo.bar.com/file1.pdf
http://foo.bar.com/file2.pdf
http://foo.bar.com/file3.pdf
...

c. Then run a shell program of some sort to find the SIZE of each file:
HEAD /file1.pdf HTTP 1.0
piping it to | {awk/sed/grep for the SIZE of each file somehow}

d. With the awk/sed/grep, strip out all lines of a lesser size than LIMIT.

e. Now run GNU wget to obtain all PDFs above a certain size LIMIT:
% wget -i list_of_urls_of_files_over_a_certain_siz
e_in_a_url

Is that the proposed algorithm?

Orak
Alan Connor

2004-05-08, 10:33 am

On 7 May 2004 16:48:20 -0700, Orak Listalavostok <oraklistal@yahoo.com> wrote:
>
>
> William Park <opengeometry@yahoo.ca> wrote in message news:<2fvn8hF30l0cU1@uni-berlin.de>...
>
> Darn. If wget can't download all files above a certain size point
> natively, I probably don't have the skill set to code it myself.
>
> Just so I understand the suggested algorithm though:
> a. Run wget as '--spider' on a URL to obtain the URLs contained in the URL:
> % wget -pkrA.pdf --spider http://foo.bar.com > list_of_urls_in_a_url
>
> (I added the convert (k) option, so that the resulting list of
> downloadable PDFs will only contain complete urls ... assuming the
> convert (k) option converts these relative URLs into complete URLs.)
>
> b. The "list_of_urls_in_a_url" should now only contain complete URLs:
> http://foo.bar.com/file1.pdf
> http://foo.bar.com/file2.pdf
> http://foo.bar.com/file3.pdf
> ...
>
> c. Then run a shell program of some sort to find the SIZE of each file:
> HEAD /file1.pdf HTTP 1.0
> piping it to | {awk/sed/grep for the SIZE of each file somehow}
>
> d. With the awk/sed/grep, strip out all lines of a lesser size than LIMIT.
>
> e. Now run GNU wget to obtain all PDFs above a certain size LIMIT:
> % wget -i list_of_urls_of_files_over_a_certain_siz
e_in_a_url
>
> Is that the proposed algorithm?
>
> Orak


Yeh. Pretty good. Send the output of "b." , the file, through grep first:

# grep '^Length:\|^Connecting to' list_of_files... > file2

# Now we are down to two lines per entry.

# Connecting to home.earthlink.net[207.217.98.29]:80... connected.
# Length: 21,736 [application/x-tar]

# Now let's get both of those lines on one line:

# sed '$!N;s/\n/ /' file2 > file3

# while read line; do

# size=`echo "$line" | sed -n -e 's/,//g' -e 's/\(^Length: *\)\
# \([0-9]0-9]*\)\(.*\)/\2/p'`

# # extract the size number from each line as it is read, removing any commas

# if [ "$size" -gt whatever ] ; then echo "$line" | sed -ne 's/\(.*\)\
# (\[[0-9][0-9]*\.[0-9]\[[0-9]*\.[0-9][0-9]*\.[0-9][0-9]*\]\)\
# \(.*\)/\2/' -e 's/\[//' -e ' s/\]//p' >> file_final

# # if the size is larger than "whatever", then strip out the IP address
# # and send it to the file used in "e.".

# else continue

# # or move to the next line

# fi

# done < file3

That will give a list of IP address to feed to wget, bypassing any need for
DNS lookup. (I 'think' wget will accept those...)

I haven't tested the above, strictly speaking, but it should work.

Sure get you started, if nothing else :-)

AC


William Park

2004-05-10, 5:33 pm

In <comp.infosystems.www.servers.misc> Orak Listalavostok <oraklistal@yahoo.com> wrote:
>
> Darn. If wget can't download all files above a certain size point
> natively, I probably don't have the skill set to code it myself.
>
> Just so I understand the suggested algorithm though:
> a. Run wget as '--spider' on a URL to obtain the URLs contained in the URL:
> % wget -pkrA.pdf --spider http://foo.bar.com > list_of_urls_in_a_url
>
> (I added the convert (k) option, so that the resulting list of
> downloadable PDFs will only contain complete urls ... assuming the
> convert (k) option converts these relative URLs into complete URLs.)
>
> b. The "list_of_urls_in_a_url" should now only contain complete URLs:
> http://foo.bar.com/file1.pdf
> http://foo.bar.com/file2.pdf
> http://foo.bar.com/file3.pdf
> ...
>
> c. Then run a shell program of some sort to find the SIZE of each file:
> HEAD /file1.pdf HTTP 1.0
> piping it to | {awk/sed/grep for the SIZE of each file somehow}
>
> d. With the awk/sed/grep, strip out all lines of a lesser size than LIMIT.
>
> e. Now run GNU wget to obtain all PDFs above a certain size LIMIT:
> % wget -i list_of_urls_of_files_over_a_certain_siz
e_in_a_url
>
> Is that the proposed algorithm?


Looks right. :-) You can do this one-by-one,

lynx -dump http://www.abc.com/ > x
vi x # step b
while read i; do
size=`lynx -head $i | awk '
BEGIN { IGNORECASE=1 }
/^Content-Length:/ {print $2}' `
[[ $size -gt 100000 ]] && wget $i
done < x

or all at once,

lynx -dump http://www.abc.com/ > x
vi x # step b
while read i; do
lynx -head $i > `basename $i`
done
grep '^Content-Length:' *.pdf > y
rm *.pdf
while IFS=: read i header size; do
[[ $size -gt 100000 ]] && wget $i
done < y

--
William Park, Open Geometry Consulting, <opengeometry@yahoo.ca>
Linux solution/training/migration, Thin-client
Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com