mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2
Web Server forum
Back To The Forum Home!Search!Private Messaging System

Web Server Talk Web Server Talk > Web Servers reviews > Apache Server configuration support > Apache Mod-Python > mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2




Pages (2): [1] 2 »   Last Thread   Next Thread Next
  Show Printable Version Email this Page Subscribe to this Thread      Post New Thread    Post A Reply      

    mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2  
Alexis Marrero


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-07-05 01:47 AM

All,

The current 3.1 mod_python implementation of
mod_python.util.StorageField.read_to_boudary reads as follows:

203      def read_to_boundary(self, req, boundary, file):
204          delim = ""
205          line = req.readline()
206          sline = line.strip()
207          last_bound = boundary + "--"
208          while line and sline != boundary and sline !=
last_bound:
209              odelim = delim
210              if line[-2:] == "\r\n":
211                  delim = "\r\n"
212                  line = line[:-2]
213              elif line[-1:] == "\n":
214                  delim = "\n"
215                  line = line[:-1]
216              file.write(odelim + line)
217              line = req.readline()
218              sline = line.strip()

As we have discussed previously:
http://www.modpython.org/pipermail/...rch/017754.html
http://www.modpython.org/pipermail/...rch/017756.html
http://www.modpython.org/pipermail/...ber/019460.html

This triggered couple of changes in mod_python 3.2 Beta which reads
as follows:
33  # Fixes memory error when upload large files such as 700+MB
ISOs.
34  readBlockSize = 65368
35
...
225     def read_to_boundary(self, req, boundary, file):
...
234         delim = ''
235         lastCharCarried = False
236         last_bound = boundary + '--'
237         roughBoundaryLength = len(last_bound) + 128
238         line = req.readline(readBlockSize)
239         lineLength = len(line)
240         if lineLength < roughBoundaryLength:
241             sline = line.strip()
242         else:
243             sline = ''
244         while lineLength > 0 and sline != boundary and sline !
= last_bound:
245             if not lastCharCarried:
246                 file.write(delim)
247                 delim = ''
248             else:
249                 lastCharCarried = False
250             cutLength = 0
251             if lineLength == readBlockSize:
252                 if line[-1:] == '\r':
253                     delim = '\r'
254                     cutLength = -1
255                     lastCharCarried = True
256             if line[-2:] == '\r\n':
257                 delim += '\r\n'
258                 cutLength = -2
259             elif line[-1:] == '\n':
260                 delim += '\n'
261                 cutLength = -1
262             if cutLength != 0:
263                 file.write(line[:cutLength])
264             else:
265                 file.write(line)
266             line = req.readline(readBlockSize)
267             lineLength = len(line)
268             if lineLength < roughBoundaryLength:
269                 sline = line.strip()
270             else:
271                 sline = ''

This function has a mysterious bug in it... For some files which I
could disclose (one of them been the PDF file for Apple's Pages User
Manual in Italian) the uploaded file in the server ends up with the
same length but different sha512 (the only digest that I'm using).
The problem is a '\r' in the middle of a chunk of data that is much
larger than readBlockSize.

Anyhow, I wrote a new function, which I believe is much simpler, and
test it with thousands and thousands of different files and so far it
seems to work fine.  It reads as follows:

def read_to_boundary(self, req, boundary, file):
''' read from the request object line by line with a maximum size,
until the new line starts with boundary
'''
previous_delimiter = ''
while 1:
line = req.readline(1<<16)
if line.startswith(boundary):
break

if line.endswith('\r\n'):
file.write(previous_delimiter + line[:-2])
previous_delimiter = '\r\n'

elif line.endswith('\r') or line.endswith('\n'):
file.write(previous_delimiter + line[:-1])
previous_delimiter = line[-1:]

else:
file.write(previous_delimiter + line)
previous_delimiter = ''

Let me know any comments on it and if you test it and fails please
also let me know. I don't have subversion account neither I don't
know how to use it thus this email.

/amn

 ________________________________________
_______
Mod_python mailing list
Mod_python@modpython.org
http://mailman.modpython.org/mailma...info/mod_python








[ Post a follow-up to this message ]



    Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2  
Alexis Marrero


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-07-05 01:47 AM

I don't have a function that creates the files but the I can point
you to a file that has the problem, ironically is "Unix Haters
Handbook"  Well, at least is not the Python HH....

http://research.microsoft.com/~daniel/uhh-download.html

It's MD5 is 9e8c42be55aac825e7a34d448044d0fe. I don't know what it
ends up been after upload with read_to_boundary().

When you use the function to copy the file you will see that the
digest will be e45979254297b0ece9c182a789d7966e.

I have other 5 files out of 78546 files that I'm testing it against
that have the same issues, coincidentally there are all PDF files.
Here is the script that I was testing it with.

def read_to_boundary(self, req, boundary, file):
''' read from the request object line by line with a maximum size,
until the new line starts with boundary
'''
previous_delimiter = ''
while 1:
line = req.readline(1<<16)
if line.startswith(boundary):
break

if line.endswith('\r\n'):
file.write(previous_delimiter + line[:-2])
previous_delimiter = '\r\n'

elif line.endswith('\r') or line.endswith('\n'):
file.write(previous_delimiter + line[:-1])
previous_delimiter = line[-1:]

else:
file.write(previous_delimiter + line)
previous_delimiter = ''

#f = file('Perl Bookshelf [4th Ed]/mre/final/ch06.pdf.new', 'a+')
#f = file('Pages User Guide.app/Contents/Resources/Italian.lproj/
Pages Manuale Utente.pdf', 'a+')
f = file('ugh.pdf.new', 'a+')
f.write('\r\n--myboundary--\r\n')
f.seek(0)
o = file('test.bin', 'wb')
read_to_boundary(None, f, '--myboundary', o)
o.close()

/amn


On Nov 6, 2005, at 11:58 AM, Jim Gallacher wrote:

> Alexis,
>
> I wanted to add that I'm testing your code.
>
> Alexis Marrero wrote:
> 
>
> You don't need an account to use subversion anonymously. Just
> install subversion and grab a mod_python working copy.
>
> $ svn co http://svn.apache.org/repos/asf/httpd/mod_python/trunk trunk
>
> This will checkout a working copy into a new directory called
> "trunk" on  your machine. All of the following commands assume you
> are working in trunk/.
>
> Make your changes in your working copy, and then create a diff with:
>
> $ svn diff lib/python/mod_python/util.py > your-patch.diff
>
> The other commands which you'll find immediately useful are:
>
> svn update    - update your working copy from the repository
> svn status    - shows status of changes in your working copy
> svn -u status    - shows status of your copy against the repository
>
> I've found "Version Control with Subverion" is an excellent
> resource and is available online.
> http://svnbook.red-bean.com/en/1.1/index.html
>
> Jim
>







[ Post a follow-up to this message ]



    Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2  
Nicolas Lehuen


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-07-05 01:47 AM

Hi guys,

In the pure "if it ain't tested, it ain't fixed" fashion, I've added a unit
test for file upload to the test suite. It uploads a randomly generated 1 MB
file to the server, and check that the MD5 digest returned by the server is
correct. I could not reproduce Alexis' bug report this way, however. But I
then I added a test with the UNIX-HATERS handbook file ugh.pdf, and bang,
here comes the bug.

I've checked in both unit tests into subversion, so that you can play with
them. I'm now going to test Alexis' fix.

Regards,
Nicolas

2005/11/6, Alexis Marrero <amarrero@mitre.org>:
>
> I don't have a function that creates the files but the I can point
> you to a file that has the problem, ironically is "Unix Haters
> Handbook"  Well, at least is not the Python HH....
>
> http://research.microsoft.com/~daniel/uhh-download.html
>
> It's MD5 is 9e8c42be55aac825e7a34d448044d0fe. I don't know what it
> ends up been after upload with read_to_boundary().
>
> When you use the function to copy the file you will see that the
> digest will be e45979254297b0ece9c182a789d7966e.
>
> I have other 5 files out of 78546 files that I'm testing it against
> that have the same issues, coincidentally there are all PDF files.
> Here is the script that I was testing it with.
>
> def read_to_boundary(self, req, boundary, file):
> ''' read from the request object line by line with a maximum size,
> until the new line starts with boundary
> '''
> previous_delimiter = ''
> while 1:
> line = req.readline(1<<16)
> if line.startswith(boundary):
> break
>
> if line.endswith('\r\n'):
> file.write(previous_delimiter + line[:-2])
> previous_delimiter = '\r\n'
>
> elif line.endswith('\r') or line.endswith('\n'):
> file.write(previous_delimiter + line[:-1])
> previous_delimiter = line[-1:]
>
> else:
> file.write(previous_delimiter + line)
> previous_delimiter = ''
>
> #f = file('Perl Bookshelf [4th Ed]/mre/final/ch06.pdf.new', 'a+')
> #f = file('Pages User Guide.app/Contents/Resources/Italian.lproj/
> Pages Manuale Utente.pdf', 'a+')
> f = file('ugh.pdf.new', 'a+')
> f.write('\r\n--myboundary--\r\n')
> f.seek(0)
> o = file('test.bin', 'wb')
> read_to_boundary(None, f, '--myboundary', o)
> o.close()
>
> /amn
>
>
> On Nov 6, 2005, at 11:58 AM, Jim Gallacher wrote:
> 
>
>






[ Post a follow-up to this message ]



    Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2  
Nicolas Lehuen


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-07-05 01:47 AM

OK, it looks like Alexis' fix solves the problem with ugh.pdf without
breaking the other unit tests. So I think we can safely integrate his patch.
.
Shall I do it ?

Regards,
Nicolas

2005/11/6, Nicolas Lehuen <nicolas.lehuen@gmail.com>:
>
> Hi guys,
>
> In the pure "if it ain't tested, it ain't fixed" fashion, I've added a
> unit test for file upload to the test suite. It uploads a randomly generat
ed
> 1 MB file to the server, and check that the MD5 digest returned by the
> server is correct. I could not reproduce Alexis' bug report this way,
> however. But I then I added a test with the UNIX-HATERS handbook file
> ugh.pdf, and bang, here comes the bug.
>
> I've checked in both unit tests into subversion, so that you can play with
> them. I'm now going to test Alexis' fix.
>
> Regards,
> Nicolas
>
> 2005/11/6, Alexis Marrero <amarrero@mitre.org>: 
>






[ Post a follow-up to this message ]



    Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2  
Alexis Marrero


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-07-05 01:47 AM

Nicolas,

Not that I'm the one to give permission whether to integrate things
or not, but just to let you know I don't even have svn installed so I
won't do it. At least not for a while...

BTW, if there are some cherrypy developers in this mailing list, the
CherryPy function that handles file uploads DOES also has the same
issue.

I'm not subscribe to CherryPy dev group thus the cross posting.

From http://svn.cherrypy.org/trunk/cherrypy/_cpcgifs.py

24      def read_lines_to_outerboundary(self):
25          """Internal: read lines until outerboundary."""
26          next = "--" + self.outerboundary
27          last = next + "--"
28          delim = ""
29          last_line_lfend = True
30          while 1:
31              line = self.fp.readline(1<<16)
32              if not line:
33                  self.done = -1
34                  break
35              if line[:2] == "--" and last_line_lfend:
36                  strippedline = line.strip()
37                  if strippedline == next:
38                      break
39                  if strippedline == last:
40                      self.done = 1
41                      break
42              odelim = delim
43              if line[-2:] == "\r\n":
44                  delim = "\r\n"
45                  line = line[:-2]
46                  last_line_lfend = True
47              elif line[-1] == "\n":
48                  delim = "\n"
49                  line = line[:-1]
50                  last_line_lfend = True
51              else:
52                  delim = ""
53                  last_line_lfend = False
54              self.__write(odelim + line)

/amn

On Nov 6, 2005, at 4:20 PM, Nicolas Lehuen wrote:

> OK, it looks like Alexis' fix solves the problem with ugh.pdf
> without breaking the other unit tests. So I think we can safely
> integrate his patch. Shall I do it ?
>
> Regards,
> Nicolas
>
> 2005/11/6, Nicolas Lehuen <nicolas.lehuen@gmail.com>:
> Hi guys,
>
> In the pure "if it ain't tested, it ain't fixed" fashion, I've
> added a unit test for file upload to the test suite. It uploads a
> randomly generated 1 MB file to the server, and check that the MD5
> digest returned by the server is correct. I could not reproduce
> Alexis' bug report this way, however. But I then I added a test
> with the UNIX-HATERS handbook file ugh.pdf, and bang, here comes
> the bug.
>
> I've checked in both unit tests into subversion, so that you can
> play with them. I'm now going to test Alexis' fix.
>
> Regards,
> Nicolas
>
> 2005/11/6, Alexis Marrero <amarrero@mitre.org>:
> I don't have a function that creates the files but the I can point
> you to a file that has the problem, ironically is "Unix Haters
> Handbook"  Well, at least is not the Python HH....
>
> http://research.microsoft.com/~daniel/uhh-download.html
>
> It's MD5 is 9e8c42be55aac825e7a34d448044d0fe. I don't know what it
> ends up been after upload with read_to_boundary().
>
> When you use the function to copy the file you will see that the
> digest will be e45979254297b0ece9c182a789d7966e.
>
> I have other 5 files out of 78546 files that I'm testing it against
> that have the same issues, coincidentally there are all PDF files.
> Here is the script that I was testing it with.
>
> def read_to_boundary(self, req, boundary, file):
>      ''' read from the request object line by line with a maximum
> size,
>          until the new line starts with boundary
>      '''
>      previous_delimiter = ''
>      while 1:
>          line = req.readline(1<<16)
>          if line.startswith(boundary):
>              break
>
>          if line.endswith('\r\n'):
>              file.write(previous_delimiter + line[:-2])
>              previous_delimiter = '\r\n'
>
>          elif line.endswith('\r') or line.endswith('\n'):
>              file.write(previous_delimiter + line[:-1])
>              previous_delimiter = line[-1:]
>
>          else:
>              file.write(previous_delimiter + line)
>              previous_delimiter = ''
>
> #f = file('Perl Bookshelf [4th Ed]/mre/final/ch06.pdf.new', 'a+')
> #f = file('Pages User Guide.app/Contents/Resources/Italian.lproj/
> Pages Manuale Utente.pdf', 'a+')
> f = file('ugh.pdf.new', 'a+')
> f.write('\r\n--myboundary--\r\n')
> f.seek(0)
> o = file('test.bin', 'wb')
> read_to_boundary(None, f, '--myboundary', o)
> o.close()
>
> /amn
>
>
> On Nov 6, 2005, at 11:58 AM, Jim Gallacher wrote:
> 
> trunk 
>
>
>







[ Post a follow-up to this message ]



    Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2  
Nicolas Lehuen


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-08-05 10:56 PM

Well, I've re-read the previous code and it looks like it does almost the
same thing except it is bugged . CherryPy's implementation is almost the
same except it ought to work.

Jim, I've integrated your tricky file into the unit test. Alexis' version
passes all tests, whereas the current version fails on your file and ugh.pdf
.
So I guess we can call it a day and integrate Alexis' fix.

One thing which is different is the use of 1<<16 = 65536 as read block size
instead of 65368. It was Barry Pearce who contributed the latest version of
read_to_boundary, he must have known why this block size was good (or was it
his lucky number ?). Anyway, I've changed Alexis' code to use the
readBlockSize variable and changed its value to 1<<16. I've done tests with
both 1<<16 and 65638 on the server side and on the client side (changing the
tricky file generation), and the new code works.

What we should have done, while we were at spending a lot of time on a
trivial string search problem, is at least to implement a fast Boyer-Moore
search algorithm, and drop the use of readline. But I guess this will be for
3.4 ;).

And yes, this 3 spaces indent is dreadful.

Regards,
Nicolas



2005/11/7, Jim Gallacher <jpg@jgassociates.ca >:
>
> Gregory (Grisha) Trubetskoy wrote: 
>
> As much as it pains me to say it, but yes, this is a must fixm so it's
> on to 3.2.5b.
>
> I think we need to do some more extensive testing on Alexis's fix before
> we roll 3.2.5b. His read_to_boundary is much simpler than the current
> one. This makes me wonder if there is some magic happening in the
> current version which is solving some weird problems, or is his code
> just that much better?
>
> I'm feeling a little dull right now so all the code just blurs together.
> It also doesn't help that util.py uses *3 spaces* for the indent. Yikes.
> How the heck did that happen? 
>
> I'll take another look tomorrow.
>
> Jim
>






[ Post a follow-up to this message ]



    Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2  
Alexis Marrero


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-08-05 10:56 PM

Ok. Now I'm confused.

What I was trying to say is that I created a file with this function:

def generate_split_file(offset=-1,
readBlockSize=65368,
fname='testfile'):
f = open(fname, 'w')
f.write('a'*50)
f.write('\r\n')
block_size =  readBlockSize + offset
f.write('b'*block_size)
f.close()

Then I uploaded 'testfile' using the following
StorageField.read_to_boundary() method:

def read_to_boundary(self, req, boundary, file):
''' read from the request object line by line with a maximum size,
until the new line starts with boundary
'''
previous_delimiter = ''
while 1:
line = req.readline(1<<16)
if line.startswith(boundary):
break

if line.endswith('\r\n'):
file.write(previous_delimiter + line[:-2])
previous_delimiter = '\r\n'

elif line.endswith('\r') or line.endswith('\n'):
file.write(previous_delimiter + line[:-1])
previous_delimiter = line[-1:]

else:
file.write(previous_delimiter + line)
previous_delimiter = ''

And the md5 in the client is the same one as in the server.  Do you
have different results? Let me know.

Regards,
/amn

On Nov 7, 2005, at 2:11 PM, Jim Gallacher wrote:

> Jim Gallacher wrote:
> 
>
> Just to clarify, is this for your new read_to_boundary or the one
> in 3.2? If it's for yours then the MD5 sum *should* be the same,
> since that's what you fixed. 
>
> 
>
>
>
>







[ Post a follow-up to this message ]



    Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2  
Alexis Marrero


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-08-05 10:56 PM

Sorry for all this emails, but my system depends 100% on mod_python
specially file uploading. 

On Nov 7, 2005, at 2:04 PM, Jim Gallacher wrote:

> Alexis Marrero wrote:
> 
>
> Did you call it with the same block size as you are using in your
> code? The '\r' character must appear in the file right at the
> readBlockSize boundary.
YES I ran it with generate_file(offset=-1, readBlockSize=1<<16,
fname='testfile')
and the MD5 of the input and output files match.

>
> ie.
>
> generate_file(offset=-1, readBlockSize=1<<16, fname='testfile')
>
> Jim
>
> PS. Please make sure you cc. python-dev when replying. It's best to
> keep discussion on the list.
>
>
>







[ Post a follow-up to this message ]



    Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2  
Alexis Marrero


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-08-05 10:56 PM

New version of read_to_boundary(...)

readBlockSize = 1 << 16
def read_to_boundary(self, req, boundary, file):
previous_delimiter = ''
while 1:
line = req.readline(readBlockSize)
if line.strip().startswith(boundary):
break

if line.endswith('\r\n'):
file.write(previous_delimiter + line[:-2])
previous_delimiter = '\r\n'

elif line.endswith('\r'):
file.write(previous_delimiter + line[:-1])
previous_delimiter = '\r'

elif line.endswith('\n'):
if len(line[:-1]) > 0:
file.write(previous_delimiter + line[:-1])
previous_delimiter = '\n'

else:
previous_delimiter += '\n'

else:
file.write(previous_delimiter + line)
previous_delimiter = ''

This new functions passes the test for Jim's filetest generator.

[core:~] amarrero% Python filegenerator.py ; md5 testfile ; cp
testfile testfile.new ; Python test.py ; md5 test.bin
MD5 (testfile) = d481990a0f0bbd8acf847cd732714555
MD5 (outputtestfile.bin) = d481990a0f0bbd8acf847cd732714555

I'm running a test with my set of files (60000+) to see any other new
issues.



On Nov 7, 2005, at 6:35 PM, Jim Gallacher wrote:

> Alexis Marrero wrote:
> 
>
> So am I!
>
> I've created a test harness so we can bypass mod_python completely.
> It includes a slightly modified version of read_to_boundary which
> adds a new parameter, readBlockSize.
>
> In the output from the test harness, your version is 'new' and the
> current version is 'cur'. Run it and see what happens.
>
> Jim
>
> $ ./upload_test_harness
>
>  ========================================

> generate_embedded_cr_file
> ----------------------------------------
> test offset -1 chunk []
> ----------------------------------------
> src 5a63347d1106afdfa264b2a61f81ae82
> cur 5a63347d1106afdfa264b2a61f81ae82 PASS
> new 5a63347d1106afdfa264b2a61f81ae82 PASS
>
> ----------------------------------------
> test offset -1 chunk ['CR']
> ----------------------------------------
> src 82204e52343d5b25c2e783cd59499973
> cur e4af2eee73029642a114697ba59217b3 FAIL
> new 82204e52343d5b25c2e783cd59499973 PASS
>
>  ========================================

> generate_split_boundary_file
> ----------------------------------------
> test offset -1 chunk []
> ----------------------------------------
> src d481990a0f0bbd8acf847cd732714555
> cur d481990a0f0bbd8acf847cd732714555 PASS
> new 8fa5ac9f913d778575ea871506c392a9 FAIL
>
> ----------------------------------------
> test offset -1 chunk ['CR']
> ----------------------------------------
> src 8fa5ac9f913d778575ea871506c392a9
> cur d481990a0f0bbd8acf847cd732714555 FAIL
> new 8fa5ac9f913d778575ea871506c392a9 PASS
>
>
> 
>
> #!/usr/bin/env python
>
> from mkfile import generate_split_file, generate_file
> import sys
> from StringIO import StringIO
> import md5
>
> def read_to_boundary_current(self, req, boundary, file,
> readBlockSize):
>     ''' currrent version '''
>     #
>     # Although technically possible for the boundary to be split by
> the read, this will
>     # not happen because the readBlockSize is set quite high - far
> longer than any boundary line
>     # will ever contain.
>     #
>     # lastCharCarried is used to detect the situation where the \r
> \n is split across the end of
>     # a read block.
>     #
>     delim = ''
>     lastCharCarried = False
>     last_bound = boundary + '--'
>     roughBoundaryLength = len(last_bound) + 128
>     line = req.readline(readBlockSize)
>     lineLength = len(line)
>     if lineLength < roughBoundaryLength:
>         sline = line.strip()
>     else:
>         sline = ''
>     while lineLength > 0 and sline != boundary and sline !=
> last_bound:
>         if not lastCharCarried:
>             file.write(delim)
>             delim = ''
>         else:
>             lastCharCarried = False
>         cutLength = 0
>
>         if lineLength == readBlockSize:
>             if line[-1:] == '\r':
>                 delim = '\r'
>                 cutLength = -1
>                 lastCharCarried = True
>
>         if line[-2:] == '\r\n':
>             delim += '\r\n'
>             cutLength = -2
>         elif line[-1:] == '\n':
>             delim += '\n'
>             cutLength = -1
>         if cutLength != 0:
>             file.write(line[:cutLength])
>         else:
>             file.write(line)
>
>         line = req.readline(readBlockSize)
>         lineLength = len(line)
>         if lineLength < roughBoundaryLength:
>             sline = line.strip()
>         else:
>             sline = ''
>
> def read_to_boundary_new(self, req, boundary, file, readBlockSize):
>     ''' Alexis' version
>         read from the request object line by line with a maximum size,
>         until the new line starts with boundary
>     '''
>     previous_delimiter = ''
>     while 1:
>         line = req.readline(readBlockSize)
>         if line.startswith(boundary):
>             break
>
>         if line.endswith('\r\n'):
>             file.write(previous_delimiter + line[:-2])
>             previous_delimiter = '\r\n'
>
>         elif line.endswith('\r') or line.endswith('\n'):
>             file.write(previous_delimiter + line[:-1])
>             previous_delimiter = line[-1:]
>
>         else:
>             file.write(previous_delimiter + line)
>             previous_delimiter = ''
>
> def get_checksum(fname):
>     data = open(fname).read()
>     m = md5.new()
>     m.update(data)
>     return m.hexdigest()
>
> def generate_embedded_cr_file(offset=-1, readBlockSize=65368,
> chunk='\r', fname='testfile'):
>     """ Generate a file which causes the error with file upload
>         The default offset of -1 should generate a file which will
>         be corrupted by the file upload.
>     """
>
>     f = open(fname, 'w')
>     f.write('a'*50)
>     f.write('\r\n')
>
>     block_size =  readBlockSize + offset
>     f.write('b'*block_size)
>     f.write(chunk)
>     f.write('ccc')
>
>     f.write('d'*50)
>     f.write('\r\n')
>
>     f.close()
>
> def generate_split_boundary_file(offset=-1, readBlockSize=65368,
> chunk='\r', fname='testfile'):
>     """ this function generates a file with a boundary string '\r
> \n--myboundary'
>         starting at readBlockSize - offset
>     """
>     f = open(fname, 'w')
>     f.write('a'*50)
>     f.write('\r\n')
>
>     block_size =  readBlockSize + offset
>     f.write('b'*block_size)
>     f.write(chunk)
>
>     f.close()
>
> def main(file_generator, offset, chunk, block_size=1<<16):
>     fname_in = 'testfile.in'
>     fname_out = 'testfile.out'
>     file_generator(offset=offset, readBlockSize=block_size,
> chunk=chunk, fname=fname_in)
>     orig_checksum = get_checksum(fname_in)
>
>     req = StringIO()
>     req.write(open(fname_in).read())
>     req.write('\r\n--myboundary\r\n')
>
>     src_cs = get_checksum(fname_in)
>     print 'src', src_cs
>
>     fname_out = '%s.cur' % fname_out
>     o = file(fname_out, 'wb')
>     req.seek(0)
>     read_to_boundary_current(None, req, '--myboundary', o, block_size)
>     o.close()
>     cs = get_checksum(fname_out)
>     print 'cur', cs,
>     if cs != src_cs:
>         print 'FAIL'
>     else:
>         print 'PASS'
>
>     fname_out = '%s.alexis' % fname_out
>     o = file(fname_out, 'wb')
>     req.seek(0)
>     read_to_boundary_new(None, req, '--myboundary', o, block_size)
>     o.close()
>     cs = get_checksum(fname_out)
>     print 'new', cs,
>     if cs != src_cs:
>         print 'FAIL'
>     else:
>         print 'PASS'
>
> def cname(ch):
>     if ch == '\r':
>         return 'CR'
>     elif ch == '\n':
>         return 'LF'
>     elif ch == '':
>         return 'None'
>     else:
>         return ord(ch)
>
> if __name__ == '__main__':
>
>     #test_chunks =  ['', '\r', '\n', '\r\n']
>
>     # only test the chunks that are currently a problem
>     test_chunks =  ['', '\r',]
>
>     test_cases =
> & #123;'generate_embedded_cr_file':generat
e_embedded_cr_file,
> 'generate_split_boundary_file': generate_split_boundary_file, }
>     for name,file_gen_obj in test_cases.items():
>         print '='*40
>         print name
>         for chunk in test_chunks:
>             for i in range(-1, 0):
>                 print '-'*40
>                 print 'test offset', i, 'chunk',[ cname(c) for c in
> chunk ]
>                 print '-'*40
>                 main(file_gen_obj, i, chunk)
>                 print
>             print
>







[ Post a follow-up to this message ]



    Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2  
Alexis Marrero


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-08-05 10:56 PM

Thanks for that improvement, don't like its complexity though.  I'm
testing "mikes" version with my set of files I will all let you know
how it goes.

BTW, the line that reads "last_bound = boundary + '--'" so we save 4
CPU cycles there 

The next test that I will run this against will be with an obscene
amount of data for which this improvement helps a lot!

/amn
On Nov 8, 2005, at 4:47 AM, Mike Looijmans wrote:

> Here's one that passes all the tests, and is 2x as fast as the
> 'current' and 'new' implementations on random binary data. I
> haven't been able to generate data where the 'mike' version is slower:
>
> def read_to_boundary(self, req, boundary, file, readBlockSize=65536):
>     prevline = ""
>     last_bound = boundary + '--'
>     carry = None
>     while 1:
>         line = req.readline(readBlockSize)
>         if not line or line.startswith(boundary):
>             if prevline.endswith('\r\n'):
>                 if carry is not None:
>                     file.write(carry)
>                 file.write(prevline[:-2])
>                 break
>             elif (carry == '\r') and (prevline[-1] == '\n'):
>                 file.write(prevline[:-1])
>                 break
>             # If we get here, it's not really a boundary!
>         if carry is not None:
>             file.write(carry)
>             carry = None
>         if prevline[-1:] == '\r':
>             file.write(prevline[:-1])
>             carry = '\r'
>         else:
>             file.write(prevline)
>         prevline = line
>
> I've attached a modified upload_test_harness.py that includes the
> new and current, also the 'org' version (as in 3.1 release) and the
> 'mike' version.
>
> In addition, I added some profiling calls to show the impact of the
> extra 'endswith' and slices.
>
> --
> Mike Looijmans
> Philips Natlab / Topic Automation
> #!/usr/bin/env python
>
> import sys
> from cStringIO import StringIO
> import md5
>
> ##def generate_split_file(offset=-1,
> ##                          readBlockSize=65368,
> ##                          fname='testfile'):
> ##    f = open(fname, 'wb')
> ##    f.write('a'*50)
> ##    f.write('\r\n')
> ##    block_size =  readBlockSize + offset
> ##    f.write('b'*block_size)
> ##    f.close()
>
> def read_to_boundary_current(self, req, boundary, file,
> readBlockSize):
>     ''' currrent version '''
>     #
>     # Although technically possible for the boundary to be split by
> the read, this will
>     # not happen because the readBlockSize is set quite high - far
> longer than any boundary line
>     # will ever contain.
>     #
>     # lastCharCarried is used to detect the situation where the \r
> \n is split across the end of
>     # a read block.
>     #
>     delim = ''
>     lastCharCarried = False
>     last_bound = boundary + '--'
>     roughBoundaryLength = len(last_bound) + 128
>     line = req.readline(readBlockSize)
>     lineLength = len(line)
>     if lineLength < roughBoundaryLength:
>         sline = line.strip()
>     else:
>         sline = ''
>     while lineLength > 0 and sline != boundary and sline !=
> last_bound:
>         if not lastCharCarried:
>             file.write(delim)
>             delim = ''
>         else:
>             lastCharCarried = False
>         cutLength = 0
>
>         if lineLength == readBlockSize:
>             if line[-1:] == '\r':
>                 delim = '\r'
>                 cutLength = -1
>                 lastCharCarried = True
>
>         if line[-2:] == '\r\n':
>             delim += '\r\n'
>             cutLength = -2
>         elif line[-1:] == '\n':
>             delim += '\n'
>             cutLength = -1
>         if cutLength != 0:
>             file.write(line[:cutLength])
>         else:
>             file.write(line)
>
>         line = req.readline(readBlockSize)
>         lineLength = len(line)
>         if lineLength < roughBoundaryLength:
>             sline = line.strip()
>         else:
>             sline = ''
>
> def read_to_boundary_new(self, req, boundary, file, readBlockSize):
>     ''' Alexis' version
>         read from the request object line by line with a maximum size,
>         until the new line starts with boundary
>     '''
>     previous_delimiter = ''
>     while 1:
>         line = req.readline(readBlockSize)
>         if line.startswith(boundary):
>             break
>
>         if line.endswith('\r\n'):
>             file.write(previous_delimiter + line[:-2])
>             previous_delimiter = '\r\n'
>
>         elif line.endswith('\r') or line.endswith('\n'):
>             file.write(previous_delimiter + line[:-1])
>             previous_delimiter = line[-1:]
>
>         else:
>             file.write(previous_delimiter + line)
>             previous_delimiter = ''
>
> def read_to_boundary_org(self, req, boundary, file, readBlockSize):
>     delim = ""
>     line = req.readline(readBlockSize)
>     while line and not line.startswith(boundary):
>         odelim = delim
>         if line[-2:] == "\r\n":
>             delim = "\r\n"
>             line = line[:-2]
>         elif line[-1:] == "\n":
>             delim = "\n"
>             line = line[:-1]
>         else:
>             delim = ""
>         file.write(odelim + line)
>         line = req.readline(readBlockSize)
>
> def read_to_boundary_mike(self, req, boundary, file,
> readBlockSize=65536):
>     prevline = ""
>     last_bound = boundary + '--'
>     carry = None
>     while 1:
>         line = req.readline(readBlockSize)
>         if not line or line.startswith(boundary):
>             if prevline.endswith('\r\n'):
>                 if carry is not None:
>                     file.write(carry)
>                 file.write(prevline[:-2])
>                 break
>             elif (carry == '\r') and (prevline[-1] == '\n'):
>                 file.write(prevline[:-1])
>                 break
>             # If we get here, it's not really a boundary!
>         if carry is not None:
>             file.write(carry)
>             carry = None
>         if prevline[-1:] == '\r':
>             file.write(prevline[:-1])
>             carry = '\r'
>         else:
>             file.write(prevline)
>         prevline = line
>
> def get_checksum(fname):
>     data = open(fname, 'rb').read()
>     m = md5.new()
>     m.update(data)
>     return m.hexdigest()
>
> def generate_embedded_cr_file(offset=-1, readBlockSize=65368,
> chunk='\r', fname='testfile'):
>     """ Generate a file which causes the error with file upload
>         The default offset of -1 should generate a file which will
>         be corrupted by the file upload.
>     """
>
>     f = open(fname, 'wb')
>     f.write('a'*50)
>     f.write('\r\n')
>
>     block_size =  readBlockSize + offset
>     f.write('b'*block_size)
>     f.write(chunk)
>     f.write('ccc')
>
>     f.write('d'*50)
>     f.write('\r\n')
>
>     f.close()
>
> def generate_split_boundary_file(offset=-1, readBlockSize=65368,
> chunk='\r', fname='testfile'):
>     """ this function generates a file with a boundary string '\r
> \n--myboundary'
>         starting at readBlockSize - offset
>     """
>     f = open(fname, 'wb')
>     f.write('a'*50)
>     f.write('\r\n')
>
>     block_size =  readBlockSize + offset
>     f.write('b'*block_size)
>     f.write(chunk)
>
>     f.close()
>
> read_boundaries = [read_to_boundary_current, read_to_boundary_new,
> read_to_boundary_org, read_to_boundary_mike]
>
> def main(file_generator, offset, chunk, block_size=1<<16):
>     fname_in = 'testfile.in'
>     fname_out_base = 'testfile.out'
>     file_generator(offset=offset, readBlockSize=block_size,
> chunk=chunk, fname=fname_in)
>
>     orig_checksum = get_checksum(fname_in)
>
>     req = StringIO()
>     req.write(open(fname_in, 'rb').read())
>     req.write('\r\n--myboundary\r\n')
>
>     src_cs = get_checksum(fname_in)
>     print '     src', src_cs
>
>     for rtb in read_boundaries:
>         name = rtb.__name__.split('_')[-1]
>         fname_out = fname_out_base + name
>         o = file(fname_out, 'wb')
>         req.seek(0)
>         rtb(None, req, '--myboundary', o, block_size)
>         size = o.tell()
>         o.close()
>         cs = get_checksum(fname_out)
>         print "%8s %s %6d" % (name, cs, size),
>         if cs != src_cs:
>             print 'FAIL'
>         else:
>             print 'PASS'
>
>
> def cname(ch):
>     if ch == '\r':
>         return 'CR'
>     elif ch == '\n':
>         return 'LF'
>     elif ch == '':
>         return 'None'
>     else:
>         return ord(ch)
>
> class DevNull:
>     def write(self, data):
>         pass
>
> if __name__ == '__main__':
>
>     #test_chunks =  ['', '\r', '\n', '\r\n']
>
>     # only test the chunks that are currently a problem
>     test_chunks =  ['', '\r',]
>
>     test_cases = (generate_embedded_cr_file,
> generate_split_boundary_file)
>     for file_gen_obj in test_cases:
>         print '='*40
>         print file_gen_obj.__name__
>         for chunk in test_chunks:
>             for i in range(-1, 0):
>                 print '-'*40
>                 print 'test offset', i, 'chunk',[ cname(c) for c in
> chunk ]
>                 print '-'*40
>                 main(file_gen_obj, i, chunk)
>                 print
>             print
> ##    sys.exit(0)
>
>     # Run profiling benchmarks with the four files
>     # (on my system, the _mike and _org versions are typically almost
>     # twice as fast as the _new and _current)
>     import profile, random
>     req = StringIO()
>     blocksize = 102400
>     block=[]
>     r = random.Random()
>     for i in xrange(blocksize):
>         block.append(chr(r.randint(0,255)))
>     block = "".join(block)
>     for c in range(64):
>       req.write(block)
>     req.write('\r\n--myboundary\r\n')
>     print "Benchmark bytes:", req.tell()
>     out = DevNull()
>     for rtb in read_boundaries:
>         req.seek(0)
>         print rtb.__name__
>         profile.run("%s(None, req, '--myboundary', out, 65536)" %
> rtb.__name__)
>







[ Post a follow-up to this message ]



    Sponsored Links  




 





   All times are GMT. The time now is 03:09 PM.      Post New Thread    Post A Reply      
Pages (2): [1] 2 »   Last Thread   Next Thread Next


Most Popular forums 

Forum Jump:
Rate This Thread:

Forum Rules:
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is OFF
vB code is ON
Smilies are ON
[IMG] code is OFF
 
Medical and Health forum | Computer Games Reviews | Graphics design forum