|
Home > Archive > Apache Mod-Python > November 2005 > mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2
|
|
| Alexis Marrero 2005-11-06, 8:47 pm |
| All,
The current 3.1 mod_python implementation of
mod_python.util.StorageField.read_to_boudary reads as follows:
203 def read_to_boundary(self, req, boundary, file):
204 delim = ""
205 line = req.readline()
206 sline = line.strip()
207 last_bound = boundary + "--"
208 while line and sline != boundary and sline !=
last_bound:
209 odelim = delim
210 if line[-2:] == "\r\n":
211 delim = "\r\n"
212 line = line[:-2]
213 elif line[-1:] == "\n":
214 delim = "\n"
215 line = line[:-1]
216 file.write(odelim + line)
217 line = req.readline()
218 sline = line.strip()
As we have discussed previously:
http://www.modpython.org/pipermail/...rch/017754.html
http://www.modpython.org/pipermail/...rch/017756.html
http://www.modpython.org/pipermail/...ber/019460.html
This triggered couple of changes in mod_python 3.2 Beta which reads
as follows:
33 # Fixes memory error when upload large files such as 700+MB
ISOs.
34 readBlockSize = 65368
35
....
225 def read_to_boundary(self, req, boundary, file):
....
234 delim = ''
235 lastCharCarried = False
236 last_bound = boundary + '--'
237 roughBoundaryLength = len(last_bound) + 128
238 line = req.readline(readBlockSize)
239 lineLength = len(line)
240 if lineLength < roughBoundaryLength:
241 sline = line.strip()
242 else:
243 sline = ''
244 while lineLength > 0 and sline != boundary and sline !
= last_bound:
245 if not lastCharCarried:
246 file.write(delim)
247 delim = ''
248 else:
249 lastCharCarried = False
250 cutLength = 0
251 if lineLength == readBlockSize:
252 if line[-1:] == '\r':
253 delim = '\r'
254 cutLength = -1
255 lastCharCarried = True
256 if line[-2:] == '\r\n':
257 delim += '\r\n'
258 cutLength = -2
259 elif line[-1:] == '\n':
260 delim += '\n'
261 cutLength = -1
262 if cutLength != 0:
263 file.write(line[:cutLength])
264 else:
265 file.write(line)
266 line = req.readline(readBlockSize)
267 lineLength = len(line)
268 if lineLength < roughBoundaryLength:
269 sline = line.strip()
270 else:
271 sline = ''
This function has a mysterious bug in it... For some files which I
could disclose (one of them been the PDF file for Apple's Pages User
Manual in Italian) the uploaded file in the server ends up with the
same length but different sha512 (the only digest that I'm using).
The problem is a '\r' in the middle of a chunk of data that is much
larger than readBlockSize.
Anyhow, I wrote a new function, which I believe is much simpler, and
test it with thousands and thousands of different files and so far it
seems to work fine. It reads as follows:
def read_to_boundary(self, req, boundary, file):
''' read from the request object line by line with a maximum size,
until the new line starts with boundary
'''
previous_delimiter = ''
while 1:
line = req.readline(1<<16)
if line.startswith(boundary):
break
if line.endswith('\r\n'):
file.write(previous_delimiter + line[:-2])
previous_delimiter = '\r\n'
elif line.endswith('\r') or line.endswith('\n'):
file.write(previous_delimiter + line[:-1])
previous_delimiter = line[-1:]
else:
file.write(previous_delimiter + line)
previous_delimiter = ''
Let me know any comments on it and if you test it and fails please
also let me know. I don't have subversion account neither I don't
know how to use it thus this email.
/amn
________________________________________
_______
Mod_python mailing list
Mod_python@modpython.org
http://mailman.modpython.org/mailma...info/mod_python
| |
| Alexis Marrero 2005-11-06, 8:47 pm |
| I don't have a function that creates the files but the I can point
you to a file that has the problem, ironically is "Unix Haters
Handbook" Well, at least is not the Python HH....
http://research.microsoft.com/~daniel/uhh-download.html
It's MD5 is 9e8c42be55aac825e7a34d448044d0fe. I don't know what it
ends up been after upload with read_to_boundary().
When you use the function to copy the file you will see that the
digest will be e45979254297b0ece9c182a789d7966e.
I have other 5 files out of 78546 files that I'm testing it against
that have the same issues, coincidentally there are all PDF files.
Here is the script that I was testing it with.
def read_to_boundary(self, req, boundary, file):
''' read from the request object line by line with a maximum size,
until the new line starts with boundary
'''
previous_delimiter = ''
while 1:
line = req.readline(1<<16)
if line.startswith(boundary):
break
if line.endswith('\r\n'):
file.write(previous_delimiter + line[:-2])
previous_delimiter = '\r\n'
elif line.endswith('\r') or line.endswith('\n'):
file.write(previous_delimiter + line[:-1])
previous_delimiter = line[-1:]
else:
file.write(previous_delimiter + line)
previous_delimiter = ''
#f = file('Perl Bookshelf [4th Ed]/mre/final/ch06.pdf.new', 'a+')
#f = file('Pages User Guide.app/Contents/Resources/Italian.lproj/
Pages Manuale Utente.pdf', 'a+')
f = file('ugh.pdf.new', 'a+')
f.write('\r\n--myboundary--\r\n')
f.seek(0)
o = file('test.bin', 'wb')
read_to_boundary(None, f, '--myboundary', o)
o.close()
/amn
On Nov 6, 2005, at 11:58 AM, Jim Gallacher wrote:
> Alexis,
>
> I wanted to add that I'm testing your code.
>
> Alexis Marrero wrote:
>
>
> You don't need an account to use subversion anonymously. Just
> install subversion and grab a mod_python working copy.
>
> $ svn co http://svn.apache.org/repos/asf/httpd/mod_python/trunk trunk
>
> This will checkout a working copy into a new directory called
> "trunk" on your machine. All of the following commands assume you
> are working in trunk/.
>
> Make your changes in your working copy, and then create a diff with:
>
> $ svn diff lib/python/mod_python/util.py > your-patch.diff
>
> The other commands which you'll find immediately useful are:
>
> svn update - update your working copy from the repository
> svn status - shows status of changes in your working copy
> svn -u status - shows status of your copy against the repository
>
> I've found "Version Control with Subverion" is an excellent
> resource and is available online.
> http://svnbook.red-bean.com/en/1.1/index.html
>
> Jim
>
| |
| Nicolas Lehuen 2005-11-06, 8:47 pm |
| Hi guys,
In the pure "if it ain't tested, it ain't fixed" fashion, I've added a unit
test for file upload to the test suite. It uploads a randomly generated 1 MB
file to the server, and check that the MD5 digest returned by the server is
correct. I could not reproduce Alexis' bug report this way, however. But I
then I added a test with the UNIX-HATERS handbook file ugh.pdf, and bang,
here comes the bug.
I've checked in both unit tests into subversion, so that you can play with
them. I'm now going to test Alexis' fix.
Regards,
Nicolas
2005/11/6, Alexis Marrero <amarrero@mitre.org>:
>
> I don't have a function that creates the files but the I can point
> you to a file that has the problem, ironically is "Unix Haters
> Handbook" Well, at least is not the Python HH....
>
> http://research.microsoft.com/~daniel/uhh-download.html
>
> It's MD5 is 9e8c42be55aac825e7a34d448044d0fe. I don't know what it
> ends up been after upload with read_to_boundary().
>
> When you use the function to copy the file you will see that the
> digest will be e45979254297b0ece9c182a789d7966e.
>
> I have other 5 files out of 78546 files that I'm testing it against
> that have the same issues, coincidentally there are all PDF files.
> Here is the script that I was testing it with.
>
> def read_to_boundary(self, req, boundary, file):
> ''' read from the request object line by line with a maximum size,
> until the new line starts with boundary
> '''
> previous_delimiter = ''
> while 1:
> line = req.readline(1<<16)
> if line.startswith(boundary):
> break
>
> if line.endswith('\r\n'):
> file.write(previous_delimiter + line[:-2])
> previous_delimiter = '\r\n'
>
> elif line.endswith('\r') or line.endswith('\n'):
> file.write(previous_delimiter + line[:-1])
> previous_delimiter = line[-1:]
>
> else:
> file.write(previous_delimiter + line)
> previous_delimiter = ''
>
> #f = file('Perl Bookshelf [4th Ed]/mre/final/ch06.pdf.new', 'a+')
> #f = file('Pages User Guide.app/Contents/Resources/Italian.lproj/
> Pages Manuale Utente.pdf', 'a+')
> f = file('ugh.pdf.new', 'a+')
> f.write('\r\n--myboundary--\r\n')
> f.seek(0)
> o = file('test.bin', 'wb')
> read_to_boundary(None, f, '--myboundary', o)
> o.close()
>
> /amn
>
>
> On Nov 6, 2005, at 11:58 AM, Jim Gallacher wrote:
>
>
>
| |
| Nicolas Lehuen 2005-11-06, 8:47 pm |
| OK, it looks like Alexis' fix solves the problem with ugh.pdf without
breaking the other unit tests. So I think we can safely integrate his patch..
Shall I do it ?
Regards,
Nicolas
2005/11/6, Nicolas Lehuen <nicolas.lehuen@gmail.com>:
>
> Hi guys,
>
> In the pure "if it ain't tested, it ain't fixed" fashion, I've added a
> unit test for file upload to the test suite. It uploads a randomly generated
> 1 MB file to the server, and check that the MD5 digest returned by the
> server is correct. I could not reproduce Alexis' bug report this way,
> however. But I then I added a test with the UNIX-HATERS handbook file
> ugh.pdf, and bang, here comes the bug.
>
> I've checked in both unit tests into subversion, so that you can play with
> them. I'm now going to test Alexis' fix.
>
> Regards,
> Nicolas
>
> 2005/11/6, Alexis Marrero <amarrero@mitre.org>:
>
| |
| Alexis Marrero 2005-11-06, 8:47 pm |
| Nicolas,
Not that I'm the one to give permission whether to integrate things
or not, but just to let you know I don't even have svn installed so I
won't do it. At least not for a while...
BTW, if there are some cherrypy developers in this mailing list, the
CherryPy function that handles file uploads DOES also has the same
issue.
I'm not subscribe to CherryPy dev group thus the cross posting.
From http://svn.cherrypy.org/trunk/cherrypy/_cpcgifs.py
24 def read_lines_to_outerboundary(self):
25 """Internal: read lines until outerboundary."""
26 next = "--" + self.outerboundary
27 last = next + "--"
28 delim = ""
29 last_line_lfend = True
30 while 1:
31 line = self.fp.readline(1<<16)
32 if not line:
33 self.done = -1
34 break
35 if line[:2] == "--" and last_line_lfend:
36 strippedline = line.strip()
37 if strippedline == next:
38 break
39 if strippedline == last:
40 self.done = 1
41 break
42 odelim = delim
43 if line[-2:] == "\r\n":
44 delim = "\r\n"
45 line = line[:-2]
46 last_line_lfend = True
47 elif line[-1] == "\n":
48 delim = "\n"
49 line = line[:-1]
50 last_line_lfend = True
51 else:
52 delim = ""
53 last_line_lfend = False
54 self.__write(odelim + line)
/amn
On Nov 6, 2005, at 4:20 PM, Nicolas Lehuen wrote:
> OK, it looks like Alexis' fix solves the problem with ugh.pdf
> without breaking the other unit tests. So I think we can safely
> integrate his patch. Shall I do it ?
>
> Regards,
> Nicolas
>
> 2005/11/6, Nicolas Lehuen <nicolas.lehuen@gmail.com>:
> Hi guys,
>
> In the pure "if it ain't tested, it ain't fixed" fashion, I've
> added a unit test for file upload to the test suite. It uploads a
> randomly generated 1 MB file to the server, and check that the MD5
> digest returned by the server is correct. I could not reproduce
> Alexis' bug report this way, however. But I then I added a test
> with the UNIX-HATERS handbook file ugh.pdf, and bang, here comes
> the bug.
>
> I've checked in both unit tests into subversion, so that you can
> play with them. I'm now going to test Alexis' fix.
>
> Regards,
> Nicolas
>
> 2005/11/6, Alexis Marrero <amarrero@mitre.org>:
> I don't have a function that creates the files but the I can point
> you to a file that has the problem, ironically is "Unix Haters
> Handbook" Well, at least is not the Python HH....
>
> http://research.microsoft.com/~daniel/uhh-download.html
>
> It's MD5 is 9e8c42be55aac825e7a34d448044d0fe. I don't know what it
> ends up been after upload with read_to_boundary().
>
> When you use the function to copy the file you will see that the
> digest will be e45979254297b0ece9c182a789d7966e.
>
> I have other 5 files out of 78546 files that I'm testing it against
> that have the same issues, coincidentally there are all PDF files.
> Here is the script that I was testing it with.
>
> def read_to_boundary(self, req, boundary, file):
> ''' read from the request object line by line with a maximum
> size,
> until the new line starts with boundary
> '''
> previous_delimiter = ''
> while 1:
> line = req.readline(1<<16)
> if line.startswith(boundary):
> break
>
> if line.endswith('\r\n'):
> file.write(previous_delimiter + line[:-2])
> previous_delimiter = '\r\n'
>
> elif line.endswith('\r') or line.endswith('\n'):
> file.write(previous_delimiter + line[:-1])
> previous_delimiter = line[-1:]
>
> else:
> file.write(previous_delimiter + line)
> previous_delimiter = ''
>
> #f = file('Perl Bookshelf [4th Ed]/mre/final/ch06.pdf.new', 'a+')
> #f = file('Pages User Guide.app/Contents/Resources/Italian.lproj/
> Pages Manuale Utente.pdf', 'a+')
> f = file('ugh.pdf.new', 'a+')
> f.write('\r\n--myboundary--\r\n')
> f.seek(0)
> o = file('test.bin', 'wb')
> read_to_boundary(None, f, '--myboundary', o)
> o.close()
>
> /amn
>
>
> On Nov 6, 2005, at 11:58 AM, Jim Gallacher wrote:
>
> trunk
>
>
>
| |
| Nicolas Lehuen 2005-11-08, 5:56 pm |
| Well, I've re-read the previous code and it looks like it does almost the
same thing except it is bugged . CherryPy's implementation is almost the
same except it ought to work.
Jim, I've integrated your tricky file into the unit test. Alexis' version
passes all tests, whereas the current version fails on your file and ugh.pdf.
So I guess we can call it a day and integrate Alexis' fix.
One thing which is different is the use of 1<<16 = 65536 as read block size
instead of 65368. It was Barry Pearce who contributed the latest version of
read_to_boundary, he must have known why this block size was good (or was it
his lucky number ?). Anyway, I've changed Alexis' code to use the
readBlockSize variable and changed its value to 1<<16. I've done tests with
both 1<<16 and 65638 on the server side and on the client side (changing the
tricky file generation), and the new code works.
What we should have done, while we were at spending a lot of time on a
trivial string search problem, is at least to implement a fast Boyer-Moore
search algorithm, and drop the use of readline. But I guess this will be for
3.4 ;).
And yes, this 3 spaces indent is dreadful.
Regards,
Nicolas
2005/11/7, Jim Gallacher <jpg@jgassociates.ca >:
>
> Gregory (Grisha) Trubetskoy wrote:
>
> As much as it pains me to say it, but yes, this is a must fixm so it's
> on to 3.2.5b.
>
> I think we need to do some more extensive testing on Alexis's fix before
> we roll 3.2.5b. His read_to_boundary is much simpler than the current
> one. This makes me wonder if there is some magic happening in the
> current version which is solving some weird problems, or is his code
> just that much better?
>
> I'm feeling a little dull right now so all the code just blurs together.
> It also doesn't help that util.py uses *3 spaces* for the indent. Yikes.
> How the heck did that happen? 
>
> I'll take another look tomorrow.
>
> Jim
>
| |
| Alexis Marrero 2005-11-08, 5:56 pm |
| Ok. Now I'm confused.
What I was trying to say is that I created a file with this function:
def generate_split_file(offset=-1,
readBlockSize=65368,
fname='testfile'):
f = open(fname, 'w')
f.write('a'*50)
f.write('\r\n')
block_size = readBlockSize + offset
f.write('b'*block_size)
f.close()
Then I uploaded 'testfile' using the following
StorageField.read_to_boundary() method:
def read_to_boundary(self, req, boundary, file):
''' read from the request object line by line with a maximum size,
until the new line starts with boundary
'''
previous_delimiter = ''
while 1:
line = req.readline(1<<16)
if line.startswith(boundary):
break
if line.endswith('\r\n'):
file.write(previous_delimiter + line[:-2])
previous_delimiter = '\r\n'
elif line.endswith('\r') or line.endswith('\n'):
file.write(previous_delimiter + line[:-1])
previous_delimiter = line[-1:]
else:
file.write(previous_delimiter + line)
previous_delimiter = ''
And the md5 in the client is the same one as in the server. Do you
have different results? Let me know.
Regards,
/amn
On Nov 7, 2005, at 2:11 PM, Jim Gallacher wrote:
> Jim Gallacher wrote:
>
>
> Just to clarify, is this for your new read_to_boundary or the one
> in 3.2? If it's for yours then the MD5 sum *should* be the same,
> since that's what you fixed. 
>
>
>
>
>
>
| |
| Alexis Marrero 2005-11-08, 5:56 pm |
| Sorry for all this emails, but my system depends 100% on mod_python
specially file uploading. 
On Nov 7, 2005, at 2:04 PM, Jim Gallacher wrote:
> Alexis Marrero wrote:
>
>
> Did you call it with the same block size as you are using in your
> code? The '\r' character must appear in the file right at the
> readBlockSize boundary.
YES I ran it with generate_file(offset=-1, readBlockSize=1<<16,
fname='testfile')
and the MD5 of the input and output files match.
>
> ie.
>
> generate_file(offset=-1, readBlockSize=1<<16, fname='testfile')
>
> Jim
>
> PS. Please make sure you cc. python-dev when replying. It's best to
> keep discussion on the list.
>
>
>
| |
| Alexis Marrero 2005-11-08, 5:56 pm |
| New version of read_to_boundary(...)
readBlockSize = 1 << 16
def read_to_boundary(self, req, boundary, file):
previous_delimiter = ''
while 1:
line = req.readline(readBlockSize)
if line.strip().startswith(boundary):
break
if line.endswith('\r\n'):
file.write(previous_delimiter + line[:-2])
previous_delimiter = '\r\n'
elif line.endswith('\r'):
file.write(previous_delimiter + line[:-1])
previous_delimiter = '\r'
elif line.endswith('\n'):
if len(line[:-1]) > 0:
file.write(previous_delimiter + line[:-1])
previous_delimiter = '\n'
else:
previous_delimiter += '\n'
else:
file.write(previous_delimiter + line)
previous_delimiter = ''
This new functions passes the test for Jim's filetest generator.
[core:~] amarrero% Python filegenerator.py ; md5 testfile ; cp
testfile testfile.new ; Python test.py ; md5 test.bin
MD5 (testfile) = d481990a0f0bbd8acf847cd732714555
MD5 (outputtestfile.bin) = d481990a0f0bbd8acf847cd732714555
I'm running a test with my set of files (60000+) to see any other new
issues.
On Nov 7, 2005, at 6:35 PM, Jim Gallacher wrote:
> Alexis Marrero wrote:
>
>
> So am I!
>
> I've created a test harness so we can bypass mod_python completely.
> It includes a slightly modified version of read_to_boundary which
> adds a new parameter, readBlockSize.
>
> In the output from the test harness, your version is 'new' and the
> current version is 'cur'. Run it and see what happens.
>
> Jim
>
> $ ./upload_test_harness
>
> ========================================
> generate_embedded_cr_file
> ----------------------------------------
> test offset -1 chunk []
> ----------------------------------------
> src 5a63347d1106afdfa264b2a61f81ae82
> cur 5a63347d1106afdfa264b2a61f81ae82 PASS
> new 5a63347d1106afdfa264b2a61f81ae82 PASS
>
> ----------------------------------------
> test offset -1 chunk ['CR']
> ----------------------------------------
> src 82204e52343d5b25c2e783cd59499973
> cur e4af2eee73029642a114697ba59217b3 FAIL
> new 82204e52343d5b25c2e783cd59499973 PASS
>
> ========================================
> generate_split_boundary_file
> ----------------------------------------
> test offset -1 chunk []
> ----------------------------------------
> src d481990a0f0bbd8acf847cd732714555
> cur d481990a0f0bbd8acf847cd732714555 PASS
> new 8fa5ac9f913d778575ea871506c392a9 FAIL
>
> ----------------------------------------
> test offset -1 chunk ['CR']
> ----------------------------------------
> src 8fa5ac9f913d778575ea871506c392a9
> cur d481990a0f0bbd8acf847cd732714555 FAIL
> new 8fa5ac9f913d778575ea871506c392a9 PASS
>
>
>
>
> #!/usr/bin/env python
>
> from mkfile import generate_split_file, generate_file
> import sys
> from StringIO import StringIO
> import md5
>
> def read_to_boundary_current(self, req, boundary, file,
> readBlockSize):
> ''' currrent version '''
> #
> # Although technically possible for the boundary to be split by
> the read, this will
> # not happen because the readBlockSize is set quite high - far
> longer than any boundary line
> # will ever contain.
> #
> # lastCharCarried is used to detect the situation where the \r
> \n is split across the end of
> # a read block.
> #
> delim = ''
> lastCharCarried = False
> last_bound = boundary + '--'
> roughBoundaryLength = len(last_bound) + 128
> line = req.readline(readBlockSize)
> lineLength = len(line)
> if lineLength < roughBoundaryLength:
> sline = line.strip()
> else:
> sline = ''
> while lineLength > 0 and sline != boundary and sline !=
> last_bound:
> if not lastCharCarried:
> file.write(delim)
> delim = ''
> else:
> lastCharCarried = False
> cutLength = 0
>
> if lineLength == readBlockSize:
> if line[-1:] == '\r':
> delim = '\r'
> cutLength = -1
> lastCharCarried = True
>
> if line[-2:] == '\r\n':
> delim += '\r\n'
> cutLength = -2
> elif line[-1:] == '\n':
> delim += '\n'
> cutLength = -1
> if cutLength != 0:
> file.write(line[:cutLength])
> else:
> file.write(line)
>
> line = req.readline(readBlockSize)
> lineLength = len(line)
> if lineLength < roughBoundaryLength:
> sline = line.strip()
> else:
> sline = ''
>
> def read_to_boundary_new(self, req, boundary, file, readBlockSize):
> ''' Alexis' version
> read from the request object line by line with a maximum size,
> until the new line starts with boundary
> '''
> previous_delimiter = ''
> while 1:
> line = req.readline(readBlockSize)
> if line.startswith(boundary):
> break
>
> if line.endswith('\r\n'):
> file.write(previous_delimiter + line[:-2])
> previous_delimiter = '\r\n'
>
> elif line.endswith('\r') or line.endswith('\n'):
> file.write(previous_delimiter + line[:-1])
> previous_delimiter = line[-1:]
>
> else:
> file.write(previous_delimiter + line)
> previous_delimiter = ''
>
> def get_checksum(fname):
> data = open(fname).read()
> m = md5.new()
> m.update(data)
> return m.hexdigest()
>
> def generate_embedded_cr_file(offset=-1, readBlockSize=65368,
> chunk='\r', fname='testfile'):
> """ Generate a file which causes the error with file upload
> The default offset of -1 should generate a file which will
> be corrupted by the file upload.
> """
>
> f = open(fname, 'w')
> f.write('a'*50)
> f.write('\r\n')
>
> block_size = readBlockSize + offset
> f.write('b'*block_size)
> f.write(chunk)
> f.write('ccc')
>
> f.write('d'*50)
> f.write('\r\n')
>
> f.close()
>
> def generate_split_boundary_file(offset=-1, readBlockSize=65368,
> chunk='\r', fname='testfile'):
> """ this function generates a file with a boundary string '\r
> \n--myboundary'
> starting at readBlockSize - offset
> """
> f = open(fname, 'w')
> f.write('a'*50)
> f.write('\r\n')
>
> block_size = readBlockSize + offset
> f.write('b'*block_size)
> f.write(chunk)
>
> f.close()
>
> def main(file_generator, offset, chunk, block_size=1<<16):
> fname_in = 'testfile.in'
> fname_out = 'testfile.out'
> file_generator(offset=offset, readBlockSize=block_size,
> chunk=chunk, fname=fname_in)
> orig_checksum = get_checksum(fname_in)
>
> req = StringIO()
> req.write(open(fname_in).read())
> req.write('\r\n--myboundary\r\n')
>
> src_cs = get_checksum(fname_in)
> print 'src', src_cs
>
> fname_out = '%s.cur' % fname_out
> o = file(fname_out, 'wb')
> req.seek(0)
> read_to_boundary_current(None, req, '--myboundary', o, block_size)
> o.close()
> cs = get_checksum(fname_out)
> print 'cur', cs,
> if cs != src_cs:
> print 'FAIL'
> else:
> print 'PASS'
>
> fname_out = '%s.alexis' % fname_out
> o = file(fname_out, 'wb')
> req.seek(0)
> read_to_boundary_new(None, req, '--myboundary', o, block_size)
> o.close()
> cs = get_checksum(fname_out)
> print 'new', cs,
> if cs != src_cs:
> print 'FAIL'
> else:
> print 'PASS'
>
> def cname(ch):
> if ch == '\r':
> return 'CR'
> elif ch == '\n':
> return 'LF'
> elif ch == '':
> return 'None'
> else:
> return ord(ch)
>
> if __name__ == '__main__':
>
> #test_chunks = ['', '\r', '\n', '\r\n']
>
> # only test the chunks that are currently a problem
> test_chunks = ['', '\r',]
>
> test_cases =
> & #123;'generate_embedded_cr_file':generat
e_embedded_cr_file,
> 'generate_split_boundary_file': generate_split_boundary_file, }
> for name,file_gen_obj in test_cases.items():
> print '='*40
> print name
> for chunk in test_chunks:
> for i in range(-1, 0):
> print '-'*40
> print 'test offset', i, 'chunk',[ cname(c) for c in
> chunk ]
> print '-'*40
> main(file_gen_obj, i, chunk)
> print
> print
>
| |
| Alexis Marrero 2005-11-08, 5:56 pm |
| Thanks for that improvement, don't like its complexity though. I'm
testing "mikes" version with my set of files I will all let you know
how it goes.
BTW, the line that reads "last_bound = boundary + '--'" so we save 4
CPU cycles there 
The next test that I will run this against will be with an obscene
amount of data for which this improvement helps a lot!
/amn
On Nov 8, 2005, at 4:47 AM, Mike Looijmans wrote:
> Here's one that passes all the tests, and is 2x as fast as the
> 'current' and 'new' implementations on random binary data. I
> haven't been able to generate data where the 'mike' version is slower:
>
> def read_to_boundary(self, req, boundary, file, readBlockSize=65536):
> prevline = ""
> last_bound = boundary + '--'
> carry = None
> while 1:
> line = req.readline(readBlockSize)
> if not line or line.startswith(boundary):
> if prevline.endswith('\r\n'):
> if carry is not None:
> file.write(carry)
> file.write(prevline[:-2])
> break
> elif (carry == '\r') and (prevline[-1] == '\n'):
> file.write(prevline[:-1])
> break
> # If we get here, it's not really a boundary!
> if carry is not None:
> file.write(carry)
> carry = None
> if prevline[-1:] == '\r':
> file.write(prevline[:-1])
> carry = '\r'
> else:
> file.write(prevline)
> prevline = line
>
> I've attached a modified upload_test_harness.py that includes the
> new and current, also the 'org' version (as in 3.1 release) and the
> 'mike' version.
>
> In addition, I added some profiling calls to show the impact of the
> extra 'endswith' and slices.
>
> --
> Mike Looijmans
> Philips Natlab / Topic Automation
> #!/usr/bin/env python
>
> import sys
> from cStringIO import StringIO
> import md5
>
> ##def generate_split_file(offset=-1,
> ## readBlockSize=65368,
> ## fname='testfile'):
> ## f = open(fname, 'wb')
> ## f.write('a'*50)
> ## f.write('\r\n')
> ## block_size = readBlockSize + offset
> ## f.write('b'*block_size)
> ## f.close()
>
> def read_to_boundary_current(self, req, boundary, file,
> readBlockSize):
> ''' currrent version '''
> #
> # Although technically possible for the boundary to be split by
> the read, this will
> # not happen because the readBlockSize is set quite high - far
> longer than any boundary line
> # will ever contain.
> #
> # lastCharCarried is used to detect the situation where the \r
> \n is split across the end of
> # a read block.
> #
> delim = ''
> lastCharCarried = False
> last_bound = boundary + '--'
> roughBoundaryLength = len(last_bound) + 128
> line = req.readline(readBlockSize)
> lineLength = len(line)
> if lineLength < roughBoundaryLength:
> sline = line.strip()
> else:
> sline = ''
> while lineLength > 0 and sline != boundary and sline !=
> last_bound:
> if not lastCharCarried:
> file.write(delim)
> delim = ''
> else:
> lastCharCarried = False
> cutLength = 0
>
> if lineLength == readBlockSize:
> if line[-1:] == '\r':
> delim = '\r'
> cutLength = -1
> lastCharCarried = True
>
> if line[-2:] == '\r\n':
> delim += '\r\n'
> cutLength = -2
> elif line[-1:] == '\n':
> delim += '\n'
> cutLength = -1
> if cutLength != 0:
> file.write(line[:cutLength])
> else:
> file.write(line)
>
> line = req.readline(readBlockSize)
> lineLength = len(line)
> if lineLength < roughBoundaryLength:
> sline = line.strip()
> else:
> sline = ''
>
> def read_to_boundary_new(self, req, boundary, file, readBlockSize):
> ''' Alexis' version
> read from the request object line by line with a maximum size,
> until the new line starts with boundary
> '''
> previous_delimiter = ''
> while 1:
> line = req.readline(readBlockSize)
> if line.startswith(boundary):
> break
>
> if line.endswith('\r\n'):
> file.write(previous_delimiter + line[:-2])
> previous_delimiter = '\r\n'
>
> elif line.endswith('\r') or line.endswith('\n'):
> file.write(previous_delimiter + line[:-1])
> previous_delimiter = line[-1:]
>
> else:
> file.write(previous_delimiter + line)
> previous_delimiter = ''
>
> def read_to_boundary_org(self, req, boundary, file, readBlockSize):
> delim = ""
> line = req.readline(readBlockSize)
> while line and not line.startswith(boundary):
> odelim = delim
> if line[-2:] == "\r\n":
> delim = "\r\n"
> line = line[:-2]
> elif line[-1:] == "\n":
> delim = "\n"
> line = line[:-1]
> else:
> delim = ""
> file.write(odelim + line)
> line = req.readline(readBlockSize)
>
> def read_to_boundary_mike(self, req, boundary, file,
> readBlockSize=65536):
> prevline = ""
> last_bound = boundary + '--'
> carry = None
> while 1:
> line = req.readline(readBlockSize)
> if not line or line.startswith(boundary):
> if prevline.endswith('\r\n'):
> if carry is not None:
> file.write(carry)
> file.write(prevline[:-2])
> break
> elif (carry == '\r') and (prevline[-1] == '\n'):
> file.write(prevline[:-1])
> break
> # If we get here, it's not really a boundary!
> if carry is not None:
> file.write(carry)
> carry = None
> if prevline[-1:] == '\r':
> file.write(prevline[:-1])
> carry = '\r'
> else:
> file.write(prevline)
> prevline = line
>
> def get_checksum(fname):
> data = open(fname, 'rb').read()
> m = md5.new()
> m.update(data)
> return m.hexdigest()
>
> def generate_embedded_cr_file(offset=-1, readBlockSize=65368,
> chunk='\r', fname='testfile'):
> """ Generate a file which causes the error with file upload
> The default offset of -1 should generate a file which will
> be corrupted by the file upload.
> """
>
> f = open(fname, 'wb')
> f.write('a'*50)
> f.write('\r\n')
>
> block_size = readBlockSize + offset
> f.write('b'*block_size)
> f.write(chunk)
> f.write('ccc')
>
> f.write('d'*50)
> f.write('\r\n')
>
> f.close()
>
> def generate_split_boundary_file(offset=-1, readBlockSize=65368,
> chunk='\r', fname='testfile'):
> """ this function generates a file with a boundary string '\r
> \n--myboundary'
> starting at readBlockSize - offset
> """
> f = open(fname, 'wb')
> f.write('a'*50)
> f.write('\r\n')
>
> block_size = readBlockSize + offset
> f.write('b'*block_size)
> f.write(chunk)
>
> f.close()
>
> read_boundaries = [read_to_boundary_current, read_to_boundary_new,
> read_to_boundary_org, read_to_boundary_mike]
>
> def main(file_generator, offset, chunk, block_size=1<<16):
> fname_in = 'testfile.in'
> fname_out_base = 'testfile.out'
> file_generator(offset=offset, readBlockSize=block_size,
> chunk=chunk, fname=fname_in)
>
> orig_checksum = get_checksum(fname_in)
>
> req = StringIO()
> req.write(open(fname_in, 'rb').read())
> req.write('\r\n--myboundary\r\n')
>
> src_cs = get_checksum(fname_in)
> print ' src', src_cs
>
> for rtb in read_boundaries:
> name = rtb.__name__.split('_')[-1]
> fname_out = fname_out_base + name
> o = file(fname_out, 'wb')
> req.seek(0)
> rtb(None, req, '--myboundary', o, block_size)
> size = o.tell()
> o.close()
> cs = get_checksum(fname_out)
> print "%8s %s %6d" % (name, cs, size),
> if cs != src_cs:
> print 'FAIL'
> else:
> print 'PASS'
>
>
> def cname(ch):
> if ch == '\r':
> return 'CR'
> elif ch == '\n':
> return 'LF'
> elif ch == '':
> return 'None'
> else:
> return ord(ch)
>
> class DevNull:
> def write(self, data):
> pass
>
> if __name__ == '__main__':
>
> #test_chunks = ['', '\r', '\n', '\r\n']
>
> # only test the chunks that are currently a problem
> test_chunks = ['', '\r',]
>
> test_cases = (generate_embedded_cr_file,
> generate_split_boundary_file)
> for file_gen_obj in test_cases:
> print '='*40
> print file_gen_obj.__name__
> for chunk in test_chunks:
> for i in range(-1, 0):
> print '-'*40
> print 'test offset', i, 'chunk',[ cname(c) for c in
> chunk ]
> print '-'*40
> main(file_gen_obj, i, chunk)
> print
> print
> ## sys.exit(0)
>
> # Run profiling benchmarks with the four files
> # (on my system, the _mike and _org versions are typically almost
> # twice as fast as the _new and _current)
> import profile, random
> req = StringIO()
> blocksize = 102400
> block=[]
> r = random.Random()
> for i in xrange(blocksize):
> block.append(chr(r.randint(0,255)))
> block = "".join(block)
> for c in range(64):
> req.write(block)
> req.write('\r\n--myboundary\r\n')
> print "Benchmark bytes:", req.tell()
> out = DevNull()
> for rtb in read_boundaries:
> req.seek(0)
> print rtb.__name__
> profile.run("%s(None, req, '--myboundary', out, 65536)" %
> rtb.__name__)
>
| |
| Alexis Marrero 2005-11-08, 5:56 pm |
| Inspired by Mike's changes I made some changes the "new" version to
improve performance while keeping readability:
def read_to_boundary_new(self, req, boundary, file, readBlockSize):
previous_delimiter = ''
bound_length = len(boundary)
while 1:
line = req.readline(readBlockSize)
if line[:bound_length] == boundary:
break
if line[-2:] == '\r\n':
file.write(previous_delimiter + line[:-2])
previous_delimiter = '\r\n'
elif line[-1:] == '\r':
file.write(previous_delimiter + line[:-1])
previous_delimiter = '\r'
elif line[-1:] == '\n':
if len(line[:-1]) > 0:
file.write(previous_delimiter + line[:-1])
previous_delimiter = '\n'
else:
previous_delimiter += '\n'
else:
file.write(previous_delimiter + line)
previous_delimiter = ''
Results are very comparable to read_to_boundary_mike
/amn
On Nov 8, 2005, at 9:07 AM, Jim Gallacher wrote:
> Mike Looijmans wrote:
>
>
> Nice changes, Mike.
>
> I started to get confused by the names of the various
> read_to_boundary_* functions, so I've made a slight modification to
> your code and renamed these functions.
>
> Note also that there are some failures for the split_boundary file
> using offset -2 chunk [CR], so I've change the default offset to
> include this condition.
>
> I've also changed the generate_*file functions to embed an
> additional '\r' in a short string. Just being paranoid I guess.
>
> Diff is attached.
>
> If it's useful I can stick the upload_test_harness code in the
> mod_python svn repository.
>
> Jim
>
> --- upload_test_harness_mike.py.orig 2005-11-08
> 08:02:13.000000000 -0500
> +++ upload_test_harness_mike.py 2005-11-08 08:54:15.000000000 -0500
> @@ -14,8 +14,8 @@
> ## f.write('b'*block_size)
> ## f.close()
>
> -def read_to_boundary_current(self, req, boundary, file,
> readBlockSize):
> - ''' currrent version '''
> +def read_to_boundary_324(self, req, boundary, file, readBlockSize):
> + ''' currrent version in 3.2.4b'''
> #
> # Although technically possible for the boundary to be split
> by the read, this will
> # not happen because the readBlockSize is set quite high - far
> longer than any boundary line
> @@ -66,7 +66,7 @@
> else:
> sline = ''
>
> -def read_to_boundary_new(self, req, boundary, file, readBlockSize):
> +def read_to_boundary_alexis(self, req, boundary, file,
> readBlockSize):
> ''' Alexis' version
> read from the request object line by line with a maximum
> size,
> until the new line starts with boundary
> @@ -89,7 +89,7 @@
> file.write(previous_delimiter + line)
> previous_delimiter = ''
>
> -def read_to_boundary_org(self, req, boundary, file, readBlockSize):
> +def read_to_boundary_314(self, req, boundary, file, readBlockSize):
> delim = ""
> line = req.readline(readBlockSize)
> while line and not line.startswith(boundary):
> @@ -145,6 +145,8 @@
>
> f = open(fname, 'wb')
> f.write('a'*50)
> + f.write('\r')
> + f.write('a'*50)
> f.write('\r\n')
>
> block_size = readBlockSize + offset
> @@ -163,6 +165,8 @@
> """
> f = open(fname, 'wb')
> f.write('a'*50)
> + f.write('\r')
> + f.write('a'*50)
> f.write('\r\n')
>
> block_size = readBlockSize + offset
> @@ -171,7 +175,7 @@
>
> f.close()
>
> -read_boundaries = [read_to_boundary_current, read_to_boundary_new,
> read_to_boundary_org, read_to_boundary_mike]
> +read_boundaries = [read_to_boundary_324, read_to_boundary_alexis,
> read_to_boundary_314, read_to_boundary_mike]
>
> def main(file_generator, offset, chunk, block_size=1<<16):
> fname_in = 'testfile.in'
> @@ -218,6 +222,9 @@
> pass
>
> if __name__ == '__main__':
> + # offsets which show problems are in [-2, -1]
> + first_offset = -2
> + last_offset = 0
>
> #test_chunks = ['', '\r', '\n', '\r\n']
>
> @@ -229,7 +236,7 @@
> print '='*40
> print file_gen_obj.__name__
> for chunk in test_chunks:
> - for i in range(-1, 0):
> + for i in range(first_offset, last_offset):
> print '-'*40
> print 'test offset', i, 'chunk',[ cname(c) for c
> in chunk ]
> print '-'*40
>
| |
| Nicolas Lehuen 2005-11-12, 7:45 am |
| Guys, next time please use the JIRA bug tracking application available at :
http://nagoya.apache.org/jira/browse/MODPYTHON
Especially this bug report :
http://issues.apache.org/jira/browse/MODPYTHON-40
I'm currently re-reading the whole thread and trying to make sense of
all the test files and patches that were submitted, and it's quite
unpleasant, especially since I'm using Google mail which tries to be
too clever about attachments.
Using JIRA, we have a permanent tracking of what's happening AND we
can store attachments in a nice way.
Regards,
Nicolas
2005/11/8, Alexis Marrero <amarrero@mitre.org>:
> Inspired by Mike's changes I made some changes the "new" version to
> improve performance while keeping readability:
>
> def read_to_boundary_new(self, req, boundary, file, readBlockSize):
> previous_delimiter =3D ''
> bound_length =3D len(boundary)
> while 1:
> line =3D req.readline(readBlockSize)
> if line[:bound_length] =3D=3D boundary:
> break
>
> if line[-2:] =3D=3D '\r\n':
> file.write(previous_delimiter + line[:-2])
> previous_delimiter =3D '\r\n'
>
> elif line[-1:] =3D=3D '\r':
> file.write(previous_delimiter + line[:-1])
> previous_delimiter =3D '\r'
>
> elif line[-1:] =3D=3D '\n':
> if len(line[:-1]) > 0:
> file.write(previous_delimiter + line[:-1])
> previous_delimiter =3D '\n'
>
> else:
> previous_delimiter +=3D '\n'
>
> else:
> file.write(previous_delimiter + line)
> previous_delimiter =3D ''
>
> Results are very comparable to read_to_boundary_mike
>
> /amn
> On Nov 8, 2005, at 9:07 AM, Jim Gallacher wrote:
>
>
>
| |
| Nicolas Lehuen 2005-11-12, 5:46 pm |
| OK, this time I think the file upload problem is solved for good. I've
checked-in Alexis's code, with comments. Then I've done a quick
rewrite of the multipart/form-data parser found in
FieldStorage.__init__ and read_to_boundary so that it uses a regexp
for the boundary checks, with the hope that it simplify the code a
little bit (and remove thos nasty strip() calls). I've re-ran all
tests and everything seems OK. Now, Alexis, if you can run your test
with the 60,000+ files on this version of util.py, that would be great
:
http://svn.apache.org/viewcvs.cgi/h...python/mod_pyt=
hon/util.py?rev=3D332779&view=3Dmarkup
Ironically, the JIRA server seems non responsive right now, so I can't
comment or close MODPYTHON-40...
Regards,
Nicolas
2005/11/12, Nicolas Lehuen <nicolas.lehuen@gmail.com>:
> Guys, next time please use the JIRA bug tracking application available at=
:
>
> http://nagoya.apache.org/jira/browse/MODPYTHON
>
> Especially this bug report :
>
> http://issues.apache.org/jira/browse/MODPYTHON-40
>
> I'm currently re-reading the whole thread and trying to make sense of
> all the test files and patches that were submitted, and it's quite
> unpleasant, especially since I'm using Google mail which tries to be
> too clever about attachments.
>
> Using JIRA, we have a permanent tracking of what's happening AND we
> can store attachments in a nice way.
>
> Regards,
> Nicolas
>
> 2005/11/8, Alexis Marrero <amarrero@mitre.org>:
0[vbcol=seagreen]
>
|
|
|
|
|