Backup Software - Need a Binary diff / compressor for win32

This is Interesting: Free IT Magazines  
Home > Archive > Backup Software > June 2006 > Need a Binary diff / compressor for win32





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author Need a Binary diff / compressor for win32
atamido@gmail.com

2006-05-24, 7:12 pm

I have some database-like files that I need to do regular backups of,
to an offsite (online) location. These aren't real database files, so
SQL type differential backups aren't an option. The files are large
(2-8GB each), and the upload speed is slow (2mbps), so getting small
sizes will be a necessity. Between two backups, at least 95% of the
data will be the same, although it may have moved around in the file so
offsets of identical data will be completely different.

I can stagger uploading full copies of the different files to spread
out the uploading. I can also keep some backups of the files locally.

My Plan:
1. Compress a given file using the maximum level of some compressor.
(Probably 7-Zip, although I'm not married to it.)
2. Upload the compressed version to the offsite location and keep the
original local to the system.
3. For the next backup, create a diff between the original and the new
version. (I need a diff program that can be used to reconstruct one
file from the other.)
4. Compress the diff and upload it. Delete the local original copy and
keep the current backup locally.
5. Repeat steps 3-4 until a current file can be uploaded in it's
entirety.

Questions:
1. Is this a workable plan, or is it flawed greatly in some way? Are
there improvements to be made in it?
2. Is there some sort of binary-diff/compression program out there that
would meet my needs fully? (and cheaply)
3. If not 2, then is there a really good binary diff program I could
use to create diffs before compressing?
4. Is there another group more appropriate for my problem?


Atamido

H.D.

2006-05-25, 7:12 am

On 24 May 2006 15:04:51 -0700, atamido@gmail.com wrote:

>I have some database-like files that I need to do regular backups of,
>to an offsite (online) location.


---
>My Plan:
>1. Compress a given file using the maximum level of some compressor.
>(Probably 7-Zip, although I'm not married to it.)
>2. Upload the compressed version to the offsite location and keep the
>original local to the system.
>3. For the next backup, create a diff between the original and the new
>version. (I need a diff program that can be used to reconstruct one
>file from the other.)
>4. Compress the diff and upload it. Delete the local original copy and
>keep the current backup locally.
>5. Repeat steps 3-4 until a current file can be uploaded in it's
>entirety.
>
>Questions:
>1. Is this a workable plan, or is it flawed greatly in some way? Are
>there improvements to be made in it?
>2. Is there some sort of binary-diff/compression program out there that
>would meet my needs fully? (and cheaply)
>3. If not 2, then is there a really good binary diff program I could
>use to create diffs before compressing?
>4. Is there another group more appropriate for my problem?


I have almost the same workflow you want to use.
I use WinRAR in combination with OCB or One Click Backup. (see
www.acritum.com)
Tweak a bit with the switches and up you are ;-)

atamido@gmail.com

2006-05-29, 5:04 pm

H.D. wrote:
> On 24 May 2006 15:04:51 -0700, atamido@gmail.com wrote:
>
> I have almost the same workflow you want to use.
> I use WinRAR in combination with OCB or One Click Backup. (see
> www.acritum.com)
> Tweak a bit with the switches and up you are ;-)


The software looks very useful, thanks! Do you know if it will do
binary diffs for large files?


Atamido

Claudio Grondi

2006-05-29, 5:04 pm

atamido@gmail.com wrote:
> H.D. wrote:
>
> The software looks very useful, thanks! Do you know if it will do
> binary diffs for large files?
>
>
> Atamido
>

I looked into it some time ago and the only promising approach seemed to
be usage of a commercial package called Caché by Intersystems:
http://www.intersystems.com/cache/
pretending to be able to handle file sizes you are mentioning (far
beyond what can be stored in RAM) in case of differential backups on
usual PC hardware.

I haven't installed and tried it yet, so I would be glad to see if it
keeps what it promises and/or to get knowledge about any other solution
to the problem you mention.

If I have got it right WinRAR and 7zip can use very large dictionaries,
but can't handle file sizes not fitting into RAM or perform a
differential compression and there is no Open Source software able to do
what you need.

In my eyes the problem you have is still an unsolved general problem of
performing an effective compression of very large amount of data (> 4
Gigabyte) taking advantage from repetitions randomly spread over the
data using standard personal computer hardware.

I would be glad to be proven wrong and therefore eager awaiting expert
response on this subject.

Claudio
Jasen Betts

2006-05-29, 5:04 pm

On 2006-05-25, atamido@gmail.com <atamido@gmail.com> wrote:
> H.D. wrote:
>


> The software looks very useful, thanks! Do you know if it will do
> binary diffs for large files?


you won't find software to do that (finding where data has been moved to in
huge files given only before and after data) it'd take weeks to run.
(given a cluster of 1000 or so pcs you might be able to reduce it to an
overnight task (per file))

look into rsync, or a streaming backup system (which records a log of the
changes made as they are made)



--

Bye.
Jasen
atamido@gmail.com

2006-05-29, 5:04 pm

Claudio Grondi wrote:
> If I have got it right WinRAR and 7zip can use very large dictionaries,
> but can't handle file sizes not fitting into RAM or perform a
> differential compression and there is no Open Source software able to do
> what you need.
>
> In my eyes the problem you have is still an unsolved general problem of
> performing an effective compression of very large amount of data (> 4
> Gigabyte) taking advantage from repetitions randomly spread over the
> data using standard personal computer hardware.
>
> I would be glad to be proven wrong and therefore eager awaiting expert
> response on this subject.


While not what I'm looking for, and not exactly what you're asking
about, I did come across cromfs. However, this is a compressed file
system for Linux, not a general archiver. From their compression page:


http://bisqwit.iki.fi/source/cromfs.html#compression
"An explanation why mkcromfs beats 7-zip in the NES ROM packing test:

7-zip packs all the files together as one stream. The maximum
dictionary size is 256 MB. (Note: The default for "maximum compression"
is 32 MB.) When 256 MB of data has been packed and more data comes in,
similarities between the first megabytes of data and the latest data
are not utilized. For example, Mega Man and Rockman are two almost
identical versions of the same image, but because there's more than 400
MB of files in between of those when they are processed in alphabetical
order, 7-zip does not see that they are similar, and will compress each
one separately.
7-zip's chances could be improved by sorting the files so that it
will process similar images sequentially. It already attempts to
accomplish this by sorting the files by filename extension and
filename, but it is not always the optimal way, as shown here.

mkcromfs however keeps track of all blocks it has encoded, and will
remember similarities no matter how long ago they were added to the
archive. This is why it outperforms 7-zip in this case, even when it
only used 16 MB fblocks."


Perhaps if someone found a way to use this file system's algorithms in
a general compression utility...


Atamido

atamido@gmail.com

2006-05-29, 5:04 pm

Jasen Betts wrote:
> On 2006-05-25, atamido@gmail.com <atamido@gmail.com> wrote:
>
> you won't find software to do that (finding where data has been moved to in
> huge files given only before and after data) it'd take weeks to run.
> (given a cluster of 1000 or so pcs you might be able to reduce it to an
> overnight task (per file))


The computational requirements are largely dependant on how large you
expect your similar blocks to be. For instance, in my case I expect
similar fields data blocks to be at least 4KB in size (although it may
actually be 1MB or more). Using that assumption, I can drasticaly
reduce my data compare size. For example, having two files, File A
(the original) and File B (the new version):

1. Take a 16 byte (128 bit) sample every 4KB in the file.
2. Do a quick check to make sure there are no identical samples and
take new samples where appropriate.
3. Create an index of markers for ever 4KB in File B.
4. Take a 4KB chunk and do a search in it for each of the 16 byte
samples.
5. For each of the matching samples, take larger segments from around
the matching samples to compare between File A and File B.
6. For the sample with the largest matching segment, update your index
markers to match the size of the segment.
7. Repeat for the entire file.
8. Save the index of matching segments, and save all of the unmatched
data.

Using the saved data, you should be able to reconstruct either the
original or the new copy from the other. So with the diff(s), and
either the original file or latest file, you could recreate any copy in
the sequence. And the (hopefully) much smaller saved data would
probably mostly compress in a ratio similar to the original file.

Using an example file size of 8GB, there would be 2,000,000 4KB samples
and 2,000,000 markers. The samples would take up 32MB of space, which
would easily fit into RAM. The 4KB chunk would easily fit into the L2
cache (or L1 of most CPUs). You could easily fit 50,000 samples into
most L2 caches. Then initially you are mostly/basically just doing
string searches 50,000 times per cache miss (~40 times) and then doing
the more complicated extended search for matching samples, all per 4KB
chunk (2 million times).

With a decent L2 cache size and fast enough CPU/memory this really
shouldn't take all that long. And someone familiar with search
optimizations would make it go even faster. Of course, clustering this
would be relatively easy with with a linear speed growth, so 10 PCs
would just make it go at least 9 times faster.

But then again, I'm not a real programmer, so I may be way off base.
It would be nice though if someone made such a program, even if just to
prove me wrong. :P

> look into rsync, or a streaming backup system (which records a log of the
> changes made as they are made)


Sadly, that isn't an option with these files.


Atamido

Claudio Grondi

2006-05-29, 5:04 pm

atamido@gmail.com wrote:
> Jasen Betts wrote:
>
>
>
> The computational requirements are largely dependant on how large you
> expect your similar blocks to be. For instance, in my case I expect
> similar fields data blocks to be at least 4KB in size (although it may
> actually be 1MB or more). Using that assumption, I can drasticaly
> reduce my data compare size. For example, having two files, File A
> (the original) and File B (the new version):
>
> 1. Take a 16 byte (128 bit) sample every 4KB in the file.

Your idea is to split a file into 4KB chunks, what you hope will help
you to do the job.
The problem with this approach is, that it assumes identical chunks
aligned at 4KB boundaries. I am quite sure, that this can't be assumed
for most files (including database files), so this approach won't work
because of misalignment of identical chunks.
In general case I see only the chance of moving e.g. a 4KB "window" byte
by byte over the entire file, so that for large files you will get
nearly as much chunks to consider as bytes in the file.
You will need for your chunk dictionary (which should allow multiple
entries for same index value) at least four times the size of the file
(for e.g. CRC32 of the chunks as hash) + the size of the information
about the chunk position in that file.
As pointed out lately in another thread the problem with databases and
dictionaries is the time of insertion of new records which increases
with the amount of already inserted records and can't be handled in
practice when the number of records goes far beyond millions of them
(and you will need 8 thousand millions of records for a 8 GByte file)
not mentioning here that the size of the database file storing the
indexed records will probably exceed the storage capability of existing
hard drives.

> 2. Do a quick check to make sure there are no identical samples and
> take new samples where appropriate.
> 3. Create an index of markers for ever 4KB in File B.
> 4. Take a 4KB chunk and do a search in it for each of the 16 byte
> samples.
> 5. For each of the matching samples, take larger segments from around
> the matching samples to compare between File A and File B.
> 6. For the sample with the largest matching segment, update your index
> markers to match the size of the segment.
> 7. Repeat for the entire file.
> 8. Save the index of matching segments, and save all of the unmatched
> data.
>
> Using the saved data, you should be able to reconstruct either the
> original or the new copy from the other. So with the diff(s), and
> either the original file or latest file, you could recreate any copy in
> the sequence. And the (hopefully) much smaller saved data would
> probably mostly compress in a ratio similar to the original file.
>
> Using an example file size of 8GB, there would be 2,000,000 4KB samples
> and 2,000,000 markers. The samples would take up 32MB of space, which
> would easily fit into RAM. The 4KB chunk would easily fit into the L2
> cache (or L1 of most CPUs). You could easily fit 50,000 samples into
> most L2 caches. Then initially you are mostly/basically just doing
> string searches 50,000 times per cache miss (~40 times) and then doing
> the more complicated extended search for matching samples, all per 4KB
> chunk (2 million times).
>
> With a decent L2 cache size and fast enough CPU/memory this really
> shouldn't take all that long. And someone familiar with search
> optimizations would make it go even faster. Of course, clustering this
> would be relatively easy with with a linear speed growth, so 10 PCs
> would just make it go at least 9 times faster.
>
> But then again, I'm not a real programmer, so I may be way off base.
> It would be nice though if someone made such a program, even if just to
> prove me wrong. :P

see above for my comments on step 1. of the proposed procedure.

Claudio
>
>
>
>
> Sadly, that isn't an option with these files.
>
>
> Atamido
>

atamido@gmail.com

2006-05-29, 5:04 pm

Claudio Grondi wrote:
> atamido@gmail.com wrote:
> Your idea is to split a file into 4KB chunks, what you hope will help
> you to do the job.


I think I may have worded this poorly. It should also say File A for
step 1. My idea would be to take a small sample from (not necessarily)
regular intervals throughout the file. 16 bytes out of every 4KB is
about 1/256th the data, or a lot less data to move around.

If you were to take a 4KB sample, when you expect 4KB to be the minimum
matching size, then comparing then you run a high risk of not getting a
match because only (for instance) the first half of the sample matches
part of File B while the second half matches a completely different
part of File B.

So taking small samples throughout the file at approximately intervals
expected to be the minimum match length gives you two advantages, 1) it
makes the data much smaller and 2) increases the chance of a match for
the full sample from File A to some part of some segment in File B.

> The problem with this approach is, that it assumes identical chunks
> aligned at 4KB boundaries. I am quite sure, that this can't be assumed
> for most files (including database files), so this approach won't work
> because of misalignment of identical chunks.


Of course the chances of the data being aligned the same in two files
is essentially nil. that is why you search the entire 4KB chunk of
File B with the each sample to try and find a match SOMEWHERE in it.
That is step 4.

Once you've found the samples from File A that have a match somewhere
in a given chunk from File B, then you find out how much of the data in
File A around the sample from File A matches the data from around the
match that was found in File B.

> In general case I see only the chance of moving e.g. a 4KB "window" byte
> by byte over the entire file, so that for large files you will get
> nearly as much chunks to consider as bytes in the file.
> You will need for your chunk dictionary (which should allow multiple
> entries for same index value) at least four times the size of the file
> (for e.g. CRC32 of the chunks as hash) + the size of the information
> about the chunk position in that file.


Which is why you wouldn't do that.

> As pointed out lately in another thread the problem with databases and
> dictionaries is the time of insertion of new records which increases
> with the amount of already inserted records and can't be handled in
> practice when the number of records goes far beyond millions of them
> (and you will need 8 thousand millions of records for a 8 GByte file)
> not mentioning here that the size of the database file storing the
> indexed records will probably exceed the storage capability of existing
> hard drives.


I honestly know nothing about that. Although, in my case, these
database-like files that I have are less than 100,000 "records". Nor
am I suggesting a dictionary, unless you consider having the entire
original file with pointers to it a dictionary.


Perhaps this would be easier with an explanation?

Lets say these are two files:
File A
SDKLFJIO409DSKLVN089904NDFNVOI089WFRNLIN
24R098J2
File B
JKFNVOI089WFRNLINNVKJASF8O4FHLKSDKLFJIO4
JNKLFJIO

I take two letters from the middle of every 12 characters in File A
JI
08
I0
4R

Then I try to match the first each of these character sets in the first
4 bytes of File B.
JKFNVOI089WF

"08" and "I0" both have a match in that chunk. Now you can do a more
complex compare starting from just two points instead of four.
Comparing to File A, these bytes match for each of the chunks:
Match around "08":
089
Match around "I0":
FNVOI089WFRNLIN

The match around "I0" is longer, so it is used. If you stopped here,
the Diff file will look somthing like this:

JK
Pointer saying the next 15 characters are the same as in File A
starting at character 25.
VKJASFH8O4FHLKSDKLFJIO4JNKLFJIO

So, it should be more clear that I am not saying to make thousands of
small files or to do the searches on KB aligned boundries.

This is not bad when you consider that a pointer might take 48 bytes,
but the chunks of data I'm looking for are at least 4KB in size.

Remember, this is just a way to do binary diffs, not a real form of
compression. as it requires the presence of the entire File A or File B
to reconstruct the other.


Atamido

Jasen Betts

2006-05-29, 5:04 pm

On 2006-05-26, atamido@gmail.com <atamido@gmail.com> wrote:
> Jasen Betts wrote:
>
> The computational requirements are largely dependant on how large you
> expect your similar blocks to be. For instance, in my case I expect
> similar fields data blocks to be at least 4KB in size (although it may
> actually be 1MB or more). Using that assumption, I can drasticaly
> reduce my data compare size. For example, having two files, File A
> (the original) and File B (the new version):


> 1. Take a 16 byte (128 bit) sample every 4KB in the file.


> 2. Do a quick check to make sure there are no identical samples and
> take new samples where appropriate.


There will be many. 16 bytes is "only three words" of ascii text, and only 8
characters in UTF 16. assuming there's a better way to represent each 4K
block in 16 bytes it could work..

> 3. Create an index of markers for ever 4KB in File B.
> 4. Take a 4KB chunk and do a search in it for each of the 16 byte
> samples.
> 5. For each of the matching samples, take larger segments from around
> the matching samples to compare between File A and File B.
> 6. For the sample with the largest matching segment, update your index
> markers to match the size of the segment.
> 7. Repeat for the entire file.
> 8. Save the index of matching segments, and save all of the unmatched
> data.


neat. that'd be much faster than a byte-by-bytes search for a match.
(I feel maybe 10000 times faster)
you're likely to miss any moved data that less than 2K
(it'll be treated as new data) probably not a big problem.

> Using the saved data, you should be able to reconstruct either the
> original or the new copy from the other. So with the diff(s), and
> either the original file or latest file, you could recreate any copy in
> the sequence. And the (hopefully) much smaller saved data would
> probably mostly compress in a ratio similar to the original file.
>
> Using an example file size of 8GB, there would be 2,000,000 4KB samples
> and 2,000,000 markers. The samples would take up 32MB of space, which
> would easily fit into RAM. The 4KB chunk would easily fit into the L2
> cache (or L1 of most CPUs). You could easily fit 50,000 samples into
> most L2 caches. Then initially you are mostly/basically just doing
> string searches 50,000 times per cache miss (~40 times) and then doing
> the more complicated extended search for matching samples, all per 4KB
> chunk (2 million times).


if you put the samples into a hash table it'll speed things up significantly
(~1000 times) and only cost maybe 50% more ram.

> But then again, I'm not a real programmer, so I may be way off base.
> It would be nice though if someone made such a program, even if just to
> prove me wrong. :P


Bye.
Jasen
atamido@gmail.com

2006-05-29, 5:04 pm

Jasen Betts wrote:
>
>
> There will be many. 16 bytes is "only three words" of ascii text, and only 8
> characters in UTF 16. assuming there's a better way to represent each 4K
> block in 16 bytes it could work..


Well, it would only be a problem if your sample just happenned to be
part of some very common phrase occuring in the file. If it occurred
1000 time, then that's a 1000 times that you needlessly do a more
complex matching. Otherwise, three words (or two long ones) in a
particular order aren't that common, and 5 or 6 false hits in a file
aren't going to cause to big of an increase in the total CPU time.

There are a few things you might be able to do to prevent it also.
Like specifically search for samples that contain non-alpha characters,
or alpha-numeric. You could also try that when your sample starts at a
space, to shift it slightly so that starts in the middle of the word
instead, resulting in partial coverage of more words and less likely to
have a random match.

Of course, you could also increase the sample size. I mention 128 bits
because it the largest possible value to fit in a single register (I
think, isn't that for MMX?), making computations that much more
possibly quick.

But hey, if someone programs this, then they have the freedom to test
and try all different sorts of options. ;)


Atamido

Jasen Betts

2006-05-29, 5:04 pm

["Followup-To:" header set to comp.compression.]
On 2006-05-27, atamido@gmail.com <atamido@gmail.com> wrote:


> Of course, you could also increase the sample size. I mention 128 bits
> because it the largest possible value to fit in a single register (I
> think, isn't that for MMX?), making computations that much more
> possibly quick.


I think there is more speed to be gained by getting the algorythm right than by
tuning stuff for particular hardware

>
> But hey, if someone programs this, then they have the freedom to test
> and try all different sorts of options. ;)
>


yeah, I'm wondering if there's a way to calculate the crc of the
remainder of an n-byte file with the first byte removed. given only
the first byte,and the crc of the whole file, without going through
n loops of the crc equation.

I'm looking for a sliding-window hash function.

Bye.
Jasen
atamido@gmail.com

2006-05-31, 7:12 pm

atamido@gmail.com wrote:
> The computational requirements are largely dependant on how large you
> expect your similar blocks to be. For instance, in my case I expect
> similar fields data blocks to be at least 4KB in size (although it may
> actually be 1MB or more). Using that assumption, I can drasticaly
> reduce my data compare size. For example, having two files, File A
> (the original) and File B (the new version):
>
> 1. Take a 16 byte (128 bit) sample every 4KB in the file.
> 2. Do a quick check to make sure there are no identical samples and
> take new samples where appropriate.
> 3. Create an index of markers for ever 4KB in File B.
> 4. Take a 4KB chunk and do a search in it for each of the 16 byte
> samples.
> 5. For each of the matching samples, take larger segments from around
> the matching samples to compare between File A and File B.
> 6. For the sample with the largest matching segment, update your index
> markers to match the size of the segment.
> 7. Repeat for the entire file.
> 8. Save the index of matching segments, and save all of the unmatched
> data.


I've been reading through the excellent layman's terms description of
how the *nix bindiff program works and it has given me some really good
ideas about how to reduce the amount of data that must be compared:
http://www.ddj.com/184409550?pgno=3

1. In the original file, moving in predefined chunk sizes (for example,
64KB chunks) create an index of how many times each byte occurs in each
chunk. (I'm going to refer to this as the byte-type-index.)
a. You only really need to store 0 - 32,768 instances from a single
chunk for this compare, so 2 bytes per [byte combination] is 512 bytes
per chunk. (Using 64KB chunks in an 8GB file would create a
[8,000,000KB / 64KB * 512B =] 64MB byte-type-index if using a flat
list.)
b. If you used a 4 bytes to store the number of bytes per chunk, it
would double the amount of RAM required for the byte-type-index, but
make it 100% accurate and remove a step of checking to see if you're
going to flip the counter.
c. For this part, it may be beneficial to count sequences of the
same byte as a single instance as we will only be using it once later
on. (If 0xFF occurs 10 times in a row, only count it once in the flat
list)

2. Using the byte-type-index, find the byte that occurs in every chunk,
but occurs the least number of times. (I'm going to refer to this byte
as the matching-byte.)
a. It would probably be smart to have an option here where a byte
only has to occur in 99% of chunks as there could be instances of
entire chunks filled with something like 0xFF. (Not being able to
match 1% of the file could be a pretty small concern really given the
potential tradeoff in speed due to reduced matching.)
b. It might also be feasible to have a secondary matching-byte for
when the primary byte combination doesn't exist in a chunk. (This
would allow you to get a hit in a chunk that doesn't have our selected
matching-byte. You would basically want to have a sub-routine from
here on out that only handled chunks without a primary matching-byte.)
c. Some sort of weighted preference should be used to pick bytes
that aren't ASCII letters/numbers to prevent to afore mentioned problem
of common words and phrases.

3. In the updated file, moving in the same predefined chunk sizes,
locate the first matching byte and take a sample starting with the
first byte after the matching byte.
a. If there is a run of matching-bytes, then use the last one. (If
your matching byte is 0x3B, and you have the sequence 0x45AF3B3B3B01AC
then your sample would start with 0x01AC.)
b. Only take one sample from each chunk. (We only need one as we
only NEED to find matches of at least about the size of the chunk.)
c. If there is not a matching-byte in the chunk, then skip the
chunk. (If you are using a secondary matching-byte, then call the
subroutine to handle it.)
d. You would of course need pointers indicating where in the updated
file this sample was taken from.

4. Sort the samples.
a. Start with the first byte and work your way from there. ( Given
three samples 34AB, 340F, and CE45: their sorted order would be 340F,
34AB, CE45)
b. A tree would be helpful here, where each branch holds samples
with the same first byte or two.
c. If one branch holds more than xx% of the samples, then you may
have hit to common of a byte pattern and step three should be started
over with a new matching-byte.
d. You can delete the byte-type-index to recover some RAM after this
point. If you aren't going to check branch sizes, then you can delete
it after step 2.
e. This step could be combined with step 3.

5. Using the original file, find the first matching-byte in the first
chunk.
a. If no matching bytes are found, ignore the chunk. (If you are
using secondary matching-bytes, then call that subroutine.)
b. As above, if there is a run of matching-bytes, then only use the
last one.

6. Compare the data after the first matching-byte to the samples to
find matching samples.
a. Using the tree from step 4, you can match a branch using the
first byte(s) of the data, and only compare those samples.

7. For the matched samples, perform a more complex compare between the
matched section of the original file and the section from the updated
file the matched sample is from.
a. I don't know the optimal method for searching, but the idea is to
find the how much of the data from the original file and the updated
file match each other. However that is done fastest.
b. The bounds for this compare are not limited to the size of the
chunk.

8. Store the largest matching size and pointers to the location in the
original and updated files in the diff.
a. If the largest matching size is less than the size required to
store the size and pointers, then skip this step.

9. Repeat steps 6 - 8 for all matching-bytes in the chunk.
a. If the largest matching size from step 8 overlaps following
matching-bytes, skip the overlapped matching bytes.

10. Repeat steps 5 - 9 for all chunks in the file.
a. If the largest matching size from step 8 overlaps part or all of
the following chunks, skip the overlapped parts/chunks.

11. Build a map of all the matched sections of the updated file, and
write the unmatched sections to the diff file.


The byte-type-index would be similar to a flat file in that it wouldn't
need to store pointers or identifiers. It would be similar to a table:

Chunk1 Chunk2 Chunk3 Chunk4 ...

0x00 25 581 5981 8145
0x01 25813 2 856 581
0x02 5861 5862 5861 11684
0x03 52 58 598 816
0x04 18505 9535 175 189
....

Except that using fixed field sizes, knowing the size of each chunk,
and how many chunks there are, you know exactly what each byte
corresponds to. This makes mining it for certain pieces of information
easy. If you only needed to know what how many times a byte value
occured and how many chunks it didn't occur in, a much less memory
intensive (and possibly faster) table would be:

Occurences Chunks without occurence

0x00 8425 581
0x01 65813 2
0x02 615861 5862
0x03 9852 58
0x04 223505 9535


Are there any other ideas to help make this a bit faster?


Atamido

Jasen Betts

2006-06-02, 7:12 am

>
> The byte-type-index would be similar to a flat file in that it wouldn't
> need to store pointers or identifiers. It would be similar to a table:
>
> Chunk1 Chunk2 Chunk3 Chunk4 ...
>
> 0x00 25 581 5981 8145
> 0x01 25813 2 856 581
> 0x02 5861 5862 5861 11684
> 0x03 52 58 598 816
> 0x04 18505 9535 175 189
> ...
>
> Except that using fixed field sizes, knowing the size of each chunk,
> and how many chunks there are, you know exactly what each byte
> corresponds to. This makes mining it for certain pieces of information
> easy. If you only needed to know what how many times a byte value
> occured and how many chunks it didn't occur in, a much less memory
> intensive (and possibly faster) table would be:
>
> Occurences Chunks without occurence
>
> 0x00 8425 581
> 0x01 65813 2
> 0x02 615861 5862
> 0x03 9852 58
> 0x04 223505 9535
>
>
> Are there any other ideas to help make this a bit faster?


Trees are slow O(n*log(n)), use a hash table, ignore collisions => O(n)

use a FIFO-compatible hashing algorythm, byte count would be one example but
it has too many bits.

characterise the original file in eg: 4k blocks (store the hash and
offset in the hash table)
if you get ha hash collision when storing a hash step forwards one byte in
the file and try again.

then run through the new file one byte at a time looking for matches, the rest
is fairly obvious I think.

Bye.
Jasen
atamido@gmail.com

2006-06-05, 7:15 am

Jasen Betts wrote:
>
> Trees are slow O(n*log(n)), use a hash table, ignore collisions => O(n)
>
> use a FIFO-compatible hashing algorythm, byte count would be one example but
> it has too many bits.
>
> characterise the original file in eg: 4k blocks (store the hash and
> offset in the hash table)
> if you get ha hash collision when storing a hash step forwards one byte in
> the file and try again.
>
> then run through the new file one byte at a time looking for matches, the rest
> is fairly obvious I think.


I can honestly say I didn't understand any of that enough to comment on
it. Seriously. I'm not sure the extent that a single or double depth
tree would have on speed, nor how to use the above algo to calculate
that. I don't understand how FIFO...hashing works. I'm not sure what
characterising, stepping forward, or etc.

Yeah. Would you point me towards some resources that explain these
concepts?


Atamido

Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com