Need a Binary diff / compressor for win32
Web Server forum
Back To The Forum Home!Search!Private Messaging System

Web Server Talk Web Server Talk > WebserverTalk Community > Backup Software > Need a Binary diff / compressor for win32




Pages (2): [1] 2 »   Last Thread   Next Thread Next
  Show Printable Version Email this Page Subscribe to this Thread      Post New Thread    Post A Reply      

    Need a Binary diff / compressor for win32  
atamido@gmail.com


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
05-25-06 12:12 AM

I have some database-like files that I need to do regular backups of,
to an offsite (online) location.  These aren't real database files, so
SQL type differential backups aren't an option.  The files are large
(2-8GB each), and the upload speed is slow (2mbps), so getting small
sizes will be a necessity.  Between two backups, at least 95% of the
data will be the same, although it may have moved around in the file so
offsets of identical data will be completely different.

I can stagger uploading full copies of the different files to spread
out the uploading.  I can also keep some backups of the files locally.

My Plan:
1. Compress a given file using the maximum level of some compressor.
(Probably 7-Zip, although I'm not married to it.)
2. Upload the compressed version to the offsite location and keep the
original local to the system.
3. For the next backup, create a diff between the original and the new
version.  (I need a diff program that can be used to reconstruct one
file from the other.)
4. Compress the diff and upload it.  Delete the local original copy and
keep the current backup locally.
5. Repeat steps 3-4 until a current file can be uploaded in it's
entirety.

Questions:
1. Is this a workable plan, or is it flawed greatly in some way?  Are
there improvements to be made in it?
2. Is there some sort of binary-diff/compression program out there that
would meet my needs fully? (and cheaply)
3. If not 2, then is there a really good binary diff program I could
use to create diffs before compressing?
4. Is there another group more appropriate for my problem?


Atamido






[ Post a follow-up to this message ]



    Re: Need a Binary diff / compressor for win32  
H.D.


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
05-25-06 12:12 PM

On 24 May 2006 15:04:51 -0700, atamido@gmail.com wrote:

>I have some database-like files that I need to do regular backups of,
>to an offsite (online) location.

---
>My Plan:
>1. Compress a given file using the maximum level of some compressor.
>(Probably 7-Zip, although I'm not married to it.)
>2. Upload the compressed version to the offsite location and keep the
>original local to the system.
>3. For the next backup, create a diff between the original and the new
>version.  (I need a diff program that can be used to reconstruct one
>file from the other.)
>4. Compress the diff and upload it.  Delete the local original copy and
>keep the current backup locally.
>5. Repeat steps 3-4 until a current file can be uploaded in it's
>entirety.
>
>Questions:
>1. Is this a workable plan, or is it flawed greatly in some way?  Are
>there improvements to be made in it?
>2. Is there some sort of binary-diff/compression program out there that
>would meet my needs fully? (and cheaply)
>3. If not 2, then is there a really good binary diff program I could
>use to create diffs before compressing?
>4. Is there another group more appropriate for my problem?

I have almost the same workflow you want to use.
I use WinRAR in combination with OCB or One Click Backup. (see
www.acritum.com)
Tweak a bit with the switches and up you are ;-)






[ Post a follow-up to this message ]



    Re: Need a Binary diff / compressor for win32  
atamido@gmail.com


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
05-29-06 10:04 PM

H.D. wrote:
> On 24 May 2006 15:04:51 -0700, atamido@gmail.com wrote: 
>
> I have almost the same workflow you want to use.
> I use WinRAR in combination with OCB or One Click Backup. (see
> www.acritum.com)
> Tweak a bit with the switches and up you are ;-)

The software looks very useful, thanks!  Do you know if it will do
binary diffs for large files?


Atamido






[ Post a follow-up to this message ]



    Re: Need a Binary diff / compressor for win32  
Claudio Grondi


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
05-29-06 10:04 PM

atamido@gmail.com wrote:
> H.D. wrote: 
>
> The software looks very useful, thanks!  Do you know if it will do
> binary diffs for large files?
>
>
> Atamido
>
I looked into it some time ago and the only promising approach seemed to
be usage of a commercial package called Caché by Intersystems:
http://www.intersystems.com/cache/
pretending to be able to handle file sizes you are mentioning (far
beyond what can be stored in RAM) in case of differential backups on
usual PC hardware.

I haven't installed and tried it yet, so I would be glad to see if it
keeps what it promises and/or to get knowledge about any other solution
to the problem you mention.

If I have got it right WinRAR and 7zip can use very large dictionaries,
but can't handle file sizes not fitting into RAM or perform a
differential compression and there is no Open Source software able to do
what you need.

In my eyes the problem you have is still an unsolved general problem of
performing an effective compression of very large amount of data (> 4
Gigabyte) taking advantage from repetitions randomly spread over the
data using standard personal computer hardware.

I would be glad to be proven wrong and therefore eager awaiting expert
response on this subject.

Claudio





[ Post a follow-up to this message ]



    Re: Need a Binary diff / compressor for win32  
Jasen Betts


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
05-29-06 10:04 PM

On 2006-05-25, atamido@gmail.com <atamido@gmail.com> wrote:
> H.D. wrote: 
>

> The software looks very useful, thanks!  Do you know if it will do
> binary diffs for large files?

you won't find software to do that (finding where data has been moved to in
huge files given only before and after data) it'd take weeks to run.
(given a cluster of 1000 or so pcs you might be able to reduce it to an
overnight task (per file))

look into rsync, or a streaming backup system (which records a log of the
changes made as they are made)



--

Bye.
Jasen





[ Post a follow-up to this message ]



    Re: Need a Binary diff / compressor for win32  
atamido@gmail.com


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
05-29-06 10:04 PM

Claudio Grondi wrote:
> If I have got it right WinRAR and 7zip can use very large dictionaries,
> but can't handle file sizes not fitting into RAM or perform a
> differential compression and there is no Open Source software able to do
> what you need.
>
> In my eyes the problem you have is still an unsolved general problem of
> performing an effective compression of very large amount of data (> 4
> Gigabyte) taking advantage from repetitions randomly spread over the
> data using standard personal computer hardware.
>
> I would be glad to be proven wrong and therefore eager awaiting expert
> response on this subject.

While not what I'm looking for, and not exactly what you're asking
about, I did come across cromfs.  However, this is a compressed file
system for Linux, not a general archiver.  From their compression page:


http://bisqwit.iki.fi/source/cromfs.html#compression
"An explanation why mkcromfs beats 7-zip in the NES ROM packing test:

7-zip packs all the files together as one stream. The maximum
dictionary size is 256 MB. (Note: The default for "maximum compression"
is 32 MB.) When 256 MB of data has been packed and more data comes in,
similarities between the first megabytes of data and the latest data
are not utilized. For example, Mega Man and Rockman are two almost
identical versions of the same image, but because there's more than 400
MB of files in between of those when they are processed in alphabetical
order, 7-zip does not see that they are similar, and will compress each
one separately.
7-zip's chances could be improved by sorting the files so that it
will process similar images sequentially. It already attempts to
accomplish this by sorting the files by filename extension and
filename, but it is not always the optimal way, as shown here.

mkcromfs however keeps track of all blocks it has encoded, and will
remember similarities no matter how long ago they were added to the
archive.  This is why it outperforms 7-zip in this case, even when it
only used 16 MB fblocks."


Perhaps if someone found a way to use this file system's algorithms in
a general compression utility...


Atamido






[ Post a follow-up to this message ]



    Re: Need a Binary diff / compressor for win32  
atamido@gmail.com


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
05-29-06 10:04 PM

Jasen Betts wrote:
> On 2006-05-25, atamido@gmail.com <atamido@gmail.com> wrote: 
>
> you won't find software to do that (finding where data has been moved to i
n
> huge files given only before and after data) it'd take weeks to run.
> (given a cluster of 1000 or so pcs you might be able to reduce it to an
> overnight task (per file))

The computational requirements are largely dependant on how large you
expect your similar blocks to be.  For instance, in my case I expect
similar fields data blocks to be at least 4KB in size (although it may
actually be 1MB or more).  Using that assumption, I can drasticaly
reduce my data compare size.  For example, having two files, File A
(the original) and File B (the new version):

1.  Take a 16 byte (128 bit) sample every 4KB in the file.
2.  Do a quick check to make sure there are no identical samples and
take new samples where appropriate.
3.  Create an index of markers for ever 4KB in File B.
4.  Take a 4KB chunk and do a search in it for each of the 16 byte
samples.
5.  For each of the matching samples, take larger segments from around
the matching samples to compare between File A and File B.
6.  For the sample with the largest matching segment, update your index
markers to match the size of the segment.
7.  Repeat for the entire file.
8.  Save the index of matching segments, and save all of the unmatched
data.

Using the saved data,  you should be able to reconstruct either the
original or the new copy from the other.  So with the diff(s), and
either the original file or latest file, you could recreate any copy in
the sequence.  And the (hopefully) much smaller saved data would
probably mostly compress in a ratio similar to the original file.

Using an example file size of 8GB, there would be 2,000,000 4KB samples
and 2,000,000 markers.  The samples would take up 32MB of space, which
would easily fit into RAM.  The 4KB chunk would easily fit into the L2
cache (or L1 of most CPUs).  You could easily fit 50,000 samples into
most L2 caches.  Then initially you are mostly/basically just doing
string searches 50,000 times per cache miss (~40 times) and then doing
the more complicated extended search for matching samples, all per 4KB
chunk (2 million times).

With a decent L2 cache size and fast enough CPU/memory this really
shouldn't take all that long.  And someone familiar with search
optimizations would make it go even faster.  Of course, clustering this
would be relatively easy with with a linear speed growth, so 10 PCs
would just make it go at least 9 times faster.

But then again, I'm not a real programmer, so I may be way off base.
It would be nice though if someone made such a program, even if just to
prove me wrong.  :P

> look into rsync, or a streaming backup system (which records a log of the
> changes made as they are made)

Sadly, that isn't an option with these files.  


Atamido






[ Post a follow-up to this message ]



    Re: Need a Binary diff / compressor for win32  
Claudio Grondi


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
05-29-06 10:04 PM

atamido@gmail.com wrote:
> Jasen Betts wrote:
> 
>
>
> The computational requirements are largely dependant on how large you
> expect your similar blocks to be.  For instance, in my case I expect
> similar fields data blocks to be at least 4KB in size (although it may
> actually be 1MB or more).  Using that assumption, I can drasticaly
> reduce my data compare size.  For example, having two files, File A
> (the original) and File B (the new version):
>
> 1.  Take a 16 byte (128 bit) sample every 4KB in the file.
Your idea is to split a file into 4KB chunks, what you hope will help
you to do the job.
The problem with this approach is, that it assumes identical chunks
aligned at 4KB boundaries. I am quite sure, that this can't be assumed
for most files (including database files), so this approach won't work
because of misalignment of identical chunks.
In general case I see only the chance of moving e.g. a 4KB "window" byte
by byte over the entire file, so that for large files you will get
nearly as much chunks to consider as bytes in the file.
You will need for your chunk dictionary (which should allow multiple
entries for same index value) at least four times the size of the file
(for e.g. CRC32 of the chunks as hash) + the size of the information
about the chunk position in that file.
As pointed out lately in another thread the problem with databases and
dictionaries is the time of insertion of new records which increases
with the amount of already inserted records and can't be handled in
practice when the number of records goes far beyond millions of them
(and you will need 8 thousand millions of records for a 8 GByte file)
not mentioning here that the size of the database file storing the
indexed records will probably exceed the storage capability of existing
hard drives.

> 2.  Do a quick check to make sure there are no identical samples and
> take new samples where appropriate.
> 3.  Create an index of markers for ever 4KB in File B.
> 4.  Take a 4KB chunk and do a search in it for each of the 16 byte
> samples.
> 5.  For each of the matching samples, take larger segments from around
> the matching samples to compare between File A and File B.
> 6.  For the sample with the largest matching segment, update your index
> markers to match the size of the segment.
> 7.  Repeat for the entire file.
> 8.  Save the index of matching segments, and save all of the unmatched
> data.
>
> Using the saved data,  you should be able to reconstruct either the
> original or the new copy from the other.  So with the diff(s), and
> either the original file or latest file, you could recreate any copy in
> the sequence.  And the (hopefully) much smaller saved data would
> probably mostly compress in a ratio similar to the original file.
>
> Using an example file size of 8GB, there would be 2,000,000 4KB samples
> and 2,000,000 markers.  The samples would take up 32MB of space, which
> would easily fit into RAM.  The 4KB chunk would easily fit into the L2
> cache (or L1 of most CPUs).  You could easily fit 50,000 samples into
> most L2 caches.  Then initially you are mostly/basically just doing
> string searches 50,000 times per cache miss (~40 times) and then doing
> the more complicated extended search for matching samples, all per 4KB
> chunk (2 million times).
>
> With a decent L2 cache size and fast enough CPU/memory this really
> shouldn't take all that long.  And someone familiar with search
> optimizations would make it go even faster.  Of course, clustering this
> would be relatively easy with with a linear speed growth, so 10 PCs
> would just make it go at least 9 times faster.
>
> But then again, I'm not a real programmer, so I may be way off base.
> It would be nice though if someone made such a program, even if just to
> prove me wrong.  :P
see above for my comments on step 1. of the proposed procedure.

Claudio
>
> 
>
>
> Sadly, that isn't an option with these files.  
>
>
> Atamido
>





[ Post a follow-up to this message ]



    Re: Need a Binary diff / compressor for win32  
atamido@gmail.com


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
05-29-06 10:04 PM

Jasen Betts wrote: 
> 
>
> There will be many. 16 bytes is "only three words" of ascii text, and only
 8
> characters in UTF 16. assuming there's a better way to represent each 4K
> block in 16 bytes it could work..

Well, it would only be a problem if your sample just happenned to be
part of some very common phrase occuring in the file.  If it occurred
1000 time, then that's a 1000 times that you needlessly do a more
complex matching.  Otherwise, three words (or two long ones) in a
particular order aren't that common, and 5 or 6 false hits in a file
aren't going to cause to big of an increase in the total CPU time.

There are a few things you might be able to do to prevent it also.
Like specifically search for samples that contain non-alpha characters,
or alpha-numeric.  You could also try that when your sample starts at a
space, to shift it slightly so that starts in the middle of the word
instead, resulting in partial coverage of more words and less likely to
have a random match.

Of course, you could also increase the sample size.  I mention 128 bits
because it the largest possible value to fit in a single register (I
think, isn't that for MMX?), making computations that much more
possibly quick.

But hey, if someone programs this, then they have the freedom to test
and try all different sorts of options.  ;)


Atamido






[ Post a follow-up to this message ]



    Re: Need a Binary diff / compressor for win32  
Jasen Betts


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
05-29-06 10:04 PM

["Followup-To:" header set to comp.compression.]
On 2006-05-27, atamido@gmail.com <atamido@gmail.com> wrote:


> Of course, you could also increase the sample size.  I mention 128 bits
> because it the largest possible value to fit in a single register (I
> think, isn't that for MMX?), making computations that much more
> possibly quick.

I think there is more speed to be gained by getting the algorythm right than
 by
tuning stuff for particular hardware

>
> But hey, if someone programs this, then they have the freedom to test
> and try all different sorts of options.  ;)
>

yeah, I'm wondering if there's a way to calculate the crc of the
remainder of an n-byte file with the first byte removed. given only
the first byte,and the crc of the whole file, without going through
n loops of the crc equation.

I'm looking for a sliding-window hash function.

Bye.
Jasen





[ Post a follow-up to this message ]



    Sponsored Links  




 





   All times are GMT. The time now is 10:44 AM.      Post New Thread    Post A Reply      
Pages (2): [1] 2 »   Last Thread   Next Thread Next


Most Popular forums 

Forum Jump:
Rate This Thread:

Forum Rules:
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is OFF
vB code is ON
Smilies are ON
[IMG] code is OFF
 
Medical and Health forum | Computer Games Reviews | Graphics design forum

Back To The Top
Home | Usercp | Faq | Register