Unix administration - Correlating Files, Compute similarity threshold?

This is Interesting: Free IT Magazines  
Home > Archive > Unix administration > March 2004 > Correlating Files, Compute similarity threshold?





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author Correlating Files, Compute similarity threshold?
Marc David Ronell

2004-03-25, 8:43 am


Is there a UNIX utility, like diff or cmp which allows two files to be
compared or correlated but its result is some sort of a figure of
merit, which describes the similarity of the files?

The idea is to cross correlate a group of files and locate the files
which are most similar. I would like to use a utility as described
above to get a rough cut and then manually review those files above a
certain cross-correlation threshold.

Thanks,

marc

Jeff Schwab

2004-03-25, 9:51 am

Marc David Ronell wrote:
> Is there a UNIX utility, like diff or cmp which allows two files to be
> compared or correlated but its result is some sort of a figure of
> merit, which describes the similarity of the files?
>
> The idea is to cross correlate a group of files and locate the files
> which are most similar. I would like to use a utility as described
> above to get a rough cut and then manually review those files above a
> certain cross-correlation threshold.


I don't know of any standard utility, but the brain-sweat has been done
already. Memory-map the files (or load them into strings) and find the
Levenshtein Distance.

http://www.merriampark.com/ld.htm#CPLUSPLUS

The author of these particular implementations is a bit whiny; e.g., the
complaint about the difficulty of matrix manipulation would go away
entirely if the author would use the standard library properly. But I
digress...
Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com