|
Home > Archive > Unix administration > March 2004 > Correlating Files, Compute similarity threshold?
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Correlating Files, Compute similarity threshold?
|
|
| Marc David Ronell 2004-03-25, 8:43 am |
|
Is there a UNIX utility, like diff or cmp which allows two files to be
compared or correlated but its result is some sort of a figure of
merit, which describes the similarity of the files?
The idea is to cross correlate a group of files and locate the files
which are most similar. I would like to use a utility as described
above to get a rough cut and then manually review those files above a
certain cross-correlation threshold.
Thanks,
marc
| |
| Jeff Schwab 2004-03-25, 9:51 am |
| Marc David Ronell wrote:
> Is there a UNIX utility, like diff or cmp which allows two files to be
> compared or correlated but its result is some sort of a figure of
> merit, which describes the similarity of the files?
>
> The idea is to cross correlate a group of files and locate the files
> which are most similar. I would like to use a utility as described
> above to get a rough cut and then manually review those files above a
> certain cross-correlation threshold.
I don't know of any standard utility, but the brain-sweat has been done
already. Memory-map the files (or load them into strings) and find the
Levenshtein Distance.
http://www.merriampark.com/ld.htm#CPLUSPLUS
The author of these particular implementations is a bit whiny; e.g., the
complaint about the difficulty of matrix manipulation would go away
entirely if the author would use the standard library properly. But I
digress...
|
|
|
|
|