Decided to write a script that spits out the similarity of two files, using
my simm. It has three mappings from file to superposition/list (byte, 2-byte and fragment simm) and gives both scaled and unscaled simm results. Here is
the code, and I decided to make it open source!
I guess some examples:
-- define some toy files:
$ echo -n "aa" > a.txt
$ echo -n "a" > b.txt
-- find their similarity:
$ ./simm.py a.txt b.txt
file 1: a.txt
file 2: b.txt
unscaled:
byte similarity: 50 %
2 byte similarity: 50 %
scaled:
byte similarity: 100 %
2 byte similarity: 50 %
Next example:
-- define toy files:
$ echo -n "fish" > a.txt
$ echo -n "fsih" > b.txt
-- find similarity:
$ ./simm.py a.txt b.txt
file 1: a.txt
file 2: b.txt
unscaled:
byte similarity: 100 %
2 byte similarity: 25 %
scaled:
byte similarity: 100 %
2 byte similarity: 25 %
A slightly more interesting example, and this time we also use fragment simm:
-- define a couple of sentences:
$ echo "Shakespeare produced most of his known work between 1589 and 1613" > a.txt
$ echo "Shakespeare produced most of his work after 1589" > b.txt
-- find their similarity, and fragment simm on spaces:
$ ./simm.py a.txt b.txt ' '
file 1: a.txt
file 2: b.txt
split strings: ' '
unscaled:
byte similarity: 71.21 %
2 byte similarity: 65.15 %
fragment similarity: 63.64 %
scaled:
byte similarity: 82.19 %
2 byte similarity: 66.2 %
fragment similarity: 63.64 %
Now, on larger files. Recall the simm matrix I
made here? Well, let's test a couple of examples:
$ ./simm.py example-files/binary.exe example-files/ebook.txt
file 1: example-files/binary.exe
file 2: example-files/ebook.txt
unscaled:
byte similarity: 9.25 %
2 byte similarity: 1.92 %
scaled:
byte similarity: 12.44 %
2 byte similarity: 2.66 %
$ ./simm.py example-files/ebook.txt example-files/website.html
file 1: example-files/ebook.txt
file 2: example-files/website.html
unscaled:
byte similarity: 24.07 %
2 byte similarity: 17.82 %
scaled:
byte similarity: 72.7 %
2 byte similarity: 47.26 %
$ ./simm.py example-files/encrypted.gpg example-files/zipped.zip
file 1: example-files/encrypted.gpg
file 2: example-files/zipped.zip
unscaled:
byte similarity: 94.06 %
2 byte similarity: 63.75 %
scaled:
byte similarity: 97.6 %
2 byte similarity: 65.2 %
Now, a more interesting fragment simm example. Recall the simm matrices I
made here. Well, let's give some examples:
$ ./simm.py webpages-v2/abc-1.html webpages-v2/abc-7.html '<' '>'
file 1: webpages-v2/abc-1.html
file 2: webpages-v2/abc-7.html
split strings: '<' '>'
unscaled:
byte similarity: 98.28 %
2 byte similarity: 95.28 %
fragment similarity: 91.88 %
scaled:
byte similarity: 98.76 %
2 byte similarity: 95.65 %
fragment similarity: 91.88 %
$ ./simm.py webpages-v2/abc-1.html webpages-v2/adelaidenow-1.html '<' '>'
file 1: webpages-v2/abc-1.html
file 2: webpages-v2/adelaidenow-1.html
split strings: '<' '>'
unscaled:
byte similarity: 11.41 %
2 byte similarity: 10.63 %
fragment similarity: 7.4 %
scaled:
byte similarity: 82.51 %
2 byte similarity: 61.66 %
fragment similarity: 28.76 %
$ ./simm.py webpages-v2/abc-1.html webpages-v2/youtube-1.html '<' '>'
file 1: webpages-v2/abc-1.html
file 2: webpages-v2/youtube-1.html
split strings: '<' '>'
unscaled:
byte similarity: 13.23 %
2 byte similarity: 10.87 %
fragment similarity: 10.29 %
scaled:
byte similarity: 76.47 %
2 byte similarity: 49.16 %
fragment similarity: 24.27 %
And finally, some text files:
$ ./simm.py text/ebook-Alices_Adventures_in_Wonderland_11.txt text/ebook-Tom_Sawyer_74.txt ' '
file 1: text/ebook-Alices_Adventures_in_Wonderland_11.txt
file 2: text/ebook-Tom_Sawyer_74.txt
split strings: ' '
unscaled:
byte similarity: 40.02 %
2 byte similarity: 38.49 %
fragment similarity: 28.69 %
scaled:
byte similarity: 95.47 %
2 byte similarity: 86.92 %
fragment similarity: 54.34 %
$ ./simm.py text/ebook-Alices_Adventures_in_Wonderland_11.txt text/ebook-moby-shakespeare.txt ' '
file 1: text/ebook-Alices_Adventures_in_Wonderland_11.txt
file 2: text/ebook-moby-shakespeare.txt
split strings: ' '
unscaled:
byte similarity: 3.13 %
2 byte similarity: 3.11 %
fragment similarity: 2.5 %
scaled:
byte similarity: 90.8 %
2 byte similarity: 79.5 %
fragment similarity: 41.8 %
$ ./simm.py text/WP-Adelaide.txt text/WP-Australia.txt ' '
file 1: text/WP-Adelaide.txt
file 2: text/WP-Australia.txt
split strings: ' '
unscaled:
byte similarity: 88.89 %
2 byte similarity: 81.13 %
fragment similarity: 46.52 %
scaled:
byte similarity: 95.47 %
2 byte similarity: 86.49 %
fragment similarity: 48.83 %
$ ./simm.py text/WP-Adelaide.txt text/WP-physics.txt ' '
file 1: text/WP-Adelaide.txt
file 2: text/WP-physics.txt
split strings: ' '
unscaled:
byte similarity: 54.64 %
2 byte similarity: 50.26 %
fragment similarity: 23.42 %
scaled:
byte similarity: 90.3 %
2 byte similarity: 77.31 %
fragment similarity: 35.54 %
And one final example. This is what happens if you try the wrong split strings for a given document type. eg here less-than and greater-than on a text file:
$ ./simm.py text/WP-Adelaide.txt text/WP-Australia.txt '<' '>'
file 1: text/WP-Adelaide.txt
file 2: text/WP-Australia.txt
split strings: '<' '>'
unscaled:
byte similarity: 88.89 %
2 byte similarity: 81.13 %
fragment similarity: 0 %
scaled:
byte similarity: 95.47 %
2 byte similarity: 86.49 %
fragment similarity: 0 %
Note the 0% fragment similarity.
That's it for this post. BTW, in looking around to see if others have written file similarity tools I found
simhash. Heh, almost certainly faster than my method!
BTW, the above should work just fine on a large range of document types. You just need to choose the right split strings. eg, for C we might use:
./simm.py code-1.c code-2.c ' ' '{' '}' '(' ')' ';'
ie, space, curly-brackets, normal brackets and semi-colon.
Indeed, we could also apply it to DNA (though presumably real researchers already have tools to do this!). So, we can specify split-strings, and apply it to text files of DNA sequences.
eg:
./simm.py DNA-1.txt DNA-2.txt 'GAAATTCCCA' 'ATACCACT' 'AACCACACAC' 'TTAGGG'
That we can do this should not be a surprise, since my fragment similarity idea was originally motivated by the idea of
gel electrophoresis.
OK. Let's do a worked example!
Let's first choose 'TTAGGG' as our split string.
Let's choose these to be our 2 DNA sequences:
GAAATTCCCA TTAGGG ATACCACT
AACCACACAC TTAGGG GAAATTCCCA
And by construction, we expect 50% similarity.
Let's take a look:
$ echo -n "GAAATTCCCATTAGGGATACCACT" > a.txt
$ echo -n "AACCACACACTTAGGGGAAATTCCCA" > b.txt
$ ./simm.py a.txt b.txt 'TTAGGG'
file 1: a.txt
file 2: b.txt
split strings: 'TTAGGG'
unscaled:
byte similarity: 84.62 %
2 byte similarity: 73.08 %
fragment similarity: 50 %
scaled:
byte similarity: 89.1 %
2 byte similarity: 75.64 %
fragment similarity: 50 %
Success!
Finally, my code is somewhat slow! A C implementation would be nice.
Update: if your files are line based, '\n' (ie, new-line) should usually be a sufficient split string.
If you have csv files say, then maybe '\n' and ',' (ie, new-line and comma).
Update: talking of speeding it up, if we use lists of bits, we can speed it up somewhat.
wf = sum-of-bits(f)
wg = sum-of-bits(g)
wfg = sum-of-bits(xor(f,g))
result = (wf + wg - wfg)/2*max(wf,wg)
Update: OK. I given this "speed up" some thought, and decided it will not be as good as the full simm. A for example is, consider C code. If we use the bit method a file containing 1 "int" is "identical" to another containing say 200 "int". Which is not all that useful!