I guess some examples:
-- define some toy files: $ echo -n "aa" > a.txt $ echo -n "a" > b.txt -- find their similarity: $ ./simm.py a.txt b.txt file 1: a.txt file 2: b.txt unscaled: byte similarity: 50 % 2 byte similarity: 50 % scaled: byte similarity: 100 % 2 byte similarity: 50 %Next example:
-- define toy files: $ echo -n "fish" > a.txt $ echo -n "fsih" > b.txt -- find similarity: $ ./simm.py a.txt b.txt file 1: a.txt file 2: b.txt unscaled: byte similarity: 100 % 2 byte similarity: 25 % scaled: byte similarity: 100 % 2 byte similarity: 25 %A slightly more interesting example, and this time we also use fragment simm:
-- define a couple of sentences: $ echo "Shakespeare produced most of his known work between 1589 and 1613" > a.txt $ echo "Shakespeare produced most of his work after 1589" > b.txt -- find their similarity, and fragment simm on spaces: $ ./simm.py a.txt b.txt ' ' file 1: a.txt file 2: b.txt split strings: ' ' unscaled: byte similarity: 71.21 % 2 byte similarity: 65.15 % fragment similarity: 63.64 % scaled: byte similarity: 82.19 % 2 byte similarity: 66.2 % fragment similarity: 63.64 %Now, on larger files. Recall the simm matrix I made here? Well, let's test a couple of examples:
$ ./simm.py example-files/binary.exe example-files/ebook.txt file 1: example-files/binary.exe file 2: example-files/ebook.txt unscaled: byte similarity: 9.25 % 2 byte similarity: 1.92 % scaled: byte similarity: 12.44 % 2 byte similarity: 2.66 % $ ./simm.py example-files/ebook.txt example-files/website.html file 1: example-files/ebook.txt file 2: example-files/website.html unscaled: byte similarity: 24.07 % 2 byte similarity: 17.82 % scaled: byte similarity: 72.7 % 2 byte similarity: 47.26 % $ ./simm.py example-files/encrypted.gpg example-files/zipped.zip file 1: example-files/encrypted.gpg file 2: example-files/zipped.zip unscaled: byte similarity: 94.06 % 2 byte similarity: 63.75 % scaled: byte similarity: 97.6 % 2 byte similarity: 65.2 %Now, a more interesting fragment simm example. Recall the simm matrices I made here. Well, let's give some examples:
$ ./simm.py webpages-v2/abc-1.html webpages-v2/abc-7.html '<' '>' file 1: webpages-v2/abc-1.html file 2: webpages-v2/abc-7.html split strings: '<' '>' unscaled: byte similarity: 98.28 % 2 byte similarity: 95.28 % fragment similarity: 91.88 % scaled: byte similarity: 98.76 % 2 byte similarity: 95.65 % fragment similarity: 91.88 % $ ./simm.py webpages-v2/abc-1.html webpages-v2/adelaidenow-1.html '<' '>' file 1: webpages-v2/abc-1.html file 2: webpages-v2/adelaidenow-1.html split strings: '<' '>' unscaled: byte similarity: 11.41 % 2 byte similarity: 10.63 % fragment similarity: 7.4 % scaled: byte similarity: 82.51 % 2 byte similarity: 61.66 % fragment similarity: 28.76 % $ ./simm.py webpages-v2/abc-1.html webpages-v2/youtube-1.html '<' '>' file 1: webpages-v2/abc-1.html file 2: webpages-v2/youtube-1.html split strings: '<' '>' unscaled: byte similarity: 13.23 % 2 byte similarity: 10.87 % fragment similarity: 10.29 % scaled: byte similarity: 76.47 % 2 byte similarity: 49.16 % fragment similarity: 24.27 %And finally, some text files:
$ ./simm.py text/ebook-Alices_Adventures_in_Wonderland_11.txt text/ebook-Tom_Sawyer_74.txt ' ' file 1: text/ebook-Alices_Adventures_in_Wonderland_11.txt file 2: text/ebook-Tom_Sawyer_74.txt split strings: ' ' unscaled: byte similarity: 40.02 % 2 byte similarity: 38.49 % fragment similarity: 28.69 % scaled: byte similarity: 95.47 % 2 byte similarity: 86.92 % fragment similarity: 54.34 % $ ./simm.py text/ebook-Alices_Adventures_in_Wonderland_11.txt text/ebook-moby-shakespeare.txt ' ' file 1: text/ebook-Alices_Adventures_in_Wonderland_11.txt file 2: text/ebook-moby-shakespeare.txt split strings: ' ' unscaled: byte similarity: 3.13 % 2 byte similarity: 3.11 % fragment similarity: 2.5 % scaled: byte similarity: 90.8 % 2 byte similarity: 79.5 % fragment similarity: 41.8 % $ ./simm.py text/WP-Adelaide.txt text/WP-Australia.txt ' ' file 1: text/WP-Adelaide.txt file 2: text/WP-Australia.txt split strings: ' ' unscaled: byte similarity: 88.89 % 2 byte similarity: 81.13 % fragment similarity: 46.52 % scaled: byte similarity: 95.47 % 2 byte similarity: 86.49 % fragment similarity: 48.83 % $ ./simm.py text/WP-Adelaide.txt text/WP-physics.txt ' ' file 1: text/WP-Adelaide.txt file 2: text/WP-physics.txt split strings: ' ' unscaled: byte similarity: 54.64 % 2 byte similarity: 50.26 % fragment similarity: 23.42 % scaled: byte similarity: 90.3 % 2 byte similarity: 77.31 % fragment similarity: 35.54 %And one final example. This is what happens if you try the wrong split strings for a given document type. eg here less-than and greater-than on a text file:
$ ./simm.py text/WP-Adelaide.txt text/WP-Australia.txt '<' '>' file 1: text/WP-Adelaide.txt file 2: text/WP-Australia.txt split strings: '<' '>' unscaled: byte similarity: 88.89 % 2 byte similarity: 81.13 % fragment similarity: 0 % scaled: byte similarity: 95.47 % 2 byte similarity: 86.49 % fragment similarity: 0 %Note the 0% fragment similarity.
That's it for this post. BTW, in looking around to see if others have written file similarity tools I found simhash. Heh, almost certainly faster than my method!
BTW, the above should work just fine on a large range of document types. You just need to choose the right split strings. eg, for C we might use:
./simm.py code-1.c code-2.c ' ' '{' '}' '(' ')' ';'ie, space, curly-brackets, normal brackets and semi-colon.
Indeed, we could also apply it to DNA (though presumably real researchers already have tools to do this!). So, we can specify split-strings, and apply it to text files of DNA sequences.
eg:
./simm.py DNA-1.txt DNA-2.txt 'GAAATTCCCA' 'ATACCACT' 'AACCACACAC' 'TTAGGG'That we can do this should not be a surprise, since my fragment similarity idea was originally motivated by the idea of gel electrophoresis.
OK. Let's do a worked example!
Let's first choose 'TTAGGG' as our split string.
Let's choose these to be our 2 DNA sequences:
GAAATTCCCA TTAGGG ATACCACT
AACCACACAC TTAGGG GAAATTCCCA
And by construction, we expect 50% similarity.
Let's take a look:
$ echo -n "GAAATTCCCATTAGGGATACCACT" > a.txt $ echo -n "AACCACACACTTAGGGGAAATTCCCA" > b.txt $ ./simm.py a.txt b.txt 'TTAGGG' file 1: a.txt file 2: b.txt split strings: 'TTAGGG' unscaled: byte similarity: 84.62 % 2 byte similarity: 73.08 % fragment similarity: 50 % scaled: byte similarity: 89.1 % 2 byte similarity: 75.64 % fragment similarity: 50 %Success!
Finally, my code is somewhat slow! A C implementation would be nice.
Update: if your files are line based, '\n' (ie, new-line) should usually be a sufficient split string.
If you have csv files say, then maybe '\n' and ',' (ie, new-line and comma).
Update: talking of speeding it up, if we use lists of bits, we can speed it up somewhat.
wf = sum-of-bits(f)
wg = sum-of-bits(g)
wfg = sum-of-bits(xor(f,g))
result = (wf + wg - wfg)/2*max(wf,wg)
Update: OK. I given this "speed up" some thought, and decided it will not be as good as the full simm. A for example is, consider C code. If we use the bit method a file containing 1 "int" is "identical" to another containing say 200 "int". Which is not all that useful!
No comments:
Post a Comment