The Semantic DB Project: announcing: command line file similarity tool

Decided to write a script that spits out the similarity of two files, using my simm. It has three mappings from file to superposition/list (byte, 2-byte and fragment simm) and gives both scaled and unscaled simm results. Here is the code, and I decided to make it open source!

I guess some examples:

-- define some toy files:
$ echo -n "aa" > a.txt
$ echo -n "a" > b.txt

-- find their similarity:
$ ./simm.py a.txt b.txt
file 1:                a.txt
file 2:                b.txt

unscaled:
  byte similarity:     50 %
  2 byte similarity:   50 %

scaled:
  byte similarity:     100 %
  2 byte similarity:   50 %

Next example:

-- define toy files:
$ echo -n "fish" > a.txt
$ echo -n "fsih" > b.txt

-- find similarity:
$ ./simm.py a.txt b.txt
file 1:                a.txt
file 2:                b.txt

unscaled:
  byte similarity:     100 %
  2 byte similarity:   25 %

scaled:
  byte similarity:     100 %
  2 byte similarity:   25 %

A slightly more interesting example, and this time we also use fragment simm:

-- define a couple of sentences:
$ echo "Shakespeare produced most of his known work between 1589 and 1613" > a.txt
$ echo "Shakespeare produced most of his work after 1589" > b.txt

-- find their similarity, and fragment simm on spaces:
$ ./simm.py a.txt b.txt ' '
file 1:                a.txt
file 2:                b.txt

split strings:         ' '

unscaled:
  byte similarity:     71.21 %
  2 byte similarity:   65.15 %
  fragment similarity: 63.64 %

scaled:
  byte similarity:     82.19 %
  2 byte similarity:   66.2 %
  fragment similarity: 63.64 %

Now, on larger files. Recall the simm matrix I made here? Well, let's test a couple of examples:

$ ./simm.py example-files/binary.exe example-files/ebook.txt
file 1:                example-files/binary.exe
file 2:                example-files/ebook.txt

unscaled:
  byte similarity:     9.25 %
  2 byte similarity:   1.92 %

scaled:
  byte similarity:     12.44 %
  2 byte similarity:   2.66 %

$ ./simm.py example-files/ebook.txt example-files/website.html
file 1:                example-files/ebook.txt
file 2:                example-files/website.html

unscaled:
  byte similarity:     24.07 %
  2 byte similarity:   17.82 %

scaled:
  byte similarity:     72.7 %
  2 byte similarity:   47.26 %

$ ./simm.py example-files/encrypted.gpg example-files/zipped.zip
file 1:                example-files/encrypted.gpg
file 2:                example-files/zipped.zip

unscaled:
  byte similarity:     94.06 %
  2 byte similarity:   63.75 %

scaled:
  byte similarity:     97.6 %
  2 byte similarity:   65.2 %

Now, a more interesting fragment simm example. Recall the simm matrices I made here. Well, let's give some examples:

$ ./simm.py webpages-v2/abc-1.html webpages-v2/abc-7.html '<' '>'
file 1:                webpages-v2/abc-1.html
file 2:                webpages-v2/abc-7.html

split strings:         '<' '>'

unscaled:
  byte similarity:     98.28 %
  2 byte similarity:   95.28 %
  fragment similarity: 91.88 %

scaled:
  byte similarity:     98.76 %
  2 byte similarity:   95.65 %
  fragment similarity: 91.88 %

$ ./simm.py webpages-v2/abc-1.html webpages-v2/adelaidenow-1.html '<' '>'
file 1:                webpages-v2/abc-1.html
file 2:                webpages-v2/adelaidenow-1.html

split strings:         '<' '>'

unscaled:
  byte similarity:     11.41 %
  2 byte similarity:   10.63 %
  fragment similarity: 7.4 %

scaled:
  byte similarity:     82.51 %
  2 byte similarity:   61.66 %
  fragment similarity: 28.76 %

$ ./simm.py webpages-v2/abc-1.html webpages-v2/youtube-1.html '<' '>'
file 1:                webpages-v2/abc-1.html
file 2:                webpages-v2/youtube-1.html

split strings:         '<' '>'

unscaled:
  byte similarity:     13.23 %
  2 byte similarity:   10.87 %
  fragment similarity: 10.29 %

scaled:
  byte similarity:     76.47 %
  2 byte similarity:   49.16 %
  fragment similarity: 24.27 %

And finally, some text files:

$ ./simm.py text/ebook-Alices_Adventures_in_Wonderland_11.txt text/ebook-Tom_Sawyer_74.txt ' '
file 1:                text/ebook-Alices_Adventures_in_Wonderland_11.txt
file 2:                text/ebook-Tom_Sawyer_74.txt

split strings:         ' '

unscaled:
  byte similarity:     40.02 %
  2 byte similarity:   38.49 %
  fragment similarity: 28.69 %

scaled:
  byte similarity:     95.47 %
  2 byte similarity:   86.92 %
  fragment similarity: 54.34 %

$ ./simm.py text/ebook-Alices_Adventures_in_Wonderland_11.txt text/ebook-moby-shakespeare.txt ' '
file 1:                text/ebook-Alices_Adventures_in_Wonderland_11.txt
file 2:                text/ebook-moby-shakespeare.txt

split strings:         ' '

unscaled:
  byte similarity:     3.13 %
  2 byte similarity:   3.11 %
  fragment similarity: 2.5 %

scaled:
  byte similarity:     90.8 %
  2 byte similarity:   79.5 %
  fragment similarity: 41.8 %

$ ./simm.py text/WP-Adelaide.txt text/WP-Australia.txt ' '
file 1:                text/WP-Adelaide.txt
file 2:                text/WP-Australia.txt

split strings:         ' '

unscaled:
  byte similarity:     88.89 %
  2 byte similarity:   81.13 %
  fragment similarity: 46.52 %

scaled:
  byte similarity:     95.47 %
  2 byte similarity:   86.49 %
  fragment similarity: 48.83 %

$ ./simm.py text/WP-Adelaide.txt text/WP-physics.txt ' '
file 1:                text/WP-Adelaide.txt
file 2:                text/WP-physics.txt

split strings:         ' '

unscaled:
  byte similarity:     54.64 %
  2 byte similarity:   50.26 %
  fragment similarity: 23.42 %

scaled:
  byte similarity:     90.3 %
  2 byte similarity:   77.31 %
  fragment similarity: 35.54 %

And one final example. This is what happens if you try the wrong split strings for a given document type. eg here less-than and greater-than on a text file:

$ ./simm.py text/WP-Adelaide.txt text/WP-Australia.txt '<' '>'
file 1:                text/WP-Adelaide.txt
file 2:                text/WP-Australia.txt

split strings:         '<' '>'

unscaled:
  byte similarity:     88.89 %
  2 byte similarity:   81.13 %
  fragment similarity: 0 %

scaled:
  byte similarity:     95.47 %
  2 byte similarity:   86.49 %
  fragment similarity: 0 %

Note the 0% fragment similarity.

That's it for this post. BTW, in looking around to see if others have written file similarity tools I found simhash. Heh, almost certainly faster than my method!

BTW, the above should work just fine on a large range of document types. You just need to choose the right split strings. eg, for C we might use:

./simm.py code-1.c code-2.c ' ' '{' '}' '(' ')' ';'

ie, space, curly-brackets, normal brackets and semi-colon.

Indeed, we could also apply it to DNA (though presumably real researchers already have tools to do this!). So, we can specify split-strings, and apply it to text files of DNA sequences.
eg:

./simm.py DNA-1.txt DNA-2.txt 'GAAATTCCCA' 'ATACCACT' 'AACCACACAC' 'TTAGGG'

That we can do this should not be a surprise, since my fragment similarity idea was originally motivated by the idea of gel electrophoresis.

OK. Let's do a worked example!
Let's first choose 'TTAGGG' as our split string.
Let's choose these to be our 2 DNA sequences:
GAAATTCCCA TTAGGG ATACCACT
AACCACACAC TTAGGG GAAATTCCCA
And by construction, we expect 50% similarity.
Let's take a look:

$ echo -n "GAAATTCCCATTAGGGATACCACT" > a.txt
$ echo -n "AACCACACACTTAGGGGAAATTCCCA" > b.txt

$ ./simm.py a.txt b.txt 'TTAGGG'
file 1:                a.txt
file 2:                b.txt

split strings:         'TTAGGG'

unscaled:
  byte similarity:     84.62 %
  2 byte similarity:   73.08 %
  fragment similarity: 50 %

scaled:
  byte similarity:     89.1 %
  2 byte similarity:   75.64 %
  fragment similarity: 50 %

Success!

Finally, my code is somewhat slow! A C implementation would be nice.

Update: if your files are line based, '\n' (ie, new-line) should usually be a sufficient split string.
If you have csv files say, then maybe '\n' and ',' (ie, new-line and comma).

Update: talking of speeding it up, if we use lists of bits, we can speed it up somewhat.
wf = sum-of-bits(f)
wg = sum-of-bits(g)
wfg = sum-of-bits(xor(f,g))
result = (wf + wg - wfg)/2*max(wf,wg)

Update: OK. I given this "speed up" some thought, and decided it will not be as good as the full simm. A for example is, consider C code. If we use the bit method a file containing 1 "int" is "identical" to another containing say 200 "int". Which is not all that useful!

The Semantic DB Project

Thursday, 7 May 2015

announcing: command line file similarity tool

No comments:

Post a Comment