Saturday 30 May 2015

announcing: command line guess file name tool.

OK. Just a quick one. A command line tool that given a string, spits out the file-names that most closely match it (using unscaled simm). Here is the code, and once again, I have made it open source.
$ ./guess.py

Usage: ./guess.py string [directory]

If the directory is not given, then use the current directory.

And a quick example:
$ ./guess.py wiki work-on-wikipedia

string to match:      wiki
directory to search:  work-on-wikipedia
====================================

50 %     wiki.xml
23.08 %  1000-wiki.xml
23.08 %  5000-wiki.xml
22.22 %  wikipedia-links.sw
13.64 %  play_with_wikipedia.py
12.5 %   trial-wikipedia-links.sw
12 %     fragment_wikipedia_xml.py
10.34 %  fast-write-wikipedia-links.sw
8.82 %   play_with_wikipedia__fast_write.py
So, that should be clear enough. And whether anything like this is already out there, I don't know. I presume so. eg, using the edit-distance algorithm as used in spell-checkers could be used in a similar way to my simm.

Heh. Perhaps I gave a bad example! In the above case, a simple:
$ ls -1 work-on-wikipedia/ | grep wiki
1000-wiki.xml
5000-wiki.xml
fast-write-wikipedia-links.sw
fragment_wikipedia_xml.py
play_with_wikipedia.py
play_with_wikipedia__fast_write.py
trial-wikipedia-links.sw
wiki.xml
wikipedia-links.sw
would have given similar results!

So here is another example of guess:
$ ./guess.py "sam fred" sw-examples/ | head

string to match:      sam fred
directory to search:  sw-examples
====================================

30.77 %  small-fred.sw
26.67 %  Freds-family.sw
26.32 %  fred-sam-friends.sw
22.22 %  shares.sw
16.67 %  saved-commands.txt

$ ./guess.py "fred sam" sw-examples/ | head

string to match:      fred sam
directory to search:  sw-examples
====================================

33.33 %  Freds-family.sw
31.58 %  fred-sam-friends.sw
25 %     frog.sw
23.08 %  small-fred.sw
20 %     friends.sw
So some things to note are:
1) "fred sam" and "sam fred" give essentially the same answer
2) I made it case insenstive
3) we don't need to use any regexp, which we would have to do if using grep in a similar manner.
4) we could possibly use this to help find the exact ket when we only partly know its name, and the sw file is too large to search manually.

Anyway, should be useful here and there.

No comments:

Post a Comment