The Semantic DB Project: mapping sw files to frequency lists

So the idea is, we have a big, and rapidly growing, collection of sw files, maybe we can use find-topic[op] to find which sw file we need. I guess a kind of sorted, fuzzy grep. All we need to do is map them to frequency lists, and we are done. Here is the code for that, and here is the resulting sw file. Note for speed we currently only process sw files that are under 10 KB in size. Maybe we will relax that at some point.

Now in the console:

-- load the data:
sa: load sw-files-to-frequency-lists.sw

-- save some typing:
sa: h |*> #=> find-topic[kets] |_self>

-- find sw files that mention frogs:
sa: h |animal: frog>
75.0|sw file: frog.sw> + 25|sw file: spell-active-buffer.sw>

-- find sw files that mention George:
sa: h |george>
66.667|sw file: similar-chars-example.sw> + 33.333|sw file: matrix-as-network.sw>

sa: h |person: George>
27.273|sw file: blog-george.sw> + 27.273|sw file: george.sw> + 27.273|sw file: new-george.sw> + 18.182|sw file: recall-general-rules-example.sw>

-- find sw files that mention bot: Emma:
sa: h |bot: Emma>
50|sw file: bot-emma.sw> + 50|sw file: bots.sw>

-- find sw files that use |op: fib>:
sa: h |op: fib>
16.667|sw file: active-fib-play.sw> + 16.667|sw file: fib-play.sw> + 16.667|sw file: fibonacci.sw> + 16.667|sw file: memoizing-fibonacci.sw> + 16.667|sw file: next-fib-play.sw> + 16.667|sw file: small-fib.sw>

-- find sw files that mention |food: Belgian Waffles>:
sa: h |food: Belgian Waffles>
31.746|sw file: clean-breakfast-menu.sw> + 25.397|sw file: breakfast-menu.sw> + 23.81|sw file: breaky-presidents.sw> + 19.048|sw file: next-breakfast-menu.sw>

-- find sw files that mention |word: btw>:
sa: h |word: btw>
100|sw file: internet-acronyms.sw>

-- find sw files that mention |document: www proposal>
sa: h |document: www proposal>
100|sw file: www-proposal.sw>

So I guess it works as expected. Not sure it is that much of an improvement over grep. Yeah, we do have coeffs in there showing which sw file is more relevant, but still.

I guess also if we map source code to frequency lists, we could use this to search for the right code file. But like I just said, not sure it is that much better than grep. Heh, maybe a little worse in some cases.

That is it for this post. Another find-topic[op] example in the next post.

Update: so there is room for improvement in here. One is to borrow Google's "did you mean" feature. Currently you have to get the ket exactly right (including capitalization) else you get nothing. With some work we should be able to use a fuzzier ket search. The key component of that would of course be our friend similar[op].

The Semantic DB Project

Friday, 27 March 2015

mapping sw files to frequency lists

No comments:

Post a Comment