sa: h1 |japan russia china> 40.697|WP: rivers> + 22.252|WP: Australia> + 19.422|WP: country list> + 12.804|WP: particle physics> + 4.825|WP: Adelaide>But I also found if I took the intersection of the separate results for Japan, Russia and China I got a much better result:
sa: intn(h1 |japan>, h1 |russia>, h1 |china>) 12.308|WP: country list>And hence some new python:
-- in the ket class (currently we don't have a superposition version of this, and probably don't need one!): def intn_find_topic(self,context,op): words = self.label.lower().split() # we made it case insensitive. if len(words) == 0: return ket("",0) results = [context.map_to_topic(ket(x),op) for x in words] if len(results) == 0: # this should never be true! return ket("",0) r = results[0] for sp in results: r = intersection(r,sp) return r.normalize(100).coeff_sort()Now, if the ket is a single word, then it gives the exact same answer as find-topic[words-1]. But if the ket is words separated by space, then it usually gives a much better result.
So, let's try the troublesome examples from last time:
-- load the data: sa: load improved-WP-word-frequencies.sw -- save some typing: -- h1 is the old method, F1 is the new method. sa: h1 |*> #=> find-topic[words-1] split |_self> sa: F1 |*> #=> intn-find-topic[words-1] |_self> -- ask about the Nile: sa: h1 |nile river> 76.811|WP: rivers> + 13.788|WP: Adelaide> + 9.401|WP: Australia> sa: F1 |nile river> 100|WP: rivers> -- ask about George Bush: sa: h1 |george bush> 67.705|WP: US presidents> + 22.363|WP: Australia> + 9.932|WP: Adelaide> sa: F1 |george bush> 77.465|WP: US presidents> + 22.535|WP: Australia> -- ask about Japan, Russia and China: sa: h1 |japan russia china> 40.697|WP: rivers> + 22.252|WP: Australia> + 19.422|WP: country list> + 12.804|WP: particle physics> + 4.825|WP: Adelaide> sa: F1 |japan russia china> 100|WP: country list>So it really is a good improvement on standard find-topic[op]! So at some stage I should probably scale it up to even more of wikipedia.
Recall the discussion (from long ago) about the difference between intersection and soft-intersection? Maybe I should find the link! Anyway, h1 can be considered to be using a soft intersection approach, and F1 the strict intersection approach (which is really a better fit for a search engine type algo anyway! You generally don't want pages that ignore one or more of your query terms.)
Anyway, my "ultra simple search algo" now looks like this:
|answer> => table[page,coeff,url] select[1,10] coeff-sort weight-pages intn-find-topic[words-1] |just some words>I guess the final comment is we can still perhaps tweak this further. This proposed algo does not take into consideration relative closeness of words in a document. eg, if one word is at the top, and the other is at the bottom, I presume you would want that a lesser result than if those two words were near each other. How to do that cleanly I don't know.
That's it for this post!
Update: here is a fun one:
sa: F1 |thomas ronald richard bill barack george james jimmy> 100|WP: US presidents>Update: Another way to look at it is that h1 is "word-1 OR word-2", F1 is "word-1 AND word-2".
No comments:
Post a Comment