Thursday, 2 April 2015

new function: intn-find-topic[op]

The motivation for this function is some of the results in the last post. Most of the time find-topic[op] gives good results, but I found sometimes it gives really crappy results. In particular, when I asked about "Japan Russia China" I got this messy result:
sa: h1 |japan russia china>
40.697|WP: rivers> + 22.252|WP: Australia> + 19.422|WP: country list> + 12.804|WP: particle physics> + 4.825|WP: Adelaide>
But I also found if I took the intersection of the separate results for Japan, Russia and China I got a much better result:
sa: intn(h1 |japan>, h1 |russia>, h1 |china>)
12.308|WP: country list>
And hence some new python:
-- in the ket class (currently we don't have a superposition version of this, and probably don't need one!):
  def intn_find_topic(self,context,op):
    words = self.label.lower().split()       # we made it case insensitive.
    if len(words) == 0:
      return ket("",0)
    results = [context.map_to_topic(ket(x),op) for x in words]
    if len(results) == 0:                    # this should never be true!
      return ket("",0)
    r = results[0]
    for sp in results:
      r = intersection(r,sp)
    return r.normalize(100).coeff_sort()
Now, if the ket is a single word, then it gives the exact same answer as find-topic[words-1]. But if the ket is words separated by space, then it usually gives a much better result.

So, let's try the troublesome examples from last time:
-- load the data:
sa: load improved-WP-word-frequencies.sw

-- save some typing:
-- h1 is the old method, F1 is the new method.
sa: h1 |*> #=> find-topic[words-1] split |_self>
sa: F1 |*> #=> intn-find-topic[words-1] |_self>

-- ask about the Nile:
sa: h1 |nile river>
76.811|WP: rivers> + 13.788|WP: Adelaide> + 9.401|WP: Australia>

sa: F1 |nile river>
100|WP: rivers>

-- ask about George Bush:
sa: h1 |george bush>
67.705|WP: US presidents> + 22.363|WP: Australia> + 9.932|WP: Adelaide>

sa: F1 |george bush>
77.465|WP: US presidents> + 22.535|WP: Australia>

-- ask about Japan, Russia and China:
sa: h1 |japan russia china>
40.697|WP: rivers> + 22.252|WP: Australia> + 19.422|WP: country list> + 12.804|WP: particle physics> + 4.825|WP: Adelaide>

sa: F1 |japan russia china>
100|WP: country list>
So it really is a good improvement on standard find-topic[op]! So at some stage I should probably scale it up to even more of wikipedia.

Recall the discussion (from long ago) about the difference between intersection and soft-intersection? Maybe I should find the link! Anyway, h1 can be considered to be using a soft intersection approach, and F1 the strict intersection approach (which is really a better fit for a search engine type algo anyway! You generally don't want pages that ignore one or more of your query terms.)

Anyway, my "ultra simple search algo" now looks like this:
|answer> => table[page,coeff,url] select[1,10] coeff-sort weight-pages intn-find-topic[words-1] |just some words>
I guess the final comment is we can still perhaps tweak this further. This proposed algo does not take into consideration relative closeness of words in a document. eg, if one word is at the top, and the other is at the bottom, I presume you would want that a lesser result than if those two words were near each other. How to do that cleanly I don't know.

That's it for this post!

Update: here is a fun one:
sa: F1 |thomas ronald richard bill barack george james jimmy>
100|WP: US presidents>
Update: Another way to look at it is that h1 is "word-1 OR word-2", F1 is "word-1 AND word-2".

No comments:

Post a Comment