Thursday, 2 April 2015

mapping wikipedia pages to frequency lists

This is a proof of concept of maybe we can use find-topic[op] to search the web, in this particular case a tiny sample of wikipedia pages. Who knows, maybe we don't need page rank to search the web? But that is all highly speculative, I haven't done any work in that direction. But I do have the wikipedia version working. Here is the code to map my sample wikipedia posts to frequency lists.

Now, in the console:
-- load the data:
sa: load improved-WP-word-frequencies.sw

-- save some typing:
sa: h1 |*> #=> find-topic[words-1] split |_self>
sa: h2 |*> #=> find-topic[words-2] |_self>
sa: h3 |*> #=> find-topic[words-3] |_self>
sa: t1 |*> #=> table[page,coeff] find-topic[words-1] split |_self>
-- NB: note the "split" in there in h1. This is important to note!
-- NB: words-1 are 1-gram word frequencies. words-2 are 2-gram word frequencies. words-3 are 3-gram word frequencies.

-- where will I find info on Adelaide?
sa: h1 |adelaide>
74.576|WP: Adelaide> + 25.424|WP: Australia>

-- where will I find info on Adelaide university?
sa: h1 |adelaide university>
66.236|WP: Adelaide> + 33.764|WP: Australia>

-- and again, this time using 3-grams:
sa: h3 |university of adelaide>
76.923|WP: Adelaide> + 23.077|WP: Australia>

-- where will I find info on Aami Stadium?
sa: h1 |aami stadium>
100|WP: Adelaide>

-- where will I find info on Perth?
sa: h1 |perth>
100|WP: Australia>

-- where will I find info on the Nile river?
sa: h1 |nile river>
76.811|WP: rivers> + 13.788|WP: Adelaide> + 9.401|WP: Australia>
-- hrmmm... Adelaide and Australia are in there because of "river"
-- let me show you:
sa: h1 |river>
53.621|WP: rivers> + 27.577|WP: Adelaide> + 18.802|WP: Australia>

-- so try again, this time using 2-grams: 
sa: h2 |nile river>
|>
-- null result.
-- so try again:
sa: h2 |river nile>
100.0|WP: rivers>
-- so we finally got there, but note how exact you have to be. Hence again, why we need a "did you mean" feature.

-- where will I find info on Bill Clinton:
sa: h1 |bill clinton>
100|WP: US presidents>

-- where will I find info on Nixon:
sa: h1 |nixon>
100|WP: US presidents>

-- where will I find info on George Bush (first try 2-grams):
sa: h2 |george bush>
|>

-- this time using 1-grams:
sa: h1 |george bush>
67.705|WP: US presidents> + 22.363|WP: Australia> + 9.932|WP: Adelaide>

-- now, why are Australia and Adelaide in there? I will show you:
sa: h1 |george>
62.077|WP: US presidents> + 19.865|WP: Adelaide> + 18.059|WP: Australia>

sa: h1 |bush>
73.333|WP: US presidents> + 26.667|WP: Australia>

-- where will I find info on physics:
sa: h1 |physics>
54.237|WP: physics> + 45.763|WP: particle physics>

-- where will I find info on electrons:
sa: h1 |electron>
62.791|WP: particle physics> + 37.209|WP: physics>

-- what about Newton?
sa: h1 |newton>
100|WP: physics>

-- and Einstein?
sa: h1 |einstein>
100|WP: physics>

-- and Feynman?
sa: h1 |feynman>
64|WP: physics> + 36|WP: particle physics>

-- where will I find info on Japan, Russia and China?
sa: h1 |japan russia china>
40.697|WP: rivers> + 22.252|WP: Australia> + 19.422|WP: country list> + 12.804|WP: particle physics> + 4.825|WP: Adelaide>
-- hrmm ... that didn't work very well.

-- let's look at the components:
-- Japan?
sa: h1 |japan>
53.598|WP: Australia> + 24.566|WP: particle physics> + 21.836|WP: country list>

-- Russia?
sa: h1 |russia>
73.846|WP: rivers> + 13.846|WP: particle physics> + 12.308|WP: country list>

-- China?
sa: h1 |china>
48.246|WP: rivers> + 24.123|WP: country list> + 14.474|WP: Adelaide> + 13.158|WP: Australia>

-- let's try an intersection:
sa: intn(h1 |japan>, h1 |russia>, h1 |china>)
12.308|WP: country list>
-- that worked much better!!
-- Indeed, I think that is a strong hint I should write an intn-find-topic[op] function!
So it all works pretty well. The important question is does it work better than standard search? I don't know.

Another question is how well this will work if we scale it up to all of wikipedia? It will certainly be slow, at least with the current code, but how good would the results be compared to the search function already built into wikipedia, or searching wikipedia indirectly using google?

I guess at this point I could propose an ultra simple search algo.
Say you search for "just some words", then this back-end BKO:
|answer> => table[page,coeff] select[1,10] coeff-sort weight-pages find-topic[words-1] split |just some words>
where weight-pages re-weights the pages returned from find-topic[op] based on some measure of quality for a page.
eg, I'm thinking something like:
-- "url: a" is a good page:
weight-pages |url: a> => 7|url: a>

-- "url: b" is not a good page:
weight-pages |url: b> => 0.2|url: b>

-- "url: c" is an ok page:
weight-pages |url: c> => 2|url: c>
And I guess that is it for this post!

Update: BTW, to safely handle the case of an unknown url (which would otherwise map to |>), define this general rule:
weight-pages |url: *> #=> |_self>

No comments:

Post a Comment