Tuesday, 21 April 2015
Announcing phase 4: tying it all together
We are almost done! I have given the details of my project in the best way I can. Yeah, other people could do better, but I can't change that. Now I have to try and explain why I think my BKO model ties in closely to a simplified model of neural circuits, and my notation serves as a good mathematical foundation for symbolic AI.
the full wage prediction results
So, I got bored of this, but I guess I should post my results! Spoiler: 77.1% success rate.
OK. First I did some precomputation:
eg:
And a note, these tables of 16,281 entries take about 2 minutes to generate. Without the precomputation, they would take the full week, for each tweak of h.
Another possible method to improve on 77% and get closer to the 84% I see with other methods is to tweak our supervised pattern recognition algo. The apply-weights is really trying to change weights after the similarity has been calculated. But we can also do it before, and pre-weight our superpositions before we feed them to simm.
So instead of:
That's it for this post!
Update: I tried a new h, but only got 74% success (12043/16281).
OK. First I did some precomputation:
load adult-wage-pattern-recognition.sw simm |*> #=> select[1,100] similar[input-pattern,pattern] |_self> map[simm,similarity-result] rel-kets[input-pattern] |> save adult-wage-pattern-recognition--saved-simm.swThis took about a week! Yeah, we could do with more speed. Thankfully similar[op] should be easy to parallelize. But now we have this it is very quick to play with settings.
-- load up the results: sa: load adult-wage-pattern-recognition--saved-simm.sw -- find the number of "above 50k" and "below 50k" in the training set: $ grep "^M" adult-wage-pattern-recognition--saved-simm.sw | grep -c "above" 7841 $ grep "^M" adult-wage-pattern-recognition--saved-simm.sw | grep -c "below" 24720 -- define our norm matrix, that takes into account the relative frequencies of "above 50k" vs "below 50k": sa: norm |above-50K> => .000127534753220 |_self> sa: norm |below-50K> => .000040453074433 |_self> -- define our first attempt at a h: sa: h |*> #=> normalize[100] coeff-sort norm M select[1,5] similarity-result |_self> -- define a couple of useful operators: sa: equal? |*> #=> equal(h|_self>,100 answer |_self>) sa: is-equal? |*> #=> max-elt wif(equal? |_self>,|True>,|False>) -- find the table of results: sa: table[input,h,answer,is-equal?] rel-kets[input-pattern] |> result: adult-wage-prediction-table-select-1-5.txt -- now the results for this h: $ grep -c "example" adult-wage-prediction-table-select-1-5.txt 16281 $ grep -c "True" adult-wage-prediction-table-select-1-5.txt 12195 -- the percent correct: 100*12195/16281 = 74.903 % -- next attempt at h (just pick the best match, and ignore the rest): sa: h |*> #=> 100 M select[1,1] similarity-result |_self> -- find the table of results: sa: table[input,h,answer,is-equal?] rel-kets[input-pattern] |> result: adult-wage-prediction-table-select-1-1.txt -- now the results for this h: $ grep -c "True" adult-wage-prediction-table-select-1-1.txt 12549 -- the percent correct: 100*12549/16281 = 77.077 %Finally, I tried using apply-weights, but I couldn't improve on 77.1%.
eg:
h |*> #=> normalize[100] coeff-sort norm M apply-weights[5,4,3,2,1] similarity-result |_self>Maybe if we had some iterative procedure to choose the weights on a sample set, and then apply that to the full set, we might improve on 77%. But I gave up!
And a note, these tables of 16,281 entries take about 2 minutes to generate. Without the precomputation, they would take the full week, for each tweak of h.
Another possible method to improve on 77% and get closer to the 84% I see with other methods is to tweak our supervised pattern recognition algo. The apply-weights is really trying to change weights after the similarity has been calculated. But we can also do it before, and pre-weight our superpositions before we feed them to simm.
So instead of:
Given the training data set D: D = {(X1,Y1),(X2,Y2),...(Xn,Yn)} where Xi, and Yi are superpositions (and must not be empty superpositions that have all coeffs equal to 0) Then learn these rules: pattern |node: 1> => X1 pattern |node: 2> => X2 ... pattern |node: n> => Xn M |node: 1> => Y1 M |node: 2> => Y2 ... M |node: n> => Yn Then given the unlabeled data set U = {Z1,Z2,...Zm}, where Zi are superpositions of the same type as Xi, learn these rules: input-pattern |example: 1> => Z1 input-pattern |example: 2> => Z2 ... input-pattern |example: m> => ZmWe first find a matrix W that re-weights our Xk and Zk superpositions/patterns. Then do:
Given the training data set D: D = {(X1,Y1),(X2,Y2),...(Xn,Yn)} where Xi, and Yi are superpositions (and must not be empty superpositions that have all coeffs equal to 0) Then learn these rules: pattern |node: 1> => W X1 pattern |node: 2> => W X2 ... pattern |node: n> => W Xn M |node: 1> => Y1 M |node: 2> => Y2 ... M |node: n> => Yn Then given the unlabeled data set U = {Z1,Z2,...Zm}, where Zi are superpositions of the same type as Xi, learn these rules: input-pattern |example: 1> => W Z1 input-pattern |example: 2> => W Z2 ... input-pattern |example: m> => W ZmAnd note that W does not need to be square. Indeed, the output of "W Xk" can be a completely different type of superposition than Xk. But again, like the apply-weights idea, I don't know a good way to find W. Perhaps borrow some ideas from standard artificial neural networks?
That's it for this post!
Update: I tried a new h, but only got 74% success (12043/16281).
h |*> #=> normalize[100] coeff-sort norm M invert subtraction-invert[1] select[1,5] similarity-result |_self>I also tried select[1,3] and select[1,10] but they were worse.
Saturday, 18 April 2015
new function: apply-weights[n1,n2,..]
Another brief one. So the motivation was to try and improve my results in the adult wage prediction example. Previously I just used select[1,5] applied to the top similarity matches. Apply-weights[] can be considered a partial generalization of that, in the sense that apply-weights[1,1,1,1,1] is the same as select[1,5].
Here is the python:
BTW, as for the wage prediction results, well, I tried some examples and I failed to improve on just picking the result with the highest match (ie, select[1,1]), at 77.1% success rate. Maybe if you choose the weights just right you will get a better result? I don't yet know how to do that though.
That's it for this post.
Update: apply-weights can be considered to be multiplication by a diagonal matrix, with the weights the values on the diagonal.
Here is the python:
def apply_weights(one,weights): weights = weights.split(",") result = superposition() for k,x in enumerate(one): if k >= len(weights): break result += x.multiply(float(weights[k])) return resultAnd here is a brief example:
sa: apply-weights[3.1415,0,6,7.3,13] split |a b c d e f g h i j> 3.142|a> + 0|b> + 6|c> + 7.3|d> + 13|e>That should be clear enough.
BTW, as for the wage prediction results, well, I tried some examples and I failed to improve on just picking the result with the highest match (ie, select[1,1]), at 77.1% success rate. Maybe if you choose the weights just right you will get a better result? I don't yet know how to do that though.
That's it for this post.
Update: apply-weights can be considered to be multiplication by a diagonal matrix, with the weights the values on the diagonal.
new function: full-exp[op,n]
Just a brief one. I don't currently have a use for it, but part of my brain says it might be useful. Yeah, I guess I am not a minimalist. Anyway, a tweak on exp[op,n], this time we keep the 1/factorial(k) factor.
So, simply enough:
full-exp[op,n] |x>
maps to: (1 + op/1 + op^2/2 + ... + op^n/n! ) |x>
No point giving the python.
Anyway, a quick example:
Update: and we can use exp as a "smear" operator, to expand a single point into a range of points. So, in a way, a little like the range function.
In the console:
So, simply enough:
full-exp[op,n] |x>
maps to: (1 + op/1 + op^2/2 + ... + op^n/n! ) |x>
No point giving the python.
Anyway, a quick example:
-- define our X operator: sa: X |*> #=> algebra(|x>,|*>,|_self>) -- the previous exp operator: sa: exp[X,6] |1> |1> + |x> + |x*x> + |x*x*x> + |x*x*x*x> + |x*x*x*x*x> + |x*x*x*x*x*x> -- the new exp operator: sa: full-exp[X,6] |1> |1> + |x> + 0.5|x*x> + 0.167|x*x*x> + 0.042|x*x*x*x> + 0.008|x*x*x*x*x> + 0.001|x*x*x*x*x*x>And I guess that is it. We see the standard Taylor series for exp(x) as expected.
Update: and we can use exp as a "smear" operator, to expand a single point into a range of points. So, in a way, a little like the range function.
In the console:
-- define a translation operator: T |*> #=> arithmetic(|_self>,|+>,|x: 1>) -- a test case (translate by 1): sa: T |x: 0> |x: 1> -- another test case (translate by 7): sa: T^7 |x: 0> |x: 7> -- use exp[op,n] as a smear operator: sa: exp[T,7] |x: 0> |x: 0> + |x: 1> + |x: 2> + |x: 3> + |x: 4> + |x: 5> + |x: 6> + |x: 7>
Tuesday, 14 April 2015
new function: list-kets
So, I guess this thing could be called the brother of rel-kets[op]. The idea is simply, return a superposition of all known kets that match the pattern.
eg:
list-kets |movie: *> returns the list of all movies (in the current context)
list-kets |person: *> returns the list of all people
list-kets |animal: *> returns the list of all animals
list-kets |*> returns all kets with learn rules.
Here is the python (in the new_context class):
As usual, more to come!
Update: here is a quick example. What is the list of animals with 4 legs?
such-that[has-4-legs] list-kets |animal: *>
Update: For now I have deprecated list-kets. I have replaced it with starts-with.
eg, list of animals with 2 legs?
such-that[has-2-legs] starts-with |animal: >
eg, list all kets that have Fred as a first name?
starts-with |Fred >
eg, list all kets:
starts-with |>
Here is the new python:
eg:
list-kets |movie: *> returns the list of all movies (in the current context)
list-kets |person: *> returns the list of all people
list-kets |animal: *> returns the list of all animals
list-kets |*> returns all kets with learn rules.
Here is the python (in the new_context class):
# e is a ket. def list_kets(self,e): label = e.the_label() if len(label) == 0: return ket("",0) if label[-1] != "*": return e label = label.rstrip("*").rstrip(": ") result = superposition() for trial_label in self.ket_rules_dict: if trial_label.startswith(label): result.data.append(ket(trial_label)) return resultI hope that is simple and obvious enough. Should be useful in a bunch of places.
As usual, more to come!
Update: here is a quick example. What is the list of animals with 4 legs?
such-that[has-4-legs] list-kets |animal: *>
Update: For now I have deprecated list-kets. I have replaced it with starts-with.
eg, list of animals with 2 legs?
such-that[has-2-legs] starts-with |animal: >
eg, list all kets that have Fred as a first name?
starts-with |Fred >
eg, list all kets:
starts-with |>
Here is the new python:
# e is a ket. def starts_with(self,e): label = e.the_label() result = superposition() for trial_label in self.ket_rules_dict: if trial_label.startswith(label): result.data.append(ket(trial_label)) return resultThat's it for now. Hrmm... I wonder if I broke any code with this change?
Saturday, 11 April 2015
new function: subset
Now, rounding out one more maths piece. We already have set union in the form of union(SP1,SP2), set intersection in the form of intn(SP1,SP2), test for set membership in the form of mbr(KET,SP), and now today, test for subsetness subset(SP1,SP2). It returns |subset> if an exact subset, and c|subset> where c < 1 for not exactly subset.
Here is the python (yeah, trivial):
And an observation. I guess one interpretation of subset(SP1,SP2) == 1 is that for every point of a "curve" in SP1, its value is bounded by (ie, less than or equal) the value of SP2 at that same point. I suspect this will be an interesting idea.
BTW, this post was partly motivated by multi-sets. Though multi-sets can be considered a subset of superpositions, since the latter can have non-integer coeffs.
Here is the python (yeah, trivial):
def subset(one,two): if one.count_sum() == 0: # prevent div by 0. return ket("",0) value = intersection(one,two).count_sum()/one.count_sum() return ket("subset",value)And here are some simple examples in the console:
-- full subset: sa: subset(|b>,|a> + |b> + |c>) |subset> -- partial subset: sa: subset(|a>,0.8|a>) 0.8|subset> -- another partial subset: sa: subset(|a> + |d>,|a> + |b> + |c>) 0.5|subset> -- another full subset: sa: subset(|b> + |d> + |e> + |f>,|a> + |b> + |c> + |d> + |e> + |f> + |g> + |h>) |subset> -- another full subset, this one with coeffs other than just 0 and 1: sa: subset(0.8|a> + 3|b> + 7|c> + 0.2|d>,|a> + 4|b> + 7|c> + 5|d> + 37|e>) |subset> -- not at all a subset: sa: subset(|d> + |e> + |f>,|a> + |b> + |c>) 0|subset>I guess that is it. I hope that is clear enough. Now, I don't yet know where I will use it, but presumably it will be useful!
And an observation. I guess one interpretation of subset(SP1,SP2) == 1 is that for every point of a "curve" in SP1, its value is bounded by (ie, less than or equal) the value of SP2 at that same point. I suspect this will be an interesting idea.
BTW, this post was partly motivated by multi-sets. Though multi-sets can be considered a subset of superpositions, since the latter can have non-integer coeffs.
Thursday, 2 April 2015
this concludes phase 3
So, all the basics are out of the way! Phase 1 and literal operators, phase 2 and function operators, phase 3 making heavy use of similar[op], find-topic[op] and more complicated BKO examples. I guess it will be the job of phase 4 to try and tie it all together. I think I'm going to find that hard!
So that is it for now! More to come of course!
So that is it for now! More to come of course!
mapping mindpixels to BKO rules
So, borrowing the data from the now closed mindpixel project, we can show how you would map them to BKO. And I think I prefer the term "molecule of knowledge" rather than "mindpixel".
Anyway, some mindpixels:
That's it for this post. I think I made my point.
Anyway, some mindpixels:
1.00 is icecream cold? 1.00 is earth a planet? 1.00 Is green a color? 1.00 do airplanes fly? 1.00 Is it hot during the summer? 1.00 is chile in south america ? 1.00 Was Socrates a man? 1.00 Computers use electricity? 1.00 The dominant language in france is french? 1.00 was abraham lincoln once president of the united states? 1.00 Is milk white? 1.00 do people have emotions? 1.00 do objects appear smaller as they move away from you? 1.00 Does the human species have a male and female gender? 1.00 Is a mountain mostly made of rock? 1.00 is sun microsystems a computer company? 1.00 Do you see with your eyes and smell with your nose? 1.00 Is smoking bad for your health? 1.00 Does a dog have four legs? 1.00 Do mammals have hearts? 1.00 is the Earth a planet? 1.00 Is water a liquid? 1.00 Is Bugs Bunny a cartoon character? 1.00 Do Humans communicate by Telephone? 1.00 is beer a drink ? 1.00 are there 12 months in a year? 1.00 does the sun hurt your eyes when you look at it? 1.00 Do most cars have doors? 1.00 is orange both a fruit and a colour? 1.00 Is water a necessity? ...Now as BKO:
is-cold |icecream> => |yes> is-a-planet |earth> => |yes> is-a-color |green> => |yes> does-fly |airplanes> => |yes> is-hot-during |summer> => |yes> is-in-south-america |chile> => |yes> is-a-man |Socrates> => |yes> uses-electricity |computer> => |yes> spoken-language |france> => |french> was-a-us-president |Abraham Lincoln> => |yes> is-white |milk> => |yes> have-emotions |people> => |yes> ... gender |human species> => |male> + |female> is-made-of-rock |mountain> => |yes> is-a-computer-company |sun microsystems> => |yes> see-with |eyes> => |yes> smell-with |nose> => |yes> is-bad-for-your-health |smoking> => |yes> has-four-legs |dog> => |yes> has-a-heart |animal> => |yes> is-a-planet |earth> => |yes> -- duplicate rule is-liquid |water> => |yes> is-a-cartoon-character |bugs bunny> => |yes> communicates-by-telephone |humans> => |yes> is-a-drink |beer> => |yes> how-many-months |year> => |number: 12> hurts-your-eyes-when-you-look-at-it |sun> => |yes> has-doors |car> => |yes> is-a-fruit |orange> => |yes> is-a-color |orange> => |yes> is-a-necessity |water> => |yes>Note that "is-something" rules seem kind of pointless, but they are actually very useful when combined with "such-that[condition] some-list".
That's it for this post. I think I made my point.
similarity matrices for wikipedia word frequency lists
Just a quick one this time. The similarity matrices for our wikipedia word frequency lists.
In the console:
Heh, we can even show the matrices of words, but they are way too big to post.
Instead:
matrix[words-1]
matrix[words-2]
matrix[words-3]
In the console:
-- load the data: sa: load improved-WP-word-frequencies.sw -- create our matrices: sa: simm-1 |*> #=> 100 self-similar[words-1] |_self> sa: simm-2 |*> #=> 100 self-similar[words-2] |_self> sa: simm-3 |*> #=> 100 self-similar[words-3] |_self> sa: map[simm-1,similarity-1] rel-kets[words-1] |> sa: map[simm-2,similarity-2] rel-kets[words-2] |> sa: map[simm-3,similarity-3] rel-kets[words-3] |> -- display them: sa: matrix[similarity-1] [ WP: Adelaide ] = [ 100.0 56.86 15.13 37.53 38.7 27.41 24.25 ] [ WP: Adelaide ] [ WP: Australia ] [ 56.86 100.0 16.48 37.53 40.39 27.63 24.32 ] [ WP: Australia ] [ WP: country list ] [ 15.13 16.48 100.0 9.77 8.53 30.32 21.36 ] [ WP: country list ] [ WP: particle physics ] [ 37.53 37.53 9.77 100.0 51.62 22.68 19.5 ] [ WP: particle physics ] [ WP: physics ] [ 38.7 40.39 8.53 51.62 100.0 22.76 18.1 ] [ WP: physics ] [ WP: rivers ] [ 27.41 27.63 30.32 22.68 22.76 100.0 24.52 ] [ WP: rivers ] [ WP: US presidents ] [ 24.25 24.32 21.36 19.5 18.1 24.52 100.0 ] [ WP: US presidents ] sa: matrix[similarity-2] [ WP: Adelaide ] = [ 100.0 15.04 1.73 6.4 7.71 4.39 3.2 ] [ WP: Adelaide ] [ WP: Australia ] [ 15.04 100.0 2.27 5.92 8.04 4.43 4.26 ] [ WP: Australia ] [ WP: country list ] [ 1.73 2.27 100.0 1.49 1.46 2.18 1.35 ] [ WP: country list ] [ WP: particle physics ] [ 6.4 5.92 1.49 100.0 13.86 3.81 3.28 ] [ WP: particle physics ] [ WP: physics ] [ 7.71 8.04 1.46 13.86 100.0 4.52 2.63 ] [ WP: physics ] [ WP: rivers ] [ 4.39 4.43 2.18 3.81 4.52 100.0 2.98 ] [ WP: rivers ] [ WP: US presidents ] [ 3.2 4.26 1.35 3.28 2.63 2.98 100.0 ] [ WP: US presidents ] sa: matrix[similarity-3] [ WP: Adelaide ] = [ 100.0 2.59 0.16 0.64 0.73 0.24 0.1 ] [ WP: Adelaide ] [ WP: Australia ] [ 2.59 100.0 0.34 0.47 0.96 0.35 0.53 ] [ WP: Australia ] [ WP: country list ] [ 0.16 0.34 100.0 0.14 0.1 0.46 0.22 ] [ WP: country list ] [ WP: particle physics ] [ 0.64 0.47 0.14 100.0 2.98 0.26 0.17 ] [ WP: particle physics ] [ WP: physics ] [ 0.73 0.96 0.1 2.98 100.0 0.48 0.14 ] [ WP: physics ] [ WP: rivers ] [ 0.24 0.35 0.46 0.26 0.48 100.0 0.21 ] [ WP: rivers ] [ WP: US presidents ] [ 0.1 0.53 0.22 0.17 0.14 0.21 100.0 ] [ WP: US presidents ]That's all clear enough. No further comment needed.
Heh, we can even show the matrices of words, but they are way too big to post.
Instead:
matrix[words-1]
matrix[words-2]
matrix[words-3]
new function: intn-find-topic[op]
The motivation for this function is some of the results in the last post. Most of the time find-topic[op] gives good results, but I found sometimes it gives really crappy results. In particular, when I asked about "Japan Russia China" I got this messy result:
So, let's try the troublesome examples from last time:
Recall the discussion (from long ago) about the difference between intersection and soft-intersection? Maybe I should find the link! Anyway, h1 can be considered to be using a soft intersection approach, and F1 the strict intersection approach (which is really a better fit for a search engine type algo anyway! You generally don't want pages that ignore one or more of your query terms.)
Anyway, my "ultra simple search algo" now looks like this:
That's it for this post!
Update: here is a fun one:
sa: h1 |japan russia china> 40.697|WP: rivers> + 22.252|WP: Australia> + 19.422|WP: country list> + 12.804|WP: particle physics> + 4.825|WP: Adelaide>But I also found if I took the intersection of the separate results for Japan, Russia and China I got a much better result:
sa: intn(h1 |japan>, h1 |russia>, h1 |china>) 12.308|WP: country list>And hence some new python:
-- in the ket class (currently we don't have a superposition version of this, and probably don't need one!): def intn_find_topic(self,context,op): words = self.label.lower().split() # we made it case insensitive. if len(words) == 0: return ket("",0) results = [context.map_to_topic(ket(x),op) for x in words] if len(results) == 0: # this should never be true! return ket("",0) r = results[0] for sp in results: r = intersection(r,sp) return r.normalize(100).coeff_sort()Now, if the ket is a single word, then it gives the exact same answer as find-topic[words-1]. But if the ket is words separated by space, then it usually gives a much better result.
So, let's try the troublesome examples from last time:
-- load the data: sa: load improved-WP-word-frequencies.sw -- save some typing: -- h1 is the old method, F1 is the new method. sa: h1 |*> #=> find-topic[words-1] split |_self> sa: F1 |*> #=> intn-find-topic[words-1] |_self> -- ask about the Nile: sa: h1 |nile river> 76.811|WP: rivers> + 13.788|WP: Adelaide> + 9.401|WP: Australia> sa: F1 |nile river> 100|WP: rivers> -- ask about George Bush: sa: h1 |george bush> 67.705|WP: US presidents> + 22.363|WP: Australia> + 9.932|WP: Adelaide> sa: F1 |george bush> 77.465|WP: US presidents> + 22.535|WP: Australia> -- ask about Japan, Russia and China: sa: h1 |japan russia china> 40.697|WP: rivers> + 22.252|WP: Australia> + 19.422|WP: country list> + 12.804|WP: particle physics> + 4.825|WP: Adelaide> sa: F1 |japan russia china> 100|WP: country list>So it really is a good improvement on standard find-topic[op]! So at some stage I should probably scale it up to even more of wikipedia.
Recall the discussion (from long ago) about the difference between intersection and soft-intersection? Maybe I should find the link! Anyway, h1 can be considered to be using a soft intersection approach, and F1 the strict intersection approach (which is really a better fit for a search engine type algo anyway! You generally don't want pages that ignore one or more of your query terms.)
Anyway, my "ultra simple search algo" now looks like this:
|answer> => table[page,coeff,url] select[1,10] coeff-sort weight-pages intn-find-topic[words-1] |just some words>I guess the final comment is we can still perhaps tweak this further. This proposed algo does not take into consideration relative closeness of words in a document. eg, if one word is at the top, and the other is at the bottom, I presume you would want that a lesser result than if those two words were near each other. How to do that cleanly I don't know.
That's it for this post!
Update: here is a fun one:
sa: F1 |thomas ronald richard bill barack george james jimmy> 100|WP: US presidents>Update: Another way to look at it is that h1 is "word-1 OR word-2", F1 is "word-1 AND word-2".
mapping wikipedia pages to frequency lists
This is a proof of concept of maybe we can use find-topic[op] to search the web, in this particular case a tiny sample of wikipedia pages. Who knows, maybe we don't need page rank to search the web? But that is all highly speculative, I haven't done any work in that direction. But I do have the wikipedia version working. Here is the code to map my sample wikipedia posts to frequency lists.
Now, in the console:
Another question is how well this will work if we scale it up to all of wikipedia? It will certainly be slow, at least with the current code, but how good would the results be compared to the search function already built into wikipedia, or searching wikipedia indirectly using google?
I guess at this point I could propose an ultra simple search algo.
Say you search for "just some words", then this back-end BKO:
eg, I'm thinking something like:
Update: BTW, to safely handle the case of an unknown url (which would otherwise map to |>), define this general rule:
Now, in the console:
-- load the data: sa: load improved-WP-word-frequencies.sw -- save some typing: sa: h1 |*> #=> find-topic[words-1] split |_self> sa: h2 |*> #=> find-topic[words-2] |_self> sa: h3 |*> #=> find-topic[words-3] |_self> sa: t1 |*> #=> table[page,coeff] find-topic[words-1] split |_self> -- NB: note the "split" in there in h1. This is important to note! -- NB: words-1 are 1-gram word frequencies. words-2 are 2-gram word frequencies. words-3 are 3-gram word frequencies. -- where will I find info on Adelaide? sa: h1 |adelaide> 74.576|WP: Adelaide> + 25.424|WP: Australia> -- where will I find info on Adelaide university? sa: h1 |adelaide university> 66.236|WP: Adelaide> + 33.764|WP: Australia> -- and again, this time using 3-grams: sa: h3 |university of adelaide> 76.923|WP: Adelaide> + 23.077|WP: Australia> -- where will I find info on Aami Stadium? sa: h1 |aami stadium> 100|WP: Adelaide> -- where will I find info on Perth? sa: h1 |perth> 100|WP: Australia> -- where will I find info on the Nile river? sa: h1 |nile river> 76.811|WP: rivers> + 13.788|WP: Adelaide> + 9.401|WP: Australia> -- hrmmm... Adelaide and Australia are in there because of "river" -- let me show you: sa: h1 |river> 53.621|WP: rivers> + 27.577|WP: Adelaide> + 18.802|WP: Australia> -- so try again, this time using 2-grams: sa: h2 |nile river> |> -- null result. -- so try again: sa: h2 |river nile> 100.0|WP: rivers> -- so we finally got there, but note how exact you have to be. Hence again, why we need a "did you mean" feature. -- where will I find info on Bill Clinton: sa: h1 |bill clinton> 100|WP: US presidents> -- where will I find info on Nixon: sa: h1 |nixon> 100|WP: US presidents> -- where will I find info on George Bush (first try 2-grams): sa: h2 |george bush> |> -- this time using 1-grams: sa: h1 |george bush> 67.705|WP: US presidents> + 22.363|WP: Australia> + 9.932|WP: Adelaide> -- now, why are Australia and Adelaide in there? I will show you: sa: h1 |george> 62.077|WP: US presidents> + 19.865|WP: Adelaide> + 18.059|WP: Australia> sa: h1 |bush> 73.333|WP: US presidents> + 26.667|WP: Australia> -- where will I find info on physics: sa: h1 |physics> 54.237|WP: physics> + 45.763|WP: particle physics> -- where will I find info on electrons: sa: h1 |electron> 62.791|WP: particle physics> + 37.209|WP: physics> -- what about Newton? sa: h1 |newton> 100|WP: physics> -- and Einstein? sa: h1 |einstein> 100|WP: physics> -- and Feynman? sa: h1 |feynman> 64|WP: physics> + 36|WP: particle physics> -- where will I find info on Japan, Russia and China? sa: h1 |japan russia china> 40.697|WP: rivers> + 22.252|WP: Australia> + 19.422|WP: country list> + 12.804|WP: particle physics> + 4.825|WP: Adelaide> -- hrmm ... that didn't work very well. -- let's look at the components: -- Japan? sa: h1 |japan> 53.598|WP: Australia> + 24.566|WP: particle physics> + 21.836|WP: country list> -- Russia? sa: h1 |russia> 73.846|WP: rivers> + 13.846|WP: particle physics> + 12.308|WP: country list> -- China? sa: h1 |china> 48.246|WP: rivers> + 24.123|WP: country list> + 14.474|WP: Adelaide> + 13.158|WP: Australia> -- let's try an intersection: sa: intn(h1 |japan>, h1 |russia>, h1 |china>) 12.308|WP: country list> -- that worked much better!! -- Indeed, I think that is a strong hint I should write an intn-find-topic[op] function!So it all works pretty well. The important question is does it work better than standard search? I don't know.
Another question is how well this will work if we scale it up to all of wikipedia? It will certainly be slow, at least with the current code, but how good would the results be compared to the search function already built into wikipedia, or searching wikipedia indirectly using google?
I guess at this point I could propose an ultra simple search algo.
Say you search for "just some words", then this back-end BKO:
|answer> => table[page,coeff] select[1,10] coeff-sort weight-pages find-topic[words-1] split |just some words>where weight-pages re-weights the pages returned from find-topic[op] based on some measure of quality for a page.
eg, I'm thinking something like:
-- "url: a" is a good page: weight-pages |url: a> => 7|url: a> -- "url: b" is not a good page: weight-pages |url: b> => 0.2|url: b> -- "url: c" is an ok page: weight-pages |url: c> => 2|url: c>And I guess that is it for this post!
Update: BTW, to safely handle the case of an unknown url (which would otherwise map to |>), define this general rule:
weight-pages |url: *> #=> |_self>
Subscribe to:
Posts (Atom)