The Semantic DB Project: find-unique[op] applied to webpage superpositions

Recall back here we mapped sample websites to superpositions, and then did pattern recognition on them. Well, using find-unique[op] we can give a vastly better result!

-- load up the data:
sa: load improved-fragment-webpages.sw
sa: load create-average-website-fragments.sw

Here is the tweaked BKO that makes use of find-unique[op]:
(this is inside this file)

-- define the list of average websites:
|ave list> => |average abc> + |average adelaidenow> + |average slashdot> + |average smh> + |average wikipedia> + |average youtube>

-- we want average hash to be distinct from the other hashes:
|null> => map[hash-4B,average-hash-4B] "" |ave list>

-- find unique kets for our average superpositions:
|null> => find-unique[average-hash-4B] |>

-- now, let's see how well these patterns recognize the pages we left out of our average:
result |abc 11> => 100 similar[hash-4B,unique-average-hash-4B] |abc 11>
result |adelaidenow 11> => 100 similar[hash-4B,unique-average-hash-4B] |adelaidenow 11>
result |slashdot 11> => 100 similar[hash-4B,unique-average-hash-4B] |slashdot 11>
result |smh 11> => 100 similar[hash-4B,unique-average-hash-4B] |smh 11>
result |wikipedia 11> => 100 similar[hash-4B,unique-average-hash-4B] |wikipedia 11>
result |youtube 11> => 100 similar[hash-4B,unique-average-hash-4B] |youtube 11>

-- tidy results:
tidy-result |abc 11> => normalize[100] result |_self>
tidy-result |adelaidenow 11> => normalize[100] result |_self>
tidy-result |slashdot 11> => normalize[100] result |_self>
tidy-result |smh 11> => normalize[100] result |_self>
tidy-result |wikipedia 11> => normalize[100] result |_self>
tidy-result |youtube 11> => normalize[100] result |_self>

And here are the results:

sa: matrix[result]
[ average abc         ] = [  36.0  0      0      0      0      0      ] [ abc 11         ]
[ average adelaidenow ]   [  0     38.66  0      0      0      0      ] [ adelaidenow 11 ]
[ average slashdot    ]   [  0     0      35.48  0.04   0      0      ] [ slashdot 11    ]
[ average smh         ]   [  0     0.02   0.02   36.99  0      0      ] [ smh 11         ]
[ average wikipedia   ]   [  0     0.01   0.03   0      36.54  0      ] [ wikipedia 11   ]
[ average youtube     ]   [  0     0.02   0      0      0      36.72  ] [ youtube 11     ]

sa: matrix[tidy-result]
[ average abc         ] = [  100  0      0      0     0    0    ] [ abc 11         ]
[ average adelaidenow ]   [  0    99.87  0      0     0    0    ] [ adelaidenow 11 ]
[ average slashdot    ]   [  0    0      99.86  0.1   0    0    ] [ slashdot 11    ]
[ average smh         ]   [  0    0.05   0.05   99.9  0    0    ] [ smh 11         ]
[ average wikipedia   ]   [  0    0.03   0.1    0     100  0    ] [ wikipedia 11   ]
[ average youtube     ]   [  0    0.05   0      0     0    100  ] [ youtube 11     ]

Some notes:
1) that is a seriously good result! Yeah, without the normalize[100] we are down from 90% to 36% similarity, but the gap between best result and next best is now rather large! Which we see clearly in the tidy-result matrix (that does make use of normalize[100]). Heh, and we don't need drop-below[t] anymore either!
2) it is interesting that we can get such a big improvement using only 1 new line of code (the find-unique[average-hash-4B] bit) and a few tweaks to the existing BKO.
3) this technique of dropping back to considering only unique kets only works some of the time. For a start you need large superpositions, and a lot of unique kets from superposition to superposition. For example this technique would not work for the Iris example, the wage prediction example, or the document-type example. I'm wondering if there is a way to borrow the general idea of suppressing kets that are duplicate, but not as harsh as only considering unique kets. Maybe as simple as, if ket is in n superpositions, map coeff => coeff/n? Or do we need something smarter than that?
4) lets take a look at how many unique kets we have:

sa: how-many-hash |*> #=> to-comma-number how-many average-hash-4B |_self>
sa: how-many-unique-hash |*> #=> to-comma-number how-many unique-average-hash-4B |_self>
sa: delta |*> #=> arithmetic(how-many average-hash-4B |_self>,|->,how-many unique-average-hash-4B |_self>)
sa: table[website,how-many-hash,how-many-unique-hash,delta] "" |ave list>
+---------------------+---------------+----------------------+-------+
| website             | how-many-hash | how-many-unique-hash | delta |
+---------------------+---------------+----------------------+-------+
| average abc         | 1,492         | 1,391                | 101   |
| average adelaidenow | 11,869        | 11,636               | 233   |
| average slashdot    | 5,462         | 5,275                | 187   |
| average smh         | 10,081        | 9,784                | 297   |
| average wikipedia   | 3,182         | 3,084                | 98    |
| average youtube     | 6,390         | 6,310                | 80    |
+---------------------+---------------+----------------------+-------+

I didn't really expect that. I thought there would be a lot more duplicate kets, but instead we only see a couple of hundred. But since removing them had such an improvement on our results, presumably the duplicate kets had relatively large coeffs. eg, the ket generated from </a> in html will be universal across our webpages, and have a large coeff.

I guess that is it for this post. Back to find-topic[op] in the next couple of posts.

The Semantic DB Project

Wednesday, 25 March 2015

find-unique[op] applied to webpage superpositions

No comments:

Post a Comment