The Semantic DB Project: creating average webpage superpositions

OK. I'm working to something here, so bare with me. Recall I had six websites (abc, adelaidenow, slashdot, smh, wikipedia, youtube), and I downloaded a copy once a day for 11 days. Well, in this post we will create average superpositions of these by adding up the first 10. The 11'th is the test case to run our pattern recognition against.

So, here is the BKO for this:
(note I didn't type all this. Cut and paste and search/replace helped!)

-- create average webpages:
-- we deliberately leave out |website 11> from our hashes, that's the test case
|abc list> => |abc 1> + |abc 2> + |abc 3> + |abc 4> + |abc 5> + |abc 6> + |abc 7> + |abc 8> + |abc 9> + |abc 10>
hash-64k |average abc> => hash-64k "" |abc list>
hash-1M |average abc> => hash-1M "" |abc list>
hash-4B |average abc> => hash-4B "" |abc list>

|adelaidenow list> => |adelaidenow 1> + |adelaidenow 2> + |adelaidenow 3> + |adelaidenow 4> + |adelaidenow 5> + |adelaidenow 6> + |adelaidenow 7> + |adelaidenow 8> + |adelaidenow 9> + |adelaidenow 10>
hash-64k |average adelaidenow> => hash-64k "" |adelaidenow list>
hash-1M |average adelaidenow> => hash-1M "" |adelaidenow list>
hash-4B |average adelaidenow> => hash-4B "" |adelaidenow list>

|slashdot list> => |slashdot 1> + |slashdot 2> + |slashdot 3> + |slashdot 4> + |slashdot 5> + |slashdot 6> + |slashdot 7> + |slashdot 8> + |slashdot 9> + |slashdot 10>
hash-64k |average slashdot> => hash-64k "" |slashdot list>
hash-1M |average slashdot> => hash-1M "" |slashdot list>
hash-4B |average slashdot> => hash-4B "" |slashdot list>

|smh list> => |smh 1> + |smh 2> + |smh 3> + |smh 4> + |smh 5> + |smh 6> + |smh 7> + |smh 8> + |smh 9> + |smh 10>
hash-64k |average smh> => hash-64k "" |smh list>
hash-1M |average smh> => hash-1M "" |smh list>
hash-4B |average smh> => hash-4B "" |smh list>

|wikipedia list> => |wikipedia 1> + |wikipedia 2> + |wikipedia 3> + |wikipedia 4> + |wikipedia 5> + |wikipedia 6> + |wikipedia 7> + |wikipedia 8> + |wikipedia 9> + |wikipedia 10>
hash-64k |average wikipedia> => hash-64k "" |wikipedia list>
hash-1M |average wikipedia> => hash-1M "" |wikipedia list>
hash-4B |average wikipedia> => hash-4B "" |wikipedia list>

|youtube list> => |youtube 1> + |youtube 2> + |youtube 3> + |youtube 4> + |youtube 5> + |youtube 6> + |youtube 7> + |youtube 8> + |youtube 9> + |youtube 10>
hash-64k |average youtube> => hash-64k "" |youtube list>
hash-1M |average youtube> => hash-1M "" |youtube list>
hash-4B |average youtube> => hash-4B "" |youtube list>

Now a couple of notes about generating our average hashes.
1) we make use of linearity of operators.
eg:

hash-64k "" |abc list>

expands to:

hash-64k (|abc 1> + |abc 2> + |abc 3> + ...)

which expands to:

hash-64k |abc 1> + hash-64k |abc 2> + hash-64k |abc 3> + ...

2) we don't need to normalize our averages. ie, we don't need sum/10. This is because our similarity metric auto-rescales the incomming superpositions, so the shape of superpositions is usually more important than the amplitude.

Now, let's look at how big these superpositions are:

-- load the data:
sa: load improved-fragment-webpages.sw
sa: load create-average-website-fragments.sw

-- define a couple of operators:
sa: count-hash-64k |*> #=> count hash-64k |_self>
sa: count-hash-1M |*> #=> count hash-1M |_self>
sa: count-hash-4B |*> #=> count hash-4B |_self>
sa: delta-1M-64k |*> #=> arithmetic(count-hash-1M|_self>,|->,count-hash-64k|_self>)
sa: delta-4B-1M |*> #=> arithmetic(count-hash-4B|_self>,|->,count-hash-1M|_self>)

-- define the list of average websites:
|ave list> => |average abc> + |average adelaidenow> + |average slashdot> + |average smh> + |average wikipedia> + |average youtube>

-- Now, take a look at the tables:
sa: table[page,count-hash-64k,count-hash-1M,count-hash-4B] "" |ave list>
+---------------------+----------------+---------------+---------------+
| page                | count-hash-64k | count-hash-1M | count-hash-4B |
+---------------------+----------------+---------------+---------------+
| average abc         | 1476           | 1491          | 1492          |
| average adelaidenow | 10840          | 11798         | 11869         |
| average slashdot    | 5235           | 5441          | 5462          |
| average smh         | 9326           | 10044         | 10081         |
| average wikipedia   | 3108           | 3178          | 3182          |
| average youtube     | 6082           | 6370          | 6390          |
+---------------------+----------------+---------------+---------------+

sa: table[page,count-hash-64k,delta-1M-64k,delta-4B-1M] "" |ave list>
+---------------------+----------------+--------------+-------------+
| page                | count-hash-64k | delta-1M-64k | delta-4B-1M |
+---------------------+----------------+--------------+-------------+
| average abc         | 1476           | 15           | 1           |
| average adelaidenow | 10840          | 958          | 71          |
| average slashdot    | 5235           | 206          | 21          |
| average smh         | 9326           | 718          | 37          |
| average wikipedia   | 3108           | 70           | 4           |
| average youtube     | 6082           | 288          | 20          |
+---------------------+----------------+--------------+-------------+

Anyway, that is all exploratory I suppose. The take home message is, the 4B superpositions are probably the best to work with. Some similarity matrices in the next couple of posts.

The Semantic DB Project

Friday, 6 March 2015

creating average webpage superpositions

No comments:

Post a Comment