So, here is the BKO for this:
(note I didn't type all this. Cut and paste and search/replace helped!)
-- create average webpages: -- we deliberately leave out |website 11> from our hashes, that's the test case |abc list> => |abc 1> + |abc 2> + |abc 3> + |abc 4> + |abc 5> + |abc 6> + |abc 7> + |abc 8> + |abc 9> + |abc 10> hash-64k |average abc> => hash-64k "" |abc list> hash-1M |average abc> => hash-1M "" |abc list> hash-4B |average abc> => hash-4B "" |abc list> |adelaidenow list> => |adelaidenow 1> + |adelaidenow 2> + |adelaidenow 3> + |adelaidenow 4> + |adelaidenow 5> + |adelaidenow 6> + |adelaidenow 7> + |adelaidenow 8> + |adelaidenow 9> + |adelaidenow 10> hash-64k |average adelaidenow> => hash-64k "" |adelaidenow list> hash-1M |average adelaidenow> => hash-1M "" |adelaidenow list> hash-4B |average adelaidenow> => hash-4B "" |adelaidenow list> |slashdot list> => |slashdot 1> + |slashdot 2> + |slashdot 3> + |slashdot 4> + |slashdot 5> + |slashdot 6> + |slashdot 7> + |slashdot 8> + |slashdot 9> + |slashdot 10> hash-64k |average slashdot> => hash-64k "" |slashdot list> hash-1M |average slashdot> => hash-1M "" |slashdot list> hash-4B |average slashdot> => hash-4B "" |slashdot list> |smh list> => |smh 1> + |smh 2> + |smh 3> + |smh 4> + |smh 5> + |smh 6> + |smh 7> + |smh 8> + |smh 9> + |smh 10> hash-64k |average smh> => hash-64k "" |smh list> hash-1M |average smh> => hash-1M "" |smh list> hash-4B |average smh> => hash-4B "" |smh list> |wikipedia list> => |wikipedia 1> + |wikipedia 2> + |wikipedia 3> + |wikipedia 4> + |wikipedia 5> + |wikipedia 6> + |wikipedia 7> + |wikipedia 8> + |wikipedia 9> + |wikipedia 10> hash-64k |average wikipedia> => hash-64k "" |wikipedia list> hash-1M |average wikipedia> => hash-1M "" |wikipedia list> hash-4B |average wikipedia> => hash-4B "" |wikipedia list> |youtube list> => |youtube 1> + |youtube 2> + |youtube 3> + |youtube 4> + |youtube 5> + |youtube 6> + |youtube 7> + |youtube 8> + |youtube 9> + |youtube 10> hash-64k |average youtube> => hash-64k "" |youtube list> hash-1M |average youtube> => hash-1M "" |youtube list> hash-4B |average youtube> => hash-4B "" |youtube list>Now a couple of notes about generating our average hashes.
1) we make use of linearity of operators.
eg:
hash-64k "" |abc list>expands to:
hash-64k (|abc 1> + |abc 2> + |abc 3> + ...)which expands to:
hash-64k |abc 1> + hash-64k |abc 2> + hash-64k |abc 3> + ...2) we don't need to normalize our averages. ie, we don't need sum/10. This is because our similarity metric auto-rescales the incomming superpositions, so the shape of superpositions is usually more important than the amplitude.
Now, let's look at how big these superpositions are:
-- load the data: sa: load improved-fragment-webpages.sw sa: load create-average-website-fragments.sw -- define a couple of operators: sa: count-hash-64k |*> #=> count hash-64k |_self> sa: count-hash-1M |*> #=> count hash-1M |_self> sa: count-hash-4B |*> #=> count hash-4B |_self> sa: delta-1M-64k |*> #=> arithmetic(count-hash-1M|_self>,|->,count-hash-64k|_self>) sa: delta-4B-1M |*> #=> arithmetic(count-hash-4B|_self>,|->,count-hash-1M|_self>) -- define the list of average websites: |ave list> => |average abc> + |average adelaidenow> + |average slashdot> + |average smh> + |average wikipedia> + |average youtube> -- Now, take a look at the tables: sa: table[page,count-hash-64k,count-hash-1M,count-hash-4B] "" |ave list> +---------------------+----------------+---------------+---------------+ | page | count-hash-64k | count-hash-1M | count-hash-4B | +---------------------+----------------+---------------+---------------+ | average abc | 1476 | 1491 | 1492 | | average adelaidenow | 10840 | 11798 | 11869 | | average slashdot | 5235 | 5441 | 5462 | | average smh | 9326 | 10044 | 10081 | | average wikipedia | 3108 | 3178 | 3182 | | average youtube | 6082 | 6370 | 6390 | +---------------------+----------------+---------------+---------------+ sa: table[page,count-hash-64k,delta-1M-64k,delta-4B-1M] "" |ave list> +---------------------+----------------+--------------+-------------+ | page | count-hash-64k | delta-1M-64k | delta-4B-1M | +---------------------+----------------+--------------+-------------+ | average abc | 1476 | 15 | 1 | | average adelaidenow | 10840 | 958 | 71 | | average slashdot | 5235 | 206 | 21 | | average smh | 9326 | 718 | 37 | | average wikipedia | 3108 | 70 | 4 | | average youtube | 6082 | 288 | 20 | +---------------------+----------------+--------------+-------------+Anyway, that is all exploratory I suppose. The take home message is, the 4B superpositions are probably the best to work with. Some similarity matrices in the next couple of posts.
No comments:
Post a Comment