Friday, 6 March 2015

website similarity matrices

This time, take a look at how similar websites are to themselves over the 11 days.

Here is the BKO:
-- create website similarity matrices:
-- list of abc websites, note we include |abc 11> and |average abc>
|full abc list> => |abc 1> + |abc 2> + |abc 3> + |abc 4> + |abc 5> + |abc 6> + |abc 7> + |abc 8> + |abc 9> + |abc 10> + |abc 11> + |average abc>

-- we want abc-hash to be distinct from standard hash, to reduce the matrix to abc only
abc-hash-4B |*> #=> hash-4B |_self>
|null> => map[abc-hash-4B] "" |full abc list>

-- we want the abc-simm to be distinct from the standard simm, to reduce the matrix to abc only
abc-simm |*> #=> 100 self-similar[abc-hash-4B] |_self>
|null> => map[abc-simm,abc-similarity] "" |full abc list>

-- now the rest of them:
|full adelaidenow list> => |adelaidenow 1> + |adelaidenow 2> + |adelaidenow 3> + |adelaidenow 4> + |adelaidenow 5> + |adelaidenow 6> + |adelaidenow 7> + |adelaidenow 8> + |adelaidenow 9> + |adelaidenow 10> + |adelaidenow 11> + |average adelaidenow>
adelaidenow-hash-4B |*> #=> hash-4B |_self>
|null> => map[adelaidenow-hash-4B] "" |full adelaidenow list>
adelaidenow-simm |*> #=> 100 self-similar[adelaidenow-hash-4B] |_self>
|null> => map[adelaidenow-simm,adelaidenow-similarity] "" |full adelaidenow list>

|full slashdot list> => |slashdot 1> + |slashdot 2> + |slashdot 3> + |slashdot 4> + |slashdot 5> + |slashdot 6> + |slashdot 7> + |slashdot 8> + |slashdot 9> + |slashdot 10> + |slashdot 11> + |average slashdot>
slashdot-hash-4B |*> #=> hash-4B |_self>
|null> => map[slashdot-hash-4B] "" |full slashdot list>
slashdot-simm |*> #=> 100 self-similar[slashdot-hash-4B] |_self>
|null> => map[slashdot-simm,slashdot-similarity] "" |full slashdot list>

|full smh list> => |smh 1> + |smh 2> + |smh 3> + |smh 4> + |smh 5> + |smh 6> + |smh 7> + |smh 8> + |smh 9> + |smh 10> + |smh 11> + |average smh>
smh-hash-4B |*> #=> hash-4B |_self>
|null> => map[smh-hash-4B] "" |full smh list>
smh-simm |*> #=> 100 self-similar[smh-hash-4B] |_self>
|null> => map[smh-simm,smh-similarity] "" |full smh list>

|full wikipedia list> => |wikipedia 1> + |wikipedia 2> + |wikipedia 3> + |wikipedia 4> + |wikipedia 5> + |wikipedia 6> + |wikipedia 7> + |wikipedia 8> + |wikipedia 9> + |wikipedia 10> + |wikipedia 11> + |average wikipedia>
wikipedia-hash-4B |*> #=> hash-4B |_self>
|null> => map[wikipedia-hash-4B] "" |full wikipedia list>
wikipedia-simm |*> #=> 100 self-similar[wikipedia-hash-4B] |_self>
|null> => map[wikipedia-simm,wikipedia-similarity] "" |full wikipedia list>

|full youtube list> => |youtube 1> + |youtube 2> + |youtube 3> + |youtube 4> + |youtube 5> + |youtube 6> + |youtube 7> + |youtube 8> + |youtube 9> + |youtube 10> + |youtube 11> + |average youtube>
youtube-hash-4B |*> #=> hash-4B |_self>
|null> => map[youtube-hash-4B] "" |full youtube list>
youtube-simm |*> #=> 100 self-similar[youtube-hash-4B] |_self>
|null> => map[youtube-simm,youtube-similarity] "" |full youtube list>
Here are the resulting matrices:
-- load the data:
sa: load improved-fragment-webpages.sw
sa: load create-average-website-fragments.sw
sa: load create-website-similarity-matrices.sw

-- show our resulting matrices:
sa: matrix[abc-similarity]
[ abc 1       ] = [  100.00  95.52   95.03   91.85   91.36   91.86   91.88   91.85   92.19   92.19   91.42   93.47   ] [ abc 1       ]
[ abc 2       ]   [  95.52   100.00  95.50   91.86   91.38   91.91   92.04   91.71   92.41   92.47   91.25   93.51   ] [ abc 2       ]
[ abc 3       ]   [  95.03   95.50   100.00  91.80   91.32   92.03   92.10   91.66   92.41   92.41   91.20   93.46   ] [ abc 3       ]
[ abc 4       ]   [  91.85   91.86   91.80   100.00  92.64   92.55   91.80   92.19   91.69   91.68   92.13   92.93   ] [ abc 4       ]
[ abc 5       ]   [  91.36   91.38   91.32   92.64   100.00  92.44   91.40   91.85   91.29   91.25   91.41   92.56   ] [ abc 5       ]
[ abc 6       ]   [  91.86   91.91   92.03   92.55   92.44   100.00  92.95   92.78   91.93   91.90   91.59   93.23   ] [ abc 6       ]
[ abc 7       ]   [  91.88   92.04   92.10   91.80   91.40   92.95   100.00  93.05   93.13   93.02   91.74   93.16   ] [ abc 7       ]
[ abc 8       ]   [  91.85   91.71   91.66   92.19   91.85   92.78   93.05   100.00  93.71   93.52   92.06   93.40   ] [ abc 8       ]
[ abc 9       ]   [  92.19   92.41   92.41   91.69   91.29   91.93   93.13   93.71   100.00  95.57   91.56   93.46   ] [ abc 9       ]
[ abc 10      ]   [  92.19   92.47   92.41   91.68   91.25   91.90   93.02   93.52   95.57   100.00  91.61   93.45   ] [ abc 10      ]
[ abc 11      ]   [  91.42   91.25   91.20   92.13   91.41   91.59   91.74   92.06   91.56   91.61   100.00  91.70   ] [ abc 11      ]
[ average abc ]   [  93.47   93.51   93.46   92.93   92.56   93.23   93.16   93.40   93.46   93.45   91.70   100.00  ] [ average abc ]

sa: matrix[adelaidenow-similarity]
[ adelaidenow 1       ] = [  100.00  86.13   83.22   78.11   77.27   76.23   75.86   75.61   75.79   76.08   76.13   80.64   ] [ adelaidenow 1       ]
[ adelaidenow 2       ]   [  86.13   100.00  87.38   81.46   77.52   77.60   77.07   76.90   76.26   77.26   76.53   82.36   ] [ adelaidenow 2       ]
[ adelaidenow 3       ]   [  83.22   87.38   100.00  83.60   78.24   77.29   76.68   76.56   76.45   77.41   76.22   82.18   ] [ adelaidenow 3       ]
[ adelaidenow 4       ]   [  78.11   81.46   83.60   100.00  83.39   78.50   77.28   77.24   75.75   76.77   76.67   81.82   ] [ adelaidenow 4       ]
[ adelaidenow 5       ]   [  77.27   77.52   78.24   83.39   100.00  81.76   77.29   76.84   76.31   76.39   77.43   81.12   ] [ adelaidenow 5       ]
[ adelaidenow 6       ]   [  76.23   77.60   77.29   78.50   81.76   100.00  79.82   78.85   76.34   77.09   76.98   81.08   ] [ adelaidenow 6       ]
[ adelaidenow 7       ]   [  75.86   77.07   76.68   77.28   77.29   79.82   100.00  84.92   78.38   77.13   76.90   81.08   ] [ adelaidenow 7       ]
[ adelaidenow 8       ]   [  75.61   76.90   76.56   77.24   76.84   78.85   84.92   100.00  82.06   79.31   77.56   81.59   ] [ adelaidenow 8       ]
[ adelaidenow 9       ]   [  75.79   76.26   76.45   75.75   76.31   76.34   78.38   82.06   100.00  85.68   78.15   80.79   ] [ adelaidenow 9       ]
[ adelaidenow 10      ]   [  76.08   77.26   77.41   76.77   76.39   77.09   77.13   79.31   85.68   100.00  82.97   80.86   ] [ adelaidenow 10      ]
[ adelaidenow 11      ]   [  76.13   76.53   76.22   76.67   77.43   76.98   76.90   77.56   78.15   82.97   100.00  78.11   ] [ adelaidenow 11      ]
[ average adelaidenow ]   [  80.64   82.36   82.18   81.82   81.12   81.08   81.08   81.59   80.79   80.86   78.11   100.00  ] [ average adelaidenow ]

sa: matrix[slashdot-similarity]
[ average slashdot ] = [  100.00  81.12   80.99   81.20   81.45   81.67   81.31   81.17   81.37   81.24   81.09   79.05   ] [ average slashdot ]
[ slashdot 1       ]   [  81.12   100.00  79.43   77.62   79.45   78.59   78.19   78.66   78.62   77.47   79.05   79.06   ] [ slashdot 1       ]
[ slashdot 2       ]   [  80.99   79.43   100.00  79.26   78.77   78.15   79.32   77.31   77.63   78.11   78.44   77.96   ] [ slashdot 2       ]
[ slashdot 3       ]   [  81.20   77.62   79.26   100.00  79.08   78.22   79.18   78.34   77.85   78.71   78.50   78.37   ] [ slashdot 3       ]
[ slashdot 4       ]   [  81.45   79.45   78.77   79.08   100.00  81.20   78.27   79.20   78.12   78.15   78.83   78.82   ] [ slashdot 4       ]
[ slashdot 5       ]   [  81.67   78.59   78.15   78.22   81.20   100.00  78.58   79.85   79.29   79.04   78.56   77.78   ] [ slashdot 5       ]
[ slashdot 6       ]   [  81.31   78.19   79.32   79.18   78.27   78.58   100.00  78.62   78.54   79.33   78.07   78.00   ] [ slashdot 6       ]
[ slashdot 7       ]   [  81.17   78.66   77.31   78.34   79.20   79.85   78.62   100.00  80.17   78.86   78.40   78.65   ] [ slashdot 7       ]
[ slashdot 8       ]   [  81.37   78.62   77.63   77.85   78.12   79.29   78.54   80.17   100.00  79.38   79.60   79.02   ] [ slashdot 8       ]
[ slashdot 9       ]   [  81.24   77.47   78.11   78.71   78.15   79.04   79.33   78.86   79.38   100.00  78.32   78.34   ] [ slashdot 9       ]
[ slashdot 10      ]   [  81.09   79.05   78.44   78.50   78.83   78.56   78.07   78.40   79.60   78.32   100.00  81.39   ] [ slashdot 10      ]
[ slashdot 11      ]   [  79.05   79.06   77.96   78.37   78.82   77.78   78.00   78.65   79.02   78.34   81.39   100.00  ] [ slashdot 11      ]

sa: matrix[smh-similarity]
[ average smh ] = [  100.00  87.76   88.16   87.95   87.62   86.82   87.31   87.12   87.72   87.54   87.61   85.55   ] [ average smh ]
[ smh 1       ]   [  87.76   100.00  89.44   87.69   86.23   85.33   85.30   84.97   85.34   84.70   85.04   84.75   ] [ smh 1       ]
[ smh 2       ]   [  88.16   89.44   100.00  89.80   86.25   85.27   85.58   85.36   85.54   85.61   85.38   85.40   ] [ smh 2       ]
[ smh 3       ]   [  87.95   87.69   89.80   100.00  86.81   85.04   85.31   85.40   85.21   85.19   85.47   84.93   ] [ smh 3       ]
[ smh 4       ]   [  87.62   86.23   86.25   86.81   100.00  86.63   86.12   85.06   85.64   85.15   85.34   85.00   ] [ smh 4       ]
[ smh 5       ]   [  86.82   85.33   85.27   85.04   86.63   100.00  85.24   84.36   85.20   84.55   85.00   84.99   ] [ smh 5       ]
[ smh 6       ]   [  87.31   85.30   85.58   85.31   86.12   85.24   100.00  86.19   86.59   85.03   85.26   85.81   ] [ smh 6       ]
[ smh 7       ]   [  87.12   84.97   85.36   85.40   85.06   84.36   86.19   100.00  86.36   85.48   85.38   85.48   ] [ smh 7       ]
[ smh 8       ]   [  87.72   85.34   85.54   85.21   85.64   85.20   86.59   86.36   100.00  87.76   86.94   85.89   ] [ smh 8       ]
[ smh 9       ]   [  87.54   84.70   85.61   85.19   85.15   84.55   85.03   85.48   87.76   100.00  90.65   85.16   ] [ smh 9       ]
[ smh 10      ]   [  87.61   85.04   85.38   85.47   85.34   85.00   85.26   85.38   86.94   90.65   100.00  85.75   ] [ smh 10      ]
[ smh 11      ]   [  85.55   84.75   85.40   84.93   85.00   84.99   85.81   85.48   85.89   85.16   85.75   100.00  ] [ smh 11      ]

sa: matrix[wikipedia-similarity]
[ average wikipedia ] = [  100.00  87.67   87.51   87.82   85.86   88.06   88.10   87.71   86.65   87.33   87.47   85.19   ] [ average wikipedia ]
[ wikipedia 1       ]   [  87.67   100.00  88.60   87.80   85.57   85.41   85.93   84.48   84.13   83.71   84.48   84.21   ] [ wikipedia 1       ]
[ wikipedia 2       ]   [  87.51   88.60   100.00  89.28   84.95   85.85   86.16   85.72   82.95   83.88   86.01   82.89   ] [ wikipedia 2       ]
[ wikipedia 3       ]   [  87.82   87.80   89.28   100.00  85.36   86.57   86.41   85.67   83.09   84.50   85.48   83.13   ] [ wikipedia 3       ]
[ wikipedia 4       ]   [  85.86   85.57   84.95   85.36   100.00  84.04   83.77   82.42   83.65   82.37   82.08   83.32   ] [ wikipedia 4       ]
[ wikipedia 5       ]   [  88.06   85.41   85.85   86.57   84.04   100.00  88.34   86.72   84.71   85.80   85.91   84.22   ] [ wikipedia 5       ]
[ wikipedia 6       ]   [  88.10   85.93   86.16   86.41   83.77   88.34   100.00  87.70   85.43   85.85   86.64   84.31   ] [ wikipedia 6       ]
[ wikipedia 7       ]   [  87.71   84.48   85.72   85.67   82.42   86.72   87.70   100.00  86.18   87.02   88.46   85.24   ] [ wikipedia 7       ]
[ wikipedia 8       ]   [  86.65   84.13   82.95   83.09   83.65   84.71   85.43   86.18   100.00  85.88   85.55   86.28   ] [ wikipedia 8       ]
[ wikipedia 9       ]   [  87.33   83.71   83.88   84.50   82.37   85.80   85.85   87.02   85.88   100.00  88.10   86.46   ] [ wikipedia 9       ]
[ wikipedia 10      ]   [  87.47   84.48   86.01   85.48   82.08   85.91   86.64   88.46   85.55   88.10   100.00  87.17   ] [ wikipedia 10      ]
[ wikipedia 11      ]   [  85.19   84.21   82.89   83.13   83.32   84.22   84.31   85.24   86.28   86.46   87.17   100.00  ] [ wikipedia 11      ]

sa: matrix[youtube-similarity]
[ average youtube ] = [  100.00  85.21   84.63   85.38   85.98   86.04   85.33   84.43   84.90   81.10   84.43   82.12   ] [ average youtube ]
[ youtube 1       ]   [  85.21   100.00  83.96   84.76   85.73   85.18   85.04   80.41   81.58   77.43   82.51   79.80   ] [ youtube 1       ]
[ youtube 2       ]   [  84.63   83.96   100.00  87.30   85.63   84.66   81.56   81.12   79.96   79.08   78.90   79.32   ] [ youtube 2       ]
[ youtube 3       ]   [  85.38   84.76   87.30   100.00  86.54   87.12   83.85   80.31   80.86   78.14   80.15   78.80   ] [ youtube 3       ]
[ youtube 4       ]   [  85.98   85.73   85.63   86.54   100.00  89.46   85.78   81.19   82.22   76.97   81.13   79.50   ] [ youtube 4       ]
[ youtube 5       ]   [  86.04   85.18   84.66   87.12   89.46   100.00  86.87   81.77   81.96   77.08   81.08   79.79   ] [ youtube 5       ]
[ youtube 6       ]   [  85.33   85.04   81.56   83.85   85.78   86.87   100.00  82.21   82.71   78.36   81.86   80.81   ] [ youtube 6       ]
[ youtube 7       ]   [  84.43   80.41   81.12   80.31   81.19   81.77   82.21   100.00  85.97   82.26   84.98   85.98   ] [ youtube 7       ]
[ youtube 8       ]   [  84.90   81.58   79.96   80.86   82.22   81.96   82.71   85.97   100.00  81.99   87.14   84.61   ] [ youtube 8       ]
[ youtube 9       ]   [  81.10   77.43   79.08   78.14   76.97   77.08   78.36   82.26   81.99   100.00  82.12   82.82   ] [ youtube 9       ]
[ youtube 10      ]   [  84.43   82.51   78.90   80.15   81.13   81.08   81.86   84.98   87.14   82.12   100.00  86.58   ] [ youtube 10      ]
[ youtube 11      ]   [  82.12   79.80   79.32   78.80   79.50   79.79   80.81   85.98   84.61   82.82   86.58   100.00  ] [ youtube 11      ]
OK. That is kind of cool. Though the matrices will presumably line-wrap if your screen is too small. Take home message, all webpages are greater than 75% similar with themselves over the 11 day period. Which I guess means we don't even need to average over 10 days! Presumably the average will give better results though.

creating average webpage superpositions

OK. I'm working to something here, so bare with me. Recall I had six websites (abc, adelaidenow, slashdot, smh, wikipedia, youtube), and I downloaded a copy once a day for 11 days. Well, in this post we will create average superpositions of these by adding up the first 10. The 11'th is the test case to run our pattern recognition against.

So, here is the BKO for this:
(note I didn't type all this. Cut and paste and search/replace helped!)
-- create average webpages:
-- we deliberately leave out |website 11> from our hashes, that's the test case
|abc list> => |abc 1> + |abc 2> + |abc 3> + |abc 4> + |abc 5> + |abc 6> + |abc 7> + |abc 8> + |abc 9> + |abc 10>
hash-64k |average abc> => hash-64k "" |abc list>
hash-1M |average abc> => hash-1M "" |abc list>
hash-4B |average abc> => hash-4B "" |abc list>

|adelaidenow list> => |adelaidenow 1> + |adelaidenow 2> + |adelaidenow 3> + |adelaidenow 4> + |adelaidenow 5> + |adelaidenow 6> + |adelaidenow 7> + |adelaidenow 8> + |adelaidenow 9> + |adelaidenow 10>
hash-64k |average adelaidenow> => hash-64k "" |adelaidenow list>
hash-1M |average adelaidenow> => hash-1M "" |adelaidenow list>
hash-4B |average adelaidenow> => hash-4B "" |adelaidenow list>

|slashdot list> => |slashdot 1> + |slashdot 2> + |slashdot 3> + |slashdot 4> + |slashdot 5> + |slashdot 6> + |slashdot 7> + |slashdot 8> + |slashdot 9> + |slashdot 10>
hash-64k |average slashdot> => hash-64k "" |slashdot list>
hash-1M |average slashdot> => hash-1M "" |slashdot list>
hash-4B |average slashdot> => hash-4B "" |slashdot list>

|smh list> => |smh 1> + |smh 2> + |smh 3> + |smh 4> + |smh 5> + |smh 6> + |smh 7> + |smh 8> + |smh 9> + |smh 10>
hash-64k |average smh> => hash-64k "" |smh list>
hash-1M |average smh> => hash-1M "" |smh list>
hash-4B |average smh> => hash-4B "" |smh list>

|wikipedia list> => |wikipedia 1> + |wikipedia 2> + |wikipedia 3> + |wikipedia 4> + |wikipedia 5> + |wikipedia 6> + |wikipedia 7> + |wikipedia 8> + |wikipedia 9> + |wikipedia 10>
hash-64k |average wikipedia> => hash-64k "" |wikipedia list>
hash-1M |average wikipedia> => hash-1M "" |wikipedia list>
hash-4B |average wikipedia> => hash-4B "" |wikipedia list>

|youtube list> => |youtube 1> + |youtube 2> + |youtube 3> + |youtube 4> + |youtube 5> + |youtube 6> + |youtube 7> + |youtube 8> + |youtube 9> + |youtube 10>
hash-64k |average youtube> => hash-64k "" |youtube list>
hash-1M |average youtube> => hash-1M "" |youtube list>
hash-4B |average youtube> => hash-4B "" |youtube list>
Now a couple of notes about generating our average hashes.
1) we make use of linearity of operators.
eg:
hash-64k "" |abc list>
expands to:
hash-64k (|abc 1> + |abc 2> + |abc 3> + ...)
which expands to:
hash-64k |abc 1> + hash-64k |abc 2> + hash-64k |abc 3> + ...
2) we don't need to normalize our averages. ie, we don't need sum/10. This is because our similarity metric auto-rescales the incomming superpositions, so the shape of superpositions is usually more important than the amplitude.

Now, let's look at how big these superpositions are:
-- load the data:
sa: load improved-fragment-webpages.sw
sa: load create-average-website-fragments.sw

-- define a couple of operators:
sa: count-hash-64k |*> #=> count hash-64k |_self>
sa: count-hash-1M |*> #=> count hash-1M |_self>
sa: count-hash-4B |*> #=> count hash-4B |_self>
sa: delta-1M-64k |*> #=> arithmetic(count-hash-1M|_self>,|->,count-hash-64k|_self>)
sa: delta-4B-1M |*> #=> arithmetic(count-hash-4B|_self>,|->,count-hash-1M|_self>)

-- define the list of average websites:
|ave list> => |average abc> + |average adelaidenow> + |average slashdot> + |average smh> + |average wikipedia> + |average youtube>

-- Now, take a look at the tables:
sa: table[page,count-hash-64k,count-hash-1M,count-hash-4B] "" |ave list>
+---------------------+----------------+---------------+---------------+
| page                | count-hash-64k | count-hash-1M | count-hash-4B |
+---------------------+----------------+---------------+---------------+
| average abc         | 1476           | 1491          | 1492          |
| average adelaidenow | 10840          | 11798         | 11869         |
| average slashdot    | 5235           | 5441          | 5462          |
| average smh         | 9326           | 10044         | 10081         |
| average wikipedia   | 3108           | 3178          | 3182          |
| average youtube     | 6082           | 6370          | 6390          |
+---------------------+----------------+---------------+---------------+

sa: table[page,count-hash-64k,delta-1M-64k,delta-4B-1M] "" |ave list>
+---------------------+----------------+--------------+-------------+
| page                | count-hash-64k | delta-1M-64k | delta-4B-1M |
+---------------------+----------------+--------------+-------------+
| average abc         | 1476           | 15           | 1           |
| average adelaidenow | 10840          | 958          | 71          |
| average slashdot    | 5235           | 206          | 21          |
| average smh         | 9326           | 718          | 37          |
| average wikipedia   | 3108           | 70           | 4           |
| average youtube     | 6082           | 288          | 20          |
+---------------------+----------------+--------------+-------------+
Anyway, that is all exploratory I suppose. The take home message is, the 4B superpositions are probably the best to work with. Some similarity matrices in the next couple of posts.

making the ket count differences clearer

Last post we looked at ket count versus the number of digits of the hash we kept (4,5,or 8). This time, a table making that super clear by showing the count delta's.
sa: load improved-fragment-webpages.sw
sa: delta-1M-64k |*> #=> arithmetic(count-1-hash-1M|_self>,|->,count-1-hash-64k|_self>)
sa: delta-4B-1M |*> #=> arithmetic(count-1-hash-4B|_self>,|->,count-1-hash-1M|_self>)
sa: table[page,count-1-hash-64k,delta-1M-64k,delta-4B-1M] rel-kets[hash-64k] |>
+----------------+------------------+--------------+-------------+
| page           | count-1-hash-64k | delta-1M-64k | delta-4B-1M |
+----------------+------------------+--------------+-------------+
| abc 1          | 650              | 3            | 0           |
| abc 2          | 652              | 3            | 0           |
| abc 3          | 652              | 4            | 0           |
| abc 4          | 661              | 3            | 0           |
| abc 5          | 660              | 5            | 0           |
| abc 6          | 660              | 3            | 0           |
| abc 7          | 651              | 3            | 0           |
| abc 8          | 659              | 4            | 0           |
| abc 9          | 650              | 4            | 0           |
| abc 10         | 655              | 3            | 0           |
| abc 11         | 660              | 4            | 0           |
| adelaidenow 1  | 2635             | 63           | 6           |
| adelaidenow 2  | 2718             | 57           | 5           |
| adelaidenow 3  | 2697             | 54           | 6           |
| adelaidenow 4  | 2701             | 63           | 6           |
| adelaidenow 5  | 2674             | 71           | 5           |
| adelaidenow 6  | 2675             | 64           | 5           |
| adelaidenow 7  | 2716             | 51           | 5           |
| adelaidenow 8  | 2758             | 63           | 6           |
| adelaidenow 9  | 2704             | 70           | 4           |
| adelaidenow 10 | 2693             | 62           | 5           |
| adelaidenow 11 | 2748             | 59           | 7           |
| slashdot 1     | 1153             | 6            | 1           |
| slashdot 2     | 1157             | 7            | 1           |
| slashdot 3     | 1164             | 4            | 2           |
| slashdot 4     | 1133             | 4            | 1           |
| slashdot 5     | 1163             | 6            | 0           |
| slashdot 6     | 1182             | 7            | 1           |
| slashdot 7     | 1149             | 12           | 1           |
| slashdot 8     | 1155             | 15           | 0           |
| slashdot 9     | 1160             | 7            | 0           |
| slashdot 10    | 1131             | 14           | 0           |
| slashdot 11    | 1137             | 9            | 0           |
| smh 1          | 2584             | 52           | 0           |
| smh 2          | 2595             | 50           | 6           |
| smh 3          | 2613             | 48           | 0           |
| smh 4          | 2572             | 60           | 3           |
| smh 5          | 2578             | 47           | 3           |
| smh 6          | 2605             | 49           | 2           |
| smh 7          | 2610             | 65           | 1           |
| smh 8          | 2592             | 40           | 0           |
| smh 9          | 2569             | 53           | 0           |
| smh 10         | 2560             | 46           | 1           |
| smh 11         | 2629             | 57           | 4           |
| wikipedia 1    | 1078             | 4            | 0           |
| wikipedia 2    | 1037             | 6            | 0           |
| wikipedia 3    | 1070             | 4            | 1           |
| wikipedia 4    | 1144             | 7            | 1           |
| wikipedia 5    | 1080             | 4            | 1           |
| wikipedia 6    | 1068             | 8            | 1           |
| wikipedia 7    | 1054             | 5            | 1           |
| wikipedia 8    | 1112             | 7            | 0           |
| wikipedia 9    | 1101             | 10           | 0           |
| wikipedia 10   | 1046             | 9            | 0           |
| wikipedia 11   | 1116             | 9            | 0           |
| youtube 1      | 1363             | 16           | 0           |
| youtube 2      | 1289             | 12           | 0           |
| youtube 3      | 1284             | 15           | 0           |
| youtube 4      | 1374             | 14           | 1           |
| youtube 5      | 1369             | 18           | 2           |
| youtube 6      | 1363             | 20           | 1           |
| youtube 7      | 1382             | 14           | 4           |
| youtube 8      | 1383             | 13           | 2           |
| youtube 9      | 1104             | 10           | 1           |
| youtube 10     | 1376             | 15           | 1           |
| youtube 11     | 1378             | 9            | 1           |
+----------------+------------------+--------------+-------------+
That's it for this post. Some processing of the webpage superpositions in the next couple.

Thursday, 5 March 2015

how big are our webpage superpositions?

Last post I gave the complete code for mapping webpages to well defined superpositions. Now, just a brief look at how big those superpositions are.

If we just keep 4 digits of the hash, there are 64k possible kets.
5 digits, then roughly 1 million.
8 digits, then roughly 4 billion possible kets.
Here are the actual numbers:
sa: load improved-fragment-webpages.sw
sa: table[page,count-1-hash-64k,count-1-hash-1M,count-1-hash-4B] rel-kets[hash-64k] |>
+----------------+------------------+-----------------+-----------------+
| page           | count-1-hash-64k | count-1-hash-1M | count-1-hash-4B |
+----------------+------------------+-----------------+-----------------+
| abc 1          | 650              | 653             | 653             |
| abc 2          | 652              | 655             | 655             |
| abc 3          | 652              | 656             | 656             |
| abc 4          | 661              | 664             | 664             |
| abc 5          | 660              | 665             | 665             |
| abc 6          | 660              | 663             | 663             |
| abc 7          | 651              | 654             | 654             |
| abc 8          | 659              | 663             | 663             |
| abc 9          | 650              | 654             | 654             |
| abc 10         | 655              | 658             | 658             |
| abc 11         | 660              | 664             | 664             |
| adelaidenow 1  | 2635             | 2698            | 2704            |
| adelaidenow 2  | 2718             | 2775            | 2780            |
| adelaidenow 3  | 2697             | 2751            | 2757            |
| adelaidenow 4  | 2701             | 2764            | 2770            |
| adelaidenow 5  | 2674             | 2745            | 2750            |
| adelaidenow 6  | 2675             | 2739            | 2744            |
| adelaidenow 7  | 2716             | 2767            | 2772            |
| adelaidenow 8  | 2758             | 2821            | 2827            |
| adelaidenow 9  | 2704             | 2774            | 2778            |
| adelaidenow 10 | 2693             | 2755            | 2760            |
| adelaidenow 11 | 2748             | 2807            | 2814            |
| slashdot 1     | 1153             | 1159            | 1160            |
| slashdot 2     | 1157             | 1164            | 1165            |
| slashdot 3     | 1164             | 1168            | 1170            |
| slashdot 4     | 1133             | 1137            | 1138            |
| slashdot 5     | 1163             | 1169            | 1169            |
| slashdot 6     | 1182             | 1189            | 1190            |
| slashdot 7     | 1149             | 1161            | 1162            |
| slashdot 8     | 1155             | 1170            | 1170            |
| slashdot 9     | 1160             | 1167            | 1167            |
| slashdot 10    | 1131             | 1145            | 1145            |
| slashdot 11    | 1137             | 1146            | 1146            |
| smh 1          | 2584             | 2636            | 2636            |
| smh 2          | 2595             | 2645            | 2651            |
| smh 3          | 2613             | 2661            | 2661            |
| smh 4          | 2572             | 2632            | 2635            |
| smh 5          | 2578             | 2625            | 2628            |
| smh 6          | 2605             | 2654            | 2656            |
| smh 7          | 2610             | 2675            | 2676            |
| smh 8          | 2592             | 2632            | 2632            |
| smh 9          | 2569             | 2622            | 2622            |
| smh 10         | 2560             | 2606            | 2607            |
| smh 11         | 2629             | 2686            | 2690            |
| wikipedia 1    | 1078             | 1082            | 1082            |
| wikipedia 2    | 1037             | 1043            | 1043            |
| wikipedia 3    | 1070             | 1074            | 1075            |
| wikipedia 4    | 1144             | 1151            | 1152            |
| wikipedia 5    | 1080             | 1084            | 1085            |
| wikipedia 6    | 1068             | 1076            | 1077            |
| wikipedia 7    | 1054             | 1059            | 1060            |
| wikipedia 8    | 1112             | 1119            | 1119            |
| wikipedia 9    | 1101             | 1111            | 1111            |
| wikipedia 10   | 1046             | 1055            | 1055            |
| wikipedia 11   | 1116             | 1125            | 1125            |
| youtube 1      | 1363             | 1379            | 1379            |
| youtube 2      | 1289             | 1301            | 1301            |
| youtube 3      | 1284             | 1299            | 1299            |
| youtube 4      | 1374             | 1388            | 1389            |
| youtube 5      | 1369             | 1387            | 1389            |
| youtube 6      | 1363             | 1383            | 1384            |
| youtube 7      | 1382             | 1396            | 1400            |
| youtube 8      | 1383             | 1396            | 1398            |
| youtube 9      | 1104             | 1114            | 1115            |
| youtube 10     | 1376             | 1391            | 1392            |
| youtube 11     | 1378             | 1387            | 1388            |
+----------------+------------------+-----------------+-----------------+
So we observe the actual ket count for the 64k case is only just smaller than the 4B case. So our 4B superpositions are really very sparse. A big win for the superposition representation over that of a vector/array representation. I guess it also means 64k superpositions are sufficient to represent webpages. Though it does leave open the question of how many webpages we can store using 64k superpositions before distinct webpages accidentally look similar.

That's it for this post. More on this topic in future posts.

mapping webpages to well defined superpositions

This is an interesting one, not sure how to best describe it. I guess the general idea is to map webpages to superpositions, and then we can do "pattern recognition" on those superpositions using similar[op]. The code is quite general, and should work for things like XML, or program code, if you choose the right substrings to fragment on. Text of ebooks say require more processing, you almost certainly need to look at word 3-grams, since if you just look at single word frequency, most English text is similar.

The outline of the algo:
- you select substrings to fragment your document on.
(In html and xml "<" and ">" should be sufficient. For C perhaps "{", "}", ";" and so on.)
- split the document into fragments
- take a hash of the fragments (and choose how many digits of the hash to keep, eg 8)
- add up kets of those hashes, and that becomes your superposition.

A couple of advantages of this scheme:
- local changes in the document do not have non-local effects. Fragments further down the page are completely unaffected by modifications elsewhere. This is a powerful feature!
- webpages that change their content daily largely have a similar structure from day to day. On the order of 90% or so similarity. We will see the actual numbers in a future post. I guess the content changes daily, but the underlying html sites use for their pages is largely invariant from day to day.

And now the python:
import sys
import hashlib
from the_semantic_db_code import *
from the_semantic_db_functions import *
from the_semantic_db_processor import *

C = context_list("fragment webpages")
fragment_strings = ["<",">"]

def fragment_string(s,fragment_strings):
  r = [s]
  for frag in fragment_strings:
    list = r
    r = []
    for s in list:
      r += s.split(frag)
  return [s.strip() for s in r if len(s.strip()) > 0 ]

def create_fragment_hashes(filename,fragment_strings,size):
  result = fast_superposition()
  with open(filename,'r') as f:
    text = f.read()
    for fragment in fragment_string(text,fragment_strings):
      hash = hashlib.sha1(fragment.encode('utf-8')).hexdigest()[-size:]
      result += ket(hash)
  return result.superposition().coeff_sort()

size_list = [[4,"64k"],[5,"1M"],[8,"4B"]]

def learn_webpage_hashes(C,webpage,n):
  for k in range(n):
    web_file = "webpages-v2/clean-" + webpage + "-" + str(k+1) + ".html"
    print("web_file:",web_file)
    for pair in size_list:
      size, string_size = pair
      print("size:",size)
      print("string size:",string_size)
      ket_name = webpage + " " + str(k+1)
      print("ket_name:",ket_name)
      hash_name = "hash-" + string_size

# now lets learn the superpositions:
      r = create_fragment_hashes(web_file,fragment_strings,size)
      C.learn(hash_name,ket_name,r)

# learn the drop-n hashes:
# an experiment really, trying to work out what is best.
# Probably r by itself, but we need to check.
      C.learn("drop-2-" + hash_name,ket_name,r.drop_below(2))
      C.learn("drop-3-" + hash_name,ket_name,r.drop_below(3))
      C.learn("drop-4-" + hash_name,ket_name,r.drop_below(4))
      C.learn("drop-5-" + hash_name,ket_name,r.drop_below(5))
      C.learn("drop-6-" + hash_name,ket_name,r.drop_below(6))
      C.learn("drop-7-" + hash_name,ket_name,r.drop_below(7))
      C.learn("drop-8-" + hash_name,ket_name,r.drop_below(8))
      C.learn("drop-9-" + hash_name,ket_name,r.drop_below(9))
      C.learn("drop-10-" + hash_name,ket_name,r.drop_below(10))

# learn how many of each:
      C.learn("count-1-" + hash_name,ket_name,r.number_count())
      C.learn("count-2-" + hash_name,ket_name,r.drop_below(2).number_count())
      C.learn("count-3-" + hash_name,ket_name,r.drop_below(3).number_count())
      C.learn("count-4-" + hash_name,ket_name,r.drop_below(4).number_count())
      C.learn("count-5-" + hash_name,ket_name,r.drop_below(5).number_count())
      C.learn("count-6-" + hash_name,ket_name,r.drop_below(6).number_count())
      C.learn("count-7-" + hash_name,ket_name,r.drop_below(7).number_count())
      C.learn("count-8-" + hash_name,ket_name,r.drop_below(8).number_count())
      C.learn("count-9-" + hash_name,ket_name,r.drop_below(9).number_count())
      C.learn("count-10-" + hash_name,ket_name,r.drop_below(10).number_count())

# learn them all!
sites = ["abc","adelaidenow","slashdot","smh","wikipedia","youtube"]
number = 11

for site in sites:
  learn_webpage_hashes(C,site,number)

name = "sw-examples/improved-fragment-webpages.sw"
save_sw(C,name)
I downloaded one page a day for 11 days from each of the listed sites. Ran my code on that, and here is the result. BTW, for 66 webpages the code took roughly 75 minutes.

That's it for this post. More on this in the next few posts.

Update: the original motivation for this algo was gel electrophoresis.

Wednesday, 4 March 2015

a table of integers and their factors

Just a brief one again. The main take-away point is to demonstrate how we can compose new operators out of existing operators.
-- define some operators:
sa: factors |*> #=> factor |_self>
sa: count-factors |*> #=> count-sum factor |_self>
sa: prime |*> #=> if(is-equal[1] count-factors |_self>,|prime>,count-factors|_self>)

-- now show the table:
sa: table[number,factors,prime] range(|number: 1>,|number: 250>)
+--------+-------------+-------+
| number | factors     | prime |
+--------+-------------+-------+
| 1      |             | 0     |
| 2      | 2           | prime |
| 3      | 3           | prime |
| 4      | 2 2         | 2     |
| 5      | 5           | prime |
| 6      | 2, 3        | 2     |
| 7      | 7           | prime |
| 8      | 3 2         | 3     |
| 9      | 2 3         | 2     |
| 10     | 2, 5        | 2     |
| 11     | 11          | prime |
| 12     | 2 2, 3      | 3     |
| 13     | 13          | prime |
| 14     | 2, 7        | 2     |
| 15     | 3, 5        | 2     |
| 16     | 4 2         | 4     |
| 17     | 17          | prime |
| 18     | 2, 2 3      | 3     |
| 19     | 19          | prime |
| 20     | 2 2, 5      | 3     |
| 21     | 3, 7        | 2     |
| 22     | 2, 11       | 2     |
| 23     | 23          | prime |
| 24     | 3 2, 3      | 4     |
| 25     | 2 5         | 2     |
| 26     | 2, 13       | 2     |
| 27     | 3 3         | 3     |
| 28     | 2 2, 7      | 3     |
| 29     | 29          | prime |
| 30     | 2, 3, 5     | 3     |
| 31     | 31          | prime |
| 32     | 5 2         | 5     |
| 33     | 3, 11       | 2     |
| 34     | 2, 17       | 2     |
| 35     | 5, 7        | 2     |
| 36     | 2 2, 2 3    | 4     |
| 37     | 37          | prime |
| 38     | 2, 19       | 2     |
| 39     | 3, 13       | 2     |
| 40     | 3 2, 5      | 4     |
| 41     | 41          | prime |
| 42     | 2, 3, 7     | 3     |
| 43     | 43          | prime |
| 44     | 2 2, 11     | 3     |
| 45     | 2 3, 5      | 3     |
| 46     | 2, 23       | 2     |
| 47     | 47          | prime |
| 48     | 4 2, 3      | 5     |
| 49     | 2 7         | 2     |
| 50     | 2, 2 5      | 3     |
| 51     | 3, 17       | 2     |
| 52     | 2 2, 13     | 3     |
| 53     | 53          | prime |
| 54     | 2, 3 3      | 4     |
| 55     | 5, 11       | 2     |
| 56     | 3 2, 7      | 4     |
| 57     | 3, 19       | 2     |
| 58     | 2, 29       | 2     |
| 59     | 59          | prime |
| 60     | 2 2, 3, 5   | 4     |
| 61     | 61          | prime |
| 62     | 2, 31       | 2     |
| 63     | 2 3, 7      | 3     |
| 64     | 6 2         | 6     |
| 65     | 5, 13       | 2     |
| 66     | 2, 3, 11    | 3     |
| 67     | 67          | prime |
| 68     | 2 2, 17     | 3     |
| 69     | 3, 23       | 2     |
| 70     | 2, 5, 7     | 3     |
| 71     | 71          | prime |
| 72     | 3 2, 2 3    | 5     |
| 73     | 73          | prime |
| 74     | 2, 37       | 2     |
| 75     | 3, 2 5      | 3     |
| 76     | 2 2, 19     | 3     |
| 77     | 7, 11       | 2     |
| 78     | 2, 3, 13    | 3     |
| 79     | 79          | prime |
| 80     | 4 2, 5      | 5     |
| 81     | 4 3         | 4     |
| 82     | 2, 41       | 2     |
| 83     | 83          | prime |
| 84     | 2 2, 3, 7   | 4     |
| 85     | 5, 17       | 2     |
| 86     | 2, 43       | 2     |
| 87     | 3, 29       | 2     |
| 88     | 3 2, 11     | 4     |
| 89     | 89          | prime |
| 90     | 2, 2 3, 5   | 4     |
| 91     | 7, 13       | 2     |
| 92     | 2 2, 23     | 3     |
| 93     | 3, 31       | 2     |
| 94     | 2, 47       | 2     |
| 95     | 5, 19       | 2     |
| 96     | 5 2, 3      | 6     |
| 97     | 97          | prime |
| 98     | 2, 2 7      | 3     |
| 99     | 2 3, 11     | 3     |
| 100    | 2 2, 2 5    | 4     |
| 101    | 101         | prime |
| 102    | 2, 3, 17    | 3     |
| 103    | 103         | prime |
| 104    | 3 2, 13     | 4     |
| 105    | 3, 5, 7     | 3     |
| 106    | 2, 53       | 2     |
| 107    | 107         | prime |
| 108    | 2 2, 3 3    | 5     |
| 109    | 109         | prime |
| 110    | 2, 5, 11    | 3     |
| 111    | 3, 37       | 2     |
| 112    | 4 2, 7      | 5     |
| 113    | 113         | prime |
| 114    | 2, 3, 19    | 3     |
| 115    | 5, 23       | 2     |
| 116    | 2 2, 29     | 3     |
| 117    | 2 3, 13     | 3     |
| 118    | 2, 59       | 2     |
| 119    | 7, 17       | 2     |
| 120    | 3 2, 3, 5   | 5     |
| 121    | 2 11        | 2     |
| 122    | 2, 61       | 2     |
| 123    | 3, 41       | 2     |
| 124    | 2 2, 31     | 3     |
| 125    | 3 5         | 3     |
| 126    | 2, 2 3, 7   | 4     |
| 127    | 127         | prime |
| 128    | 7 2         | 7     |
| 129    | 3, 43       | 2     |
| 130    | 2, 5, 13    | 3     |
| 131    | 131         | prime |
| 132    | 2 2, 3, 11  | 4     |
| 133    | 7, 19       | 2     |
| 134    | 2, 67       | 2     |
| 135    | 3 3, 5      | 4     |
| 136    | 3 2, 17     | 4     |
| 137    | 137         | prime |
| 138    | 2, 3, 23    | 3     |
| 139    | 139         | prime |
| 140    | 2 2, 5, 7   | 4     |
| 141    | 3, 47       | 2     |
| 142    | 2, 71       | 2     |
| 143    | 11, 13      | 2     |
| 144    | 4 2, 2 3    | 6     |
| 145    | 5, 29       | 2     |
| 146    | 2, 73       | 2     |
| 147    | 3, 2 7      | 3     |
| 148    | 2 2, 37     | 3     |
| 149    | 149         | prime |
| 150    | 2, 3, 2 5   | 4     |
| 151    | 151         | prime |
| 152    | 3 2, 19     | 4     |
| 153    | 2 3, 17     | 3     |
| 154    | 2, 7, 11    | 3     |
| 155    | 5, 31       | 2     |
| 156    | 2 2, 3, 13  | 4     |
| 157    | 157         | prime |
| 158    | 2, 79       | 2     |
| 159    | 3, 53       | 2     |
| 160    | 5 2, 5      | 6     |
| 161    | 7, 23       | 2     |
| 162    | 2, 4 3      | 5     |
| 163    | 163         | prime |
| 164    | 2 2, 41     | 3     |
| 165    | 3, 5, 11    | 3     |
| 166    | 2, 83       | 2     |
| 167    | 167         | prime |
| 168    | 3 2, 3, 7   | 5     |
| 169    | 2 13        | 2     |
| 170    | 2, 5, 17    | 3     |
| 171    | 2 3, 19     | 3     |
| 172    | 2 2, 43     | 3     |
| 173    | 173         | prime |
| 174    | 2, 3, 29    | 3     |
| 175    | 2 5, 7      | 3     |
| 176    | 4 2, 11     | 5     |
| 177    | 3, 59       | 2     |
| 178    | 2, 89       | 2     |
| 179    | 179         | prime |
| 180    | 2 2, 2 3, 5 | 5     |
| 181    | 181         | prime |
| 182    | 2, 7, 13    | 3     |
| 183    | 3, 61       | 2     |
| 184    | 3 2, 23     | 4     |
| 185    | 5, 37       | 2     |
| 186    | 2, 3, 31    | 3     |
| 187    | 11, 17      | 2     |
| 188    | 2 2, 47     | 3     |
| 189    | 3 3, 7      | 4     |
| 190    | 2, 5, 19    | 3     |
| 191    | 191         | prime |
| 192    | 6 2, 3      | 7     |
| 193    | 193         | prime |
| 194    | 2, 97       | 2     |
| 195    | 3, 5, 13    | 3     |
| 196    | 2 2, 2 7    | 4     |
| 197    | 197         | prime |
| 198    | 2, 2 3, 11  | 4     |
| 199    | 199         | prime |
| 200    | 3 2, 2 5    | 5     |
| 201    | 3, 67       | 2     |
| 202    | 2, 101      | 2     |
| 203    | 7, 29       | 2     |
| 204    | 2 2, 3, 17  | 4     |
| 205    | 5, 41       | 2     |
| 206    | 2, 103      | 2     |
| 207    | 2 3, 23     | 3     |
| 208    | 4 2, 13     | 5     |
| 209    | 11, 19      | 2     |
| 210    | 2, 3, 5, 7  | 4     |
| 211    | 211         | prime |
| 212    | 2 2, 53     | 3     |
| 213    | 3, 71       | 2     |
| 214    | 2, 107      | 2     |
| 215    | 5, 43       | 2     |
| 216    | 3 2, 3 3    | 6     |
| 217    | 7, 31       | 2     |
| 218    | 2, 109      | 2     |
| 219    | 3, 73       | 2     |
| 220    | 2 2, 5, 11  | 4     |
| 221    | 13, 17      | 2     |
| 222    | 2, 3, 37    | 3     |
| 223    | 223         | prime |
| 224    | 5 2, 7      | 6     |
| 225    | 2 3, 2 5    | 4     |
| 226    | 2, 113      | 2     |
| 227    | 227         | prime |
| 228    | 2 2, 3, 19  | 4     |
| 229    | 229         | prime |
| 230    | 2, 5, 23    | 3     |
| 231    | 3, 7, 11    | 3     |
| 232    | 3 2, 29     | 4     |
| 233    | 233         | prime |
| 234    | 2, 2 3, 13  | 4     |
| 235    | 5, 47       | 2     |
| 236    | 2 2, 59     | 3     |
| 237    | 3, 79       | 2     |
| 238    | 2, 7, 17    | 3     |
| 239    | 239         | prime |
| 240    | 4 2, 3, 5   | 6     |
| 241    | 241         | prime |
| 242    | 2, 2 11     | 3     |
| 243    | 5 3         | 5     |
| 244    | 2 2, 61     | 3     |
| 245    | 5, 2 7      | 3     |
| 246    | 2, 3, 41    | 3     |
| 247    | 13, 19      | 2     |
| 248    | 3 2, 31     | 4     |
| 249    | 3, 83       | 2     |
| 250    | 2, 3 5      | 4     |
+--------+-------------+-------+
I guess that is it for this post. I hope it is clear what I was trying to show!

Maybe I should also mention it really highlights the location of twin primes.

Tuesday, 3 March 2015

top level domains in sw format

Just another brief one. Decided it would be useful to have top level domain info in sw format.

Simply enough:
sa: load top-level-domains.sw
sa: table[country,top-level-domain] ket-sort rel-kets[top-level-domain] |>
+----------------------------------------------+------------------+
| country                                      | top-level-domain |
+----------------------------------------------+------------------+
| (Former) Soviet Union                        | su               |
| Afghanistan                                  | af               |
| Air-transport                                | aero             |
| Aland                                        | ax               |
| Albania                                      | al               |
| Algeria                                      | dz               |
| American Samoa                               | as               |
| Andorra                                      | ad               |
| Angola                                       | ao               |
| Anguilla                                     | ai               |
| Antarctica                                   | aq               |
| Antigua and Barbuda                          | ag               |
| Argentina                                    | ar               |
| Armenia                                      | am               |
| Aruba                                        | aw               |
| Ascension Island                             | ac               |
| Asia-Pacific community                       | asia             |
| Australia                                    | au               |
| Austria                                      | at               |
| Azerbaijan                                   | az               |
| Bahamas                                      | bs               |
| Bahrain                                      | bh               |
| Bangladesh                                   | bd               |
| Barbados                                     | bb               |
| Belarus                                      | by               |
| Belgium                                      | be               |
| Belize                                       | bz               |
| Benin                                        | bj               |
| Bermuda                                      | bm               |
| Bhutan                                       | bt               |
| Bolivia                                      | bo               |
| Bosnia and Herzegovina                       | ba               |
| Botswana                                     | bw               |
| Bouvet Island                                | bv               |
| Brazil                                       | br               |
| British Indian Ocean Territory               | io               |
| Brunei Darussalam                            | bn               |
| Bulgaria                                     | bg               |
| Burkina Faso                                 | bf               |
| Burundi                                      | bi               |
| Business                                     | biz              |
| Cambodia                                     | kh               |
| Cameroon                                     | cm               |
| Canada                                       | ca               |
| Cap Verde                                    | cv               |
| Catalan community                            | cat              |
| Cayman Islands                               | ky               |
| Central African Republic                     | cf               |
| Chad                                         | td               |
| Chile                                        | cl               |
| China                                        | cn               |
| Christmas Island                             | cx               |
| Cocos (Keeling) Islands                      | cc               |
| Colombia                                     | co               |
| Comoros                                      | km               |
| Cook Islands                                 | ck               |
| Cooperative association                      | coop             |
| Costa Rica                                   | cr               |
| Cote D'Ivoire (Ivory Coast)                  | ci               |
| Credentialed professional                    | pro              |
| Croatia/Hrvatska                             | hr               |
| Cuba                                         | cu               |
| Cyprus                                       | cy               |
| Czech Republic                               | cz               |
| Democratic People's Republic Korea           | kp               |
| Democratic Republic of the Congo             | cg               |
| Denmark                                      | dk               |
| Djibouti                                     | dj               |
| Dominica                                     | dm               |
| Dominican Republic                           | do               |
| East Timor                                   | tp               |
| Ecuador                                      | ec               |
| Egypt                                        | eg               |
| El Salvador                                  | sv               |
| Equatorial Guinea                            | gq               |
| Eritrea                                      | er               |
| Estonia                                      | ee               |
| Ethiopia                                     | et               |
| European Union                               | eu               |
| Falkland Islands (Malvina)                   | fk               |
| Faroe Islands                                | fo               |
| Federal State of Micronesia                  | fm               |
| Fiji                                         | fj               |
| Finland                                      | fi               |
| Former Yugoslav Republic Macedonia           | mk               |
| Former Yugoslavia                            | yu               |
| France                                       | fr               |
| French Guiana                                | gf               |
| French Polynesia                             | pf               |
| French Southern Territories                  | tf               |
| Gabon                                        | ga               |
| Gambia                                       | gm               |
| Georgia                                      | ge               |
| Germany                                      | de               |
| Ghana                                        | gh               |
| Gibraltar                                    | gi               |
| Greece                                       | gr               |
| Greenland                                    | gl               |
| Grenada                                      | gd               |
| Guadeloupe                                   | gp               |
| Guam                                         | gu               |
| Guatemala                                    | gt               |
| Guernsey                                     | gg               |
| Guinea                                       | gn               |
| Guinea-Bissau                                | gw               |
| Guyana                                       | gy               |
| Haiti                                        | ht               |
| Heard and McDonald Islands                   | hm               |
| Holy See (City Vatican State)                | va               |
| Honduras                                     | hn               |
| Hong Kong                                    | hk               |
| Human resource manager                       | jobs             |
| Hungary                                      | hu               |
| Iceland                                      | is               |
| India                                        | in               |
| Individual                                   | name             |
| Indonesia                                    | id               |
| Information                                  | info             |
| International                                | int              |
| Iraq                                         | iq               |
| Ireland                                      | ie               |
| Islamic Republic of Iran                     | ir               |
| Isle of Man                                  | im               |
| Israel                                       | il               |
| Italy                                        | it               |
| Jamaica                                      | jm               |
| Japan                                        | jp               |
| Jersey                                       | je               |
| Jordan                                       | jo               |
| Kazakhstan                                   | kz               |
| Kenya                                        | ke               |
| Kiribati                                     | ki               |
| Kuwait                                       | kw               |
| Kyrgyzstan                                   | kg               |
| Lao People's Democratic Republic             | la               |
| Latvia                                       | lv               |
| Lebanon                                      | lb               |
| Lesotho                                      | ls               |
| Liberia                                      | lr               |
| Libyan Arab Jamahiriya                       | ly               |
| Liechtenstein                                | li               |
| Lithuania                                    | lt               |
| Luxembourg                                   | lu               |
| Macau                                        | mo               |
| Madagascar                                   | mg               |
| Malawi                                       | mw               |
| Malaysia                                     | my               |
| Maldives                                     | mv               |
| Mali                                         | ml               |
| Malta                                        | mt               |
| Marshall Islands                             | mh               |
| Martinique                                   | mq               |
| Mauritania                                   | mr               |
| Mauritius                                    | mu               |
| Mayotte                                      | yt               |
| Mexico                                       | mx               |
| Mobile provider                              | mobi             |
| Monaco                                       | mc               |
| Mongolia                                     | mn               |
| Montenegro                                   | me               |
| Montserrat                                   | ms               |
| Morocco                                      | ma               |
| Mozambique                                   | mz               |
| Museum                                       | museum           |
| Myanmar                                      | mm               |
| Namibia                                      | na               |
| Nato field                                   | nato             |
| Nauru                                        | nr               |
| Nepal                                        | np               |
| Netherlands                                  | nl               |
| Netherlands Antilles                         | an               |
| Network provider                             | net              |
| New Caledonia                                | nc               |
| New Zealand (Aotearoa)                       | nz               |
| Nicaragua                                    | ni               |
| Niger                                        | ne               |
| Nigeria                                      | ng               |
| Niue                                         | nu               |
| Non-Profit Organization                      | org              |
| Norfolk Island                               | nf               |
| Northern Mariana Islands                     | mp               |
| Norway                                       | no               |
| Old style Arpanet                            | arpa             |
| Oman                                         | om               |
| Pakistan                                     | pk               |
| Palau                                        | pw               |
| Palestinian Territories                      | ps               |
| Panama                                       | pa               |
| Papua New Guinea                             | pg               |
| Paraguay                                     | py               |
| Peru                                         | pe               |
| Philippines                                  | ph               |
| Pitcairn Island                              | pn               |
| Poland                                       | pl               |
| Portugal                                     | pt               |
| Puerto Rico                                  | pr               |
| Qatar                                        | qa               |
| Republic of Korea                            | kr               |
| Republic of Moldova                          | md               |
| Reunion Island                               | re               |
| Romania                                      | ro               |
| Russian Federation                           | ru               |
| Rwanda                                       | rw               |
| Saint Kitts and Nevis                        | kn               |
| Saint Lucia                                  | lc               |
| Saint Vincent and the Grenadines             | vc               |
| San Marino                                   | sm               |
| Sao Tome and Principe                        | st               |
| Saudi Arabia                                 | sa               |
| Senegal                                      | sn               |
| Serbia                                       | rs               |
| Seychelles                                   | sc               |
| Sierra Leone                                 | sl               |
| Singapore                                    | sg               |
| Slovak Republic                              | sk               |
| Slovenia                                     | si               |
| Solomon Islands                              | sb               |
| Somalia                                      | so               |
| South Africa                                 | za               |
| South Georgia and the South Sandwich Islands | gs               |
| Spain                                        | es               |
| Sri Lanka                                    | lk               |
| St. Helena                                   | sh               |
| St. Pierre and Miquelon                      | pm               |
| Sudan                                        | sd               |
| Suriname                                     | sr               |
| Svalbard and Jan Mayen Islands               | sj               |
| Swaziland                                    | sz               |
| Sweden                                       | se               |
| Switzerland                                  | ch               |
| Syrian Arab Republic                         | sy               |
| Taiwan                                       | tw               |
| Tajikistan                                   | tj               |
| Tanzania                                     | tz               |
| Telephone service                            | tel              |
| Thailand                                     | th               |
| Togo                                         | tg               |
| Tokelau                                      | tk               |
| Tonga                                        | to               |
| Travel agent                                 | travel           |
| Trinidad and Tobago                          | tt               |
| Tunisia                                      | tn               |
| Turkey                                       | tr               |
| Turkmenistan                                 | tm               |
| Turks and Caicos Islands                     | tc               |
| Tuvalu                                       | tv               |
| Uganda                                       | ug               |
| Ukraine                                      | ua               |
| United Arab Emirates                         | ae               |
| United Kingdom                               | uk               |
| United States                                | us               |
| Uruguay                                      | uy               |
| US Commercial                                | com              |
| US Educational                               | edu              |
| US Government                                | gov              |
| US Military                                  | mil              |
| US Minor Outlying Islands                    | um               |
| Uzbekistan                                   | uz               |
| Vanuatu                                      | vu               |
| Venezuela                                    | ve               |
| Vietnam                                      | vn               |
| Virgin Islands (British)                     | vg               |
| Virgin Islands (USA)                         | vi               |
| Wallis and Futuna Islands                    | wf               |
| Western Sahara                               | eh               |
| Western Samoa                                | ws               |
| Yemen                                        | ye               |
| Zambia                                       | zm               |
| Zimbabwe                                     | zw               |
+----------------------------------------------+------------------+
And a brief note that there are all top level domains in there, not just those for countries.

And may as well do it once more, this time sorted by top level domain:
sa: table[country,top-level-domain] sort-by[top-level-domain] rel-kets[top-level-domain] |>
+----------------------------------------------+------------------+
| country                                      | top-level-domain |
+----------------------------------------------+------------------+
| Ascension Island                             | ac               |
| Andorra                                      | ad               |
| United Arab Emirates                         | ae               |
| Air-transport                                | aero             |
| Afghanistan                                  | af               |
| Antigua and Barbuda                          | ag               |
| Anguilla                                     | ai               |
| Albania                                      | al               |
| Armenia                                      | am               |
| Netherlands Antilles                         | an               |
| Angola                                       | ao               |
| Antarctica                                   | aq               |
| Argentina                                    | ar               |
| Old style Arpanet                            | arpa             |
| American Samoa                               | as               |
| Asia-Pacific community                       | asia             |
| Austria                                      | at               |
| Australia                                    | au               |
| Aruba                                        | aw               |
| Aland                                        | ax               |
| Azerbaijan                                   | az               |
| Bosnia and Herzegovina                       | ba               |
| Barbados                                     | bb               |
| Bangladesh                                   | bd               |
| Belgium                                      | be               |
| Burkina Faso                                 | bf               |
| Bulgaria                                     | bg               |
| Bahrain                                      | bh               |
| Burundi                                      | bi               |
| Business                                     | biz              |
| Benin                                        | bj               |
| Bermuda                                      | bm               |
| Brunei Darussalam                            | bn               |
| Bolivia                                      | bo               |
| Brazil                                       | br               |
| Bahamas                                      | bs               |
| Bhutan                                       | bt               |
| Bouvet Island                                | bv               |
| Botswana                                     | bw               |
| Belarus                                      | by               |
| Belize                                       | bz               |
| Canada                                       | ca               |
| Catalan community                            | cat              |
| Cocos (Keeling) Islands                      | cc               |
| Central African Republic                     | cf               |
| Democratic Republic of the Congo             | cg               |
| Switzerland                                  | ch               |
| Cote D'Ivoire (Ivory Coast)                  | ci               |
| Cook Islands                                 | ck               |
| Chile                                        | cl               |
| Cameroon                                     | cm               |
| China                                        | cn               |
| Colombia                                     | co               |
| US Commercial                                | com              |
| Cooperative association                      | coop             |
| Costa Rica                                   | cr               |
| Cuba                                         | cu               |
| Cap Verde                                    | cv               |
| Christmas Island                             | cx               |
| Cyprus                                       | cy               |
| Czech Republic                               | cz               |
| Germany                                      | de               |
| Djibouti                                     | dj               |
| Denmark                                      | dk               |
| Dominica                                     | dm               |
| Dominican Republic                           | do               |
| Algeria                                      | dz               |
| Ecuador                                      | ec               |
| US Educational                               | edu              |
| Estonia                                      | ee               |
| Egypt                                        | eg               |
| Western Sahara                               | eh               |
| Eritrea                                      | er               |
| Spain                                        | es               |
| Ethiopia                                     | et               |
| European Union                               | eu               |
| Finland                                      | fi               |
| Fiji                                         | fj               |
| Falkland Islands (Malvina)                   | fk               |
| Federal State of Micronesia                  | fm               |
| Faroe Islands                                | fo               |
| France                                       | fr               |
| Gabon                                        | ga               |
| Grenada                                      | gd               |
| Georgia                                      | ge               |
| French Guiana                                | gf               |
| Guernsey                                     | gg               |
| Ghana                                        | gh               |
| Gibraltar                                    | gi               |
| Greenland                                    | gl               |
| Gambia                                       | gm               |
| Guinea                                       | gn               |
| US Government                                | gov              |
| Guadeloupe                                   | gp               |
| Equatorial Guinea                            | gq               |
| Greece                                       | gr               |
| South Georgia and the South Sandwich Islands | gs               |
| Guatemala                                    | gt               |
| Guam                                         | gu               |
| Guinea-Bissau                                | gw               |
| Guyana                                       | gy               |
| Hong Kong                                    | hk               |
| Heard and McDonald Islands                   | hm               |
| Honduras                                     | hn               |
| Croatia/Hrvatska                             | hr               |
| Haiti                                        | ht               |
| Hungary                                      | hu               |
| Indonesia                                    | id               |
| Ireland                                      | ie               |
| Israel                                       | il               |
| Isle of Man                                  | im               |
| India                                        | in               |
| Information                                  | info             |
| International                                | int              |
| British Indian Ocean Territory               | io               |
| Iraq                                         | iq               |
| Islamic Republic of Iran                     | ir               |
| Iceland                                      | is               |
| Italy                                        | it               |
| Jersey                                       | je               |
| Jamaica                                      | jm               |
| Jordan                                       | jo               |
| Human resource manager                       | jobs             |
| Japan                                        | jp               |
| Kenya                                        | ke               |
| Kyrgyzstan                                   | kg               |
| Cambodia                                     | kh               |
| Kiribati                                     | ki               |
| Comoros                                      | km               |
| Saint Kitts and Nevis                        | kn               |
| Democratic People's Republic Korea           | kp               |
| Republic of Korea                            | kr               |
| Kuwait                                       | kw               |
| Cayman Islands                               | ky               |
| Kazakhstan                                   | kz               |
| Lao People's Democratic Republic             | la               |
| Lebanon                                      | lb               |
| Saint Lucia                                  | lc               |
| Liechtenstein                                | li               |
| Sri Lanka                                    | lk               |
| Liberia                                      | lr               |
| Lesotho                                      | ls               |
| Lithuania                                    | lt               |
| Luxembourg                                   | lu               |
| Latvia                                       | lv               |
| Libyan Arab Jamahiriya                       | ly               |
| Morocco                                      | ma               |
| Monaco                                       | mc               |
| Republic of Moldova                          | md               |
| Montenegro                                   | me               |
| Madagascar                                   | mg               |
| Marshall Islands                             | mh               |
| US Military                                  | mil              |
| Former Yugoslav Republic Macedonia           | mk               |
| Mali                                         | ml               |
| Myanmar                                      | mm               |
| Mongolia                                     | mn               |
| Macau                                        | mo               |
| Mobile provider                              | mobi             |
| Northern Mariana Islands                     | mp               |
| Martinique                                   | mq               |
| Mauritania                                   | mr               |
| Montserrat                                   | ms               |
| Malta                                        | mt               |
| Mauritius                                    | mu               |
| Museum                                       | museum           |
| Maldives                                     | mv               |
| Malawi                                       | mw               |
| Mexico                                       | mx               |
| Malaysia                                     | my               |
| Mozambique                                   | mz               |
| Namibia                                      | na               |
| Individual                                   | name             |
| Nato field                                   | nato             |
| New Caledonia                                | nc               |
| Niger                                        | ne               |
| Network provider                             | net              |
| Norfolk Island                               | nf               |
| Nigeria                                      | ng               |
| Nicaragua                                    | ni               |
| Netherlands                                  | nl               |
| Norway                                       | no               |
| Nepal                                        | np               |
| Nauru                                        | nr               |
| Niue                                         | nu               |
| New Zealand (Aotearoa)                       | nz               |
| Oman                                         | om               |
| Non-Profit Organization                      | org              |
| Panama                                       | pa               |
| Peru                                         | pe               |
| French Polynesia                             | pf               |
| Papua New Guinea                             | pg               |
| Philippines                                  | ph               |
| Pakistan                                     | pk               |
| Poland                                       | pl               |
| St. Pierre and Miquelon                      | pm               |
| Pitcairn Island                              | pn               |
| Puerto Rico                                  | pr               |
| Credentialed professional                    | pro              |
| Palestinian Territories                      | ps               |
| Portugal                                     | pt               |
| Palau                                        | pw               |
| Paraguay                                     | py               |
| Qatar                                        | qa               |
| Reunion Island                               | re               |
| Romania                                      | ro               |
| Serbia                                       | rs               |
| Russian Federation                           | ru               |
| Rwanda                                       | rw               |
| Saudi Arabia                                 | sa               |
| Solomon Islands                              | sb               |
| Seychelles                                   | sc               |
| Sudan                                        | sd               |
| Sweden                                       | se               |
| Singapore                                    | sg               |
| St. Helena                                   | sh               |
| Slovenia                                     | si               |
| Svalbard and Jan Mayen Islands               | sj               |
| Slovak Republic                              | sk               |
| Sierra Leone                                 | sl               |
| San Marino                                   | sm               |
| Senegal                                      | sn               |
| Somalia                                      | so               |
| Suriname                                     | sr               |
| Sao Tome and Principe                        | st               |
| (Former) Soviet Union                        | su               |
| El Salvador                                  | sv               |
| Syrian Arab Republic                         | sy               |
| Swaziland                                    | sz               |
| Turks and Caicos Islands                     | tc               |
| Chad                                         | td               |
| Telephone service                            | tel              |
| French Southern Territories                  | tf               |
| Togo                                         | tg               |
| Thailand                                     | th               |
| Tajikistan                                   | tj               |
| Tokelau                                      | tk               |
| Turkmenistan                                 | tm               |
| Tunisia                                      | tn               |
| Tonga                                        | to               |
| East Timor                                   | tp               |
| Turkey                                       | tr               |
| Travel agent                                 | travel           |
| Trinidad and Tobago                          | tt               |
| Tuvalu                                       | tv               |
| Taiwan                                       | tw               |
| Tanzania                                     | tz               |
| Ukraine                                      | ua               |
| Uganda                                       | ug               |
| United Kingdom                               | uk               |
| US Minor Outlying Islands                    | um               |
| United States                                | us               |
| Uruguay                                      | uy               |
| Uzbekistan                                   | uz               |
| Holy See (City Vatican State)                | va               |
| Saint Vincent and the Grenadines             | vc               |
| Venezuela                                    | ve               |
| Virgin Islands (British)                     | vg               |
| Virgin Islands (USA)                         | vi               |
| Vietnam                                      | vn               |
| Vanuatu                                      | vu               |
| Wallis and Futuna Islands                    | wf               |
| Western Samoa                                | ws               |
| Yemen                                        | ye               |
| Mayotte                                      | yt               |
| Former Yugoslavia                            | yu               |
| South Africa                                 | za               |
| Zambia                                       | zm               |
| Zimbabwe                                     | zw               |
+----------------------------------------------+------------------+
And that's it for this post!

Update: take a look at a table where the tld has length greater than 2:
sa: tld-is-longer-than-2 |*> #=> is-greater-than[2] ket-length top-level-domain |_self> 
sa: table[org,top-level-domain] sort-by[top-level-domain] such-that[tld-is-longer-than-2] rel-kets[top-level-domain] |>
+---------------------------+------------------+
| org                       | top-level-domain |
+---------------------------+------------------+
| Air-transport             | aero             |
| Old style Arpanet         | arpa             |
| Asia-Pacific community    | asia             |
| Business                  | biz              |
| Catalan community         | cat              |
| US Commercial             | com              |
| Cooperative association   | coop             |
| US Educational            | edu              |
| US Government             | gov              |
| Information               | info             |
| International             | int              |
| Human resource manager    | jobs             |
| US Military               | mil              |
| Mobile provider           | mobi             |
| Museum                    | museum           |
| Individual                | name             |
| Nato field                | nato             |
| Network provider          | net              |
| Non-Profit Organization   | org              |
| Credentialed professional | pro              |
| Telephone service         | tel              |
| Travel agent              | travel           |
+---------------------------+------------------+
Cool!

Monday, 2 March 2015

how many movies?

A quick one in this post. Print a table of the top 200 actors in terms of movie count.
-- load the data:
sa: load improved-imdb.sw

-- first define our "how many movies" operator:
sa: how-many-movies |*> #=> how-many movies |_self>

-- now show the table:
sa: rank-table[actor,how-many-movies] select[1,200] reverse sort-by[how-many-movies] rel-kets[movies] |>
+------+-------------------------------------+-----------------+
| rank | actor                               | how-many-movies |
+------+-------------------------------------+-----------------+
| 1    | Mel Blanc                           | 969             |
| 2    | Brahmanandam                        | 958             |
| 3    | Matsunosuke Onoe                    | 926             |
| 4    | Bess Flowers                        | 812             |
| 5    | Herman Hack                         | 648             |
| 6    | Lee (I) Phelps                      | 639             |
| 7    | Edmund Cobb                         | 633             |
| 8    | Frank (I) O'Connor                  | 614             |
| 9    | Ron Jeremy                          | 608             |
| 10   | Tom London                          | 605             |
| 11   | Shakti (I) Kapoor                   | 597             |
| 12   | Jack Mower                          | 591             |
| 13   | Bud (I) Osborne                     | 588             |
| 14   | Sam (II) Harris                     | 566             |
| 15   | Adoor Bhasi                         | 559             |
| 16   | Jack (I) Richardson                 | 549             |
| 17   | Franklyn Farnum                     | 546             |
| 18   | Jagathi Sreekumar                   | 544             |
| 19   | Frank (I) Ellis                     | 542             |
| 20   | Larry Steers                        | 540             |
| 21   | 'Snub' Pollard                      | 540             |
| 22   | Eddie (I) Garcia                    | 534             |
| 23   | Harold (I) Miller                   | 530             |
| 24   | Stuart Holmes                       | 524             |
| 25   | Stanley Blystone                    | 521             |
| 26   | Irving Bacon                        | 508             |
| 27   | Prem Nazir                          | 506             |
| 28   | Charles (I) Sullivan                | 504             |
| 29   | Emmett Vogan                        | 495             |
| 30   | Bud Jamison                         | 495             |
| 31   | Francis (I) Ford                    | 489             |
| 32   | Helen (I)                           | 485             |
| 33   | Harry Strang                        | 476             |
| 34   | Lee (I) Moran                       | 475             |
| 35   | Heinie Conklin                      | 472             |
| 36   | Donald (I) Kerr                     | 471             |
| 37   | Fred Kelsey                         | 470             |
| 38   | Paquito Diaz                        | 467             |
| 39   | Jimmy Aubrey                        | 465             |
| 40   | Leo (I) White                       | 464             |
| 41   | Oliver Hardy                        | 457             |
| 42   | Ernie (I) Adams                     | 455             |
| 43   | Lee Shumway                         | 454             |
| 44   | Aruna Irani                         | 450             |
| 45   | Wade Boteler                        | 448             |
| 46   | Lafe McKee                          | 447             |
| 47   | Victor Potel                        | 445             |
| 48   | Edgar Kennedy                       | 438             |
| 49   | Vernon Dent                         | 437             |
| 50   | Jack Mercer                         | 436             |
| 51   | Lester Dorr                         | 436             |
| 52   | Billy Bletcher                      | 436             |
| 53   | Charles (II) King                   | 432             |
| 54   | Jack Mulhall                        | 431             |
| 55   | George Magrill                      | 429             |
| 56   | Ethan Laidlaw                       | 429             |
| 57   | Edward Peil Sr.                     | 427             |
| 58   | George (I) Chesebro                 | 426             |
| 59   | Billy Franey                        | 425             |
| 60   | Edward (I) Earle                    | 422             |
| 61   | Hank (I) Mann                       | 421             |
| 62   | George Morrell                      | 419             |
| 63   | William H. O'Brien                  | 416             |
| 64   | Wilfred Lucas                       | 415             |
| 65   | Bob (II) Burns                      | 411             |
| 66   | Chuck (I) Hamilton                  | 408             |
| 67   | Cyril Ring                          | 407             |
| 68   | Hank Bell                           | 407             |
| 69   | James Flavin                        | 406             |
| 70   | Jeffrey Sayre                       | 405             |
| 71   | Gino Corrado                        | 404             |
| 72   | Pat (I) O'Malley                    | 402             |
| 73   | Eddie (I) Lyons                     | 402             |
| 74   | Joey Silvera                        | 401             |
| 75   | Frank Hagney                        | 401             |
| 76   | Herbert (I) Rawlinson               | 400             |
| 77   | Selmer Jackson                      | 399             |
| 78   | Harry Todd                          | 398             |
| 79   | Kader (I) Khan                      | 398             |
| 80   | Raza Murad                          | 395             |
| 81   | Milton Kibbee                       | 391             |
| 82   | Kenner G. Kemp                      | 391             |
| 83   | Horace B. Carpenter                 | 390             |
| 84   | Robert Homans                       | 388             |
| 85   | Bryant Washburn                     | 387             |
| 86   | Syd Saylor                          | 387             |
| 87   | J. Warren Kerrigan                  | 387             |
| 88   | Asrani                              | 386             |
| 89   | Paul Panzer                         | 384             |
| 90   | Edward Hearn                        | 383             |
| 91   | Tom (I) Byron                       | 383             |
| 92   | Milburn Morante                     | 382             |
| 93   | James Conaty                        | 382             |
| 94   | Mack Sennett                        | 380             |
| 95   | Mike (I) Lally                      | 380             |
| 96   | Gulshan Grover                      | 380             |
| 97   | Harry (I) Tenbrook                  | 379             |
| 98   | Madan (I) Puri                      | 378             |
| 99   | Eddie (I) Dunn                      | 377             |
| 100  | Mammootty                           | 376             |
| 101  | Andy (I) Clyde                      | 376             |
| 102  | Forrest (I) Taylor                  | 375             |
| 103  | Birbal                              | 375             |
| 104  | Pran (I)                            | 374             |
| 105  | Raymond Hatton                      | 374             |
| 106  | George (I) Chandler                 | 374             |
| 107  | Claire McDowell                     | 370             |
| 108  | Roy Bucko                           | 366             |
| 109  | Tom (I) Kennedy                     | 362             |
| 110  | Slim Whitaker                       | 361             |
| 111  | Kit Guard                           | 360             |
| 112  | Byron Foulger                       | 360             |
| 113  | Ralph Brooks                        | 359             |
| 114  | Satyendra Kapoor                    | 358             |
| 115  | Bert (I) Roach                      | 357             |
| 116  | Anupam Kher                         | 357             |
| 117  | Philo McCullough                    | 356             |
| 118  | Jagdeep (I)                         | 356             |
| 119  | Ed (III) Brady                      | 355             |
| 120  | Gilbert M. 'Broncho Billy' Anderson | 354             |
| 121  | Mae Questel                         | 353             |
| 122  | Dot Farley                          | 352             |
| 123  | Frank (I) Mayo                      | 351             |
| 124  | Frank (I) Mills                     | 349             |
| 125  | Jack (I) Kirk                       | 348             |
| 126  | Lane Chandler                       | 348             |
| 127  | Jack Perrin                         | 347             |
| 128  | Max Alvarado                        | 346             |
| 129  | Pierre Watkin                       | 345             |
| 130  | Max (I) Wagner                      | 345             |
| 131  | Bert Moorhouse                      | 345             |
| 132  | Eddy Chandler                       | 344             |
| 133  | Al St. John                         | 342             |
| 134  | Harry (I) Semels                    | 340             |
| 135  | Viju Khote                          | 340             |
| 136  | Adolf Hitler                        | 340             |
| 137  | Joseph Crehan                       | 340             |
| 138  | Robert (I) McKenzie                 | 339             |
| 139  | Sidney Bracey                       | 339             |
| 140  | Lalita Pawar                        | 338             |
| 141  | Edward Dillon                       | 338             |
| 142  | Sukumari                            | 336             |
| 143  | Jack Tornek                         | 336             |
| 144  | Jack (I) Evans                      | 336             |
| 145  | Jim Corey                           | 336             |
| 146  | Blackie Whiteford                   | 335             |
| 147  | J. Farrell MacDonald                | 335             |
| 148  | Glen Cavender                       | 334             |
| 149  | Henry B. Walthall                   | 333             |
| 150  | Stanley (I) Andrews                 | 333             |
| 151  | Anup (I) Kumar                      | 331             |
| 152  | Prem Chopra                         | 330             |
| 153  | Sharada (I)                         | 329             |
| 154  | Ralph Dunn                          | 329             |
| 155  | Artie Ortego                        | 328             |
| 156  | Gaston Modot                        | 328             |
| 157  | Arthur V. Johnson                   | 328             |
| 158  | Al (I) Hill                         | 328             |
| 159  | Harry (II) Harvey                   | 328             |
| 160  | Charles Ogle                        | 327             |
| 161  | Paul (I) Hurst                      | 327             |
| 162  | Jack Chefe                          | 327             |
| 163  | Mary Pickford                       | 326             |
| 164  | Edmund Mortimer                     | 326             |
| 165  | Mohanlal (I)                        | 325             |
| 166  | Herbert Prior                       | 323             |
| 167  | Forbes Murray                       | 323             |
| 168  | Edward LeSaint                      | 323             |
| 169  | Charles (I) Sherlock                | 322             |
| 170  | Dinesh Hingoo                       | 322             |
| 171  | John (I) Hamilton                   | 322             |
| 172  | Nedumudi Venu                       | 321             |
| 173  | Carl Stockdale                      | 321             |
| 174  | Humberto (I) Rodrguez               | 321             |
| 175  | Mithun (I) Chakraborty              | 321             |
| 176  | King (I) Baggot                     | 321             |
| 177  | Thikkurisi Sukumaran Nair           | 320             |
| 178  | Al (I) Ferguson                     | 320             |
| 179  | William B. Davidson                 | 320             |
| 180  | George Estregan                     | 319             |
| 181  | Arthur Housman                      | 318             |
| 182  | Jos Chvez                           | 318             |
| 183  | Brooks Benedict                     | 318             |
| 184  | Polidor                             | 317             |
| 185  | Tom (I) Mix                         | 317             |
| 186  | Kiran (I) Kumar                     | 317             |
| 187  | Bert (I) Stevens                    | 316             |
| 188  | Hermann Picha                       | 315             |
| 189  | Dell Henderson                      | 315             |
| 190  | Shivaji Ganesan                     | 315             |
| 191  | Mickey (I) Rooney                   | 314             |
| 192  | Edward Keane                        | 314             |
| 193  | William (I) Bailey                  | 314             |
| 194  | Charles (I) West                    | 313             |
| 195  | Jamie Gillis                        | 313             |
| 196  | Edgar Dearing                       | 313             |
| 197  | Ben (I) Corbett                     | 313             |
| 198  | Steve (I) Clark                     | 312             |
| 199  | Tom Santschi                        | 311             |
| 200  | Edward Gargan                       | 311             |
+------+-------------------------------------+-----------------+
And that's it for this post! More to come.

similar[movies] and similar[actors]

In the past I gave some examples of common[movies] and common[actors]. Today, let's try similar[op].
sa: load improved-imdb.sw
-- who has similar movies to Tom Cruise:
sa: table[actor,coeff] select[1,15] 100 self-similar[movies] |actor: Tom Cruise>
+------------------+---------+
| actor            | coeff   |
+------------------+---------+
| Tom Cruise       | 100.000 |
| Nicole Kidman    | 11.940  |
| William Mapother | 8.065   |
| Steven Spielberg | 8.065   |
| Ving Rhames      | 6.579   |
| Brad Pitt        | 6.061   |
| John Travolta    | 5.634   |
| Ron (I) Dean     | 4.839   |
| Dale Dye         | 4.839   |
| Cuba Gooding Jr. | 4.839   |
| Michael G. Kehoe | 4.839   |
| Simon Pegg       | 4.839   |
| Sydney Pollack   | 4.839   |
| Jeremy Renner    | 4.839   |
| George C. Scott  | 4.839   |
+------------------+---------+

-- now dig a little into these results:
-- how many movies has Tom Cruise starred in?
sa: how-many movies |actor: Tom Cruise>
|number: 62>

-- how many movies has Nicole Kidman starred in?
sa: how-many movies |actor: Nicole Kidman>
|number: 67>

-- how many common movies have Tom and Nicole starred in?
sa: how-many common[movies] (|actor: Tom Cruise> + |actor: Nicole Kidman>)
|number: 8>

-- show a table of common movies for Tom Cruise and Nicole Kidman:
sa: table[movies] common[movies] (|actor: Tom Cruise> + |actor: Nicole Kidman>)
+---------------------------------------------------+
| movies                                            |
+---------------------------------------------------+
| August (2008)                                     |
| Boffo! Tinseltown's Bombs and Blockbusters (2006) |
| Days of Thunder (1990)                            |
| Der Geist des Geldes (2007)                       |
| Eyes Wide Shut (1999)                             |
| Far and Away (1992)                               |
| A Life in Pictures (2001)                         |
| The Queen (2006)                                  |
+---------------------------------------------------+

-- now another example:
-- who has similar movies to Matt Damon:
sa: table[actor,coeff] select[1,15] 100 self-similar[movies] |actor: Matt Damon>
+---------------------+---------+
| actor               | coeff   |
+---------------------+---------+
| Matt Damon          | 100.000 |
| Ben Affleck         | 13.333  |
| George Clooney      | 10.667  |
| Casey Affleck       | 9.333   |
| Johnny Cicco        | 9.333   |
| Brad Pitt           | 6.667   |
| Don Cheadle         | 5.333   |
| Eddie Jemison       | 5.333   |
| Jason (I) Lee       | 5.333   |
| Ernest O'Donnell    | 5.333   |
| Carl Reiner         | 5.333   |
| Jerry (I) Weintraub | 5.333   |
| Franka Potente      | 5.333   |
| Julia (I) Roberts   | 5.333   |
| Josh Brolin         | 4.000   |
+---------------------+---------+

-- a table of common movies for Matt Damon and Ben Affleck:
sa: table[movies] common[movies] (|actor: Matt Damon> + |actor: Ben Affleck>)
+---------------------------------------+
| movies                                |
+---------------------------------------+
| Chasing Amy (1997)                    |
| Dogma (1999)                          |
| Field of Dreams (1989)                |
| Glory Daze (1995)                     |
| Good Will Hunting (1997)              |
| Jay and Silent Bob Strike Back (2001) |
| Jersey Girl (2004)                    |
| School Ties (1992)                    |
| The Third Wheel (2002)                |
| Unite for Japan (2011)                |
+---------------------------------------+

-- a table of common movies for Matt Damon and George Clooney:
sa: table[movies] common[movies] (|actor: Matt Damon> + |actor: George Clooney>)
+--------------------------------------------+
| movies                                     |
+--------------------------------------------+
| Confessions of a Dangerous Mind (2002)     |
| George W. Bush Battles Jesus Christ (2008) |
| Ocean's Eleven (2001)                      |
| Ocean's Thirteen (2007)                    |
| Ocean's Twelve (2004)                      |
| Radioman (2012)                            |
| Syriana (2005)                             |
| The Monuments Men (2014)                   |
+--------------------------------------------+

-- who has similar movies to Morgan Freeman:
sa: table[actor,coeff] select[1,15] 100 self-similar[movies] |actor: Morgan (I) Freeman>
+----------------------+---------+
| actor                | coeff   |
+----------------------+---------+
| Morgan (I) Freeman   | 100.000 |
| Clint Eastwood       | 6.542   |
| Aaron Eckhart        | 4.673   |
| Bruce Willis         | 4.673   |
| Ashley Judd          | 4.673   |
| Steve Carell         | 3.738   |
| Kevin Costner        | 3.738   |
| Alfonso Freeman      | 3.738   |
| Gene Hackman         | 3.738   |
| Cillian Murphy       | 3.738   |
| James (III) Rawlings | 3.738   |
| Tim (I) Robbins      | 3.738   |
| Steven Spielberg     | 3.738   |
| Radha Mitchell       | 3.738   |
| Jessica Tandy        | 3.738   |
+----------------------+---------+

-- a table of common movies for Morgan Freeman and Clint Eastwood:
sa: table[movies] common[movies] (|actor: Morgan (I) Freeman> + |actor: Clint Eastwood>)
+---------------------------------------------------+
| movies                                            |
+---------------------------------------------------+
| A Century of Cinema (1994)                        |
| Boffo! Tinseltown's Bombs and Blockbusters (2006) |
| The Story of Richard D. Zanuck (2013)             |
| The Untold Story (2013)                           |
| Million Dollar Baby (2004)                        |
| Tales from the Warner Bros. Lot (2013)            |
| Unforgiven (1992)                                 |
+---------------------------------------------------+

-- which movies have similar actors to Pulp Fiction:
sa: table[movie,coeff] select[1,15] 100 self-similar[actors] |movie: Pulp Fiction (1994)>
+-----------------------------------------+---------+
| movie                                   | coeff   |
+-----------------------------------------+---------+
| Pulp Fiction (1994)                     | 100.000 |
| Reservoir Dogs (1992)                   | 16.981  |
| You're Still Not Fooling Anybody (1997) | 13.208  |
| Jackie Brown (1997)                     | 9.434   |
| Vol. 2 (2004)                           | 7.576   |
| Full Tilt Boogie (1997)                 | 7.547   |
| Somebody to Love (1994)                 | 7.547   |
| From Dusk Till Dawn (1996)              | 7.547   |
| My Best Friend's Birthday (1987)        | 7.547   |
| The Prophecy (1995)                     | 5.660   |
| It's Pat (1994)                         | 5.660   |
| Jumpin' at the Boneyard (1992)          | 5.660   |
| Out of Sight (1998)                     | 5.660   |
| Kiss of Death (1995)                    | 5.660   |
| Inside Hollywood Movies (2013)          | 5.660   |
+-----------------------------------------+---------+

-- a table of common actors for Pulp Fiction and Reservoir Dogs:
sa: table[actor] common[actors] (|movie: Pulp Fiction (1994)> + |movie: Reservoir Dogs (1992)>)
+-------------------+
| actor             |
+-------------------+
| Lawrence Bender   |
| Steve Buscemi     |
| Harvey (I) Keitel |
| Tim (I) Roth      |
| Robert (I) Ruth   |
| Burr Steers       |
| Quentin Tarantino |
| Rich (I) Turner   |
| Linda (I) Kaye    |
+-------------------+

-- which movies have similar actors to Star Trek: The Motion Picture:
-- NB: the extract-value operator is stripping our "Star Trek: " prefix. I don't know how to fix.
sa: table[movie,coeff] select[1,15] 100 self-similar[actors] |movie: Star Trek: The Motion Picture (1979)> 
+---------------------------------+---------+
| movie                           | coeff   |
+---------------------------------+---------+
| The Motion Picture (1979)       | 100.000 |
| The Voyage Home (1986)          | 15.625  |
| The Search for Spock (1984)     | 15.625  |
| The Undiscovered Country (1991) | 15.625  |
| The Final Frontier (1989)       | 10.938  |
| The Wrath of Khan (1982)        | 10.938  |
| Road Trek 2011 (2012)           | 10.938  |
| Star Trek Adventure (1991)      | 10.938  |
| To Be Takei (2014)              | 7.812   |
| Generations (1994)              | 5.797   |
| Trekkies (1997)                 | 5.670   |
| Trek Nation (2010)              | 4.688   |
| The Other Movie (1981)          | 4.688   |
| The Captains (2011)             | 4.688   |
| Backyard Blockbusters (2012)    | 4.545   |
+---------------------------------+---------+

-- a table of common actors for the original series Star Trek movies:
-- NB: we know to select the top 6 movies because of the results in the table just above
sa: table[actor] common[actors] select[1,6] self-similar[actors] |movie: Star Trek: The Motion Picture (1979)>
+-------------------+
| actor             |
+-------------------+
| James Doohan      |
| DeForest Kelley   |
| Walter (I) Koenig |
| Leonard Nimoy     |
| William Shatner   |
| George Takei      |
| Nichelle Nichols  |
+-------------------+
And that's it for this post. I think we are really starting to see some cool results, and starting to show how the Feynman Knowledge Engine works in practice. As usual, heaps more to come!

Update: use this to find the top 100 sci-fi movies:
table[movie,rating] select[1,100] reverse sort-by[rating] such-that[genre-is-scifi] rel-kets[actors] |>