Sunday, 28 June 2015

ebook letter frequencies

I wrote this one roughly a year ago, but figure may as well add it to the blog. Given ebooks (mostly from Project Gutenberg), find their letter frequencies. So not super interesting, but let's add it anyway.

Here is the code, and the resulting sw file.

Now a couple of matrices in the console:
sa: load ebook-letter-counts.sw
sa: matrix[letter-count]
[ a ] = [  9083   26317  142241  23325  76232   35669  260565  35285  23871  ] [ Alice-in-Wonderland  ]
[ b ]   [  1621   4766   25476   4829   15699   6847   50138   6117   4763   ] [ Frankenstein         ]
[ c ]   [  2817   9055   37297   7379   21938   11349  72409   10725  6942   ] [ Gone-with-Wind       ]
[ d ]   [  5228   16720  85897   12139  37966   18763  144619  18828  15168  ] [ I-Robot              ]
[ e ]   [  15084  45720  228415  37293  117608  59029  440119  54536  37230  ] [ Moby-Dick            ]
[ f ]   [  2248   8516   34779   5940   20363   9936   73859   9105   6270   ] [ nineteen-eighty-four ]
[ g ]   [  2751   5762   38283   6037   20489   9113   61948   8023   6822   ] [ Shakespeare          ]
[ h ]   [  7581   19400  119901  16803  61947   28093  234301  28284  19130  ] [ Sherlock-Holmes      ]
[ i ]   [  7803   21411  101987  20074  62942   30304  214275  27361  18380  ] [ Tom-Sawyer           ]
[ j ]   [  222    431    1501    346    915     310    2955    421    465    ]
[ k ]   [  1202   1722   18290   2370   8011    3512   32029   3590   3136   ]
[ l ]   [  5053   12603  79783   12870  42338   18395  156371  17276  12426  ]
[ m ]   [  2245   10295  39595   6534   22871   10513  101507  11391  7255   ]
[ n ]   [  7871   24220  123989  21302  65429   31516  231652  29337  20858  ]
[ o ]   [  9245   25050  130230  24555  69648   34287  299732  34452  24251  ]
[ p ]   [  1796   5939   23979   5148   16553   8058   50638   6987   4766   ]
[ q ]   [  135    323    1270    321    1244    397    2998    416    182    ]
[ r ]   [  6400   20708  105074  17003  52446   25861  224994  25378  16262  ]
[ s ]   [  6980   20808  107430  18044  62734   28382  232317  27105  17852  ]
[ t ]   [  11631  29706  157163  28316  86983   42127  311911  39232  28389  ]
[ u ]   [  3867   10340  50453   9483   26933   12903  121631  13527  9376   ]
[ v ]   [  911    3788   15224   3062   8540    4252   36692   4471   2451   ]
[ w ]   [  2696   7335   43623   6761   21174   11225  78929   10754  7735   ]
[ x ]   [  170    675    1700    508    1037    779    4867    567    326    ]
[ y ]   [  2442   7743   37639   6552   16849   9071   90162   9267   6830   ]
[ z ]   [  79     243    1045    208    598     303    1418    150    155    ]

sa: norm |*> #=> normalize[100] letter-count |_self>
sa: map[norm,normalized-letter-count] rel-kets[letter-count]
sa: matrix[normalized-letter-count]
[ a ] = [  7.75   7.75   8.12   7.85   8.11   7.91   7.38   8.16   7.92   ] [ Alice-in-Wonderland  ]
[ b ]   [  1.38   1.4    1.45   1.62   1.67   1.52   1.42   1.41   1.58   ] [ Frankenstein         ]
[ c ]   [  2.4    2.67   2.13   2.48   2.34   2.52   2.05   2.48   2.3    ] [ Gone-with-Wind       ]
[ d ]   [  4.46   4.92   4.9    4.08   4.04   4.16   4.09   4.35   5.03   ] [ I-Robot              ]
[ e ]   [  12.87  13.46  13.04  12.55  12.52  13.09  12.46  12.61  12.36  ] [ Moby-Dick            ]
[ f ]   [  1.92   2.51   1.98   2.0    2.17   2.2    2.09   2.1    2.08   ] [ nineteen-eighty-four ]
[ g ]   [  2.35   1.7    2.18   2.03   2.18   2.02   1.75   1.85   2.26   ] [ Shakespeare          ]
[ h ]   [  6.47   5.71   6.84   5.65   6.59   6.23   6.63   6.54   6.35   ] [ Sherlock-Holmes      ]
[ i ]   [  6.66   6.3    5.82   6.75   6.7    6.72   6.06   6.32   6.1    ] [ Tom-Sawyer           ]
[ j ]   [  0.19   0.13   0.09   0.12   0.1    0.07   0.08   0.1    0.15   ]
[ k ]   [  1.03   0.51   1.04   0.8    0.85   0.78   0.91   0.83   1.04   ]
[ l ]   [  4.31   3.71   4.55   4.33   4.51   4.08   4.43   3.99   4.12   ]
[ m ]   [  1.92   3.03   2.26   2.2    2.43   2.33   2.87   2.63   2.41   ]
[ n ]   [  6.72   7.13   7.08   7.17   6.96   6.99   6.56   6.78   6.92   ]
[ o ]   [  7.89   7.38   7.43   8.26   7.41   7.6    8.48   7.96   8.05   ]
[ p ]   [  1.53   1.75   1.37   1.73   1.76   1.79   1.43   1.62   1.58   ]
[ q ]   [  0.12   0.1    0.07   0.11   0.13   0.09   0.08   0.1    0.06   ]
[ r ]   [  5.46   6.1    6.0    5.72   5.58   5.73   6.37   5.87   5.4    ]
[ s ]   [  5.96   6.13   6.13   6.07   6.68   6.29   6.58   6.27   5.93   ]
[ t ]   [  9.93   8.75   8.97   9.53   9.26   9.34   8.83   9.07   9.42   ]
[ u ]   [  3.3    3.04   2.88   3.19   2.87   2.86   3.44   3.13   3.11   ]
[ v ]   [  0.78   1.12   0.87   1.03   0.91   0.94   1.04   1.03   0.81   ]
[ w ]   [  2.3    2.16   2.49   2.27   2.25   2.49   2.23   2.49   2.57   ]
[ x ]   [  0.15   0.2    0.1    0.17   0.11   0.17   0.14   0.13   0.11   ]
[ y ]   [  2.08   2.28   2.15   2.2    1.79   2.01   2.55   2.14   2.27   ]
[ z ]   [  0.07   0.07   0.06   0.07   0.06   0.07   0.04   0.03   0.05   ]
sa: save ebook-letter-counts--normalized.sw
And I guess that is it.

Update: while we are here, may as well give the simm matrix:
sa: simm |*> #=> 100 self-similar[letter-count] |_self>
sa: map[simm,simm-matrix] rel-kets[letter-count]
sa: matrix[simm-matrix]
[ Alice-in-Wonderland  ] = [  100.0  94.94  96.52  97.32  96.76  97.11  95.57  97.09  97.49  ] [ Alice-in-Wonderland  ]
[ Frankenstein         ]   [  94.94  100.0  95.97  96.01  95.22  96.48  95.24  96.52  95.54  ] [ Frankenstein         ]
[ Gone-with-Wind       ]   [  96.52  95.97  100.0  96.0   96.98  97.01  95.91  97.12  97.17  ] [ Gone-with-Wind       ]
[ I-Robot              ]   [  97.32  96.01  96.0   100.0  97.3   97.87  96.06  97.35  97.12  ] [ I-Robot              ]
[ Moby-Dick            ]   [  96.76  95.22  96.98  97.3   100.0  98.05  96.07  97.39  96.85  ] [ Moby-Dick            ]
[ nineteen-eighty-four ]   [  97.11  96.48  97.01  97.87  98.05  100.0  95.55  97.88  97.1   ] [ nineteen-eighty-four ]
[ Shakespeare          ]   [  95.57  95.24  95.91  96.06  96.07  95.55  100    97.08  95.89  ] [ Shakespeare          ]
[ Sherlock-Holmes      ]   [  97.09  96.52  97.12  97.35  97.39  97.88  97.08  100    97.54  ] [ Sherlock-Holmes      ]
[ Tom-Sawyer           ]   [  97.49  95.54  97.17  97.12  96.85  97.1   95.89  97.54  100    ] [ Tom-Sawyer           ]
So we see that English text has largely the same letter frequencies over different ebooks. Which makes sense of course, but nice to see it visually.

And it would be nice to have an "unscaled-similar[op]" operator. The problem is that would require an entire new function in the new_context class, which I am reluctant to do, since unscaled-simm is a rare use case. Currently it can be done for special occasions by changing the simm function in new_context.pattern_recognition() to unscaled_simm(A,B).

No comments:

Post a Comment