Wednesday 5 August 2015

start and end chars for 3grams that precede a full stop

Another quick one. Not super useful, but feel like doing it anyway. The start and end characters for the 3grams that precede both commas and full stops.

First, we need a new function operator (note it is not perfect yet, but will do for now):
# select-chars[3,4,7] |abcdefgh> == |cdg>
#
# one is a ket
def select_chars(one,positions):
  try:
    positions = positions.split(",")
    chars = list(one.label)
    text = "".join(chars[int(x)-1] for x in positions if int(x) <= len(chars))
    return ket(text)
  except:
    return ket("",0)   
Now we can do this:
sa: load ngram-letter-pairs--sherlock-holmes.sw
sa: find-inverse[next-2-letters]
sa: SC |*> #=> select-chars[1] |_self>
sa: EC |*> #=> select-chars[0] |_self>

sa: table[start-char,coeff] ket-sort SC common[inverse-next-2-letters] (|, > + |. >)
+------------+-------+
| start-char | coeff |
+------------+-------+
| 2          | 1     |
| 3          | 1     |
| 4          | 1     |
|            | 18    |
| "          | 1     |
| '          | 1     |
| -          | 2     |
| a          | 54    |
| b          | 9     |
| c          | 10    |
| d          | 17    |
| e          | 49    |
| f          | 7     |
| F          | 1     |
| g          | 10    |
| h          | 19    |
| i          | 55    |
| I          | 1     |
| k          | 6     |
| l          | 22    |
| L          | 1     |
| m          | 13    |
| n          | 29    |
| o          | 53    |
| p          | 10    |
| q          | 1     |
| r          | 34    |
| s          | 23    |
| t          | 24    |
| u          | 27    |
| v          | 7     |
| w          | 5     |
| W          | 1     |
| x          | 1     |
| y          | 6     |
| Y          | 1     |
| z          | 1     |
+------------+-------+

sa: table[end-char,coeff] ket-sort EC common[inverse-next-2-letters] (|, > + |. >)
+----------+-------+
| end-char | coeff |
+----------+-------+
| 3        | 1     |
| 4        | 1     |
| 5        | 1     |
| a        | 8     |
| A        | 1     |
| b        | 1     |
| c        | 2     |
| d        | 43    |
| e        | 82    |
| f        | 8     |
| g        | 8     |
| h        | 21    |
| I        | 2     |
| k        | 16    |
| l        | 26    |
| m        | 14    |
| n        | 38    |
| o        | 15    |
| p        | 12    |
| r        | 33    |
| s        | 82    |
| t        | 44    |
| u        | 2     |
| w        | 9     |
| x        | 3     |
| y        | 49    |
+----------+-------+
I don't think this is super useful. Though knowing which characters are allowed to precede a full stop is mildly interesting. Note that this is only the case for two capital letters "A" and "I".

To pick a rather random example of why this might be interesting, consider: "C. elegans". Since in text C followed by a dot is rare, we can guess that maybe "C." means abbreviation, rather than end of sentence.

Doh! So much for that idea. Here is the table when we only look at letters that precede the full stop. Ie we no longer consider the precede comma case:
sa: table[end-char,coeff] ket-sort EC inverse-next-2-letters |. >
+----------+-------+
| end-char | coeff |
+----------+-------+
| 0        | 1     |
| 1        | 4     |
| 2        | 3     |
| 3        | 5     |
| 4        | 3     |
| 5        | 4     |
| 6        | 2     |
| 7        | 1     |
| 8        | 2     |
| 9        | 1     |
| )        | 1     |
| a        | 15    |
| A        | 2     |
| b        | 1     |
| B        | 2     |
| c        | 4     |
| C        | 2     |
| d        | 46    |
| D        | 2     |
| e        | 94    |
| E        | 3     |
| f        | 8     |
| F        | 1     |
| g        | 13    |
| h        | 29    |
| H        | 4     |
| I        | 12    |
| J        | 1     |
| k        | 17    |
| K        | 5     |
| l        | 31    |
| L        | 1     |
| m        | 22    |
| n        | 46    |
| o        | 18    |
| p        | 19    |
| q        | 1     |
| r        | 44    |
| R        | 1     |
| s        | 98    |
| S        | 5     |
| t        | 55    |
| T        | 1     |
| u        | 2     |
| U        | 1     |
| V        | 3     |
| w        | 10    |
| X        | 2     |
| x        | 4     |
| y        | 64    |
+----------+-------+
Hrmm... lots of capitals in there this time. Though they do have lower frequency than lower case. But still, breaks what I was just saying above.

No comments:

Post a Comment