First, we need a new function operator (note it is not perfect yet, but will do for now):
# select-chars[3,4,7] |abcdefgh> == |cdg> # # one is a ket def select_chars(one,positions): try: positions = positions.split(",") chars = list(one.label) text = "".join(chars[int(x)-1] for x in positions if int(x) <= len(chars)) return ket(text) except: return ket("",0)Now we can do this:
sa: load ngram-letter-pairs--sherlock-holmes.sw sa: find-inverse[next-2-letters] sa: SC |*> #=> select-chars[1] |_self> sa: EC |*> #=> select-chars[0] |_self> sa: table[start-char,coeff] ket-sort SC common[inverse-next-2-letters] (|, > + |. >) +------------+-------+ | start-char | coeff | +------------+-------+ | 2 | 1 | | 3 | 1 | | 4 | 1 | | | 18 | | " | 1 | | ' | 1 | | - | 2 | | a | 54 | | b | 9 | | c | 10 | | d | 17 | | e | 49 | | f | 7 | | F | 1 | | g | 10 | | h | 19 | | i | 55 | | I | 1 | | k | 6 | | l | 22 | | L | 1 | | m | 13 | | n | 29 | | o | 53 | | p | 10 | | q | 1 | | r | 34 | | s | 23 | | t | 24 | | u | 27 | | v | 7 | | w | 5 | | W | 1 | | x | 1 | | y | 6 | | Y | 1 | | z | 1 | +------------+-------+ sa: table[end-char,coeff] ket-sort EC common[inverse-next-2-letters] (|, > + |. >) +----------+-------+ | end-char | coeff | +----------+-------+ | 3 | 1 | | 4 | 1 | | 5 | 1 | | a | 8 | | A | 1 | | b | 1 | | c | 2 | | d | 43 | | e | 82 | | f | 8 | | g | 8 | | h | 21 | | I | 2 | | k | 16 | | l | 26 | | m | 14 | | n | 38 | | o | 15 | | p | 12 | | r | 33 | | s | 82 | | t | 44 | | u | 2 | | w | 9 | | x | 3 | | y | 49 | +----------+-------+I don't think this is super useful. Though knowing which characters are allowed to precede a full stop is mildly interesting. Note that this is only the case for two capital letters "A" and "I".
To pick a rather random example of why this might be interesting, consider: "C. elegans". Since in text C followed by a dot is rare, we can guess that maybe "C." means abbreviation, rather than end of sentence.
Doh! So much for that idea. Here is the table when we only look at letters that precede the full stop. Ie we no longer consider the precede comma case:
sa: table[end-char,coeff] ket-sort EC inverse-next-2-letters |. > +----------+-------+ | end-char | coeff | +----------+-------+ | 0 | 1 | | 1 | 4 | | 2 | 3 | | 3 | 5 | | 4 | 3 | | 5 | 4 | | 6 | 2 | | 7 | 1 | | 8 | 2 | | 9 | 1 | | ) | 1 | | a | 15 | | A | 2 | | b | 1 | | B | 2 | | c | 4 | | C | 2 | | d | 46 | | D | 2 | | e | 94 | | E | 3 | | f | 8 | | F | 1 | | g | 13 | | h | 29 | | H | 4 | | I | 12 | | J | 1 | | k | 17 | | K | 5 | | l | 31 | | L | 1 | | m | 22 | | n | 46 | | o | 18 | | p | 19 | | q | 1 | | r | 44 | | R | 1 | | s | 98 | | S | 5 | | t | 55 | | T | 1 | | u | 2 | | U | 1 | | V | 3 | | w | 10 | | X | 2 | | x | 4 | | y | 64 | +----------+-------+Hrmm... lots of capitals in there this time. Though they do have lower frequency than lower case. But still, breaks what I was just saying above.
No comments:
Post a Comment