Sunday, 19 July 2015

introducing the ngram stitch

Otherwise known as the Rambler algo. The basic outline is you have a big corpus of conversational text, eg from a web-board, and then you process that a little, and then the algo creative-writes/rambles.

I'll just give the algo for 3/5 ngram stitch, but should extend in the obvious way to other p/q.
Simply:
```extract all the 5-grams from your seed text
loop {
extract the last 3 words from string
find a set of 5-grams that start with those 3 words and pick one randomly
}
```
Then we use this code to find our n-grams:
```def create_ngram_pairs(s):
return [[" ".join(s[i:i+3])," ".join(s[i+3:i+5])] for i in range(len(s) - 4)]

# learn ngram pairs:
def learn_ngram_pairs(context,filename):
with open(filename,'r') as f:
words = re.sub('[<|>=]','',text)
for ngram_pairs in create_ngram_pairs(words.split()):
try:
except:
continue

learn_ngram_pairs(C,filename)

dest = "sw-examples/ngram-pairs--webboard.sw"
save_sw(C,dest)
```
Some example learn rules in that sw are:
```next-2 |Looking forward to> => |that. it> + |doing something> + |it. I> + |when the> + |the Paranoid> + |tomorrow's. ("flow",> + |seeing The> + |tomorrow. 3.1415926...can't> + |you posting> + |the "Geometric> + |it. Breaking> + |being a> + |Joe Biden>
next-2 |forward to that.> => |it was>
next-2 |to that. it> => |was 4>
next-2 |that. it was> => |4 below> + |only 100db>
next-2 |it was 4> => |below zero> + |years ago>
next-2 |was 4 below> => |zero maybe>
```
And then we need this function operator:
```# extract-3-tail |a b c d e f g h> == |f g h>
#
# assumes one is a ket
def extract_3_tail(one):
split_str = one.label.rsplit(' ',3)
if len(split_str) < 4:
return one
return ket(" ".join(split_str[1:]))
```
Then after all that preparation, our Ramlber algo simplifies to:
```ramble |*> #=> merge-labels(|_self> + | > + pick-elt next-2 extract-3-tail |_self>)
```
Examples in the next post.

BTW, I find it interesting that we can compact down the Rambler algo to 1 line of BKO.