The Semantic DB Project: edit distance in BKO

I've been meaning to do this one for a long time now! Finally got around to it today. Not very hard. So, start with the wikipedia page. Now, we need to implement these 3 operators:

    Insertion of a single symbol. If a = uv, then inserting the symbol x produces uxv. This can also be denoted ε→x, using ε to denote the empty string.
    Deletion of a single symbol changes uxv to uv (x→ε).
    Substitution of a single symbol x for a symbol y ≠ x changes uxv to uyv (x→y).

A few lines of python, and now we can reproduce the example on the wikipage:

The Levenshtein distance between "kitten" and "sitting" is 3. A minimal edit script that transforms the former into the latter is:

    kitten → sitten (substitution of "s" for "k")
    sitten → sittin (substitution of "i" for "e")
    sittin → sitting (insertion of "g" at the end).

LCS distance (insertions and deletions only) gives a different distance and minimal edit script:

    delete k at 0
    insert s at 0
    delete e at 4
    insert i at 4
    insert g at 6

for a total cost/distance of 5 operations.

And now reproduce that example in BKO:

-- Levenshtein distance:
sa: insert[g,6] substitute[i,e,4] substitute[s,k,0] |kitten>
|sitting>

-- LCS distance:
sa: insert[g,6] insert[i,4] delete[e,4] insert[s,0] delete[k,0] |kitten>
|sitting>

And now a couple of comments. First, it seems BKO operators are a natural way to represent these text operators. The second is that the number of operators from starting superposition (in the above example, kets), to the final one seems to form a natural kind of distance metric. This being general, not just for the above 3 operators, but all operators in our project. For example, when you say something is a deep thought, in BKO that would correspond to a very long op sequence from starting facts. And related to that, the brain must be doing something interesting, because for a computer, finding the exact op sequence that matches starting point to desired destination is vastly expensive big O wise. How the brain seems to do this efficiently, I don't know. I mean, it is an extreme example, but consider the 100 page proof of Fermat's last theorem. If we convert that to a big op sequence, naive search to find it would be impossible. I guess humans must divide the problem into smaller problems which require smaller op sequences to find, but still it is rather amazing!

And oh yeah, I think eventually we will find a natural way to encode mathematics using the BKO structure. For example a proof is just an operator (itself constructed from an op sequence) from one superposition (in this case starting axioms and knowledge) to a desired end superposition. And math objects such as metrics or groups correspond to particular network structures. And possibly homomorphisms are maps between objects with similar network structures.

Another thing I should mention is I'm not a big fan of always using edit distance to measure similarity of two strings. Sure, seems to work great for spell check, which so far I can't reproduce well using my simm. But for longer text I prefer my similarity metric. I guess the best link I can give is to this blog post, and this code.

BTW, here was my attempt to use similar[op] for spell check:

sa: load common-English-words.sw
sa: word2ngram-op |*> #=> letter-ngrams[1,2,3] |_self>
sa: map[word2ngram-op,word2ngram] list-of |common English words>
sa: map[word2ngram-op,word2ngram] |elefant>
sa: table[word,coeff] select[1,30] 100 self-similar[word2ngram] |elefant>
+-----------+--------+
| word      | coeff  |
+-----------+--------+
| elefant   | 100.0  |
| elegant   | 66.667 |
| mantelet  | 57.937 |
| elephant  | 57.143 |
| relevant  | 57.143 |
| fantail   | 55.556 |
| infante   | 55.556 |
| teleran   | 55.556 |
| leant     | 52.778 |
| inelegant | 51.389 |
| infantile | 51.389 |
| antler    | 51.111 |
| cantle    | 51.111 |
| levant    | 51.111 |
| mantel    | 51.111 |
| mantle    | 51.111 |
| anele     | 50.0   |
| antefix   | 50.0   |
| defiant   | 50.0   |
| element   | 50.0   |
| fantasm   | 50.0   |
| fantast   | 50.0   |
| fantasy   | 50.0   |
| fantom    | 50.0   |
| gantlet   | 50.0   |
| infant    | 50.0   |
| infanta   | 50.0   |
| pantile   | 50.0   |
| relent    | 50.0   |
| reliant   | 50.0   |
+-----------+--------+

So sort of works. But we have things in there like "mantelet" that has a high match because of "ele" and "ant". This is from the 3-grams, and the fact that we have no real ordering info. So "ele" + "ant" is just as similar as "ant" + "ele". I guess an improvement is not to run the similar[op] on ngrams, but first convert letters into how they are pronounced. And then run similar[op] on that. I don't know the easiest way to do that.

BTW, here is how I made the "common-English-words.sw" file:

-- download data from here: http://icon.shef.ac.uk/Moby/mwords.html
-- then run this script:
$ ./list-to-sp.sh "list-of |common English words> => " 74550com.mon > common-English-words.sw

Update: ahh... one definition of intelligence is the speed with which an agent finds the op sequence that maps from the current state (superposition) to the desired end state (superposition). Especially for the case of non-trivial op sequences. There are other possible definitions. eg, the speed with which an agent finds a good internal representation (ie a well constructed network) for a concept.

BTW, a comment on "well constructed network". This is probably the difference between rote learning a concept (which corresponds to a poorly constructed network), and having a deep understanding of a concept.

The Semantic DB Project

Thursday, 17 December 2015

edit distance in BKO

No comments:

Post a Comment