Monday, 16 March 2015

some wage prediction results

OK. Using the resulting sw file from last time, let's check out how well our proposed algo works. Decided to do a run with a random sample of 1000 test cases.

Here is the BKO, but I have since written a tidier version:
```sa: load adult-wage-pattern-recognition.sw
sa: norm |above-50K> => .000127534753220 |_self>
sa: norm |below-50K> => .000040453074433 |_self>
sa: norm-h4 |*> #=> normalize[100] coeff-sort norm M select[1,5] similar[input-pattern,pattern] |_self>
sa: equal? |*> #=> equal(norm-h4|_self>,100 answer|_self>)
sa: is-equal? |*> #=> if(is-greater-equal-than[0.5] push-float equal? |_self>,equal? |_self>,|False>)
```
Now, due to recalculating norm-h4 (and hence similar[op]) a bunch of times, this code was rather slow! It took over a day to finish. So for the next attempt, which I'm going to run against the full test suite, I'm going to do some precomputation:
```-- keeping only 100 of the best matching patterns should be sufficient.
simm |*> #=> select[1,100] similar[input-pattern,pattern] |_self>
map[simm,similarity-result] rel-kets[input-pattern] |>
```
Now, what results did I get from the random sample of 1000 test cases? 761 correct! ie, 76.1%. I'd like mid eighties, but still not too bad. Unfortunately, I can't think of any easy tweak to improve the results. Besides, when I looked at the similarity results, it was apparent that superpositions with 14 elements just doesn't have enough data points for a data set this large.

The full result table is here.
```-- a sample of that table:
+---------------+----------------------------------+-----------+-----------+
| input         | norm-h4                          | answer    | is-equal? |
+---------------+----------------------------------+-----------+-----------+
| example-5707  | 55.92 below-50K, 44.08 above-50K | below-50K | 0.56 True |
| example-9629  | 67.76 above-50K, 32.24 below-50K | below-50K | False     |
| example-14039 | 82.55 above-50K, 17.45 below-50K | below-50K | False     |
| example-731   | 67.76 above-50K, 32.24 below-50K | below-50K | False     |
| example-5362  | 100.00 below-50K                 | below-50K | 1.00 True |
| example-8     | 67.76 above-50K, 32.24 below-50K | above-50K | 0.68 True |
| example-5354  | 100 below-50K                    | below-50K | True      |
| example-9825  | 67.76 above-50K, 32.24 below-50K | below-50K | False     |
| example-14761 | 100 below-50K                    | below-50K | True      |
| example-14586 | 55.92 below-50K, 44.08 above-50K | below-50K | 0.56 True |
| example-8934  | 100 below-50K                    | below-50K | True      |
| example-11743 | 100 below-50K                    | below-50K | True      |
| example-10676 | 100 above-50K                    | above-50K | True      |
| example-9861  | 67.76 above-50K, 32.24 below-50K | below-50K | False     |
| example-7228  | 55.92 below-50K, 44.08 above-50K | below-50K | 0.56 True |
| example-8039  | 82.54 above-50K, 17.46 below-50K | below-50K | False     |
| example-11380 | 67.76 above-50K, 32.24 below-50K | below-50K | False     |
| example-7476  | 55.92 below-50K, 44.08 above-50K | below-50K | 0.56 True |
| example-10797 | 82.55 above-50K, 17.45 below-50K | above-50K | 0.83 True |
| example-13409 | 82.54 above-50K, 17.46 below-50K | above-50K | 0.83 True |
| example-14147 | 100 above-50K                    | above-50K | True      |
| example-4809  | 100 below-50K                    | below-50K | True      |
| example-5080  | 67.76 above-50K, 32.24 below-50K | below-50K | False     |
...
```
Some notes:
1) Here is an example similarity result. Note that large numbers of superpositions are essentially identical, making it hard for my particular algo. Recall that my webpage example had superpositions with around 2000 distinct kets, compared to just 14 here.
```sa: table[input,coeff,M] select[1,15] 100 similar[input-pattern,pattern] |example-9861>
+------------+--------+-----------+
| input      | coeff  | M         |
+------------+--------+-----------+
| node-16657 | 99.998 | above-50K |
| node-24189 | 99.998 | above-50K |
| node-18407 | 99.998 | below-50K |
| node-9655  | 99.997 | below-50K |
| node-23605 | 99.997 | below-50K |
| node-821   | 99.997 | below-50K |
| node-18534 | 99.997 | below-50K |
| node-12712 | 99.997 | above-50K |
| node-4948  | 99.997 | above-50K |
| node-7466  | 99.997 | above-50K |
| node-10614 | 99.997 | above-50K |
| node-16614 | 99.997 | below-50K |
| node-4179  | 99.997 | below-50K |
| node-15637 | 99.997 | above-50K |
| node-12012 | 99.997 | below-50K |
+------------+--------+-----------+
```
Point is, this data set fails this pattern recognition requirement (since above-50K and below-50K are not easily distinguishable):
`  3) distinctive means different object types have easily distinguishable superpositions`
2) In trying to find a good mapping function h, on first attempt I used:
`h |*> #=> coeff-sort M similar[input-pattern,pattern] |_self>`
and got terrible results! Everything was classified as "below-50K". The fix was to take into account the relative numbers of below-50K (which was the vast majority) and above-50k. I had a look at the training data set:
```\$ grep -c "<=50K" adult.data
24720

7841
```
And hence my norm term, which is 1/label-count:
``` norm |above-50K> => .000127534753220 |_self>
norm |below-50K> => .000040453074433 |_self>
```
But that was still not enough. I tried a drop-below[t] but that too failed miserably. Eventually, I found select[1,n] was the way to go. And was pretty stubbornly around 75% success rate for n in {5,10,15}. Point is, so far this is the h that works best:
```sa: norm-h4 |*> #=> normalize[100] coeff-sort norm M select[1,5] similar[input-pattern,pattern] |_self>
```
The select[1,5] essentially means we are just taking an average of only the top 5 best matches.
3) Long term, the speed of similar[op] on large data-sets is not an issue. It will be trivial to write parallel versions.

That's it for this post. I think I will move on to other things in my next few posts. I'm reasonably happy with my proof of concepts for my supervised pattern recognition algo.