Friday, 5 June 2015

more inverse-simm results

OK. In the last post we discovered similar[inverse-links-to] seems to give some good results. Let's expand our test set, and try it on a few more examples.
-- define our test set:
|list> => |WP: Erwin_Schrdinger> + |WP: Richard_Feynman> + |WP: Cat> + |WP: Dog> + |WP: Apple> + |WP: Adelaide> + |WP: University_of_Adelaide> + |WP: Particle_physics> + |WP: Lisp_(programming_language)> + |WP: APL_(programming_language)> + |WP: SQL> + |WP: SPARQL> + |WP: The_Doors> + |WP: Rugby> + |WP: Australian_Football_League>

-- how many incoming links?
sa: how-many-in-links |*> #=> how-many inverse-links-to |_self>
sa: table[wikipage,how-many-in-links] "" |list>
+-----------------------------+-------------------+
| wikipage                    | how-many-in-links |
+-----------------------------+-------------------+
| Erwin_Schrdinger            | 53                |
| Richard_Feynman             | 79                |
| Cat                         | 14                |
| Dog                         | 24                |
| Apple                       | 21                |
| Adelaide                    | 81                |
| University_of_Adelaide      | 10                |
| Particle_physics            | 17                |
| Lisp_(programming_language) | 64                |
| APL_(programming_language)  | 24                |
| SQL                         | 41                |
| SPARQL                      | 6                 |
| The_Doors                   | 41                |
| Rugby                       | 0                 |
| Australian_Football_League  | 30                |
+-----------------------------+-------------------+

-- create the data:
sa: inverse-simm-op |WP: *> #=> select[1,500] 100 self-similar[inverse-links-to] |_self>
sa: |null> => map[inverse-simm-op,inverse-simm] "" |list>

-- define an operator to explore the resulting data:
sa: T |*> #=> table[wikipage,coeff] select[1,20] inverse-simm |_self>

-- now our examples:
sa: T |WP: Erwin_Schrdinger>
+---------------------------+--------+
| wikipage                  | coeff  |
+---------------------------+--------+
| Erwin_Schrdinger          | 100.0  |
| Max_Born                  | 32.075 |
| Niels_Bohr                | 31.646 |
| Schrdinger_equation       | 30.189 |
| Paul_Dirac                | 29.31  |
| Wolfgang_Pauli            | 28.302 |
| Werner_Heisenberg         | 28.049 |
| Max_Planck                | 26.984 |
| uncertainty_principle     | 26.415 |
| photoelectric_effect      | 24.528 |
| Roger_Penrose             | 22.642 |
| Bohr_model                | 20.755 |
| Arnold_Sommerfeld         | 20.755 |
| Louis_de_Broglie          | 20.755 |
| wave_function             | 20.755 |
| Copenhagen_interpretation | 18.868 |
| quantum_state             | 18.868 |
| Ernest_Rutherford         | 17.742 |
| Maxwell's_equations       | 17.241 |
| Pauli_exclusion_principle | 16.981 |
+---------------------------+--------+

sa: T |WP: Richard_Feynman>
+------------------------------------+--------+
| wikipage                           | coeff  |
+------------------------------------+--------+
| Richard_Feynman                    | 100.0  |
| Werner_Heisenberg                  | 24.39  |
| special_relativity                 | 20.792 |
| Niels_Bohr                         | 20.253 |
| Paul_Dirac                         | 20.253 |
| particle_physics                   | 20.225 |
| classical_mechanics                | 20.0   |
| fermion                            | 18.987 |
| spin_(physics)                     | 18.987 |
| Standard_Model                     | 17.722 |
| Schrdinger_equation                | 17.722 |
| quantum_field_theory               | 17.722 |
| electromagnetism                   | 17.241 |
| Erwin_Schrdinger                   | 16.456 |
| Pauli_exclusion_principle          | 16.456 |
| quark                              | 16.456 |
| Stephen_Hawking                    | 16.456 |
| quantum_electrodynamics            | 16.456 |
| Julian_Schwinger                   | 16.456 |
| Category:Concepts_in_physics | 16.279 |
+------------------------------------+--------+

sa: T |WP: Cat>
+----------+--------+
| wikipage | coeff  |
+----------+--------+
| Cat      | 100.0  |
| Horse    | 31.25  |
| Donkey   | 28.571 |
| Goat     | 28.571 |
| Elephant | 21.429 |
| Pig      | 21.429 |
| Rabbit   | 21.429 |
| Deer     | 21.429 |
| Mule     | 21.429 |
| Goose    | 21.429 |
| Dog      | 20.833 |
| Sheep    | 20     |
| Lion     | 18.75  |
| Almond   | 14.286 |
| Alder    | 14.286 |
| Ant      | 14.286 |
| Bear     | 14.286 |
| Bee      | 14.286 |
| Fox      | 14.286 |
| Lizard   | 14.286 |
+----------+--------+

sa: T |WP: Dog>
+------------------+--------+
| wikipage         | coeff  |
+------------------+--------+
| Dog              | 100.0  |
| Horse            | 29.167 |
| coyote           | 22.222 |
| Gray_wolf        | 21.429 |
| Arctic_fox       | 20.833 |
| Cat              | 20.833 |
| Canidae          | 20.833 |
| Elephant         | 20.833 |
| bobcat           | 20.833 |
| red_fox          | 20.833 |
| Donkey           | 20.833 |
| red_wolf         | 20.833 |
| Rabbit           | 16.667 |
| African_wild_dog | 16.667 |
| gray_wolf        | 16.667 |
| Domestic_sheep   | 16.667 |
| dingo            | 16.667 |
| Goat             | 16.667 |
| Cattle           | 16.667 |
| otter            | 16.667 |
+------------------+--------+

sa: T |WP: Apple>
+----------------+--------+
| wikipage       | coeff  |
+----------------+--------+
| Apple          | 100.0  |
| Strawberry     | 33.333 |
| Cranberry      | 23.81  |
| Grape          | 23.81  |
| Tomato         | 23.81  |
| Cherry         | 23.81  |
| Kiwifruit      | 19.048 |
| Blackberry     | 19.048 |
| plum           | 19.048 |
| Lime_(fruit)   | 19.048 |
| Pineapple      | 19.048 |
| Lemon          | 19.048 |
| Blueberry      | 19.048 |
| peach          | 17.5   |
| pear           | 17.391 |
| Orange_(fruit) | 16.667 |
| Pear           | 14.286 |
| Banana         | 14.286 |
| Peach          | 14.286 |
| Squash_(plant) | 14.286 |
+----------------+--------+

sa: T |WP: Adelaide>
+-------------------------------------+--------+
| wikipage                            | coeff  |
+-------------------------------------+--------+
| Adelaide                            | 100.0  |
| Brisbane                            | 37.079 |
| Perth                               | 32.099 |
| South_Australia                     | 26.042 |
| Melbourne                           | 18.687 |
| Canberra                            | 15.044 |
| The_Age                             | 14.634 |
| Sydney                              | 14.583 |
| Australian_Broadcasting_Corporation | 13.445 |
| Australian_rules_football           | 12.346 |
| Auckland                            | 12.195 |
| Australian_Labor_Party              | 11.111 |
| Darwin,_Northern_Territory          | 11.111 |
| Triple_J                            | 11.111 |
| Seven_Network                       | 11.111 |
| States_and_territories_of_Australia | 11.111 |
| Karachi                             | 10.989 |
| Australian_Football_League          | 9.877  |
| The_Australian                      | 9.877  |
| Western_Australia                   | 8.911  |
+-------------------------------------+--------+

sa: T |WP: University_of_Adelaide>
+----------------------------------------+-------+
| wikipage                               | coeff |
+----------------------------------------+-------+
| University_of_Adelaide                 | 100.0 |
| Port_Adelaide_Football_Club            | 20    |
| Adelaide_Oval                          | 20    |
| Adelaide_city_centre                   | 20    |
| University_of_South_Australia          | 20    |
| Port_Adelaide                          | 20    |
| Australian_Grand_Prix                  | 20    |
| State_Bank_of_South_Australia          | 20    |
| Mount_Lofty                            | 20    |
| South_Eastern_Freeway                  | 20    |
| Southern_Expressway_(Australia)        | 20    |
| Government_of_South_Australia          | 20    |
| Flinders_University_of_South_Australia | 20    |
| South_Australian_Museum                | 20    |
| Adelaide_Crows                         | 20    |
| Glenelg,_South_Australia               | 20    |
| Australian_Central_Standard_Time       | 20    |
| Australian_Central_Daylight_Time       | 20    |
| Fleurieu_Peninsula                     | 20    |
| River_Torrens                          | 20    |
+----------------------------------------+-------+

sa: T |WP: Particle_physics>
+----------------------------------------+--------+
| wikipage                               | coeff  |
+----------------------------------------+--------+
| Particle_physics                       | 100    |
| Optics                                 | 20     |
| Cosmology                              | 20     |
| Acoustics                              | 17.647 |
| Condensed_matter_physics               | 17.647 |
| Fluid_dynamics                         | 17.647 |
| Thermodynamics                         | 17.647 |
| kinematics                             | 17.647 |
| atomic,_molecular,_and_optical_physics | 17.647 |
| cosmic_inflation                       | 17.647 |
| Fluid_statics                          | 17.647 |
| Lambda-CDM_model                       | 17.647 |
| Biophysics                             | 17.647 |
| Category:Physics                 | 17.647 |
| Lev_Landau                             | 15.789 |
| Nuclear_physics                        | 14.286 |
| virtual_particle                       | 14.286 |
| quantum_chemistry                      | 13.636 |
| American_Physical_Society              | 13.333 |
| Elementary_particle                    | 13.043 |
+----------------------------------------+--------+

sa: T |WP: Lisp_(programming_language)>
+------------------------------------+--------+
| wikipage                           | coeff  |
+------------------------------------+--------+
| Lisp_(programming_language)        | 100    |
| Smalltalk                          | 28.125 |
| Pascal_(programming_language)      | 24.675 |
| Fortran                            | 23.881 |
| Scheme_(programming_language)      | 23.438 |
| Ruby_(programming_language)        | 23.188 |
| object-oriented_programming        | 21.875 |
| PHP                                | 20.779 |
| Prolog                             | 20.312 |
| Haskell_(programming_language)     | 20.312 |
| Ada_(programming_language)         | 18.75  |
| APL_(programming_language)         | 18.75  |
| BASIC                              | 18.75  |
| COBOL                              | 18.75  |
| functional_programming             | 18.75  |
| John_McCarthy_(computer_scientist) | 18.75  |
| C_Sharp_(programming_language)     | 18.75  |
| programming_language               | 17.647 |
| JavaScript                         | 17.526 |
| compiler                           | 17     |
+------------------------------------+--------+

sa: T |WP: APL_(programming_language)>
+------------------------------------+--------+
| wikipage                           | coeff  |
+------------------------------------+--------+
| APL_(programming_language)         | 100.0  |
| Kenneth_E._Iverson                 | 33.333 |
| John_McCarthy_(computer_scientist) | 25.0   |
| John_Backus                        | 25.0   |
| Prolog                             | 23.333 |
| Alan_Kay                           | 20.833 |
| AWK                                | 20.833 |
| Grace_Hopper                       | 20.833 |
| ML_(programming_language)          | 20.833 |
| Niklaus_Wirth                      | 20.833 |
| logic_programming                  | 20.833 |
| J_(programming_language)           | 20.833 |
| bytecode                           | 19.231 |
| Lisp_(programming_language)        | 18.75  |
| programmer                         | 17.857 |
| Objective-C                        | 17.241 |
| BASIC                              | 17.188 |
| Mathematica                        | 17.143 |
| SQL                                | 17.073 |
| ALGOL                              | 16.667 |
+------------------------------------+--------+

sa: T |WP: SQL>
+----------------------------------------+--------+
| wikipage                               | coeff  |
+----------------------------------------+--------+
| SQL                                    | 100.0  |
| Haskell_(programming_language)         | 22.727 |
| PHP                                    | 19.481 |
| APL_(programming_language)             | 17.073 |
| Category:Cross-platform_software | 17.073 |
| Visual_Basic                           | 17.073 |
| relational_database                    | 17.073 |
| COBOL                                  | 16.327 |
| PostgreSQL                             | 14.634 |
| R_(programming_language)               | 14.634 |
| Run_time_(program_lifecycle_phase)     | 14.634 |
| Ruby_(programming_language)            | 14.493 |
| JavaScript                             | 13.402 |
| C_Sharp_(programming_language)         | 13.208 |
| database                               | 12.644 |
| Lisp_(programming_language)            | 12.5   |
| Common_Lisp                            | 12.195 |
| Graphical_user_interface               | 12.195 |
| MySQL                                  | 12.195 |
| Mathematica                            | 12.195 |
+----------------------------------------+--------+

sa: T |WP: SPARQL>
+--------------------------------------------------------------------------------------------+--------+
| wikipage                                                                                   | coeff  |
+--------------------------------------------------------------------------------------------+--------+
| SPARQL                                                                                     | 100.0  |
| Web_Ontology_Language                                                                      | 33.333 |
| Agris:_International_Information_System_for_the_Agricultural_Sciences_and_Technology | 33.333 |
| W3C_XML_Schema                                                                             | 33.333 |
| GRDDL                                                                                      | 33.333 |
| Conceptual_interoperability                                                                | 33.333 |
| Category:Web_services                                                                | 33.333 |
| Resource_Description_Framework                                                             | 20     |
| Analytic_geometry                                                                          | 16.667 |
| DHTML                                                                                      | 16.667 |
| Interpolation                                                                              | 16.667 |
| GNU_nano                                                                                   | 16.667 |
| Pico_(text_editor)                                                                         | 16.667 |
| Relational_database                                                                        | 16.667 |
| Sir_Charles_Lyell                                                                          | 16.667 |
| Synchronized_Multimedia_Integration_Language                                               | 16.667 |
| Semantic_network                                                                           | 16.667 |
| Backronym                                                                                  | 16.667 |
| Interoperability                                                                           | 16.667 |
| RAS_syndrome                                                                               | 16.667 |
+--------------------------------------------------------------------------------------------+--------+

sa: T |WP: The_Doors>
+---------------------------------------+--------+
| wikipage                              | coeff  |
+---------------------------------------+--------+
| The_Doors                             | 100.0  |
| Jim_Morrison                          | 31.707 |
| Ray_Manzarek                          | 21.951 |
| Jefferson_Airplane                    | 17.073 |
| The_Band                              | 14.634 |
| Bee_Gees                              | 14.634 |
| Janis_Joplin                          | 12.195 |
| Summer_of_Love                        | 12.195 |
| Timothy_Leary                         | 12.195 |
| folk_rock                             | 12.195 |
| Governor_of_American_Samoa            | 12.195 |
| Cream_(band)                          | 12.195 |
| Sgt._Pepper's_Lonely_Hearts_Club_Band | 12.195 |
| The_Byrds                             | 12.195 |
| Joan_Baez                             | 12.195 |
| Iron_Butterfly                        | 12.195 |
| Brian_Jones                           | 12.195 |
| Creedence_Clearwater_Revival          | 12.195 |
| Johnny_Winter                         | 12.195 |
| The_Yardbirds                         | 12.195 |
+---------------------------------------+--------+

sa: T |WP: Rugby>
+----------+-------+
| wikipage | coeff |
+----------+-------+
+----------+-------+

sa: T |WP: Australian_Football_League>
+---------------------------------+--------+
| wikipage                        | coeff  |
+---------------------------------+--------+
| Australian_Football_League      | 100.0  |
| West_Coast_Eagles               | 40     |
| Richmond_Football_Club          | 40     |
| Sydney_Swans                    | 36.667 |
| St_Kilda_Football_Club          | 36.667 |
| Collingwood_Football_Club       | 36.667 |
| Hawthorn_Football_Club          | 36.667 |
| Australian_rules_football       | 33.871 |
| Essendon_Football_Club          | 33.333 |
| North_Melbourne_Football_Club   | 33.333 |
| Western_Bulldogs                | 33.333 |
| Carlton_Football_Club           | 33.333 |
| Brownlow_Medal                  | 33.333 |
| Seven_Network                   | 32.353 |
| Melbourne_Cricket_Ground        | 30     |
| Australian_Bureau_of_Statistics | 30     |
| Special_Broadcasting_Service    | 30     |
| Melbourne_Football_Club         | 30     |
| Fitzroy_Football_Club           | 30     |
| 2001_AFL_season                 | 30     |
+---------------------------------+--------+
Wow! That works unbelievably well. I don't know exactly why, but hey. And why similar[inverse-links-to] works better than similar[links-to], I don't know either! A question that comes to mind is, if we use a larger subset of wikipedia, will these results get better or worse? I suspect better, but not sure.

BTW, this is from:
100*30000/15559125
= 0.19 %
of the total English wikipedia.

And the resulting sw file is here.

what do we know about bananas?

Now we have a subset of wikipedia in sw format, we need to look around, see what we have.

There are 4 things that might be interesting to ask (with this data-set):
links-to |wikipage>
inverse-links-to |wikipage>
similar[links-to] |wikipage>
similar[inverse-links-to] |wikipage>

So, let's start with bananas:
sa: table[wikipage,coeff] select[1,50] coeff-sort links-to |WP: Banana>
+-------------------------------------------------+-------+
| wikipage                                        | coeff |
+-------------------------------------------------+-------+
| Musa_balbisiana                                 | 5     |
| Southeast_Asia                                  | 5     |
| Musa_acuminata                                  | 4     |
| Kerala                                          | 4     |
| Musa_(genus)                                    | 3     |
| starch                                          | 3     |
| Philippines                                     | 3     |
| India                                           | 3     |
| Plantain_(true)                                 | 3     |
| Chiquita_Brands_International                   | 3     |
| Malaysia                                        | 3     |
| Indonesia                                       | 3     |
| Thailand                                        | 3     |
| Central_America                                 | 3     |
| Uganda                                          | 3     |
| Panama_disease                                  | 3     |
| fruit                                           | 2     |
| herbaceous                                      | 2     |
| Papua_New_Guinea                                | 2     |
| fiber                                           | 2     |
| #Cavendish                                      | 2     |
| List_of_banana_cultivars                        | 2     |
| Musa_velutina                                   | 2     |
| Fe'i_banana                                     | 2     |
| Ensete_ventricosum                              | 2     |
| Musaceae                                        | 2     |
| pseudostem                                      | 2     |
| inflorescence                                   | 2     |
| List_of_banana_cultivars#AAB_Group              | 2     |
| Synonym_(taxonomy)                              | 2     |
| Palestine                                       | 2     |
| Madagascar                                      | 2     |
| China                                           | 2     |
| Dole_Food_Company                               | 2     |
| Burundi                                         | 2     |
| Rwanda                                          | 2     |
| Hybrid_(biology)                                | 2     |
| genetic_engineering                             | 2     |
| Food_and_Agriculture_Organization               | 2     |
| Guatemala                                       | 2     |
| ICTSD                                           | 2     |
| Fyffes                                          | 2     |
| United_Fruit_Company                            | 2     |
| coffee                                          | 2     |
| International_Institute_of_Tropical_Agriculture | 2     |
| Daily_Value                                     | 2     |
| potatoes                                        | 2     |
| dopamine                                        | 2     |
| South_Asia                                      | 2     |
| Pisang_goreng                                   | 2     |
+-------------------------------------------------+-------+

sa: table[wikipage,coeff] select[1,50] coeff-sort inverse-links-to |WP: Banana>
+-----------------------+-------+
| wikipage              | coeff |
+-----------------------+-------+
| Economy_of_Costa_Rica | 1     |
| Flavor                | 1     |
| Economy_of_Honduras   | 1     |
| Economy_of_Jamaica    | 1     |
| Jericho               | 1     |
| Jell-O                | 1     |
| Economy_of_Malawi     | 1     |
| Neapolitan_ice_cream  | 1     |
| Thai_cuisine          | 1     |
| Vitamin_C             | 1     |
| Zingiberales          | 1     |
+-----------------------+-------+

sa: table[wikipage,coeff] select[1,50] 100 self-similar[links-to] |WP: Banana>
+---------------------------------------------------------------------------------+--------+
| wikipage                                                                        | coeff  |
+---------------------------------------------------------------------------------+--------+
| Banana                                                                          | 100.0  |
| Cooking_plantain                                                                | 15.096 |
| ITU_prefix                                                                      | 9.52   |
| Wikipedia:Status_of_the_porting_of_U.S._Dept_of_State_info                      | 9.088  |
| United_Nations_Industrial_Development_Organization                              | 8.376  |
| Wikipedia:Status_of_the_porting_of_the_CIA_World_Factbook                       | 8.096  |
| Abac                                                                            | 8.036  |
| International_Tropical_Timber_Agreement,_1994                                   | 7.813  |
| Deforestation                                                                   | 7.483  |
| International_Tropical_Timber_Agreement,_1983                                   | 7.366  |
| Western_imperialism_in_Asia                                                     | 6.983  |
| International_Hydrographic_Organization                                         | 6.92   |
| Foreign_relations_of_China                                                      | 6.806  |
| Lions_Clubs_International                                                       | 6.326  |
| Asia                                                                            | 6.242  |
| Passport                                                                        | 6.24   |
| Diaspora                                                                        | 5.931  |
| International_Electrotechnical_Commission                                       | 5.925  |
| Member_states_of_the_United_Nations                                             | 5.645  |
| Indian_Ocean                                                                    | 5.625  |
| History_of_Southeast_Asia                                                       | 5.62   |
| Solanaceae                                                                      | 5.581  |
| Tiger                                                                           | 5.56   |
| Kyoto_Protocol                                                                  | 5.527  |
| Curry                                                                           | 5.495  |
| Sugar                                                                           | 5.41   |
| Blood_alcohol_content                                                           | 5.389  |
| List_of_mountains                                                               | 5.388  |
| Rice                                                                            | 5.37   |
| Southeast_Asia                                                                  | 5.361  |
| Ceiba_pentandra                                                                 | 5.357  |
| Pigeon_pea                                                                      | 5.357  |
| Rat                                                                             | 5.219  |
| Abugida                                                                         | 5.116  |
| Chinatown                                                                       | 5.1    |
| Video_CD                                                                        | 5.086  |
| History_of_the_Pacific_Islands                                                  | 5.052  |
| Time_zone                                                                       | 5.037  |
| Asian_Development_Bank                                                          | 5.035  |
| Hindu                                                                           | 5.03   |
| Foreign_relations_of_Indonesia                                                  | 5.004  |
| Foreign_relations_of_Singapore                                                  | 5.002  |
| Economy_of_Burma                                                                | 4.993  |
| Spice                                                                           | 4.952  |
| Convention_on_Fishing_and_Conservation_of_the_Living_Resources_of_the_High_Seas | 4.911  |
| Foreign_relations_of_Taiwan                                                     | 4.842  |
| Tamil_language                                                                  | 4.828  |
| Risk_(game)                                                                     | 4.792  |
| Foreign_relations_of_the_Philippines                                            | 4.776  |
| Numismatics                                                                     | 4.748  |
+---------------------------------------------------------------------------------+--------+
  Time taken: 10 minutes, 7 seconds, 599 milliseconds

sa: table[wikipage,coeff] select[1,100] 100 self-similar[inverse-links-to] |WP: Banana>
+-----------------------------------------+--------+
| wikipage                                | coeff  |
+-----------------------------------------+--------+
| Banana                                  | 100.0  |
| Grape                                   | 27.273 |
| Pineapple                               | 27.273 |
| Mango                                   | 27.273 |
| Ascorbic_acid                           | 18.182 |
| Cranberry                               | 18.182 |
| Kiwifruit                               | 18.182 |
| Pear                                    | 18.182 |
| Tea                                     | 18.182 |
| Vanilla                                 | 18.182 |
| Chicken                                 | 18.182 |
| Peach                                   | 18.182 |
| apparel                                 | 18.182 |
| lime_(fruit)                            | 18.182 |
| pudding                                 | 18.182 |
| mangosteen                              | 18.182 |
| Coconut                                 | 18.182 |
| gelatin_dessert                         | 18.182 |
| Lemon                                   | 18.182 |
| galangal                                | 18.182 |
| Melon                                   | 18.182 |
| Grapefruit                              | 18.182 |
| Raspberry                               | 18.182 |
| Apricot                                 | 18.182 |
| Watermelon                              | 18.182 |
| Coffee                                  | 16.667 |
| Strawberry                              | 15.789 |
| Cherry                                  | 15.385 |
| custard                                 | 15.385 |
| Pork                                    | 15.385 |
| Sugar                                   | 14.286 |
| Apple                                   | 14.286 |
| condensed_milk                          | 14.286 |
| Garden_strawberry                       | 14.286 |
| Blackberry                              | 13.333 |
| raspberry                               | 13.333 |
| Fruit                                   | 12.5   |
| pistachio                               | 12.5   |
| Orange_(fruit)                          | 12.5   |
| tapioca                                 | 12     |
| Tomato                                  | 11.765 |
| Center_for_Economic_and_Policy_Research | 11.111 |
| Lime_(fruit)                            | 10.526 |
| turmeric                                | 10.526 |
| pineapple                               | 10.256 |
| Domestic_sheep                          | 9.524  |
| broccoli                                | 9.524  |
| Adobe                                   | 9.091  |
| Analytical_chemistry                    | 9.091  |
| Commelinales                            | 9.091  |
| Calf                                    | 9.091  |
| Celebrity                               | 9.091  |
| Cream                                   | 9.091  |
| Buddhist_cuisine                        | 9.091  |
| Celery                                  | 9.091  |
| Chocolate                               | 9.091  |
| Cola                                    | 9.091  |
| Ester                                   | 9.091  |
| Epipaleolithic                          | 9.091  |
| Food_additive                           | 9.091  |
| Glycine                                 | 9.091  |
| Gelatin                                 | 9.091  |
| Gelatin_dessert                         | 9.091  |
| James_Lind                              | 9.091  |
| Kathleen_Kenyon                         | 9.091  |
| Ceiba_pentandra                         | 9.091  |
| Mya_(unit)                              | 9.091  |
| Monosaccharide                          | 9.091  |
| Parsley                                 | 9.091  |
| Potato                                  | 9.091  |
| Pia_colada                              | 9.091  |
| Spice                                   | 9.091  |
| Soft_drink                              | 9.091  |
| Economy_of_South_Africa                 | 9.091  |
| Scurvy                                  | 9.091  |
| Tarsiidae                               | 9.091  |
| Zingiberaceae                           | 9.091  |
| Beef                                    | 9.091  |
| Cooking_plantain                        | 9.091  |
| Casimir_Funk                            | 9.091  |
| Pigeon_pea                              | 9.091  |
| Pistachio                               | 9.091  |
| Cod                                     | 9.091  |
| Tropical                                | 9.091  |
| tadpole                                 | 9.091  |
| water_buffalo                           | 9.091  |
| Phosphoric_acid                         | 9.091  |
| Tartaric_acid                           | 9.091  |
| Citric_acid                             | 9.091  |
| weak_acid                               | 9.091  |
| Lactic_acid                             | 9.091  |
| Dietary_Reference_Intake                | 9.091  |
| Canaanite_language                      | 9.091  |
| cell_adhesion_molecule                  | 9.091  |
| salivary_gland                          | 9.091  |
| bananas                                 | 9.091  |
| gourami                                 | 9.091  |
| fishing_industry                        | 9.091  |
| Molding_(process)                       | 9.091  |
| Carrot                                  | 9.091  |
+-----------------------------------------+--------+
  Time taken: 14 minutes, 27 seconds, 470 milliseconds
So, similar[inverse-links-to] works best. And quite noticeably better too! Exactly why, I'm not sure. Anyway, now we know to apply this to more examples. I'll do that in the next post.

how many wikipage links?

Just a simple one. Find the wikipages (in our processed fragment/subset of wikipedia) with the most out-going links, and the most in-coming links.
sa: load 30k--wikipedia-links.sw
sa: how-many-out-links |*> #=> how-many links-to |_self>
sa: rank-table[wikipage,how-many-out-links] select[1,100] reverse sort-by[how-many-out-links] starts-with |WP: >
+------+----------------------------------------------+--------------------+
| rank | wikipage                                     | how-many-out-links |
+------+----------------------------------------------+--------------------+
| 1    | List_of_Latin_words_with_English_derivatives | 3171               |
| 2    | List_of_Greek_words_with_English_derivatives | 3163               |
| 3    | Index_of_philosophy_articles_(AC)            | 2978               |
| 4    | List_of_chess_players                        | 1996               |
| 5    | 2000s_(decade)                               | 1861               |
| 6    | March_27                                     | 1844               |
| 7    | 1989                                         | 1789               |
| 8    | 1991                                         | 1708               |
| 9    | 1990                                         | 1673               |
| 10   | List_of_Scots                                | 1663               |
| 11   | 1979                                         | 1619               |
| 12   | 1972                                         | 1617               |
| 13   | 1945                                         | 1597               |
| 14   | 1988                                         | 1595               |
| 15   | 1973                                         | 1583               |
| 16   | 1977                                         | 1574               |
| 17   | 1992                                         | 1556               |
| 18   | 1983                                         | 1546               |
| 19   | List_of_mountains                            | 1545               |
| 20   | 1967                                         | 1538               |
| 21   | 1966                                         | 1530               |
| 22   | Russia                                       | 1516               |
| 23   | 1970                                         | 1514               |
| 24   | List_of_historical_period_drama_films        | 1499               |
| 25   | 1976                                         | 1492               |
| 26   | 1993                                         | 1492               |
| 27   | 1980                                         | 1491               |
| 28   | England                                      | 1481               |
| 29   | 1985                                         | 1476               |
| 30   | 1981                                         | 1475               |
| 31   | 1986                                         | 1470               |
| 32   | March_4                                      | 1450               |
| 33   | Poland                                       | 1446               |
| 34   | 1971                                         | 1435               |
| 35   | 1965                                         | 1422               |
| 36   | 1984                                         | 1415               |
| 37   | 1982                                         | 1408               |
| 38   | 1975                                         | 1404               |
| 39   | 1964                                         | 1402               |
| 40   | 1960s                                        | 1395               |
| 41   | History_of_painting                          | 1392               |
| 42   | 1963                                         | 1375               |
| 43   | 1944                                         | 1360               |
| 44   | Pakistan                                     | 1349               |
| 45   | 1997                                         | 1347               |
| 46   | 1987                                         | 1342               |
| 47   | 1978                                         | 1340               |
| 48   | 1968                                         | 1330               |
| 49   | 1969                                         | 1325               |
| 50   | 1946                                         | 1319               |
| 51   | 1996                                         | 1315               |
| 52   | Portugal                                     | 1315               |
| 53   | List_of_comedians                            | 1313               |
| 54   | 1962                                         | 1307               |
| 55   | 1960                                         | 1302               |
| 56   | List_of_fictional_robots_and_androids        | 1287               |
| 57   | United_Kingdom                               | 1283               |
| 58   | Greece                                       | 1276               |
| 59   | Rock_music                                   | 1275               |
| 60   | 1943                                         | 1262               |
| 61   | List_of_economists                           | 1262               |
| 62   | Academy_Award_for_Best_Production_Design     | 1246               |
| 63   | Italy                                        | 1245               |
| 64   | 1941                                         | 1228               |
| 65   | 1940                                         | 1212               |
| 66   | 1948                                         | 1211               |
| 67   | 1995                                         | 1210               |
| 68   | 1947                                         | 1210               |
| 69   | 1990s                                        | 1206               |
| 70   | 1957                                         | 1198               |
| 71   | History_of_Austria                           | 1196               |
| 72   | 1961                                         | 1189               |
| 73   | January_1                                    | 1187               |
| 74   | British_Museum                               | 1187               |
| 75   | 1942                                         | 1175               |
| 76   | 2007                                         | 1163               |
| 77   | Germany                                      | 1159               |
| 78   | 1998                                         | 1156               |
| 79   | Sculpture                                    | 1156               |
| 80   | Andalusia                                    | 1156               |
| 81   | List_of_French_people                        | 1154               |
| 82   | Missouri                                     | 1153               |
| 83   | List_of_agnostics                            | 1151               |
| 84   | World_War_II                                 | 1146               |
| 85   | List_of_marine_aquarium_fish_species         | 1144               |
| 86   | Spain                                        | 1142               |
| 87   | 1950s                                        | 1133               |
| 88   | 1958                                         | 1132               |
| 89   | Chicago                                      | 1131               |
| 90   | E_number                                     | 1130               |
| 91   | Pittsburgh                                   | 1118               |
| 92   | Lists_of_office-holders                      | 1115               |
| 93   | 1953                                         | 1113               |
| 94   | List_of_wine-producing_regions               | 1112               |
| 95   | Ukraine                                      | 1111               |
| 96   | 1959                                         | 1109               |
| 97   | Dubbing_(filmmaking)                         | 1109               |
| 98   | 1956                                         | 1104               |
| 99   | List_of_dog_breeds                           | 1103               |
| 100  | 1955                                         | 1102               |
+------+----------------------------------------------+--------------------+
  Time taken: 1 minute, 4 seconds, 532 milliseconds

sa: find-inverse[links-to]
  Time taken: 2 minutes, 23 seconds, 162 milliseconds

sa: how-many-in-links |*> #=> how-many inverse-links-to |_self>
sa: rank-table[wikipage,how-many-in-links] select[1,100] reverse sort-by[how-many-in-links] starts-with |WP: >
+------+--------------------------------+-------------------+
| rank | wikipage                       | how-many-in-links |
+------+--------------------------------+-------------------+
| 1    | United_States                  | 3027              |
| 2    | World_War_II                   | 2242              |
| 3    | Julian_calendar                | 1864              |
| 4    | Wikipedia:Persondata     | 1819              |
| 5    | Roman_numerals                 | 1773              |
| 6    | France                         | 1621              |
| 7    | United_Kingdom                 | 1617              |
| 8    | Germany                        | 1446              |
| 9    | Soviet_Union                   | 1412              |
| 10   | Italy                          | 1343              |
| 11   | Latin                          | 1277              |
| 12   | India                          | 1262              |
| 13   | China                          | 1247              |
| 14   | The_New_York_Times             | 1239              |
| 15   | Japan                          | 1210              |
| 16   | World_War_I                    | 1145              |
| 17   | Canada                         | 1130              |
| 18   | Europe                         | 1071              |
| 19   | Russia                         | 1037              |
| 20   | England                        | 1034              |
| 21   | New_York_City                  | 1017              |
| 22   | London                         | 1001              |
| 23   | United_Nations                 | 993               |
| 24   | Egypt                          | 955               |
| 25   | Rome                           | 944               |
| 26   | Australia                      | 934               |
| 27   | English_language               | 916               |
| 28   | Spain                          | 899               |
| 29   | Netherlands                    | 896               |
| 30   | Roman_Empire                   | 895               |
| 31   | Christianity                   | 860               |
| 32   | European_Union                 | 844               |
| 33   | Anno_Domini                    | 841               |
| 34   | BBC                            | 822               |
| 35   | Ottoman_Empire                 | 817               |
| 36   | Oxford_University_Press        | 798               |
| 37   | Turkey                         | 796               |
| 38   | Category:Living_people   | 793               |
| 39   | California                     | 793               |
| 40   | Greek_language                 | 777               |
| 41   | Israel                         | 775               |
| 42   | Sweden                         | 762               |
| 43   | calendar_era                   | 751               |
| 44   | New_York                       | 749               |
| 45   | Paris                          | 741               |
| 46   | Iran                           | 715               |
| 47   | Poland                         | 711               |
| 48   | The_Guardian                   | 706               |
| 49   | Byzantine_Empire               | 704               |
| 50   | Brazil                         | 704               |
| 51   | Scotland                       | 696               |
| 52   | New_York_Times                 | 664               |
| 53   | Islam                          | 650               |
| 54   | President_of_the_United_States | 650               |
| 55   | Catholic_Church                | 647               |
| 56   | French_language                | 641               |
| 57   | Middle_Ages                    | 640               |
| 58   | American_Civil_War             | 635               |
| 59   | Cambridge_University_Press     | 623               |
| 60   | Philippines                    | 623               |
| 61   | Mexico                         | 613               |
| 62   | Cold_War                       | 611               |
| 63   | Roman_Catholic_Church          | 610               |
| 64   | Denmark                        | 603               |
| 65   | Switzerland                    | 603               |
| 66   | Norway                         | 599               |
| 67   | Greece                         | 584               |
| 68   | Nazi_Germany                   | 583               |
| 69   | Ancient_Rome                   | 575               |
| 70   | Constantinople                 | 573               |
| 71   | NASA                           | 568               |
| 72   | Pakistan                       | 568               |
| 73   | Washington,_D.C.               | 566               |
| 74   | BBC_News                       | 565               |
| 75   | mathematics                    | 565               |
| 76   | Portugal                       | 553               |
| 77   | Syria                          | 548               |
| 78   | Austria                        | 546               |
| 79   | Iraq                           | 543               |
| 80   | New_Zealand                    | 540               |
| 81   | Vietnam_War                    | 538               |
| 82   | Time_(magazine)                | 538               |
| 83   | Ireland                        | 536               |
| 84   | Argentina                      | 534               |
| 85   | Buddhism                       | 524               |
| 86   | South_Africa                   | 520               |
| 87   | Ukraine                        | 516               |
| 88   | Indonesia                      | 513               |
| 89   | Africa                         | 506               |
| 90   | Ab_urbe_condita                | 504               |
| 91   | German_language                | 503               |
| 92   | Belgium                        | 490               |
| 93   | Romania                        | 487               |
| 94   | Afghanistan                    | 485               |
| 95   | Holy_Roman_Empire              | 476               |
| 96   | North_America                  | 473               |
| 97   | Hungary                        | 463               |
| 98   | Gregorian_calendar             | 462               |
| 99   | UNESCO                         | 461               |
| 100  | Massachusetts                  | 455               |
+------+--------------------------------+-------------------+
  Time taken: 6 minutes, 58 seconds, 581 milliseconds
OK. That is hopefully self explanatory. More in the next couple of posts!

towards processing all of wikipedia

I now have aspirations to process all of wikipedia, and see what we can get out of it. Certainly the best curated collection of knowledge on the planet (and structurally much more homogenous than the web at large), and being all hyperlinked makes it great for my project. We can think of wikipedia as one giant network, and my project is all about networks.

Now, the problem is wikipedia is somewhat large. 50 GB xml file (for English wikipedia), and just over 15 million pages. Prior to this I gave optimizations little thought (I was more interested in getting things working, rather than run speed), but with wikipedia, we have to at least start to think about more efficient ways to do things!

So, this was my first attempt to process wikipedia. But it was memory hungry!! I threw the largest memory EC2 instance type with 244 GB of RAM and it still wasn't enough (yeah, that's what you get when you don't think about efficiencies!). I thought this was the section of code that was problematic:
# now learn it all:
    for x in r:
      link, anchor = ket_process_anchor_text(x)
      C.add_learn("links-to",page_name_ket,link)
      C.add_learn("inverse-links-to",link,page_name_ket)

      C.add_learn("contains-anchor",page_name_ket,anchor)
      C.add_learn("inverse-contains-anchor",anchor,page_name_ket)

      C.add_learn("links-to",anchor,link)
      C.add_learn("inverse-links-to",link,anchor)
So I tweaked it to this. The learn it all section now writes it to disk as it goes, instead of loading it all into memory (and we cut down on what we are trying to learn to just "links-to"):
    result = fast_superposition()
    for x in r:
      result += ket_process_link(x)
    dest.write("links-to " + str(page_name_ket) + " => " + str(result.superposition().coeff_sort()) + "\n\n")
But that still chomped all the RAM I could throw at it. I eventually figured out the real problem was trying to load the entire 50 GB XML into memory at once, ie, this at the top of the code:
document = open(sys.argv[1],'rb')
soup = BeautifulSoup(document)
So the fix was to write some code to split that 50 GB into more manageable pieces. Hence my fragment-wikipedia-xml code.
Now, put it to use:
$ ./fragment_wikipedia_xml.py

Usage: ./fragment_wikipedia_xml.py wiki.xml [pages-per-fragment]

$ ./fragment_wikipedia_xml.py wiki.xml 30000
And after a couple of days of processing we have 517 xml files that are small enough to load into memory. Small enough that I no longer need EC2, and small enough to take less than 2 GB of RAM for BeautifulSoup to load it. Cool. So, we still can't process all of wikipedia, not yet, but we have a decent subset to play with. Now, put it to use:
$ du -h data/fragments/0.xml
676M    data/fragments/0.xml

$ grep -c "<page>" data/fragments/0.xml
30000

$ ./play_with_wikipedia__fast_write.py data/fragments/0.xml
And we now have this (the full hyperlink structure of that fragment of wikipedia, in sw format): 30k--wikipedia-links.sw
$ ./spit-out-sw-stats.sh 30k--wikipedia-links.sw
(92M, 1 op types and 30000 learn rules)
links-to: 30000 learn rules
I'll put this sw file to use in the next post.

Now, it is useful to have the inverse-links-to as well, so:
sa: find-inverse[links-to]
sa: save 30k--wikipedia-links--with-inverse.sw
Which only takes a couple of minutes.

More to come!

Update: some stats:
$ ./spit-out-sw-stats.sh 30k--wikipedia-links--with-inverse-simm.sw
(317M, 5 op types and 1326090 learn rules)
how-many-in-links: 1 learn rules
inverse-links-to: 1296087 learn rules
inverse-simm: 14 learn rules
inverse-simm-op: 1 learn rules
links-to: 29984 learn rules
Now, why are there less than 30k links-to learn rules in this second example? Simple enough. The fast_write even wrote out:
op |ket> => |>
but in the process of loading them into memory (then processing them), the |> rules are not learned/ignored.
So, there must be 30000 - 29984 = 16 learn empty-sp rules. Let's see what grep has to say:
$ grep " => |>" 30k--wikipedia-links.sw
links-to |WP: Cogency> => |>
links-to |WP: Wikipedia:Complete_list_of_language_wikis_available> => |>
links-to |WP: Floccinaucinihilipilification> => |>
links-to |WP: Wikipedia:GNE_Project_Files> => |>
links-to |WP: Wikipedia:GNE_Project_Files/Proposed_GNU_Moderation_System> => |>
links-to |WP: Wikipedia:GNE_Project_Files/GNE_Project_Design> => |>
links-to |WP: Wikipedia:The_future_of_Wikipedia> => |>
links-to |WP: Portal:Contents/Outlines> => |>
links-to |WP: Wikipedia:WikiBiblion> => |>
links-to |WP: Wikipedia:PHP_script_tech_talk> => |>
links-to |WP: Normoxic> => |>
links-to |WP: Conducted_interference> => |>
links-to |WP: Local_battery> => |>
links-to |WP: Wikipedia:Blocked_IPs> => |>

$ grep -c " => |>" 30k--wikipedia-links.sw
14
So, still 2 missing :). Not sure why. Don't really care.

Update: Let's show the full steps from wikipedia, to wikipedia link structure:
wiki.xml from here: http://en.wikipedia.org/wiki/Wikipedia:Database_download
$ bzip2 -dc enwiki-20150515-pages-articles.xml.bz2 > wiki.xml
$ ./fragment_wikipedia_xml.py wiki.xml 30000

for file in $(ls -1 data/fragments/*); do
  echo "file: $file"
  ./play_with_wikipedia__fast_write.py "$file"
done

$ cd sw-results/

for i in $(seq 0 9); do
  echo "$i"
  cat $i--30k--wikipedia-links.sw >> 300k--wikipedia-links.sw
done

$ bzip2 -v *.sw
And the resulting sw files are compressed and stored here.

BTW:
$ ./spit-out-sw-stats.sh 300k--wikipedia-links.sw
(494M, 1 op types and 300000 learn rules)
links-to: 300000 learn rules

$ ./spit-out-sw-stats.sh 300k--wikipedia-links--with-inverse.sw
(1.5G, 2 op types and 4571683 learn rules)
inverse-links-to: 4272183 learn rules
links-to: 299499 learn rules

Wednesday, 3 June 2015

new function: find-inverse[op]

Just a short one. "create inverse" was painfully slow, most of that is because of the O(n^2) for add_learn, and the long lists created by "inverse-supported-ops". So, speed up was, we specify what ops we want inverse for, side-stepping the expensive ones. If you still want full inverse, we can still run "create inverse", but now, most of the time this is sufficient:
find-inverse[op]
Here is a short example:
sa: load fred-sam-friends.sw
sa: dump
----------------------------------------
|context> => |context: friends>

friends |Fred> => |Jack> + |Harry> + |Ed> + |Mary> + |Rob> + |Patrick> + |Emma> + |Charlie>

friends |Sam> => |Charlie> + |George> + |Emma> + |Jack> + |Rober> + |Frank> + |Julie>
----------------------------------------

sa: find-inverse[friends]
sa: dump
----------------------------------------
|context> => |context: friends>

friends |Fred> => |Jack> + |Harry> + |Ed> + |Mary> + |Rob> + |Patrick> + |Emma> + |Charlie>

friends |Sam> => |Charlie> + |George> + |Emma> + |Jack> + |Rober> + |Frank> + |Julie>

inverse-friends |Jack> => |Fred> + |Sam>

inverse-friends |Harry> => |Fred>

inverse-friends |Ed> => |Fred>

inverse-friends |Mary> => |Fred>

inverse-friends |Rob> => |Fred>

inverse-friends |Patrick> => |Fred>

inverse-friends |Emma> => |Fred> + |Sam>

inverse-friends |Charlie> => |Fred> + |Sam>

inverse-friends |George> => |Sam>

inverse-friends |Rober> => |Sam>

inverse-friends |Frank> => |Sam>

inverse-friends |Julie> => |Sam>
----------------------------------------
Anyway, main point is this is much faster (and cleaner) than the old "create inverse". And of course is a brother to find-unique[op].

That's it for this post.

Saturday, 30 May 2015

announcing: command line guess file name tool.

OK. Just a quick one. A command line tool that given a string, spits out the file-names that most closely match it (using unscaled simm). Here is the code, and once again, I have made it open source.
$ ./guess.py

Usage: ./guess.py string [directory]

If the directory is not given, then use the current directory.

And a quick example:
$ ./guess.py wiki work-on-wikipedia

string to match:      wiki
directory to search:  work-on-wikipedia
====================================

50 %     wiki.xml
23.08 %  1000-wiki.xml
23.08 %  5000-wiki.xml
22.22 %  wikipedia-links.sw
13.64 %  play_with_wikipedia.py
12.5 %   trial-wikipedia-links.sw
12 %     fragment_wikipedia_xml.py
10.34 %  fast-write-wikipedia-links.sw
8.82 %   play_with_wikipedia__fast_write.py
So, that should be clear enough. And whether anything like this is already out there, I don't know. I presume so. eg, using the edit-distance algorithm as used in spell-checkers could be used in a similar way to my simm.

Heh. Perhaps I gave a bad example! In the above case, a simple:
$ ls -1 work-on-wikipedia/ | grep wiki
1000-wiki.xml
5000-wiki.xml
fast-write-wikipedia-links.sw
fragment_wikipedia_xml.py
play_with_wikipedia.py
play_with_wikipedia__fast_write.py
trial-wikipedia-links.sw
wiki.xml
wikipedia-links.sw
would have given similar results!

So here is another example of guess:
$ ./guess.py "sam fred" sw-examples/ | head

string to match:      sam fred
directory to search:  sw-examples
====================================

30.77 %  small-fred.sw
26.67 %  Freds-family.sw
26.32 %  fred-sam-friends.sw
22.22 %  shares.sw
16.67 %  saved-commands.txt

$ ./guess.py "fred sam" sw-examples/ | head

string to match:      fred sam
directory to search:  sw-examples
====================================

33.33 %  Freds-family.sw
31.58 %  fred-sam-friends.sw
25 %     frog.sw
23.08 %  small-fred.sw
20 %     friends.sw
So some things to note are:
1) "fred sam" and "sam fred" give essentially the same answer
2) I made it case insenstive
3) we don't need to use any regexp, which we would have to do if using grep in a similar manner.
4) we could possibly use this to help find the exact ket when we only partly know its name, and the sw file is too large to search manually.

Anyway, should be useful here and there.

Tuesday, 19 May 2015

comment on ket independence

Just a quick comment on the idea of ket independence.

We can say |x_i> for i in {1,2,..,n} are independent with respect to the op-sequence "op4 op3 op2 op1" if:
op4 op3 op2 op1 (|x_1> + |x_2> + ... + |x_n>)
  = op4 op3 op2 op1 |x_1>
  + op4 op3 op2 op1 |x_2>
  ...
  + op4 op3 op2 op1 |x_n>

 Not yet sure where this will be useful, but figure I should mention it.

Tuesday, 12 May 2015

average-categorize websites

Now for a more interesting example. Recall work on pattern recognition of websites. In that case, we constructed the averages manually. We manually took the average of abc {1,2,...,10}, adelaidenow {1,2,...,10} etc. Then applied simm to webpage-11. This time, we get average-categorize to find our sample classes (again, webpage-11's are filtered out). Here is the BKO:
sa: load improved-fragment-webpages.sw
sa: load label-training-data-for-website-fragments.sw
sa: average-categorize[training-hash-4B,0.7,phi,average-cat]
sa: load create-cat-ave-pat-rec-matrix.sw
Here is the result:
sa: matrix[result]
[ phi: 1 ] = [  91.69  28.73  25.76  37.77  29.45  24.33  ] [ abc 11         ]
[ phi: 2 ]   [  28.77  78.04  26.72  29.86  25.25  28.19  ] [ adelaidenow 11 ]
[ phi: 3 ]   [  25.76  26.88  79.05  28.28  26.86  23.21  ] [ slashdot 11    ]
[ phi: 4 ]   [  37.8   29.75  28.16  85.54  32.06  24.95  ] [ smh 11         ]
[ phi: 5 ]   [  29.71  25.25  26.91  31.87  85.17  22.09  ] [ wikipedia 11   ]
[ phi: 6 ]   [  24.32  28.18  23.47  24.92  21.94  82.02  ] [ youtube 11     ]
Which counts as a success! This matrix is essentially identical to this one. Except this time, we didn't specify the classes, average-categorize did that for us, resulting in phi {1,2,3,4,5,6}.

So, a nice example of unlabeled learning. Though admittedly, webpage superpositions are rather well behaved. Certainly better behaved than the adult wage prediction example! In the sense that different classes have distinct superpositions. The wage prediction example completely failed that test.

Now we have distinct classes we can give them names. Pretty simply actually:
M |phi: 1> => |abc>
M |phi: 2> => |adelaidenow>
M |phi: 3> => |slashdot>
M |phi: 4> => |smh>
M |phi: 5> => |wikipedia>
M |phi: 6> => |youtube>
Now, take a look:
sa: matrix[M,result]
[ abc         ] = [  1  0  0  0  0  0  ] [  91.69  28.73  25.76  37.77  29.45  24.33  ] [ abc 11         ]
[ adelaidenow ]   [  0  1  0  0  0  0  ] [  28.77  78.04  26.72  29.86  25.25  28.19  ] [ adelaidenow 11 ]
[ slashdot    ]   [  0  0  1  0  0  0  ] [  25.76  26.88  79.05  28.28  26.86  23.21  ] [ slashdot 11    ]
[ smh         ]   [  0  0  0  1  0  0  ] [  37.8   29.75  28.16  85.54  32.06  24.95  ] [ smh 11         ]
[ wikipedia   ]   [  0  0  0  0  1  0  ] [  29.71  25.25  26.91  31.87  85.17  22.09  ] [ wikipedia 11   ]
[ youtube     ]   [  0  0  0  0  0  1  ] [  24.32  28.18  23.47  24.92  21.94  82.02  ] [ youtube 11     ]
Or, as a merged matrix:
sa: merged-matrix[M,result]
[ abc         ] = [  91.69  28.73  25.76  37.77  29.45  24.33  ] [ abc 11         ]
[ adelaidenow ]   [  28.77  78.04  26.72  29.86  25.25  28.19  ] [ adelaidenow 11 ]
[ slashdot    ]   [  25.76  26.88  79.05  28.28  26.86  23.21  ] [ slashdot 11    ]
[ smh         ]   [  37.8   29.75  28.16  85.54  32.06  24.95  ] [ smh 11         ]
[ wikipedia   ]   [  29.71  25.25  26.91  31.87  85.17  22.09  ] [ wikipedia 11   ]
[ youtube     ]   [  24.32  28.18  23.47  24.92  21.94  82.02  ] [ youtube 11     ]
Presumably humans/children do something similar. They build up and recognize classes of objects before they even have a name for them. Only later after the classes are well specified they discover a label (in the above case, the M learn rules).

average-categorize H-I simple pattern recognition example

Last post I outlined and gave the code for average-categorize. This post a simple example. Recall the simple pattern recognition example, H-I-pat-rec.sw. Let's apply average-categorize to it.

Now, we have these patterns:
x: letter: H
1       1
1       1
1       1
1 1 1 1 1
1       1
1       1
1       1

x: noisy: H
        1
1       1
1       1
1 1 1   1
1
1       1
1       1

x: noisy: H2
1       1
1
1   1 1 1
1 1 1 1 1
1 1     1
1       1
1 1 1   1

x: letter: I
1 1 1 1 1
    1
    1
    1
    1
    1
1 1 1 1 1

x: noisy: I
1 1 1 1
    1


    1
    1
1   1 1 1

x: noisy: I2
1 1     1
  1 1 1
    1
    1
    1 1 1
1 1 1 1
1 1 1 1 1

x: letter: O
1 1 1 1 1
1
1
1
1
1
1 1 1 1 1
Run this code:
average_categorize(C,"pixels",0.65,"phi","pixels")
And now we have:
x: phi: 1
1       2
2       1
2   0 0 2
2 2 2 1 2
2 0     1
2       2
2 0 0   2

x: phi: 2
1 1 1 1 1
    1
    1
    1
    1
    1
1 1 1 1 1

x: phi: 3
1 1     1
  1 1 1
    1
    1
    1 1 1
1 1 1 1
1 1 1 1 1

x: phi: 4
1 1 1 1 1
1
1
1
1
1
1 1 1 1 1
I think that works pretty well!

Webpage superposition example in the next post.

new function: average-categorize

This one I think will be super useful, and super powerful. I only just came up with it, so I haven't fully explored its properties, but I have already found some. And it's looking promising!

So the idea is, given a collection of points, how do we classify them into groups? I'm pretty sure this is a solved problem, but I wanted to see what I could come up with.

It goes something like this (though see the code for exact details). We have a list of currently known patterns (out_list), and a list of incoming patterns (one). For each incoming pattern (r), compare it with each of the known patterns (sp) using simm. Store the name and similarity value for the best match. At the end of the loop, if the best match is lower than a threshold (recall for simm, higher threshold is a better match, exactly 1 for exact match) then store it as its' own known pattern. Otherwise "average" it with the best matching known pattern, weighted based on its similarity. And in BKO land, simm scales/normalizes superpositions, so to average just means add the superpositions, and we don't need to divide by n.

Here is the code:
def average_categorize(context,parameters):
  try:
    op,t,phi,ave = parameters.split(',')
    t = float(t)
  except:
    return ket("",0)
    
  one = context.relevant_kets(op)
  out_list = []
  for x in one:
    r = x.apply_op(context,op)
    best_k = -1
    best_simm = 0
    for k,sp in enumerate(out_list):
      similarity = silent_simm(r,sp)
      if similarity > best_simm:
        best_k = k
        best_simm = similarity
    if best_k == -1 or best_simm < t:
      out_list.append(r)
    else:
      k = best_k
#      out_list[k] += r
      out_list[k] += r.multiply(best_simm)  # reweight based on result of simm.
  for k,sp in enumerate(out_list):
    context.learn(ave,phi + ": " + str(k+1),sp)
  return ket("average categorize")
Now, I may tweak this code yet, but it is already a good start. I will give a couple of examples in future posts.

Some properties:
1) it converges as you add more examples
2) the order you learn is important, earlier examples have a bigger effect than later ones
3) it reproduces the light-bulb effect, ie stronger examples have a bigger effect
4) it reproduces the reverse-light-bulb effect, ie, weaker examples have essentially no impact
5) I think it is roughly O(n^2), certainly cheaper than my previous categorize code.

I guess I need to explain what I mean by "stronger examples" and "weaker examples". Superpositions have a thing I call currency. Let's drop to the console:
sa: measure-currency |x>
|number: 1>

sa: measure-currency 2.71|y>
|number: 2.71>

sa: measure-currency (|a> + |b> + |c>)
|number: 3>

sa: measure-currency (0.2|a> + 4|b> + 11|c> + 0.3|d>)
|number: 15.5>
So when I talk of "stronger example" I mean r has a high currency, and a "weaker example" I mean r has a small currency.

If we look at the line:
      out_list[k] += r.multiply(best_simm)
then (ignoring the simm reweighting), the higher the currency of r, the more power it has to effect out_list[k]. I'm not sure that is clear! I'm working from mental images, and it is hard to translate them to words.

More to come!

BTW, I don't think I have mentioned this anywhere yet, but we can say an operator "op" preserves currency if:
measure-currency SP == measure-currency op SP
A related definition (and probably a slightly better one) for currency conservation is that if in matrix format the sum of each column is exactly 1, then we can say that operator conserves currency.

Thursday, 7 May 2015

an interesting little observation

OK. Just a quick one. There seems to be an interesting analogy between the different layers in the BKO scheme.

Take the BKO:
|y> => op4 op3 op2 op1 "" |x>
which simply corresponds to stepping from superposition to superposition, then storing it in |y>
Well, a level above that, we have something similar with .sw files. Noting that just loading an sw file can do computation (the only time they don't is if all the rules are literal superpositions).

So the analogy of the above, at the sw level is:
load x.sw
load op1.sw
load op2.sw
load op3.sw
load op4.sw
save y.sw

Anyway, I just thought it was an interesting little observation.

announcing: command line file similarity tool

Decided to write a script that spits out the similarity of two files, using my simm. It has three mappings from file to superposition/list (byte, 2-byte and fragment simm) and gives both scaled and unscaled simm results. Here is the code, and I decided to make it open source!

I guess some examples:
-- define some toy files:
$ echo -n "aa" > a.txt
$ echo -n "a" > b.txt

-- find their similarity:
$ ./simm.py a.txt b.txt
file 1:                a.txt
file 2:                b.txt

unscaled:
  byte similarity:     50 %
  2 byte similarity:   50 %

scaled:
  byte similarity:     100 %
  2 byte similarity:   50 %
Next example:
-- define toy files:
$ echo -n "fish" > a.txt
$ echo -n "fsih" > b.txt

-- find similarity:
$ ./simm.py a.txt b.txt
file 1:                a.txt
file 2:                b.txt

unscaled:
  byte similarity:     100 %
  2 byte similarity:   25 %

scaled:
  byte similarity:     100 %
  2 byte similarity:   25 %
A slightly more interesting example, and this time we also use fragment simm:
-- define a couple of sentences:
$ echo "Shakespeare produced most of his known work between 1589 and 1613" > a.txt
$ echo "Shakespeare produced most of his work after 1589" > b.txt

-- find their similarity, and fragment simm on spaces:
$ ./simm.py a.txt b.txt ' '
file 1:                a.txt
file 2:                b.txt

split strings:         ' '

unscaled:
  byte similarity:     71.21 %
  2 byte similarity:   65.15 %
  fragment similarity: 63.64 %

scaled:
  byte similarity:     82.19 %
  2 byte similarity:   66.2 %
  fragment similarity: 63.64 %
Now, on larger files. Recall the simm matrix I made here? Well, let's test a couple of examples:
$ ./simm.py example-files/binary.exe example-files/ebook.txt
file 1:                example-files/binary.exe
file 2:                example-files/ebook.txt

unscaled:
  byte similarity:     9.25 %
  2 byte similarity:   1.92 %

scaled:
  byte similarity:     12.44 %
  2 byte similarity:   2.66 %

$ ./simm.py example-files/ebook.txt example-files/website.html
file 1:                example-files/ebook.txt
file 2:                example-files/website.html

unscaled:
  byte similarity:     24.07 %
  2 byte similarity:   17.82 %

scaled:
  byte similarity:     72.7 %
  2 byte similarity:   47.26 %

$ ./simm.py example-files/encrypted.gpg example-files/zipped.zip
file 1:                example-files/encrypted.gpg
file 2:                example-files/zipped.zip

unscaled:
  byte similarity:     94.06 %
  2 byte similarity:   63.75 %

scaled:
  byte similarity:     97.6 %
  2 byte similarity:   65.2 %
Now, a more interesting fragment simm example. Recall the simm matrices I made here. Well, let's give some examples:
$ ./simm.py webpages-v2/abc-1.html webpages-v2/abc-7.html '<' '>'
file 1:                webpages-v2/abc-1.html
file 2:                webpages-v2/abc-7.html

split strings:         '<' '>'

unscaled:
  byte similarity:     98.28 %
  2 byte similarity:   95.28 %
  fragment similarity: 91.88 %

scaled:
  byte similarity:     98.76 %
  2 byte similarity:   95.65 %
  fragment similarity: 91.88 %

$ ./simm.py webpages-v2/abc-1.html webpages-v2/adelaidenow-1.html '<' '>'
file 1:                webpages-v2/abc-1.html
file 2:                webpages-v2/adelaidenow-1.html

split strings:         '<' '>'

unscaled:
  byte similarity:     11.41 %
  2 byte similarity:   10.63 %
  fragment similarity: 7.4 %

scaled:
  byte similarity:     82.51 %
  2 byte similarity:   61.66 %
  fragment similarity: 28.76 %

$ ./simm.py webpages-v2/abc-1.html webpages-v2/youtube-1.html '<' '>'
file 1:                webpages-v2/abc-1.html
file 2:                webpages-v2/youtube-1.html

split strings:         '<' '>'

unscaled:
  byte similarity:     13.23 %
  2 byte similarity:   10.87 %
  fragment similarity: 10.29 %

scaled:
  byte similarity:     76.47 %
  2 byte similarity:   49.16 %
  fragment similarity: 24.27 %
And finally, some text files:
$ ./simm.py text/ebook-Alices_Adventures_in_Wonderland_11.txt text/ebook-Tom_Sawyer_74.txt ' '
file 1:                text/ebook-Alices_Adventures_in_Wonderland_11.txt
file 2:                text/ebook-Tom_Sawyer_74.txt

split strings:         ' '

unscaled:
  byte similarity:     40.02 %
  2 byte similarity:   38.49 %
  fragment similarity: 28.69 %

scaled:
  byte similarity:     95.47 %
  2 byte similarity:   86.92 %
  fragment similarity: 54.34 %

$ ./simm.py text/ebook-Alices_Adventures_in_Wonderland_11.txt text/ebook-moby-shakespeare.txt ' '
file 1:                text/ebook-Alices_Adventures_in_Wonderland_11.txt
file 2:                text/ebook-moby-shakespeare.txt

split strings:         ' '

unscaled:
  byte similarity:     3.13 %
  2 byte similarity:   3.11 %
  fragment similarity: 2.5 %

scaled:
  byte similarity:     90.8 %
  2 byte similarity:   79.5 %
  fragment similarity: 41.8 %

$ ./simm.py text/WP-Adelaide.txt text/WP-Australia.txt ' '
file 1:                text/WP-Adelaide.txt
file 2:                text/WP-Australia.txt

split strings:         ' '

unscaled:
  byte similarity:     88.89 %
  2 byte similarity:   81.13 %
  fragment similarity: 46.52 %

scaled:
  byte similarity:     95.47 %
  2 byte similarity:   86.49 %
  fragment similarity: 48.83 %

$ ./simm.py text/WP-Adelaide.txt text/WP-physics.txt ' '
file 1:                text/WP-Adelaide.txt
file 2:                text/WP-physics.txt

split strings:         ' '

unscaled:
  byte similarity:     54.64 %
  2 byte similarity:   50.26 %
  fragment similarity: 23.42 %

scaled:
  byte similarity:     90.3 %
  2 byte similarity:   77.31 %
  fragment similarity: 35.54 %
And one final example. This is what happens if you try the wrong split strings for a given document type. eg here less-than and greater-than on a text file:
$ ./simm.py text/WP-Adelaide.txt text/WP-Australia.txt '<' '>'
file 1:                text/WP-Adelaide.txt
file 2:                text/WP-Australia.txt

split strings:         '<' '>'

unscaled:
  byte similarity:     88.89 %
  2 byte similarity:   81.13 %
  fragment similarity: 0 %

scaled:
  byte similarity:     95.47 %
  2 byte similarity:   86.49 %
  fragment similarity: 0 %
Note the 0% fragment similarity.

That's it for this post. BTW, in looking around to see if others have written file similarity tools I found simhash. Heh, almost certainly faster than my method!

BTW, the above should work just fine on a large range of document types. You just need to choose the right split strings. eg, for C we might use:
./simm.py code-1.c code-2.c ' ' '{' '}' '(' ')' ';'
ie, space, curly-brackets, normal brackets and semi-colon.

Indeed, we could also apply it to DNA (though presumably real researchers already have tools to do this!). So, we can specify split-strings, and apply it to text files of DNA sequences.
eg:
./simm.py DNA-1.txt DNA-2.txt 'GAAATTCCCA' 'ATACCACT' 'AACCACACAC' 'TTAGGG'
That we can do this should not be a surprise, since my fragment similarity idea was originally motivated by the idea of gel electrophoresis.

OK. Let's do a worked example!
Let's first choose 'TTAGGG' as our split string.
Let's choose these to be our 2 DNA sequences:
GAAATTCCCA TTAGGG ATACCACT
AACCACACAC TTAGGG GAAATTCCCA
And by construction, we expect 50% similarity.
Let's take a look:
$ echo -n "GAAATTCCCATTAGGGATACCACT" > a.txt
$ echo -n "AACCACACACTTAGGGGAAATTCCCA" > b.txt

$ ./simm.py a.txt b.txt 'TTAGGG'
file 1:                a.txt
file 2:                b.txt

split strings:         'TTAGGG'

unscaled:
  byte similarity:     84.62 %
  2 byte similarity:   73.08 %
  fragment similarity: 50 %

scaled:
  byte similarity:     89.1 %
  2 byte similarity:   75.64 %
  fragment similarity: 50 %
Success!

Finally, my code is somewhat slow! A C implementation would be nice.

Update: if your files are line based, '\n' (ie, new-line) should usually be a sufficient split string.
If you have csv files say, then maybe '\n' and ',' (ie, new-line and comma).

Update: talking of speeding it up, if we use lists of bits, we can speed it up somewhat.
wf = sum-of-bits(f)
wg = sum-of-bits(g)
wfg = sum-of-bits(xor(f,g))
result = (wf + wg - wfg)/2*max(wf,wg)

Update: OK. I given this "speed up" some thought, and decided it will not be as good as the full simm. A for example is, consider C code. If we use the bit method a file containing 1 "int" is "identical" to another containing say 200 "int". Which is not all that useful!

Tuesday, 21 April 2015

Announcing phase 4: tying it all together

We are almost done! I have given the details of my project in the best way I can. Yeah, other people could do better, but I can't change that. Now I have to try and explain why I think my BKO model ties in closely to a simplified model of neural circuits, and my notation serves as a good mathematical foundation for symbolic AI.

the full wage prediction results

So, I got bored of this, but I guess I should post my results! Spoiler: 77.1% success rate.

OK. First I did some precomputation:
  load adult-wage-pattern-recognition.sw
  simm |*> #=> select[1,100] similar[input-pattern,pattern] |_self>
  map[simm,similarity-result] rel-kets[input-pattern] |>
  save adult-wage-pattern-recognition--saved-simm.sw
This took about a week! Yeah, we could do with more speed. Thankfully similar[op] should be easy to parallelize. But now we have this it is very quick to play with settings.
-- load up the results:
sa: load adult-wage-pattern-recognition--saved-simm.sw

-- find the number of "above 50k" and "below 50k" in the training set:
$ grep "^M" adult-wage-pattern-recognition--saved-simm.sw | grep -c "above"
7841

$ grep "^M" adult-wage-pattern-recognition--saved-simm.sw | grep -c "below"
24720

-- define our norm matrix, that takes into account the relative frequencies of "above 50k" vs "below 50k":
sa: norm |above-50K> => .000127534753220 |_self>
sa: norm |below-50K> => .000040453074433 |_self>

-- define our first attempt at a h:
sa: h |*> #=> normalize[100] coeff-sort norm M select[1,5] similarity-result |_self>

-- define a couple of useful operators:
sa: equal? |*> #=> equal(h|_self>,100 answer |_self>)
sa: is-equal? |*> #=> max-elt wif(equal? |_self>,|True>,|False>)

-- find the table of results:
sa: table[input,h,answer,is-equal?] rel-kets[input-pattern] |>
result: adult-wage-prediction-table-select-1-5.txt

-- now the results for this h:
$ grep -c "example" adult-wage-prediction-table-select-1-5.txt
16281

$ grep -c "True" adult-wage-prediction-table-select-1-5.txt
12195

-- the percent correct:
100*12195/16281
= 74.903 %

-- next attempt at h (just pick the best match, and ignore the rest):
sa: h |*> #=> 100 M select[1,1] similarity-result |_self>

-- find the table of results:
sa: table[input,h,answer,is-equal?] rel-kets[input-pattern] |>
result: adult-wage-prediction-table-select-1-1.txt

-- now the results for this h:
$ grep -c "True" adult-wage-prediction-table-select-1-1.txt
12549

-- the percent correct:
100*12549/16281
= 77.077 %
Finally, I tried using apply-weights, but I couldn't improve on 77.1%.
eg:
h |*> #=> normalize[100] coeff-sort norm M apply-weights[5,4,3,2,1] similarity-result |_self>
Maybe if we had some iterative procedure to choose the weights on a sample set, and then apply that to the full set, we might improve on 77%. But I gave up!

And a note, these tables of 16,281 entries take about 2 minutes to generate. Without the precomputation, they would take the full week, for each tweak of h.

Another possible method to improve on 77% and get closer to the 84% I see with other methods is to tweak our supervised pattern recognition algo. The apply-weights is really trying to change weights after the similarity has been calculated. But we can also do it before, and pre-weight our superpositions before we feed them to simm.

So instead of:
Given the training data set D:
D = {(X1,Y1),(X2,Y2),...(Xn,Yn)}
where Xi, and Yi are superpositions (and must not be empty superpositions that have all coeffs equal to 0)

Then learn these rules:
pattern |node: 1> => X1
pattern |node: 2> => X2
...
pattern |node: n> => Xn

M |node: 1> => Y1
M |node: 2> => Y2
...
M |node: n> => Yn

Then given the unlabeled data set U = {Z1,Z2,...Zm}, where Zi are superpositions of the same type as Xi, learn these rules:
input-pattern |example: 1> =>  Z1
input-pattern |example: 2> =>  Z2
...
input-pattern |example: m> =>  Zm
We first find a matrix W that re-weights our Xk and Zk superpositions/patterns. Then do:
Given the training data set D:
D = {(X1,Y1),(X2,Y2),...(Xn,Yn)}
where Xi, and Yi are superpositions (and must not be empty superpositions that have all coeffs equal to 0)

Then learn these rules:
pattern |node: 1> => W X1
pattern |node: 2> => W X2
...
pattern |node: n> => W Xn

M |node: 1> => Y1
M |node: 2> => Y2
...
M |node: n> => Yn

Then given the unlabeled data set U = {Z1,Z2,...Zm}, where Zi are superpositions of the same type as Xi, learn these rules:
input-pattern |example: 1> =>  W Z1
input-pattern |example: 2> =>  W Z2
...
input-pattern |example: m> =>  W Zm
And note that W does not need to be square. Indeed, the output of "W Xk" can be a completely different type of superposition than Xk. But again, like the apply-weights idea, I don't know a good way to find W. Perhaps borrow some ideas from standard artificial neural networks?

That's it for this post!

Update: I tried a new h, but only got 74% success (12043/16281).
h |*> #=> normalize[100] coeff-sort norm M invert subtraction-invert[1] select[1,5] similarity-result |_self>
I also tried select[1,3] and select[1,10] but they were worse.

Saturday, 18 April 2015

new function: apply-weights[n1,n2,..]

Another brief one. So the motivation was to try and improve my results in the adult wage prediction example. Previously I just used select[1,5] applied to the top similarity matches. Apply-weights[] can be considered a partial generalization of that, in the sense that apply-weights[1,1,1,1,1] is the same as select[1,5].

Here is the python:
def apply_weights(one,weights):
  weights = weights.split(",")
  result = superposition()
  for k,x in enumerate(one):
    if k >= len(weights):
      break
    result += x.multiply(float(weights[k]))
  return result
And here is a brief example:
sa: apply-weights[3.1415,0,6,7.3,13] split |a b c d e f g h i j>
3.142|a> + 0|b> + 6|c> + 7.3|d> + 13|e>
That should be clear enough.

BTW, as for the wage prediction results, well, I tried some examples and I failed to improve on just picking the result with the highest match (ie, select[1,1]), at 77.1% success rate. Maybe if you choose the weights just right you will get a better result? I don't yet know how to do that though.

That's it for this post.

Update: apply-weights can be considered to be multiplication by a diagonal matrix, with the weights the values on the diagonal.

new function: full-exp[op,n]

Just a brief one. I don't currently have a use for it, but part of my brain says it might be useful. Yeah, I guess I am not a minimalist. Anyway, a tweak on exp[op,n], this time we keep the 1/factorial(k) factor.

So, simply enough:
full-exp[op,n] |x>
maps to: (1 + op/1 + op^2/2 + ... + op^n/n! ) |x>

No point giving the python.

Anyway, a quick example:
-- define our X operator:
sa: X |*> #=> algebra(|x>,|*>,|_self>)

-- the previous exp operator:
sa: exp[X,6] |1>
|1> + |x> + |x*x> + |x*x*x> + |x*x*x*x> + |x*x*x*x*x> + |x*x*x*x*x*x>

-- the new exp operator:
sa: full-exp[X,6] |1>
|1> + |x> + 0.5|x*x> + 0.167|x*x*x> + 0.042|x*x*x*x> + 0.008|x*x*x*x*x> + 0.001|x*x*x*x*x*x>
And I guess that is it. We see the standard Taylor series for exp(x) as expected.

Update: and we can use exp as a "smear" operator, to expand a single point into a range of points. So, in a way, a little like the range function.

In the console:
-- define a translation operator:
T |*> #=> arithmetic(|_self>,|+>,|x: 1>)

-- a test case (translate by 1):
sa: T |x: 0>
|x: 1>

-- another test case (translate by 7):
sa: T^7 |x: 0>
|x: 7>

-- use exp[op,n] as a smear operator:
sa: exp[T,7] |x: 0>
|x: 0> + |x: 1> + |x: 2> + |x: 3> + |x: 4> + |x: 5> + |x: 6> + |x: 7>

Tuesday, 14 April 2015

new function: list-kets

So, I guess this thing could be called the brother of rel-kets[op]. The idea is simply, return a superposition of all known kets that match the pattern.

eg:
list-kets |movie: *> returns the list of all movies (in the current context)
list-kets |person: *> returns the list of all people
list-kets |animal: *> returns the list of all animals
list-kets |*> returns all kets with learn rules.

Here is the python (in the new_context class):
# e is a ket.
  def list_kets(self,e):
    label = e.the_label()
    if len(label) == 0:
      return ket("",0)
    if label[-1] != "*":
      return e
    label = label.rstrip("*").rstrip(": ")
    result = superposition()
    for trial_label in self.ket_rules_dict:
      if trial_label.startswith(label):
        result.data.append(ket(trial_label))  
    return result
I hope that is simple and obvious enough. Should be useful in a bunch of places.

As usual, more to come!

Update: here is a quick example. What is the list of animals with 4 legs?
such-that[has-4-legs] list-kets |animal: *>

Update: For now I have deprecated list-kets. I have replaced it with starts-with.
eg, list of animals with 2 legs?
such-that[has-2-legs] starts-with |animal: >

eg, list all kets that have Fred as a first name?
starts-with |Fred >

eg, list all kets:
starts-with |>

Here is the new python:
# e is a ket.
  def starts_with(self,e):
    label = e.the_label()
    result = superposition()
    for trial_label in self.ket_rules_dict:
      if trial_label.startswith(label):
        result.data.append(ket(trial_label))  
    return result
That's it for now. Hrmm... I wonder if I broke any code with this change?

Saturday, 11 April 2015

new function: subset

Now, rounding out one more maths piece. We already have set union in the form of union(SP1,SP2), set intersection in the form of intn(SP1,SP2), test for set membership in the form of mbr(KET,SP), and now today, test for subsetness subset(SP1,SP2). It returns |subset> if an exact subset, and c|subset> where c < 1 for not exactly subset.

Here is the python (yeah, trivial):
def subset(one,two):
  if one.count_sum() == 0:           # prevent div by 0.
    return ket("",0)
  value = intersection(one,two).count_sum()/one.count_sum()
  return ket("subset",value)
And here are some simple examples in the console:
-- full subset:
sa: subset(|b>,|a> + |b> + |c>)
|subset>

-- partial subset:
sa: subset(|a>,0.8|a>)
0.8|subset>

-- another partial subset:
sa: subset(|a> + |d>,|a> + |b> + |c>)
0.5|subset>

-- another full subset:
sa: subset(|b> + |d> + |e> + |f>,|a> + |b> + |c> + |d> + |e> + |f> + |g> + |h>)
|subset>

-- another full subset, this one with coeffs other than just 0 and 1:
sa: subset(0.8|a> + 3|b> + 7|c> + 0.2|d>,|a> + 4|b> + 7|c> + 5|d> + 37|e>)
|subset>

-- not at all a subset:
sa: subset(|d> + |e> + |f>,|a> + |b> + |c>)
0|subset>
I guess that is it. I hope that is clear enough. Now, I don't yet know where I will use it, but presumably it will be useful!

And an observation. I guess one interpretation of subset(SP1,SP2) == 1 is that for every point of a "curve" in SP1, its value is bounded by (ie, less than or equal) the value of SP2 at that same point. I suspect this will be an interesting idea.

BTW, this post was partly motivated by multi-sets. Though multi-sets can be considered a subset of superpositions, since the latter can have non-integer coeffs.

Thursday, 2 April 2015

this concludes phase 3

So, all the basics are out of the way! Phase 1 and literal operators, phase 2 and function operators, phase 3 making heavy use of similar[op], find-topic[op] and more complicated BKO examples. I guess it will be the job of phase 4 to try and tie it all together. I think I'm going to find that hard!

So that is it for now! More to come of course!

mapping mindpixels to BKO rules

So, borrowing the data from the now closed mindpixel project, we can show how you would map them to BKO. And I think I prefer the term "molecule of knowledge" rather than "mindpixel".

Anyway, some mindpixels:
	1.00	is icecream cold?
	1.00	is earth a planet?
	1.00	Is green a color?
	1.00	do airplanes fly?
	1.00	Is it hot during the summer?
	1.00	is chile in south america ?
	1.00	Was Socrates a man?
	1.00	Computers use electricity?
	1.00	The dominant language in france is french?
	1.00	was abraham lincoln once president of the united states?
	1.00	Is milk white?
	1.00	do people have emotions?
	1.00	do objects appear smaller as they move away from you?
	1.00	Does the human species have a male and female gender?
	1.00	Is a mountain mostly made of rock?
	1.00	is sun microsystems a computer company?
	1.00	Do you see with your eyes and smell with your nose?
	1.00	Is smoking bad for your health?
	1.00	Does a dog have four legs?
	1.00	Do mammals have hearts?
	1.00	is the Earth a planet?
	1.00	Is water a liquid?
	1.00	Is Bugs Bunny a cartoon character?
	1.00	Do Humans communicate by Telephone?
	1.00	is beer a drink ?
	1.00	are there 12 months in a year?
	1.00	does the sun hurt your eyes when you look at it?
	1.00	Do most cars have doors?
	1.00	is orange both a fruit and a colour?
	1.00	Is water a necessity?
...
Now as BKO:
is-cold |icecream> => |yes>
is-a-planet |earth> => |yes>
is-a-color |green> => |yes>
does-fly |airplanes> => |yes>
is-hot-during |summer> => |yes>
is-in-south-america |chile> => |yes>
is-a-man |Socrates> => |yes>
uses-electricity |computer> => |yes>
spoken-language |france> => |french>
was-a-us-president |Abraham Lincoln> => |yes>
is-white |milk> => |yes>
have-emotions |people> => |yes>
...
gender |human species> => |male> + |female>
is-made-of-rock |mountain> => |yes>
is-a-computer-company |sun microsystems> => |yes>
see-with |eyes> => |yes>
smell-with |nose> => |yes>
is-bad-for-your-health |smoking> => |yes>
has-four-legs |dog> => |yes>
has-a-heart |animal> => |yes>
is-a-planet |earth> => |yes> -- duplicate rule
is-liquid |water> => |yes>
is-a-cartoon-character |bugs bunny> => |yes>
communicates-by-telephone |humans> => |yes>
is-a-drink |beer> => |yes>
how-many-months |year> => |number: 12>
hurts-your-eyes-when-you-look-at-it |sun> => |yes>
has-doors |car> => |yes>
is-a-fruit |orange> => |yes>
is-a-color |orange> => |yes>
is-a-necessity |water> => |yes>
Note that "is-something" rules seem kind of pointless, but they are actually very useful when combined with "such-that[condition] some-list".

That's it for this post. I think I made my point.

similarity matrices for wikipedia word frequency lists

Just a quick one this time. The similarity matrices for our wikipedia word frequency lists.

In the console:
-- load the data:
sa: load improved-WP-word-frequencies.sw

-- create our matrices:
sa: simm-1 |*> #=> 100 self-similar[words-1] |_self>
sa: simm-2 |*> #=> 100 self-similar[words-2] |_self>
sa: simm-3 |*> #=> 100 self-similar[words-3] |_self>
sa: map[simm-1,similarity-1] rel-kets[words-1] |>
sa: map[simm-2,similarity-2] rel-kets[words-2] |>
sa: map[simm-3,similarity-3] rel-kets[words-3] |>

-- display them:
sa: matrix[similarity-1]
[ WP: Adelaide         ] = [  100.0  56.86  15.13  37.53  38.7   27.41  24.25  ] [ WP: Adelaide         ]
[ WP: Australia        ]   [  56.86  100.0  16.48  37.53  40.39  27.63  24.32  ] [ WP: Australia        ]
[ WP: country list     ]   [  15.13  16.48  100.0  9.77   8.53   30.32  21.36  ] [ WP: country list     ]
[ WP: particle physics ]   [  37.53  37.53  9.77   100.0  51.62  22.68  19.5   ] [ WP: particle physics ]
[ WP: physics          ]   [  38.7   40.39  8.53   51.62  100.0  22.76  18.1   ] [ WP: physics          ]
[ WP: rivers           ]   [  27.41  27.63  30.32  22.68  22.76  100.0  24.52  ] [ WP: rivers           ]
[ WP: US presidents    ]   [  24.25  24.32  21.36  19.5   18.1   24.52  100.0  ] [ WP: US presidents    ]

sa: matrix[similarity-2]
[ WP: Adelaide         ] = [  100.0  15.04  1.73   6.4    7.71   4.39   3.2    ] [ WP: Adelaide         ]
[ WP: Australia        ]   [  15.04  100.0  2.27   5.92   8.04   4.43   4.26   ] [ WP: Australia        ]
[ WP: country list     ]   [  1.73   2.27   100.0  1.49   1.46   2.18   1.35   ] [ WP: country list     ]
[ WP: particle physics ]   [  6.4    5.92   1.49   100.0  13.86  3.81   3.28   ] [ WP: particle physics ]
[ WP: physics          ]   [  7.71   8.04   1.46   13.86  100.0  4.52   2.63   ] [ WP: physics          ]
[ WP: rivers           ]   [  4.39   4.43   2.18   3.81   4.52   100.0  2.98   ] [ WP: rivers           ]
[ WP: US presidents    ]   [  3.2    4.26   1.35   3.28   2.63   2.98   100.0  ] [ WP: US presidents    ]

sa: matrix[similarity-3]
[ WP: Adelaide         ] = [  100.0  2.59   0.16   0.64   0.73   0.24   0.1    ] [ WP: Adelaide         ]
[ WP: Australia        ]   [  2.59   100.0  0.34   0.47   0.96   0.35   0.53   ] [ WP: Australia        ]
[ WP: country list     ]   [  0.16   0.34   100.0  0.14   0.1    0.46   0.22   ] [ WP: country list     ]
[ WP: particle physics ]   [  0.64   0.47   0.14   100.0  2.98   0.26   0.17   ] [ WP: particle physics ]
[ WP: physics          ]   [  0.73   0.96   0.1    2.98   100.0  0.48   0.14   ] [ WP: physics          ]
[ WP: rivers           ]   [  0.24   0.35   0.46   0.26   0.48   100.0  0.21   ] [ WP: rivers           ]
[ WP: US presidents    ]   [  0.1    0.53   0.22   0.17   0.14   0.21   100.0  ] [ WP: US presidents    ]
That's all clear enough. No further comment needed.

Heh, we can even show the matrices of words, but they are way too big to post.
Instead:
matrix[words-1]
matrix[words-2]
matrix[words-3]

new function: intn-find-topic[op]

The motivation for this function is some of the results in the last post. Most of the time find-topic[op] gives good results, but I found sometimes it gives really crappy results. In particular, when I asked about "Japan Russia China" I got this messy result:
sa: h1 |japan russia china>
40.697|WP: rivers> + 22.252|WP: Australia> + 19.422|WP: country list> + 12.804|WP: particle physics> + 4.825|WP: Adelaide>
But I also found if I took the intersection of the separate results for Japan, Russia and China I got a much better result:
sa: intn(h1 |japan>, h1 |russia>, h1 |china>)
12.308|WP: country list>
And hence some new python:
-- in the ket class (currently we don't have a superposition version of this, and probably don't need one!):
  def intn_find_topic(self,context,op):
    words = self.label.lower().split()       # we made it case insensitive.
    if len(words) == 0:
      return ket("",0)
    results = [context.map_to_topic(ket(x),op) for x in words]
    if len(results) == 0:                    # this should never be true!
      return ket("",0)
    r = results[0]
    for sp in results:
      r = intersection(r,sp)
    return r.normalize(100).coeff_sort()
Now, if the ket is a single word, then it gives the exact same answer as find-topic[words-1]. But if the ket is words separated by space, then it usually gives a much better result.

So, let's try the troublesome examples from last time:
-- load the data:
sa: load improved-WP-word-frequencies.sw

-- save some typing:
-- h1 is the old method, F1 is the new method.
sa: h1 |*> #=> find-topic[words-1] split |_self>
sa: F1 |*> #=> intn-find-topic[words-1] |_self>

-- ask about the Nile:
sa: h1 |nile river>
76.811|WP: rivers> + 13.788|WP: Adelaide> + 9.401|WP: Australia>

sa: F1 |nile river>
100|WP: rivers>

-- ask about George Bush:
sa: h1 |george bush>
67.705|WP: US presidents> + 22.363|WP: Australia> + 9.932|WP: Adelaide>

sa: F1 |george bush>
77.465|WP: US presidents> + 22.535|WP: Australia>

-- ask about Japan, Russia and China:
sa: h1 |japan russia china>
40.697|WP: rivers> + 22.252|WP: Australia> + 19.422|WP: country list> + 12.804|WP: particle physics> + 4.825|WP: Adelaide>

sa: F1 |japan russia china>
100|WP: country list>
So it really is a good improvement on standard find-topic[op]! So at some stage I should probably scale it up to even more of wikipedia.

Recall the discussion (from long ago) about the difference between intersection and soft-intersection? Maybe I should find the link! Anyway, h1 can be considered to be using a soft intersection approach, and F1 the strict intersection approach (which is really a better fit for a search engine type algo anyway! You generally don't want pages that ignore one or more of your query terms.)

Anyway, my "ultra simple search algo" now looks like this:
|answer> => table[page,coeff,url] select[1,10] coeff-sort weight-pages intn-find-topic[words-1] |just some words>
I guess the final comment is we can still perhaps tweak this further. This proposed algo does not take into consideration relative closeness of words in a document. eg, if one word is at the top, and the other is at the bottom, I presume you would want that a lesser result than if those two words were near each other. How to do that cleanly I don't know.

That's it for this post!

Update: here is a fun one:
sa: F1 |thomas ronald richard bill barack george james jimmy>
100|WP: US presidents>
Update: Another way to look at it is that h1 is "word-1 OR word-2", F1 is "word-1 AND word-2".

mapping wikipedia pages to frequency lists

This is a proof of concept of maybe we can use find-topic[op] to search the web, in this particular case a tiny sample of wikipedia pages. Who knows, maybe we don't need page rank to search the web? But that is all highly speculative, I haven't done any work in that direction. But I do have the wikipedia version working. Here is the code to map my sample wikipedia posts to frequency lists.

Now, in the console:
-- load the data:
sa: load improved-WP-word-frequencies.sw

-- save some typing:
sa: h1 |*> #=> find-topic[words-1] split |_self>
sa: h2 |*> #=> find-topic[words-2] |_self>
sa: h3 |*> #=> find-topic[words-3] |_self>
sa: t1 |*> #=> table[page,coeff] find-topic[words-1] split |_self>
-- NB: note the "split" in there in h1. This is important to note!
-- NB: words-1 are 1-gram word frequencies. words-2 are 2-gram word frequencies. words-3 are 3-gram word frequencies.

-- where will I find info on Adelaide?
sa: h1 |adelaide>
74.576|WP: Adelaide> + 25.424|WP: Australia>

-- where will I find info on Adelaide university?
sa: h1 |adelaide university>
66.236|WP: Adelaide> + 33.764|WP: Australia>

-- and again, this time using 3-grams:
sa: h3 |university of adelaide>
76.923|WP: Adelaide> + 23.077|WP: Australia>

-- where will I find info on Aami Stadium?
sa: h1 |aami stadium>
100|WP: Adelaide>

-- where will I find info on Perth?
sa: h1 |perth>
100|WP: Australia>

-- where will I find info on the Nile river?
sa: h1 |nile river>
76.811|WP: rivers> + 13.788|WP: Adelaide> + 9.401|WP: Australia>
-- hrmmm... Adelaide and Australia are in there because of "river"
-- let me show you:
sa: h1 |river>
53.621|WP: rivers> + 27.577|WP: Adelaide> + 18.802|WP: Australia>

-- so try again, this time using 2-grams: 
sa: h2 |nile river>
|>
-- null result.
-- so try again:
sa: h2 |river nile>
100.0|WP: rivers>
-- so we finally got there, but note how exact you have to be. Hence again, why we need a "did you mean" feature.

-- where will I find info on Bill Clinton:
sa: h1 |bill clinton>
100|WP: US presidents>

-- where will I find info on Nixon:
sa: h1 |nixon>
100|WP: US presidents>

-- where will I find info on George Bush (first try 2-grams):
sa: h2 |george bush>
|>

-- this time using 1-grams:
sa: h1 |george bush>
67.705|WP: US presidents> + 22.363|WP: Australia> + 9.932|WP: Adelaide>

-- now, why are Australia and Adelaide in there? I will show you:
sa: h1 |george>
62.077|WP: US presidents> + 19.865|WP: Adelaide> + 18.059|WP: Australia>

sa: h1 |bush>
73.333|WP: US presidents> + 26.667|WP: Australia>

-- where will I find info on physics:
sa: h1 |physics>
54.237|WP: physics> + 45.763|WP: particle physics>

-- where will I find info on electrons:
sa: h1 |electron>
62.791|WP: particle physics> + 37.209|WP: physics>

-- what about Newton?
sa: h1 |newton>
100|WP: physics>

-- and Einstein?
sa: h1 |einstein>
100|WP: physics>

-- and Feynman?
sa: h1 |feynman>
64|WP: physics> + 36|WP: particle physics>

-- where will I find info on Japan, Russia and China?
sa: h1 |japan russia china>
40.697|WP: rivers> + 22.252|WP: Australia> + 19.422|WP: country list> + 12.804|WP: particle physics> + 4.825|WP: Adelaide>
-- hrmm ... that didn't work very well.

-- let's look at the components:
-- Japan?
sa: h1 |japan>
53.598|WP: Australia> + 24.566|WP: particle physics> + 21.836|WP: country list>

-- Russia?
sa: h1 |russia>
73.846|WP: rivers> + 13.846|WP: particle physics> + 12.308|WP: country list>

-- China?
sa: h1 |china>
48.246|WP: rivers> + 24.123|WP: country list> + 14.474|WP: Adelaide> + 13.158|WP: Australia>

-- let's try an intersection:
sa: intn(h1 |japan>, h1 |russia>, h1 |china>)
12.308|WP: country list>
-- that worked much better!!
-- Indeed, I think that is a strong hint I should write an intn-find-topic[op] function!
So it all works pretty well. The important question is does it work better than standard search? I don't know.

Another question is how well this will work if we scale it up to all of wikipedia? It will certainly be slow, at least with the current code, but how good would the results be compared to the search function already built into wikipedia, or searching wikipedia indirectly using google?

I guess at this point I could propose an ultra simple search algo.
Say you search for "just some words", then this back-end BKO:
|answer> => table[page,coeff] select[1,10] coeff-sort weight-pages find-topic[words-1] split |just some words>
where weight-pages re-weights the pages returned from find-topic[op] based on some measure of quality for a page.
eg, I'm thinking something like:
-- "url: a" is a good page:
weight-pages |url: a> => 7|url: a>

-- "url: b" is not a good page:
weight-pages |url: b> => 0.2|url: b>

-- "url: c" is an ok page:
weight-pages |url: c> => 2|url: c>
And I guess that is it for this post!

Update: BTW, to safely handle the case of an unknown url (which would otherwise map to |>), define this general rule:
weight-pages |url: *> #=> |_self>