The Semantic DB Project

Sunday, 21 June 2015

some more similar[inverse-links-to] results

This time using 300,000 pages of wikipedia (out of 15,000,000 total). So roughly 2% of total. Even with EC2, I don't really have the processing power (with the current code) to use much larger sets than this.

sa: load 300k--wikipedia-links.sw
sa: find-inverse[links-to]
sa: H |*> #=> how-many inverse-links-to merge-labels(|WP: > + |_self>)
sa: S |*> #=> table[wikipage,coeff] select[1,60] 100 self-similar[inverse-links-to] merge-labels(|WP: > + |_self>)

sa: S |Love>
+-----------------------------------+--------+
| wikipage                          | coeff  |
+-----------------------------------+--------+
| Love                              | 100.0  |
| Pride                             | 17.391 |
| Pleasure                          | 13.043 |
| Jealousy                          | 13.043 |
| Philotes_(mythology)              | 13.043 |
| Imagination                       | 13.043 |
| Pity                              | 13.043 |
| Envy                              | 13.043 |
| Peace                             | 12.121 |
| Matter                            | 12     |
| Fear                              | 8.696  |
| Measurement                       | 8.696  |
| Number                            | 8.696  |
| Observation                       | 8.696  |
| Misanthropy                       | 8.696  |
| Piety                             | 8.696  |
| Courage                           | 8.696  |
| Hope                              | 8.696  |
| Lust                              | 8.696  |
| Asteria                           | 8.696  |
| Orthrus                           | 8.696  |
| Modesty                           | 8.696  |
| Punishment                        | 8.696  |
| Idea                              | 8.696  |
| Politeness                        | 8.696  |
| Learning                          | 8.696  |
| Luck                              | 8.696  |
| Sexual_attraction                 | 8.696  |
| Necessity                         | 8.696  |
| Physical_intimacy                 | 8.696  |
| Wrath                             | 8.696  |
| Gluttony                          | 8.696  |
| Prediction                        | 8.696  |
| Darkness                          | 8.696  |
| Safety                            | 8.696  |
| Optimism                          | 8.696  |
| Doubt                             | 8.696  |
| Moderation                        | 8.696  |
| Compassion                        | 8.696  |
| Respect                           | 8.696  |
| Nomenclature                      | 8.696  |
| Courtship                         | 8.696  |
| Jonathan_Barnes                   | 8.696  |
| DielsKranz_numbering_system       | 8.696  |
| John_Raven                        | 8.696  |
| De_amore_(Andreas_Capellanus)     | 8.696  |
| Infatuation                       | 8.696  |
| Category:Love               | 8.696  |
| Contempt                          | 8.696  |
| Memory                            | 8.696  |
| Quantity                          | 8.696  |
| cyclops                           | 8.696  |
| Curiosity                         | 8.696  |
| Passion_(emotion)                 | 8.696  |
| Category:Philosophy_of_love | 8.696  |
| nonverbal_communication           | 8.696  |
| Air                               | 8.696  |
| Neikea                            | 8.696  |
| Peter_Kingsley_(scholar)          | 8.696  |
| Inquiry                           | 8.696  |
+-----------------------------------+--------+
  Time taken: 1 hour, 42 minutes, 23 seconds, 210 milliseconds

sa: S |Knowledge>
+----------------------------+--------+
| wikipage                   | coeff  |
+----------------------------+--------+
| Knowledge                  | 100.0  |
| Inquiry                    | 16     |
| Measurement                | 12     |
| Pride                      | 12     |
| Idea                       | 12     |
| Learning                   | 12     |
| Prediction                 | 12     |
| Experience                 | 12     |
| Memory                     | 12     |
| Intelligence_(trait)       | 12     |
| understanding              | 10.345 |
| Imre_Lakatos               | 8.333  |
| Beauty                     | 8      |
| Outline_of_education       | 8      |
| Faith                      | 8      |
| Love                       | 8      |
| Meaning_of_life            | 8      |
| Metaphor                   | 8      |
| Nominalism                 | 8      |
| Number                     | 8      |
| Observation                | 8      |
| Platonic_idealism          | 8      |
| Pain                       | 8      |
| Pathological_science       | 8      |
| Problem_of_other_minds     | 8      |
| Misanthropy                | 8      |
| Piety                      | 8      |
| Virtue                     | 8      |
| Lust                       | 8      |
| Discovery_(observation)    | 8      |
| Ineffability               | 8      |
| Belief                     | 8      |
| Organization               | 8      |
| Modesty                    | 8      |
| Placebo                    | 8      |
| Punishment                 | 8      |
| Quasi-empirical_method     | 8      |
| Pleasure                   | 8      |
| Jealousy                   | 8      |
| Authority                  | 8      |
| Karl_Mannheim              | 8      |
| Paradigm                   | 8      |
| Intensionality             | 8      |
| Problem_of_induction       | 8      |
| Necessity                  | 8      |
| Elegance                   | 8      |
| Prattyasamutpda            | 8      |
| Moderation                 | 8      |
| Phenomenalism              | 8      |
| Nomenclature               | 8      |
| Potentiality_and_actuality | 8      |
| Max_Scheler                | 8      |
| Matter                     | 8      |
| Panpsychism                | 8      |
| Information                | 8      |
| knowledge_management       | 8      |
| Lev_Shestov                | 8      |
| Interpretation_(logic)     | 8      |
| Outline_of_philosophy      | 8      |
| Outline_of_logic           | 8      |
+----------------------------+--------+
  Time taken: 1 hour, 48 minutes, 29 seconds, 868 milliseconds

sa: H |Google>
|number: 704>

sa: S |Google>
+---------------------------------------+--------+
| wikipage                              | coeff  |
+---------------------------------------+--------+
| Google                                | 100.0  |
| Apple_Inc.                            | 14.063 |
| Microsoft                             | 12.732 |
| Facebook                              | 11.222 |
| Yahoo!                                | 9.375  |
| World_Wide_Web                        | 8.807  |
| IBM                                   | 8.093  |
| Sun_Microsystems                      | 7.955  |
| Android_(operating_system)            | 7.812  |
| Internet                              | 7.487  |
| Amazon.com                            | 7.102  |
| Intel                                 | 6.676  |
| Linux                                 | 6.537  |
| Hewlett-Packard                       | 6.25   |
| Stanford_University                   | 6.108  |
| Twitter                               | 6.108  |
| web_browser                           | 6.108  |
| HTML                                  | 5.824  |
| operating_system                      | 5.803  |
| YouTube                               | 5.657  |
| Forbes                                | 5.384  |
| Massachusetts_Institute_of_Technology | 5.324  |
| Java_(programming_language)           | 4.83   |
| AOL                                   | 4.687  |
| smartphone                            | 4.687  |
| open_source                           | 4.687  |
| C_(programming_language)              | 4.608  |
| Silicon_Valley                        | 4.545  |
| Nokia                                 | 4.403  |
| C++                                   | 4.403  |
| Microsoft_Windows                     | 4.354  |
| JavaScript                            | 4.261  |
| Wired_(magazine)                      | 4.261  |
| Motorola                              | 4.119  |
| XML                                   | 4.119  |
| Wall_Street_Journal                   | 4.119  |
| CNET                                  | 4.119  |
| copyright                             | 4.119  |
| software                              | 4.119  |
| Oracle_Corporation                    | 3.977  |
| Sony                                  | 3.977  |
| Unix                                  | 3.977  |
| Mac_OS_X                              | 3.977  |
| Wikipedia                             | 3.977  |
| Internet_Explorer                     | 3.835  |
| OS_X                                  | 3.835  |
| source_code                           | 3.835  |
| eBay                                  | 3.835  |
| computer_science                      | 3.748  |
| University_of_California,_Berkeley    | 3.732  |
| IP_address                            | 3.693  |
| Larry_Page                            | 3.693  |
| iPhone                                | 3.693  |
| algorithm                             | 3.693  |
| free_software                         | 3.693  |
| University_of_Michigan                | 3.551  |
| GNU_General_Public_License            | 3.551  |
| database                              | 3.551  |
| Carnegie_Mellon_University            | 3.409  |
| Cisco_Systems                         | 3.409  |
+---------------------------------------+--------+
  Time taken: 1 day, 18 hours, 53 minutes, 1 second, 791 milliseconds

sa: H |Blog>
|number: 32>

sa: S |Blog>
+-----------------------------------------------------------------+-------+
| wikipage                                                        | coeff |
+-----------------------------------------------------------------+-------+
| Blog                                                            | 100   |
| Active_Server_Pages                                             | 9.375 |
| Desktop_publishing                                              | 9.375 |
| Online_chat                                                     | 9.375 |
| CAPTCHA                                                         | 9.375 |
| RSS                                                             | 9.302 |
| Dynamic_HTML                                                    | 6.25  |
| Malware                                                         | 6.25  |
| Chat_room                                                       | 6.25  |
| Content_management_system                                       | 6.25  |
| ABC_World_News_Tonight                                          | 6.25  |
| Cross-site_scripting                                            | 6.25  |
| Primetime_(TV_series)                                           | 6.25  |
| Phishing                                                        | 6.25  |
| home_page                                                       | 6.25  |
| Open_source_software                                            | 6.25  |
| impact_factor                                                   | 6.25  |
| Terminate_and_Stay_Resident                                     | 6.25  |
| electronic_mailing_list                                         | 6.25  |
| Podcast                                                         | 6.25  |
| Google_Scholar                                                  | 6.25  |
| OPML                                                            | 6.25  |
| feed_aggregator                                                 | 6.25  |
| peer-review                                                     | 6.25  |
| Social_networking_service                                       | 6.25  |
| Digg                                                            | 6.25  |
| carbon_copy                                                     | 6.25  |
| online_community                                                | 6.25  |
| Freemium                                                        | 6.25  |
| Microsoft_Silverlight                                           | 6.25  |
| Wikia                                                           | 6.25  |
| Peer-to-peer_file_sharing                                       | 6.25  |
| Fully_qualified_domain_name                                     | 6.25  |
| Category:Internet_forums                                  | 6.25  |
| Category:American_broadcast_news_analysts                 | 6.25  |
| arXiv.org                                                       | 6.25  |
| preprint                                                        | 6.25  |
| Cicada_3301                                                     | 6.25  |
| fansite                                                         | 6.25  |
| Affiliate_marketing                                             | 6.25  |
| Category:American_television_news_anchors                 | 6.25  |
| Category:ABC_News_personalities                           | 6.25  |
| Category:American_television_reporters_and_correspondents | 6.25  |
| Lisa_McRee                                                      | 6.25  |
| Category:Electronic_publishing                            | 6.25  |
| Kevin_Newman_(journalist)                                       | 6.25  |
| Robin_Roberts_(sportscaster)                                    | 6.25  |
| Internet_Information_Services                                   | 6.061 |
| newsmagazine                                                    | 5.882 |
| George_Stephanopoulos                                           | 5.714 |
| news_presenter                                                  | 5.714 |
| FAQ                                                             | 5.556 |
| Internet_meme                                                   | 5.405 |
| Common_Gateway_Interface                                        | 5.263 |
| Bulletin_board_system                                           | 5.172 |
| Internet_slang                                                  | 5     |
| news_anchor                                                     | 4.651 |
| Document_Object_Model                                           | 4.444 |
| Staff_writer                                                    | 4.444 |
| web_application                                                 | 4.348 |
+-----------------------------------------------------------------+-------+
  Time taken: 2 hours, 12 minutes, 50 seconds, 381 milliseconds

sa: H |arXiv.org>
|number: 3>

sa: S |arXiv.org>
+------------------------------------------------------------------------------------+--------+
| wikipage                                                                           | coeff  |
+------------------------------------------------------------------------------------+--------+
| arXiv.org                                                                          | 100    |
| citation_impact                                                                    | 40     |
| serials_crisis                                                                     | 40     |
| NEC_Research_Institute                                                             | 40     |
| postprint                                                                          | 40     |
| institutional_repository                                                           | 40     |
| OAIster                                                                            | 40     |
| SHERPA_(organisation)                                                              | 40     |
| Category:Electronic_publishing                                               | 40     |
| Paul_Ginsparg                                                                      | 33.333 |
| preprint                                                                           | 27.273 |
| self-archiving                                                                     | 25     |
| Category:Academic_publishing                                                 | 23.077 |
| Methodological_naturalism                                                          | 20     |
| Presocratics                                                                       | 20     |
| Cryptology_ePrint_Archive                                                          | 20     |
| Open_publishing                                                                    | 20     |
| Hubble_diagram                                                                     | 20     |
| GZK_paradox                                                                        | 20     |
| List_of_unsolved_problems_in_physics                                               | 20     |
| Print_on_demand                                                                    | 20     |
| TeV                                                                                | 20     |
| Boundary_condition                                                                 | 20     |
| Black_body_radiation                                                               | 20     |
| Subscriptions                                                                      | 20     |
| R.P._Feynman                                                                       | 20     |
| Citeseer                                                                           | 20     |
| Citation_index                                                                     | 20     |
| File:Solvay_conference_1927.jpg                                              | 20     |
| File:Senenmut-Grab.JPG                                                       | 20     |
| bioacoustics                                                                       | 20     |
| pattern_formation                                                                  | 20     |
| University_Physics                                                                 | 20     |
| File:Archimedes-screw_one-screw-threads_with-ball_3D-view_animated_small.gif | 20     |
| Bryn_Mawr_Classical_Review                                                         | 20     |
| File:Acceleration_components.JPG                                             | 20     |
| Delayed_open-access_journal                                                        | 20     |
| Astronomical_ceiling_of_Senemut_Tomb                                               | 20     |
| quantitative_finance                                                               | 20     |
| File:CMS_Higgs-event.jpg                                                     | 20     |
| James_Madison_Award                                                                | 20     |
| Public_Knowledge_Project                                                           | 20     |
| the_central_science                                                                | 20     |
| Difference_between_chemistry_and_physics                                           | 20     |
| theses                                                                             | 20     |
| Optical_physics                                                                    | 20     |
| analytic_solution                                                                  | 20     |
| weakly_interacting_massive_particle                                                | 20     |
| superclusters                                                                      | 20     |
| Open_Humanities_Press                                                              | 20     |
| iBooks_Author                                                                      | 20     |
| econophysics                                                                       | 20     |
| ultrasonics                                                                        | 20     |
| OAI-PMH                                                                            | 20     |
| Journal_of_Library_Administration                                                  | 20     |
| File:Einstein1921_by_F_Schmutzer_2.jpg                                       | 20     |
| Ancient_Greek_poetry                                                               | 20     |
| Publish_or_perish                                                                  | 20     |
| higher_dimension                                                                   | 20     |
| IBEX                                                                               | 20     |
+------------------------------------------------------------------------------------+--------+
  Time taken: 41 minutes, 16 seconds, 954 milliseconds

sa: H |Theory_of_everything>
|number: 13>

sa: S |Theory_of_everything>
+-------------------------------------------------------------+--------+
| wikipage                                                    | coeff  |
+-------------------------------------------------------------+--------+
| Theory_of_everything                                        | 100.0  |
| Ultimate_fate_of_the_universe                               | 21.429 |
| Planck_scale                                                | 17.391 |
| Big_Rip                                                     | 15.385 |
| Eddington_limit                                             | 15.385 |
| Supersymmetry                                               | 15.385 |
| Arrow_of_time                                               | 15.385 |
| Dimensionless_physical_constant                             | 15.385 |
| Plumian_Professor_of_Astronomy_and_Experimental_Philosophy  | 15.385 |
| Sir_Roger_Penrose                                           | 15.385 |
| Bakerian_Lecture                                            | 15.385 |
| grand_unified_theory                                        | 15.385 |
| Big_Freeze                                                  | 15.385 |
| Topological_order                                           | 15.385 |
| Baryon_asymmetry                                            | 15.385 |
| Neutrino_mass                                               | 15.385 |
| Unified_field_theory                                        | 15.385 |
| Membrane_(M-theory)                                         | 15.385 |
| Static_forces_and_virtual-particle_exchange                 | 15.385 |
| Generation_(particle_physics)                               | 15.385 |
| Stellar_nucleosynthesis                                     | 14.286 |
| Compact_Muon_Solenoid                                       | 13.333 |
| Cosmic_inflation                                            | 13.333 |
| neutrino_oscillation                                        | 12.5   |
| Hermann_Bondi                                               | 11.765 |
| Category:Presidents_of_the_Royal_Astronomical_Society | 11.111 |
| YangMills_theory                                            | 11.111 |
| anthropic_principle                                         | 10.345 |
| Dark_matter                                                 | 9.524  |
| James_Watson                                                | 9.091  |
| CP_violation                                                | 8      |
| Anisotropy                                                  | 7.692  |
| Antiparticle                                                | 7.692  |
| Acts                                                        | 7.692  |
| Centripetal_force                                           | 7.692  |
| Graviton                                                    | 7.692  |
| Gluon                                                       | 7.692  |
| Hydrogen_atom                                               | 7.692  |
| Liquid_crystal                                              | 7.692  |
| Main_sequence                                               | 7.692  |
| Morphogenesis                                               | 7.692  |
| Panspermia                                                  | 7.692  |
| Proton_decay                                                | 7.692  |
| Qubit                                                       | 7.692  |
| Tokamak                                                     | 7.692  |
| Quintessence_(physics)                                      | 7.692  |
| Sonoluminescence                                            | 7.692  |
| Gravitational_lens                                          | 7.692  |
| High-temperature_superconductor                             | 7.692  |
| Fact                                                        | 7.692  |
| Timeline_of_gravitational_physics_and_relativity            | 7.692  |
| Timeline_of_stellar_astronomy                               | 7.692  |
| List_of_astronomers                                         | 7.692  |
| Astrophysicist                                              | 7.692  |
| Triple-alpha_process                                        | 7.692  |
| Religious                                                   | 7.692  |
| Quark_matter                                                | 7.692  |
| Gravity_assist                                              | 7.692  |
| Theory_of_Everything                                        | 7.692  |
| Color_confinement                                           | 7.692  |
+-------------------------------------------------------------+--------+
  Time taken: 1 hour, 8 minutes, 42 seconds, 470 milliseconds

OK. Some cool results in there. Actually, I think they are amazing! I think I have done enough examples of this now.

Though maybe I should note, that the bigger the number H returns, the better the result. Which presumably means if we used even more of wikipedia, we would get even better results! And brings to mind the question, how many wikipages do we need to know more than the average human?

BTW, I don't think I have linked to this yet, the full wikipedia link structure in sw notation. bzip2 down to about 2 GB I seem to recall.

Sunday, 14 June 2015

non-linear resonance

The idea is this. The brain stores a vast collection of patterns, and in the BKO scheme we store these patterns as superpositions. But then given an input pattern the brain can resonate if it detects a specific pattern. In the extreme case of a "non-linear resonance" the pattern has to be very precise to trigger the resonance. Though we can have weak resonance too, in which case even a loosely similar pattern can trigger a resonance.

Anyway, words are boring, lets give an example in the console:

-- load up this data:
sa: dump
----------------------------------------
|context> => |context: non-linear resonance>

non-linear-resonance |*> #=> 1000 drop-below[0.99] simm(""|_self>, ""|g>) |g>
weak-resonance |*> #=> 200 drop-below[0.6] simm(""|_self>, ""|g>) |g>

 |g> => |a> + |b> + |c> + |d>

 |f1> => |a>
 |f2> => |a> + |b>
 |f3> => |a> + |b> + |c>
 |f4> => |a> + |b> + |c> + 0.9|d>
 |f5> => 0.95|a> + |b> + |c> + |d>
 |f6> => |a> + |b> + |c> + |d>
 |f7> => |a> + |b> + |c> + |d> + |e>
 |list> => |f1> + |f2> + |f3> + |f4> + |f5> + |f6> + |f7>
----------------------------------------

where g is our incoming pattern, and f_k are our stored patterns.
And we have defined our weak-resonance and non-linear-resonance operators (where we need at least 60% similarity for the weak-resonance, and 99% similarity for the non-linear-resonance, and the amplitude of the non-linear-resonance is much higher).

And then, let's look at the resulting table:

sa: table[pattern,weak-resonance,non-linear-resonance] "" |list>
+---------+----------------+----------------------+
| pattern | weak-resonance | non-linear-resonance |
+---------+----------------+----------------------+
| f1      |                |                      |
| f2      |                |                      |
| f3      | 150 g          |                      |
| f4      | 196.15 g       |                      |
| f5      | 198.10 g       | 990.51 g             |
| f6      | 200 g          | 1000 g               |
| f7      | 160 g          |                      |
+---------+----------------+----------------------+

I'm not sure how that table looks to others, but to me it is very, very beautiful. It is showing hints of stuff I have been thinking about for a long, long time now.

Anyway, some comments:
1) the patterns can of course be anything. eg, a very specific sequence of sounds could non-linear resonate with the "frog" neuron. The specific sequence of letters for "beach" might weak-resonate with the "sun", "sand", "waves" and "beach-goers" neurons.
2) the above of course has a lot of similarity with the similar[op] operator.
3) I suspect something very similar to that table happens in the hippocampus. But that is for later!

Update: the above is essentially a 1-D version of the landscape function:
L(f,x) = simm(f,g(x))
Though here we have f and g swapped, so it is:
L(g,x) = simm(g,f(x))

Update: on reflection, I don't think the "beach" thing is a good example of a weak resonance. I'll have to see if I can think of a better example.

Update: we can also have a "square resonance":

square-resonance |*> #=> 200 clean drop-below[0.6] simm(""|_self>, ""|g>) |g>

Now, look at the table:

sa: table[pattern,weak-resonance,square-resonance,non-linear-resonance] "" |list>
+---------+----------------+------------------+----------------------+
| pattern | weak-resonance | square-resonance | non-linear-resonance |
+---------+----------------+------------------+----------------------+
| f1      |                |                  |                      |
| f2      |                |                  |                      |
| f3      | 150 g          | 200 g            |                      |
| f4      | 196.15 g       | 200 g            |                      |
| f5      | 198.10 g       | 200 g            | 990.51 g             |
| f6      | 200 g          | 200 g            | 1000 g               |
| f7      | 160 g          | 200 g            |                      |
+---------+----------------+------------------+----------------------+

Anyway, that should be clear enough. And of course, we can make (almost) arbitrary permutations of the shape of the resonance.

Also, we could call "weak resonance" a "fuzzy resonance". Ie, the pattern only has to be fuzzy close, yet it still resonates.

Wednesday, 10 June 2015

wikipedia fragment to frequency list

Now, we had a lot of success in mapping wikipedia to its link structure, and finding semantic similarities. That uses "similar[op]". This time, we map wikipedia (well, a small piece of it) to frequency lists, and see how "find-topic[op]" works.

Here is my code to convert wikipedia xml to single word frequencies.

$ ./play_with_wikipedia_freq_list.py data/fragments/0.xml

10,604 minutes later, we have this.

Now, some examples:

sa: T |*> #=> table[wikipage,coeff] select[1,300] 100 intn-find-topic[words-1] |_self>
sa: T |river torrens>
+-----------------------------+----------+
| wikipage                    | coeff    |
+-----------------------------+----------+
| Murray_River                | 2210.857 |
| The_Bronx                   | 1450.875 |
| South_Australia             | 1243.607 |
| Adelaide                    | 1130.552 |
| Prince_Edward_Island        | 746.164  |
| Gypsum                      | 710.633  |
| Port_Adelaide_Football_Club | 678.331  |
| June_14                     | 552.714  |
| Trade                       | 552.714  |
| October_25                  | 497.443  |
| Dinosaur                    | 226.11   |
+-----------------------------+----------+
  Time taken: 27 minutes, 19 seconds, 709 milliseconds

sa: T |adelaide university>
+-------------------------------------------+--------+
| wikipage                                  | coeff  |
+-------------------------------------------+--------+
| Macquarie_University                      | 90.953 |
| Immanuel_Kant                             | 74.416 |
| Robert_Menzies                            | 71.625 |
| David_Hume                                | 71.625 |
| Theology                                  | 68.214 |
| Adelaide                                  | 65.114 |
| Austin,_Texas                             | 65.114 |
| Yoga                                      | 65.114 |
| Gregor_Mendel                             | 63.951 |
| Mike_Moore_(New_Zealand_politician)       | 63.951 |
| New_South_Wales                           | 61.393 |
| Perth                                     | 61.393 |
| Aristophanes                              | 61.393 |
| Bob_Hawke                                 | 61.393 |
| Culture_of_Canada                         | 61.393 |
| John_Milton                               | 61.393 |
| West_Bengal                               | 61.393 |
| Brewing                                   | 61.393 |
| Fyodor_Dostoyevsky                        | 61.393 |
| Hunter_College                            | 61.393 |
| John_Stuart_Mill                          | 61.393 |
...
  Time taken: 27 minutes, 8 seconds, 965 milliseconds

sa: T |apple juice>
+---------------------------------------+---------+
| wikipage                              | coeff   |
+---------------------------------------+---------+
| Vinegar                               | 402.189 |
| McIntosh_(apple)                      | 367.835 |
| Fruit                                 | 361.97  |
| Cuisine_of_the_United_States          | 329.064 |
| Drink                                 | 321.751 |
| Vietnamese_cuisine                    | 321.751 |
| List_of_cocktails                     | 321.751 |
| Hungarian_language                    | 294.268 |
| Arsenic                               | 289.576 |
| Chardonnay                            | 289.576 |
| Pear                                  | 271.478 |
| Swedish_cuisine                       | 271.478 |
| Cuisine_of_the_Southern_United_States | 271.478 |
| Food_preservation                     | 241.314 |
| Turkish_cuisine                       | 241.314 |
| Mead                                  | 241.314 |
| French_cuisine                        | 217.182 |
| Mojito                                | 206.84  |
...
  Time taken: 27 minutes, 25 seconds, 378 milliseconds

T |russia china japan australia new zealand egypt>
+-----------------------------------------------------------+--------+
| wikipage                                                  | coeff  |
+-----------------------------------------------------------+--------+
| Tram                                                      | 77.349 |
| List_of_national_capitals_and_largest_cities_by_country   | 75.967 |
| General_Motors                                            | 74.448 |
| 2000s_(decade)                                            | 70.903 |
| History_of_painting                                       | 70.903 |
| 2010s                                                     | 67.68  |
| British_Empire                                            | 67.68  |
| Foreign_relations_of_China                                | 67.68  |
| Self-determination                                        | 67.68  |
| Foreign_relations_of_Taiwan                               | 67.68  |
| Toyota                                                    | 67.68  |
| Dwight_D._Eisenhower                                      | 65.991 |
| Psychology                                                | 65.991 |
| 2008                                                      | 63.813 |
| List_of_former_sovereign_states                           | 63.813 |
| Foreign_relations_of_Indonesia                            | 63.813 |
| Foreign_relations_of_Japan                                | 63.813 |
| Foreign_relations_of_North_Korea                          | 63.813 |
| Peninsula                                                 | 63.813 |
| Pandemic                                                  | 63.813 |
| United_Nations_Security_Council                           | 63.813 |
| 1996                                                      | 63.813 |
| List_of_mountains                                         | 63.813 |
...
  Time taken: 1 hour, 39 minutes, 16 seconds, 194 milliseconds

Anyway, largely rubbish results! Doesn't mean find-topic[op] is completely useless, eg, seems to work well with finding name type (male, female, last), but just doesn't work that well on wikipedia.

even more inverse simm results

OK. I have been throwing example after example at my similar[inverse-links-to] code, and almost all of them gave great results. Some even with only 6 incoming links. The worst I think was "kangaroo" which only had 4 incoming links.

Anyway, too big to paste in here, so here is the text file.
And here are my operators:

H |*> #=> how-many inverse-links-to |_self>
S |*> #=> table[wikipage,coeff] select[1,200] 100 self-similar[inverse-links-to] |_self>

I guess the point is made. Not sure how much more I want to do with it, at least for now.

Update: here is a thought. You know the game, one person says a word/concept then the other person has to answer with the first thing that crosses their mind. We could do that (though the current code would be rather slow!). Something like:

what-do-you-think-of-when-you-hear |*> #=> select[1,1] clean similar[inverse-links-to] |_self>

Noting that we don't need the multiply by 100, and we use similar[op] not self-similar[op].
So, some examples in the console:

sa: what-do-you-think-of-when-you-hear |*> #=> select[1,1] clean similar[inverse-links-to] |_self>
sa: what-do-you-think-of-when-you-hear |WP: Pen>
|WP: Marker_pen>

sa: what-do-you-think-of-when-you-hear |WP: Fox>
|WP: Hyena>

Cool! Though we can add a random element in there too:

-- version 1:
sa: what-do-you-think-of-when-you-hear |*> #=> clean select[1,1] shuffle select[1,5] similar[inverse-links-to] |_self>

-- version 2:
sa: what-do-you-think-of-when-you-hear |*> #=> clean pick-elt select[1,5] similar[inverse-links-to] |_self>

sa: what-do-you-think-of-when-you-hear |WP: Fox>
|WP: Weasel>

sa: what-do-you-think-of-when-you-hear |WP: Diamond>
|WP: Selenium>

-- version 3:
sa: what-do-you-think-of-when-you-hear |*> #=> clean pick-elt select[1,5] similar[inverse-links-to] merge-labels(|WP: > + |_self>)

sa: what-do-you-think-of-when-you-hear |Physics>
|WP: Chemistry>

-- version 4:
sa: what-do-you-think-of-when-you-hear |*> #=> extract-value clean pick-elt select[1,5] similar[inverse-links-to] merge-labels(|WP: > + |_self>)

sa: what-do-you-think-of-when-you-hear |Student>
|Uniform>

sa: what-do-you-think-of-when-you-hear |Quentin_Tarantino>
|Brian_De_Palma>

sa: what-do-you-think-of-when-you-hear |House>
|Amateur_astronomy>

sa: what-do-you-think-of-when-you-hear |Elephant>
|Ant>

sa: what-do-you-think-of-when-you-hear |Elephant>
|Donkey>

sa: what-do-you-think-of-when-you-hear |Rock_and_roll>
|Jimmie_Rodgers_(country_singer)>

sa: what-do-you-think-of-when-you-hear |Rock_and_roll>
|American_Bandstand>

sa: what-do-you-think-of-when-you-hear |Paris>
|Moscow>

sa: what-do-you-think-of-when-you-hear |Paris>
|Rome>

sa: what-do-you-think-of-when-you-hear |Cryptography>
|Data_Encryption_Standard>

sa: what-do-you-think-of-when-you-hear |Cryptography>
|one-time_pad>

sa: what-do-you-think-of-when-you-hear |Drake_equation>
|Kepler_(spacecraft)>

sa: what-do-you-think-of-when-you-hear |Drake_equation>
|Frank_Drake>

sa: what-do-you-think-of-when-you-hear |Slashdot>
|Bruce_Schneier>

sa: what-do-you-think-of-when-you-hear |Slashdot>
|Adrian_Carmack>

sa: what-do-you-think-of-when-you-hear |Slashdot>
|Neal_Stephenson>

sa: what-do-you-think-of-when-you-hear |Scotland>
|Edinburgh>

sa: what-do-you-think-of-when-you-hear |Scotland>
|Spain>

sa: what-do-you-think-of-when-you-hear |Food>
|Wood>

sa: what-do-you-think-of-when-you-hear |Food>
|Cement>

sa: what-do-you-think-of-when-you-hear |Food>
|glycemic_index>

sa: what-do-you-think-of-when-you-hear |The_Terminator>
|T-1000>

sa: what-do-you-think-of-when-you-hear |The_Terminator>
|Aliens_(film)>

sa: what-do-you-think-of-when-you-hear |Algebra>
|Axiom>

sa: what-do-you-think-of-when-you-hear |Algebra>
|Constructivism_(mathematics)>

sa: what-do-you-think-of-when-you-hear |Functional_analysis>
|Measure_theory>

sa: what-do-you-think-of-when-you-hear |Functional_analysis>
|spanning_tree>

OK. Some weird ones in there, but some fun ones too. And if we use more of wikipedia then we should get even better results.

Update: we can now also do this:

sa: table[wikipage] exp[what-do-you-think-of-when-you-hear,10] |The_Terminator>
+-----------------------------+
| wikipage                    |
+-----------------------------+
| The_Terminator              |
| Sylvester_Stallone          |
| Eddie_Murphy                |
| Kevin_Costner               |
| Michelle_Pfeiffer           |
| Sharon_Stone                |
| Frasier                     |
| sitcom                      |
| Cheers                      |
| CBS_Television_Distribution |
+-----------------------------+
  Time taken: 1 hour, 18 minutes, 41 seconds, 294 milliseconds

Update: I applied similar[inverse-links-to] to a 300,000 sample of wikipedia (as opposed to 30,000 used here) and the results were even better!

Friday, 5 June 2015

more inverse-simm results

OK. In the last post we discovered similar[inverse-links-to] seems to give some good results. Let's expand our test set, and try it on a few more examples.

-- define our test set:
|list> => |WP: Erwin_Schrdinger> + |WP: Richard_Feynman> + |WP: Cat> + |WP: Dog> + |WP: Apple> + |WP: Adelaide> + |WP: University_of_Adelaide> + |WP: Particle_physics> + |WP: Lisp_(programming_language)> + |WP: APL_(programming_language)> + |WP: SQL> + |WP: SPARQL> + |WP: The_Doors> + |WP: Rugby> + |WP: Australian_Football_League>

-- how many incoming links?
sa: how-many-in-links |*> #=> how-many inverse-links-to |_self>
sa: table[wikipage,how-many-in-links] "" |list>
+-----------------------------+-------------------+
| wikipage                    | how-many-in-links |
+-----------------------------+-------------------+
| Erwin_Schrdinger            | 53                |
| Richard_Feynman             | 79                |
| Cat                         | 14                |
| Dog                         | 24                |
| Apple                       | 21                |
| Adelaide                    | 81                |
| University_of_Adelaide      | 10                |
| Particle_physics            | 17                |
| Lisp_(programming_language) | 64                |
| APL_(programming_language)  | 24                |
| SQL                         | 41                |
| SPARQL                      | 6                 |
| The_Doors                   | 41                |
| Rugby                       | 0                 |
| Australian_Football_League  | 30                |
+-----------------------------+-------------------+

-- create the data:
sa: inverse-simm-op |WP: *> #=> select[1,500] 100 self-similar[inverse-links-to] |_self>
sa: |null> => map[inverse-simm-op,inverse-simm] "" |list>

-- define an operator to explore the resulting data:
sa: T |*> #=> table[wikipage,coeff] select[1,20] inverse-simm |_self>

-- now our examples:
sa: T |WP: Erwin_Schrdinger>
+---------------------------+--------+
| wikipage                  | coeff  |
+---------------------------+--------+
| Erwin_Schrdinger          | 100.0  |
| Max_Born                  | 32.075 |
| Niels_Bohr                | 31.646 |
| Schrdinger_equation       | 30.189 |
| Paul_Dirac                | 29.31  |
| Wolfgang_Pauli            | 28.302 |
| Werner_Heisenberg         | 28.049 |
| Max_Planck                | 26.984 |
| uncertainty_principle     | 26.415 |
| photoelectric_effect      | 24.528 |
| Roger_Penrose             | 22.642 |
| Bohr_model                | 20.755 |
| Arnold_Sommerfeld         | 20.755 |
| Louis_de_Broglie          | 20.755 |
| wave_function             | 20.755 |
| Copenhagen_interpretation | 18.868 |
| quantum_state             | 18.868 |
| Ernest_Rutherford         | 17.742 |
| Maxwell's_equations       | 17.241 |
| Pauli_exclusion_principle | 16.981 |
+---------------------------+--------+

sa: T |WP: Richard_Feynman>
+------------------------------------+--------+
| wikipage                           | coeff  |
+------------------------------------+--------+
| Richard_Feynman                    | 100.0  |
| Werner_Heisenberg                  | 24.39  |
| special_relativity                 | 20.792 |
| Niels_Bohr                         | 20.253 |
| Paul_Dirac                         | 20.253 |
| particle_physics                   | 20.225 |
| classical_mechanics                | 20.0   |
| fermion                            | 18.987 |
| spin_(physics)                     | 18.987 |
| Standard_Model                     | 17.722 |
| Schrdinger_equation                | 17.722 |
| quantum_field_theory               | 17.722 |
| electromagnetism                   | 17.241 |
| Erwin_Schrdinger                   | 16.456 |
| Pauli_exclusion_principle          | 16.456 |
| quark                              | 16.456 |
| Stephen_Hawking                    | 16.456 |
| quantum_electrodynamics            | 16.456 |
| Julian_Schwinger                   | 16.456 |
| Category:Concepts_in_physics | 16.279 |
+------------------------------------+--------+

sa: T |WP: Cat>
+----------+--------+
| wikipage | coeff  |
+----------+--------+
| Cat      | 100.0  |
| Horse    | 31.25  |
| Donkey   | 28.571 |
| Goat     | 28.571 |
| Elephant | 21.429 |
| Pig      | 21.429 |
| Rabbit   | 21.429 |
| Deer     | 21.429 |
| Mule     | 21.429 |
| Goose    | 21.429 |
| Dog      | 20.833 |
| Sheep    | 20     |
| Lion     | 18.75  |
| Almond   | 14.286 |
| Alder    | 14.286 |
| Ant      | 14.286 |
| Bear     | 14.286 |
| Bee      | 14.286 |
| Fox      | 14.286 |
| Lizard   | 14.286 |
+----------+--------+

sa: T |WP: Dog>
+------------------+--------+
| wikipage         | coeff  |
+------------------+--------+
| Dog              | 100.0  |
| Horse            | 29.167 |
| coyote           | 22.222 |
| Gray_wolf        | 21.429 |
| Arctic_fox       | 20.833 |
| Cat              | 20.833 |
| Canidae          | 20.833 |
| Elephant         | 20.833 |
| bobcat           | 20.833 |
| red_fox          | 20.833 |
| Donkey           | 20.833 |
| red_wolf         | 20.833 |
| Rabbit           | 16.667 |
| African_wild_dog | 16.667 |
| gray_wolf        | 16.667 |
| Domestic_sheep   | 16.667 |
| dingo            | 16.667 |
| Goat             | 16.667 |
| Cattle           | 16.667 |
| otter            | 16.667 |
+------------------+--------+

sa: T |WP: Apple>
+----------------+--------+
| wikipage       | coeff  |
+----------------+--------+
| Apple          | 100.0  |
| Strawberry     | 33.333 |
| Cranberry      | 23.81  |
| Grape          | 23.81  |
| Tomato         | 23.81  |
| Cherry         | 23.81  |
| Kiwifruit      | 19.048 |
| Blackberry     | 19.048 |
| plum           | 19.048 |
| Lime_(fruit)   | 19.048 |
| Pineapple      | 19.048 |
| Lemon          | 19.048 |
| Blueberry      | 19.048 |
| peach          | 17.5   |
| pear           | 17.391 |
| Orange_(fruit) | 16.667 |
| Pear           | 14.286 |
| Banana         | 14.286 |
| Peach          | 14.286 |
| Squash_(plant) | 14.286 |
+----------------+--------+

sa: T |WP: Adelaide>
+-------------------------------------+--------+
| wikipage                            | coeff  |
+-------------------------------------+--------+
| Adelaide                            | 100.0  |
| Brisbane                            | 37.079 |
| Perth                               | 32.099 |
| South_Australia                     | 26.042 |
| Melbourne                           | 18.687 |
| Canberra                            | 15.044 |
| The_Age                             | 14.634 |
| Sydney                              | 14.583 |
| Australian_Broadcasting_Corporation | 13.445 |
| Australian_rules_football           | 12.346 |
| Auckland                            | 12.195 |
| Australian_Labor_Party              | 11.111 |
| Darwin,_Northern_Territory          | 11.111 |
| Triple_J                            | 11.111 |
| Seven_Network                       | 11.111 |
| States_and_territories_of_Australia | 11.111 |
| Karachi                             | 10.989 |
| Australian_Football_League          | 9.877  |
| The_Australian                      | 9.877  |
| Western_Australia                   | 8.911  |
+-------------------------------------+--------+

sa: T |WP: University_of_Adelaide>
+----------------------------------------+-------+
| wikipage                               | coeff |
+----------------------------------------+-------+
| University_of_Adelaide                 | 100.0 |
| Port_Adelaide_Football_Club            | 20    |
| Adelaide_Oval                          | 20    |
| Adelaide_city_centre                   | 20    |
| University_of_South_Australia          | 20    |
| Port_Adelaide                          | 20    |
| Australian_Grand_Prix                  | 20    |
| State_Bank_of_South_Australia          | 20    |
| Mount_Lofty                            | 20    |
| South_Eastern_Freeway                  | 20    |
| Southern_Expressway_(Australia)        | 20    |
| Government_of_South_Australia          | 20    |
| Flinders_University_of_South_Australia | 20    |
| South_Australian_Museum                | 20    |
| Adelaide_Crows                         | 20    |
| Glenelg,_South_Australia               | 20    |
| Australian_Central_Standard_Time       | 20    |
| Australian_Central_Daylight_Time       | 20    |
| Fleurieu_Peninsula                     | 20    |
| River_Torrens                          | 20    |
+----------------------------------------+-------+

sa: T |WP: Particle_physics>
+----------------------------------------+--------+
| wikipage                               | coeff  |
+----------------------------------------+--------+
| Particle_physics                       | 100    |
| Optics                                 | 20     |
| Cosmology                              | 20     |
| Acoustics                              | 17.647 |
| Condensed_matter_physics               | 17.647 |
| Fluid_dynamics                         | 17.647 |
| Thermodynamics                         | 17.647 |
| kinematics                             | 17.647 |
| atomic,_molecular,_and_optical_physics | 17.647 |
| cosmic_inflation                       | 17.647 |
| Fluid_statics                          | 17.647 |
| Lambda-CDM_model                       | 17.647 |
| Biophysics                             | 17.647 |
| Category:Physics                 | 17.647 |
| Lev_Landau                             | 15.789 |
| Nuclear_physics                        | 14.286 |
| virtual_particle                       | 14.286 |
| quantum_chemistry                      | 13.636 |
| American_Physical_Society              | 13.333 |
| Elementary_particle                    | 13.043 |
+----------------------------------------+--------+

sa: T |WP: Lisp_(programming_language)>
+------------------------------------+--------+
| wikipage                           | coeff  |
+------------------------------------+--------+
| Lisp_(programming_language)        | 100    |
| Smalltalk                          | 28.125 |
| Pascal_(programming_language)      | 24.675 |
| Fortran                            | 23.881 |
| Scheme_(programming_language)      | 23.438 |
| Ruby_(programming_language)        | 23.188 |
| object-oriented_programming        | 21.875 |
| PHP                                | 20.779 |
| Prolog                             | 20.312 |
| Haskell_(programming_language)     | 20.312 |
| Ada_(programming_language)         | 18.75  |
| APL_(programming_language)         | 18.75  |
| BASIC                              | 18.75  |
| COBOL                              | 18.75  |
| functional_programming             | 18.75  |
| John_McCarthy_(computer_scientist) | 18.75  |
| C_Sharp_(programming_language)     | 18.75  |
| programming_language               | 17.647 |
| JavaScript                         | 17.526 |
| compiler                           | 17     |
+------------------------------------+--------+

sa: T |WP: APL_(programming_language)>
+------------------------------------+--------+
| wikipage                           | coeff  |
+------------------------------------+--------+
| APL_(programming_language)         | 100.0  |
| Kenneth_E._Iverson                 | 33.333 |
| John_McCarthy_(computer_scientist) | 25.0   |
| John_Backus                        | 25.0   |
| Prolog                             | 23.333 |
| Alan_Kay                           | 20.833 |
| AWK                                | 20.833 |
| Grace_Hopper                       | 20.833 |
| ML_(programming_language)          | 20.833 |
| Niklaus_Wirth                      | 20.833 |
| logic_programming                  | 20.833 |
| J_(programming_language)           | 20.833 |
| bytecode                           | 19.231 |
| Lisp_(programming_language)        | 18.75  |
| programmer                         | 17.857 |
| Objective-C                        | 17.241 |
| BASIC                              | 17.188 |
| Mathematica                        | 17.143 |
| SQL                                | 17.073 |
| ALGOL                              | 16.667 |
+------------------------------------+--------+

sa: T |WP: SQL>
+----------------------------------------+--------+
| wikipage                               | coeff  |
+----------------------------------------+--------+
| SQL                                    | 100.0  |
| Haskell_(programming_language)         | 22.727 |
| PHP                                    | 19.481 |
| APL_(programming_language)             | 17.073 |
| Category:Cross-platform_software | 17.073 |
| Visual_Basic                           | 17.073 |
| relational_database                    | 17.073 |
| COBOL                                  | 16.327 |
| PostgreSQL                             | 14.634 |
| R_(programming_language)               | 14.634 |
| Run_time_(program_lifecycle_phase)     | 14.634 |
| Ruby_(programming_language)            | 14.493 |
| JavaScript                             | 13.402 |
| C_Sharp_(programming_language)         | 13.208 |
| database                               | 12.644 |
| Lisp_(programming_language)            | 12.5   |
| Common_Lisp                            | 12.195 |
| Graphical_user_interface               | 12.195 |
| MySQL                                  | 12.195 |
| Mathematica                            | 12.195 |
+----------------------------------------+--------+

sa: T |WP: SPARQL>
+--------------------------------------------------------------------------------------------+--------+
| wikipage                                                                                   | coeff  |
+--------------------------------------------------------------------------------------------+--------+
| SPARQL                                                                                     | 100.0  |
| Web_Ontology_Language                                                                      | 33.333 |
| Agris:_International_Information_System_for_the_Agricultural_Sciences_and_Technology | 33.333 |
| W3C_XML_Schema                                                                             | 33.333 |
| GRDDL                                                                                      | 33.333 |
| Conceptual_interoperability                                                                | 33.333 |
| Category:Web_services                                                                | 33.333 |
| Resource_Description_Framework                                                             | 20     |
| Analytic_geometry                                                                          | 16.667 |
| DHTML                                                                                      | 16.667 |
| Interpolation                                                                              | 16.667 |
| GNU_nano                                                                                   | 16.667 |
| Pico_(text_editor)                                                                         | 16.667 |
| Relational_database                                                                        | 16.667 |
| Sir_Charles_Lyell                                                                          | 16.667 |
| Synchronized_Multimedia_Integration_Language                                               | 16.667 |
| Semantic_network                                                                           | 16.667 |
| Backronym                                                                                  | 16.667 |
| Interoperability                                                                           | 16.667 |
| RAS_syndrome                                                                               | 16.667 |
+--------------------------------------------------------------------------------------------+--------+

sa: T |WP: The_Doors>
+---------------------------------------+--------+
| wikipage                              | coeff  |
+---------------------------------------+--------+
| The_Doors                             | 100.0  |
| Jim_Morrison                          | 31.707 |
| Ray_Manzarek                          | 21.951 |
| Jefferson_Airplane                    | 17.073 |
| The_Band                              | 14.634 |
| Bee_Gees                              | 14.634 |
| Janis_Joplin                          | 12.195 |
| Summer_of_Love                        | 12.195 |
| Timothy_Leary                         | 12.195 |
| folk_rock                             | 12.195 |
| Governor_of_American_Samoa            | 12.195 |
| Cream_(band)                          | 12.195 |
| Sgt._Pepper's_Lonely_Hearts_Club_Band | 12.195 |
| The_Byrds                             | 12.195 |
| Joan_Baez                             | 12.195 |
| Iron_Butterfly                        | 12.195 |
| Brian_Jones                           | 12.195 |
| Creedence_Clearwater_Revival          | 12.195 |
| Johnny_Winter                         | 12.195 |
| The_Yardbirds                         | 12.195 |
+---------------------------------------+--------+

sa: T |WP: Rugby>
+----------+-------+
| wikipage | coeff |
+----------+-------+
+----------+-------+

sa: T |WP: Australian_Football_League>
+---------------------------------+--------+
| wikipage                        | coeff  |
+---------------------------------+--------+
| Australian_Football_League      | 100.0  |
| West_Coast_Eagles               | 40     |
| Richmond_Football_Club          | 40     |
| Sydney_Swans                    | 36.667 |
| St_Kilda_Football_Club          | 36.667 |
| Collingwood_Football_Club       | 36.667 |
| Hawthorn_Football_Club          | 36.667 |
| Australian_rules_football       | 33.871 |
| Essendon_Football_Club          | 33.333 |
| North_Melbourne_Football_Club   | 33.333 |
| Western_Bulldogs                | 33.333 |
| Carlton_Football_Club           | 33.333 |
| Brownlow_Medal                  | 33.333 |
| Seven_Network                   | 32.353 |
| Melbourne_Cricket_Ground        | 30     |
| Australian_Bureau_of_Statistics | 30     |
| Special_Broadcasting_Service    | 30     |
| Melbourne_Football_Club         | 30     |
| Fitzroy_Football_Club           | 30     |
| 2001_AFL_season                 | 30     |
+---------------------------------+--------+

Wow! That works unbelievably well. I don't know exactly why, but hey. And why similar[inverse-links-to] works better than similar[links-to], I don't know either! A question that comes to mind is, if we use a larger subset of wikipedia, will these results get better or worse? I suspect better, but not sure.

BTW, this is from:

100*30000/15559125
= 0.19 %

of the total English wikipedia.

And the resulting sw file is here.

what do we know about bananas?

Now we have a subset of wikipedia in sw format, we need to look around, see what we have.

There are 4 things that might be interesting to ask (with this data-set):
links-to |wikipage>
inverse-links-to |wikipage>
similar[links-to] |wikipage>
similar[inverse-links-to] |wikipage>

So, let's start with bananas:

sa: table[wikipage,coeff] select[1,50] coeff-sort links-to |WP: Banana>
+-------------------------------------------------+-------+
| wikipage                                        | coeff |
+-------------------------------------------------+-------+
| Musa_balbisiana                                 | 5     |
| Southeast_Asia                                  | 5     |
| Musa_acuminata                                  | 4     |
| Kerala                                          | 4     |
| Musa_(genus)                                    | 3     |
| starch                                          | 3     |
| Philippines                                     | 3     |
| India                                           | 3     |
| Plantain_(true)                                 | 3     |
| Chiquita_Brands_International                   | 3     |
| Malaysia                                        | 3     |
| Indonesia                                       | 3     |
| Thailand                                        | 3     |
| Central_America                                 | 3     |
| Uganda                                          | 3     |
| Panama_disease                                  | 3     |
| fruit                                           | 2     |
| herbaceous                                      | 2     |
| Papua_New_Guinea                                | 2     |
| fiber                                           | 2     |
| #Cavendish                                      | 2     |
| List_of_banana_cultivars                        | 2     |
| Musa_velutina                                   | 2     |
| Fe'i_banana                                     | 2     |
| Ensete_ventricosum                              | 2     |
| Musaceae                                        | 2     |
| pseudostem                                      | 2     |
| inflorescence                                   | 2     |
| List_of_banana_cultivars#AAB_Group              | 2     |
| Synonym_(taxonomy)                              | 2     |
| Palestine                                       | 2     |
| Madagascar                                      | 2     |
| China                                           | 2     |
| Dole_Food_Company                               | 2     |
| Burundi                                         | 2     |
| Rwanda                                          | 2     |
| Hybrid_(biology)                                | 2     |
| genetic_engineering                             | 2     |
| Food_and_Agriculture_Organization               | 2     |
| Guatemala                                       | 2     |
| ICTSD                                           | 2     |
| Fyffes                                          | 2     |
| United_Fruit_Company                            | 2     |
| coffee                                          | 2     |
| International_Institute_of_Tropical_Agriculture | 2     |
| Daily_Value                                     | 2     |
| potatoes                                        | 2     |
| dopamine                                        | 2     |
| South_Asia                                      | 2     |
| Pisang_goreng                                   | 2     |
+-------------------------------------------------+-------+

sa: table[wikipage,coeff] select[1,50] coeff-sort inverse-links-to |WP: Banana>
+-----------------------+-------+
| wikipage              | coeff |
+-----------------------+-------+
| Economy_of_Costa_Rica | 1     |
| Flavor                | 1     |
| Economy_of_Honduras   | 1     |
| Economy_of_Jamaica    | 1     |
| Jericho               | 1     |
| Jell-O                | 1     |
| Economy_of_Malawi     | 1     |
| Neapolitan_ice_cream  | 1     |
| Thai_cuisine          | 1     |
| Vitamin_C             | 1     |
| Zingiberales          | 1     |
+-----------------------+-------+

sa: table[wikipage,coeff] select[1,50] 100 self-similar[links-to] |WP: Banana>
+---------------------------------------------------------------------------------+--------+
| wikipage                                                                        | coeff  |
+---------------------------------------------------------------------------------+--------+
| Banana                                                                          | 100.0  |
| Cooking_plantain                                                                | 15.096 |
| ITU_prefix                                                                      | 9.52   |
| Wikipedia:Status_of_the_porting_of_U.S._Dept_of_State_info                      | 9.088  |
| United_Nations_Industrial_Development_Organization                              | 8.376  |
| Wikipedia:Status_of_the_porting_of_the_CIA_World_Factbook                       | 8.096  |
| Abac                                                                            | 8.036  |
| International_Tropical_Timber_Agreement,_1994                                   | 7.813  |
| Deforestation                                                                   | 7.483  |
| International_Tropical_Timber_Agreement,_1983                                   | 7.366  |
| Western_imperialism_in_Asia                                                     | 6.983  |
| International_Hydrographic_Organization                                         | 6.92   |
| Foreign_relations_of_China                                                      | 6.806  |
| Lions_Clubs_International                                                       | 6.326  |
| Asia                                                                            | 6.242  |
| Passport                                                                        | 6.24   |
| Diaspora                                                                        | 5.931  |
| International_Electrotechnical_Commission                                       | 5.925  |
| Member_states_of_the_United_Nations                                             | 5.645  |
| Indian_Ocean                                                                    | 5.625  |
| History_of_Southeast_Asia                                                       | 5.62   |
| Solanaceae                                                                      | 5.581  |
| Tiger                                                                           | 5.56   |
| Kyoto_Protocol                                                                  | 5.527  |
| Curry                                                                           | 5.495  |
| Sugar                                                                           | 5.41   |
| Blood_alcohol_content                                                           | 5.389  |
| List_of_mountains                                                               | 5.388  |
| Rice                                                                            | 5.37   |
| Southeast_Asia                                                                  | 5.361  |
| Ceiba_pentandra                                                                 | 5.357  |
| Pigeon_pea                                                                      | 5.357  |
| Rat                                                                             | 5.219  |
| Abugida                                                                         | 5.116  |
| Chinatown                                                                       | 5.1    |
| Video_CD                                                                        | 5.086  |
| History_of_the_Pacific_Islands                                                  | 5.052  |
| Time_zone                                                                       | 5.037  |
| Asian_Development_Bank                                                          | 5.035  |
| Hindu                                                                           | 5.03   |
| Foreign_relations_of_Indonesia                                                  | 5.004  |
| Foreign_relations_of_Singapore                                                  | 5.002  |
| Economy_of_Burma                                                                | 4.993  |
| Spice                                                                           | 4.952  |
| Convention_on_Fishing_and_Conservation_of_the_Living_Resources_of_the_High_Seas | 4.911  |
| Foreign_relations_of_Taiwan                                                     | 4.842  |
| Tamil_language                                                                  | 4.828  |
| Risk_(game)                                                                     | 4.792  |
| Foreign_relations_of_the_Philippines                                            | 4.776  |
| Numismatics                                                                     | 4.748  |
+---------------------------------------------------------------------------------+--------+
  Time taken: 10 minutes, 7 seconds, 599 milliseconds

sa: table[wikipage,coeff] select[1,100] 100 self-similar[inverse-links-to] |WP: Banana>
+-----------------------------------------+--------+
| wikipage                                | coeff  |
+-----------------------------------------+--------+
| Banana                                  | 100.0  |
| Grape                                   | 27.273 |
| Pineapple                               | 27.273 |
| Mango                                   | 27.273 |
| Ascorbic_acid                           | 18.182 |
| Cranberry                               | 18.182 |
| Kiwifruit                               | 18.182 |
| Pear                                    | 18.182 |
| Tea                                     | 18.182 |
| Vanilla                                 | 18.182 |
| Chicken                                 | 18.182 |
| Peach                                   | 18.182 |
| apparel                                 | 18.182 |
| lime_(fruit)                            | 18.182 |
| pudding                                 | 18.182 |
| mangosteen                              | 18.182 |
| Coconut                                 | 18.182 |
| gelatin_dessert                         | 18.182 |
| Lemon                                   | 18.182 |
| galangal                                | 18.182 |
| Melon                                   | 18.182 |
| Grapefruit                              | 18.182 |
| Raspberry                               | 18.182 |
| Apricot                                 | 18.182 |
| Watermelon                              | 18.182 |
| Coffee                                  | 16.667 |
| Strawberry                              | 15.789 |
| Cherry                                  | 15.385 |
| custard                                 | 15.385 |
| Pork                                    | 15.385 |
| Sugar                                   | 14.286 |
| Apple                                   | 14.286 |
| condensed_milk                          | 14.286 |
| Garden_strawberry                       | 14.286 |
| Blackberry                              | 13.333 |
| raspberry                               | 13.333 |
| Fruit                                   | 12.5   |
| pistachio                               | 12.5   |
| Orange_(fruit)                          | 12.5   |
| tapioca                                 | 12     |
| Tomato                                  | 11.765 |
| Center_for_Economic_and_Policy_Research | 11.111 |
| Lime_(fruit)                            | 10.526 |
| turmeric                                | 10.526 |
| pineapple                               | 10.256 |
| Domestic_sheep                          | 9.524  |
| broccoli                                | 9.524  |
| Adobe                                   | 9.091  |
| Analytical_chemistry                    | 9.091  |
| Commelinales                            | 9.091  |
| Calf                                    | 9.091  |
| Celebrity                               | 9.091  |
| Cream                                   | 9.091  |
| Buddhist_cuisine                        | 9.091  |
| Celery                                  | 9.091  |
| Chocolate                               | 9.091  |
| Cola                                    | 9.091  |
| Ester                                   | 9.091  |
| Epipaleolithic                          | 9.091  |
| Food_additive                           | 9.091  |
| Glycine                                 | 9.091  |
| Gelatin                                 | 9.091  |
| Gelatin_dessert                         | 9.091  |
| James_Lind                              | 9.091  |
| Kathleen_Kenyon                         | 9.091  |
| Ceiba_pentandra                         | 9.091  |
| Mya_(unit)                              | 9.091  |
| Monosaccharide                          | 9.091  |
| Parsley                                 | 9.091  |
| Potato                                  | 9.091  |
| Pia_colada                              | 9.091  |
| Spice                                   | 9.091  |
| Soft_drink                              | 9.091  |
| Economy_of_South_Africa                 | 9.091  |
| Scurvy                                  | 9.091  |
| Tarsiidae                               | 9.091  |
| Zingiberaceae                           | 9.091  |
| Beef                                    | 9.091  |
| Cooking_plantain                        | 9.091  |
| Casimir_Funk                            | 9.091  |
| Pigeon_pea                              | 9.091  |
| Pistachio                               | 9.091  |
| Cod                                     | 9.091  |
| Tropical                                | 9.091  |
| tadpole                                 | 9.091  |
| water_buffalo                           | 9.091  |
| Phosphoric_acid                         | 9.091  |
| Tartaric_acid                           | 9.091  |
| Citric_acid                             | 9.091  |
| weak_acid                               | 9.091  |
| Lactic_acid                             | 9.091  |
| Dietary_Reference_Intake                | 9.091  |
| Canaanite_language                      | 9.091  |
| cell_adhesion_molecule                  | 9.091  |
| salivary_gland                          | 9.091  |
| bananas                                 | 9.091  |
| gourami                                 | 9.091  |
| fishing_industry                        | 9.091  |
| Molding_(process)                       | 9.091  |
| Carrot                                  | 9.091  |
+-----------------------------------------+--------+
  Time taken: 14 minutes, 27 seconds, 470 milliseconds

So, similar[inverse-links-to] works best. And quite noticeably better too! Exactly why, I'm not sure. Anyway, now we know to apply this to more examples. I'll do that in the next post.

how many wikipage links?

Just a simple one. Find the wikipages (in our processed fragment/subset of wikipedia) with the most out-going links, and the most in-coming links.

sa: load 30k--wikipedia-links.sw
sa: how-many-out-links |*> #=> how-many links-to |_self>
sa: rank-table[wikipage,how-many-out-links] select[1,100] reverse sort-by[how-many-out-links] starts-with |WP: >
+------+----------------------------------------------+--------------------+
| rank | wikipage                                     | how-many-out-links |
+------+----------------------------------------------+--------------------+
| 1    | List_of_Latin_words_with_English_derivatives | 3171               |
| 2    | List_of_Greek_words_with_English_derivatives | 3163               |
| 3    | Index_of_philosophy_articles_(AC)            | 2978               |
| 4    | List_of_chess_players                        | 1996               |
| 5    | 2000s_(decade)                               | 1861               |
| 6    | March_27                                     | 1844               |
| 7    | 1989                                         | 1789               |
| 8    | 1991                                         | 1708               |
| 9    | 1990                                         | 1673               |
| 10   | List_of_Scots                                | 1663               |
| 11   | 1979                                         | 1619               |
| 12   | 1972                                         | 1617               |
| 13   | 1945                                         | 1597               |
| 14   | 1988                                         | 1595               |
| 15   | 1973                                         | 1583               |
| 16   | 1977                                         | 1574               |
| 17   | 1992                                         | 1556               |
| 18   | 1983                                         | 1546               |
| 19   | List_of_mountains                            | 1545               |
| 20   | 1967                                         | 1538               |
| 21   | 1966                                         | 1530               |
| 22   | Russia                                       | 1516               |
| 23   | 1970                                         | 1514               |
| 24   | List_of_historical_period_drama_films        | 1499               |
| 25   | 1976                                         | 1492               |
| 26   | 1993                                         | 1492               |
| 27   | 1980                                         | 1491               |
| 28   | England                                      | 1481               |
| 29   | 1985                                         | 1476               |
| 30   | 1981                                         | 1475               |
| 31   | 1986                                         | 1470               |
| 32   | March_4                                      | 1450               |
| 33   | Poland                                       | 1446               |
| 34   | 1971                                         | 1435               |
| 35   | 1965                                         | 1422               |
| 36   | 1984                                         | 1415               |
| 37   | 1982                                         | 1408               |
| 38   | 1975                                         | 1404               |
| 39   | 1964                                         | 1402               |
| 40   | 1960s                                        | 1395               |
| 41   | History_of_painting                          | 1392               |
| 42   | 1963                                         | 1375               |
| 43   | 1944                                         | 1360               |
| 44   | Pakistan                                     | 1349               |
| 45   | 1997                                         | 1347               |
| 46   | 1987                                         | 1342               |
| 47   | 1978                                         | 1340               |
| 48   | 1968                                         | 1330               |
| 49   | 1969                                         | 1325               |
| 50   | 1946                                         | 1319               |
| 51   | 1996                                         | 1315               |
| 52   | Portugal                                     | 1315               |
| 53   | List_of_comedians                            | 1313               |
| 54   | 1962                                         | 1307               |
| 55   | 1960                                         | 1302               |
| 56   | List_of_fictional_robots_and_androids        | 1287               |
| 57   | United_Kingdom                               | 1283               |
| 58   | Greece                                       | 1276               |
| 59   | Rock_music                                   | 1275               |
| 60   | 1943                                         | 1262               |
| 61   | List_of_economists                           | 1262               |
| 62   | Academy_Award_for_Best_Production_Design     | 1246               |
| 63   | Italy                                        | 1245               |
| 64   | 1941                                         | 1228               |
| 65   | 1940                                         | 1212               |
| 66   | 1948                                         | 1211               |
| 67   | 1995                                         | 1210               |
| 68   | 1947                                         | 1210               |
| 69   | 1990s                                        | 1206               |
| 70   | 1957                                         | 1198               |
| 71   | History_of_Austria                           | 1196               |
| 72   | 1961                                         | 1189               |
| 73   | January_1                                    | 1187               |
| 74   | British_Museum                               | 1187               |
| 75   | 1942                                         | 1175               |
| 76   | 2007                                         | 1163               |
| 77   | Germany                                      | 1159               |
| 78   | 1998                                         | 1156               |
| 79   | Sculpture                                    | 1156               |
| 80   | Andalusia                                    | 1156               |
| 81   | List_of_French_people                        | 1154               |
| 82   | Missouri                                     | 1153               |
| 83   | List_of_agnostics                            | 1151               |
| 84   | World_War_II                                 | 1146               |
| 85   | List_of_marine_aquarium_fish_species         | 1144               |
| 86   | Spain                                        | 1142               |
| 87   | 1950s                                        | 1133               |
| 88   | 1958                                         | 1132               |
| 89   | Chicago                                      | 1131               |
| 90   | E_number                                     | 1130               |
| 91   | Pittsburgh                                   | 1118               |
| 92   | Lists_of_office-holders                      | 1115               |
| 93   | 1953                                         | 1113               |
| 94   | List_of_wine-producing_regions               | 1112               |
| 95   | Ukraine                                      | 1111               |
| 96   | 1959                                         | 1109               |
| 97   | Dubbing_(filmmaking)                         | 1109               |
| 98   | 1956                                         | 1104               |
| 99   | List_of_dog_breeds                           | 1103               |
| 100  | 1955                                         | 1102               |
+------+----------------------------------------------+--------------------+
  Time taken: 1 minute, 4 seconds, 532 milliseconds

sa: find-inverse[links-to]
  Time taken: 2 minutes, 23 seconds, 162 milliseconds

sa: how-many-in-links |*> #=> how-many inverse-links-to |_self>
sa: rank-table[wikipage,how-many-in-links] select[1,100] reverse sort-by[how-many-in-links] starts-with |WP: >
+------+--------------------------------+-------------------+
| rank | wikipage                       | how-many-in-links |
+------+--------------------------------+-------------------+
| 1    | United_States                  | 3027              |
| 2    | World_War_II                   | 2242              |
| 3    | Julian_calendar                | 1864              |
| 4    | Wikipedia:Persondata     | 1819              |
| 5    | Roman_numerals                 | 1773              |
| 6    | France                         | 1621              |
| 7    | United_Kingdom                 | 1617              |
| 8    | Germany                        | 1446              |
| 9    | Soviet_Union                   | 1412              |
| 10   | Italy                          | 1343              |
| 11   | Latin                          | 1277              |
| 12   | India                          | 1262              |
| 13   | China                          | 1247              |
| 14   | The_New_York_Times             | 1239              |
| 15   | Japan                          | 1210              |
| 16   | World_War_I                    | 1145              |
| 17   | Canada                         | 1130              |
| 18   | Europe                         | 1071              |
| 19   | Russia                         | 1037              |
| 20   | England                        | 1034              |
| 21   | New_York_City                  | 1017              |
| 22   | London                         | 1001              |
| 23   | United_Nations                 | 993               |
| 24   | Egypt                          | 955               |
| 25   | Rome                           | 944               |
| 26   | Australia                      | 934               |
| 27   | English_language               | 916               |
| 28   | Spain                          | 899               |
| 29   | Netherlands                    | 896               |
| 30   | Roman_Empire                   | 895               |
| 31   | Christianity                   | 860               |
| 32   | European_Union                 | 844               |
| 33   | Anno_Domini                    | 841               |
| 34   | BBC                            | 822               |
| 35   | Ottoman_Empire                 | 817               |
| 36   | Oxford_University_Press        | 798               |
| 37   | Turkey                         | 796               |
| 38   | Category:Living_people   | 793               |
| 39   | California                     | 793               |
| 40   | Greek_language                 | 777               |
| 41   | Israel                         | 775               |
| 42   | Sweden                         | 762               |
| 43   | calendar_era                   | 751               |
| 44   | New_York                       | 749               |
| 45   | Paris                          | 741               |
| 46   | Iran                           | 715               |
| 47   | Poland                         | 711               |
| 48   | The_Guardian                   | 706               |
| 49   | Byzantine_Empire               | 704               |
| 50   | Brazil                         | 704               |
| 51   | Scotland                       | 696               |
| 52   | New_York_Times                 | 664               |
| 53   | Islam                          | 650               |
| 54   | President_of_the_United_States | 650               |
| 55   | Catholic_Church                | 647               |
| 56   | French_language                | 641               |
| 57   | Middle_Ages                    | 640               |
| 58   | American_Civil_War             | 635               |
| 59   | Cambridge_University_Press     | 623               |
| 60   | Philippines                    | 623               |
| 61   | Mexico                         | 613               |
| 62   | Cold_War                       | 611               |
| 63   | Roman_Catholic_Church          | 610               |
| 64   | Denmark                        | 603               |
| 65   | Switzerland                    | 603               |
| 66   | Norway                         | 599               |
| 67   | Greece                         | 584               |
| 68   | Nazi_Germany                   | 583               |
| 69   | Ancient_Rome                   | 575               |
| 70   | Constantinople                 | 573               |
| 71   | NASA                           | 568               |
| 72   | Pakistan                       | 568               |
| 73   | Washington,_D.C.               | 566               |
| 74   | BBC_News                       | 565               |
| 75   | mathematics                    | 565               |
| 76   | Portugal                       | 553               |
| 77   | Syria                          | 548               |
| 78   | Austria                        | 546               |
| 79   | Iraq                           | 543               |
| 80   | New_Zealand                    | 540               |
| 81   | Vietnam_War                    | 538               |
| 82   | Time_(magazine)                | 538               |
| 83   | Ireland                        | 536               |
| 84   | Argentina                      | 534               |
| 85   | Buddhism                       | 524               |
| 86   | South_Africa                   | 520               |
| 87   | Ukraine                        | 516               |
| 88   | Indonesia                      | 513               |
| 89   | Africa                         | 506               |
| 90   | Ab_urbe_condita                | 504               |
| 91   | German_language                | 503               |
| 92   | Belgium                        | 490               |
| 93   | Romania                        | 487               |
| 94   | Afghanistan                    | 485               |
| 95   | Holy_Roman_Empire              | 476               |
| 96   | North_America                  | 473               |
| 97   | Hungary                        | 463               |
| 98   | Gregorian_calendar             | 462               |
| 99   | UNESCO                         | 461               |
| 100  | Massachusetts                  | 455               |
+------+--------------------------------+-------------------+
  Time taken: 6 minutes, 58 seconds, 581 milliseconds

OK. That is hopefully self explanatory. More in the next couple of posts!

towards processing all of wikipedia

I now have aspirations to process all of wikipedia, and see what we can get out of it. Certainly the best curated collection of knowledge on the planet (and structurally much more homogenous than the web at large), and being all hyperlinked makes it great for my project. We can think of wikipedia as one giant network, and my project is all about networks.

Now, the problem is wikipedia is somewhat large. 50 GB xml file (for English wikipedia), and just over 15 million pages. Prior to this I gave optimizations little thought (I was more interested in getting things working, rather than run speed), but with wikipedia, we have to at least start to think about more efficient ways to do things!

So, this was my first attempt to process wikipedia. But it was memory hungry!! I threw the largest memory EC2 instance type with 244 GB of RAM and it still wasn't enough (yeah, that's what you get when you don't think about efficiencies!). I thought this was the section of code that was problematic:

# now learn it all:
    for x in r:
      link, anchor = ket_process_anchor_text(x)
      C.add_learn("links-to",page_name_ket,link)
      C.add_learn("inverse-links-to",link,page_name_ket)

      C.add_learn("contains-anchor",page_name_ket,anchor)
      C.add_learn("inverse-contains-anchor",anchor,page_name_ket)

      C.add_learn("links-to",anchor,link)
      C.add_learn("inverse-links-to",link,anchor)

So I tweaked it to this. The learn it all section now writes it to disk as it goes, instead of loading it all into memory (and we cut down on what we are trying to learn to just "links-to"):

    result = fast_superposition()
    for x in r:
      result += ket_process_link(x)
    dest.write("links-to " + str(page_name_ket) + " => " + str(result.superposition().coeff_sort()) + "\n\n")

But that still chomped all the RAM I could throw at it. I eventually figured out the real problem was trying to load the entire 50 GB XML into memory at once, ie, this at the top of the code:

document = open(sys.argv[1],'rb')
soup = BeautifulSoup(document)

So the fix was to write some code to split that 50 GB into more manageable pieces. Hence my fragment-wikipedia-xml code.
Now, put it to use:

$ ./fragment_wikipedia_xml.py

Usage: ./fragment_wikipedia_xml.py wiki.xml [pages-per-fragment]

$ ./fragment_wikipedia_xml.py wiki.xml 30000

And after a couple of days of processing we have 517 xml files that are small enough to load into memory. Small enough that I no longer need EC2, and small enough to take less than 2 GB of RAM for BeautifulSoup to load it. Cool. So, we still can't process all of wikipedia, not yet, but we have a decent subset to play with. Now, put it to use:

$ du -h data/fragments/0.xml
676M    data/fragments/0.xml

$ grep -c "<page>" data/fragments/0.xml
30000

$ ./play_with_wikipedia__fast_write.py data/fragments/0.xml

And we now have this (the full hyperlink structure of that fragment of wikipedia, in sw format): 30k--wikipedia-links.sw

$ ./spit-out-sw-stats.sh 30k--wikipedia-links.sw
(92M, 1 op types and 30000 learn rules)
links-to: 30000 learn rules

I'll put this sw file to use in the next post.

Now, it is useful to have the inverse-links-to as well, so:

sa: find-inverse[links-to]
sa: save 30k--wikipedia-links--with-inverse.sw

Which only takes a couple of minutes.

More to come!

Update: some stats:

$ ./spit-out-sw-stats.sh 30k--wikipedia-links--with-inverse-simm.sw
(317M, 5 op types and 1326090 learn rules)
how-many-in-links: 1 learn rules
inverse-links-to: 1296087 learn rules
inverse-simm: 14 learn rules
inverse-simm-op: 1 learn rules
links-to: 29984 learn rules

Now, why are there less than 30k links-to learn rules in this second example? Simple enough. The fast_write even wrote out:
op |ket> => |>
but in the process of loading them into memory (then processing them), the |> rules are not learned/ignored.
So, there must be 30000 - 29984 = 16 learn empty-sp rules. Let's see what grep has to say:

$ grep " => |>" 30k--wikipedia-links.sw
links-to |WP: Cogency> => |>
links-to |WP: Wikipedia:Complete_list_of_language_wikis_available> => |>
links-to |WP: Floccinaucinihilipilification> => |>
links-to |WP: Wikipedia:GNE_Project_Files> => |>
links-to |WP: Wikipedia:GNE_Project_Files/Proposed_GNU_Moderation_System> => |>
links-to |WP: Wikipedia:GNE_Project_Files/GNE_Project_Design> => |>
links-to |WP: Wikipedia:The_future_of_Wikipedia> => |>
links-to |WP: Portal:Contents/Outlines> => |>
links-to |WP: Wikipedia:WikiBiblion> => |>
links-to |WP: Wikipedia:PHP_script_tech_talk> => |>
links-to |WP: Normoxic> => |>
links-to |WP: Conducted_interference> => |>
links-to |WP: Local_battery> => |>
links-to |WP: Wikipedia:Blocked_IPs> => |>

$ grep -c " => |>" 30k--wikipedia-links.sw
14

So, still 2 missing :). Not sure why. Don't really care.

Update: Let's show the full steps from wikipedia, to wikipedia link structure:

wiki.xml from here: http://en.wikipedia.org/wiki/Wikipedia:Database_download
$ bzip2 -dc enwiki-20150515-pages-articles.xml.bz2 > wiki.xml
$ ./fragment_wikipedia_xml.py wiki.xml 30000

for file in $(ls -1 data/fragments/*); do
  echo "file: $file"
  ./play_with_wikipedia__fast_write.py "$file"
done

$ cd sw-results/

for i in $(seq 0 9); do
  echo "$i"
  cat $i--30k--wikipedia-links.sw >> 300k--wikipedia-links.sw
done

$ bzip2 -v *.sw

And the resulting sw files are compressed and stored here.

BTW:

$ ./spit-out-sw-stats.sh 300k--wikipedia-links.sw
(494M, 1 op types and 300000 learn rules)
links-to: 300000 learn rules

$ ./spit-out-sw-stats.sh 300k--wikipedia-links--with-inverse.sw
(1.5G, 2 op types and 4571683 learn rules)
inverse-links-to: 4272183 learn rules
links-to: 299499 learn rules

Wednesday, 3 June 2015

new function: find-inverse[op]

Just a short one. "create inverse" was painfully slow, most of that is because of the O(n^2) for add_learn, and the long lists created by "inverse-supported-ops". So, speed up was, we specify what ops we want inverse for, side-stepping the expensive ones. If you still want full inverse, we can still run "create inverse", but now, most of the time this is sufficient:

find-inverse[op]

Here is a short example:

sa: load fred-sam-friends.sw
sa: dump
----------------------------------------
|context> => |context: friends>

friends |Fred> => |Jack> + |Harry> + |Ed> + |Mary> + |Rob> + |Patrick> + |Emma> + |Charlie>

friends |Sam> => |Charlie> + |George> + |Emma> + |Jack> + |Rober> + |Frank> + |Julie>
----------------------------------------

sa: find-inverse[friends]
sa: dump
----------------------------------------
|context> => |context: friends>

friends |Fred> => |Jack> + |Harry> + |Ed> + |Mary> + |Rob> + |Patrick> + |Emma> + |Charlie>

friends |Sam> => |Charlie> + |George> + |Emma> + |Jack> + |Rober> + |Frank> + |Julie>

inverse-friends |Jack> => |Fred> + |Sam>

inverse-friends |Harry> => |Fred>

inverse-friends |Ed> => |Fred>

inverse-friends |Mary> => |Fred>

inverse-friends |Rob> => |Fred>

inverse-friends |Patrick> => |Fred>

inverse-friends |Emma> => |Fred> + |Sam>

inverse-friends |Charlie> => |Fred> + |Sam>

inverse-friends |George> => |Sam>

inverse-friends |Rober> => |Sam>

inverse-friends |Frank> => |Sam>

inverse-friends |Julie> => |Sam>
----------------------------------------

Anyway, main point is this is much faster (and cleaner) than the old "create inverse". And of course is a brother to find-unique[op].

That's it for this post.

Saturday, 30 May 2015

announcing: command line guess file name tool.

OK. Just a quick one. A command line tool that given a string, spits out the file-names that most closely match it (using unscaled simm). Here is the code, and once again, I have made it open source.

$ ./guess.py

Usage: ./guess.py string [directory]

If the directory is not given, then use the current directory.

And a quick example:

$ ./guess.py wiki work-on-wikipedia

string to match:      wiki
directory to search:  work-on-wikipedia
====================================

50 %     wiki.xml
23.08 %  1000-wiki.xml
23.08 %  5000-wiki.xml
22.22 %  wikipedia-links.sw
13.64 %  play_with_wikipedia.py
12.5 %   trial-wikipedia-links.sw
12 %     fragment_wikipedia_xml.py
10.34 %  fast-write-wikipedia-links.sw
8.82 %   play_with_wikipedia__fast_write.py

So, that should be clear enough. And whether anything like this is already out there, I don't know. I presume so. eg, using the edit-distance algorithm as used in spell-checkers could be used in a similar way to my simm.

Heh. Perhaps I gave a bad example! In the above case, a simple:

$ ls -1 work-on-wikipedia/ | grep wiki
1000-wiki.xml
5000-wiki.xml
fast-write-wikipedia-links.sw
fragment_wikipedia_xml.py
play_with_wikipedia.py
play_with_wikipedia__fast_write.py
trial-wikipedia-links.sw
wiki.xml
wikipedia-links.sw

would have given similar results!

So here is another example of guess:

$ ./guess.py "sam fred" sw-examples/ | head

string to match:      sam fred
directory to search:  sw-examples
====================================

30.77 %  small-fred.sw
26.67 %  Freds-family.sw
26.32 %  fred-sam-friends.sw
22.22 %  shares.sw
16.67 %  saved-commands.txt

$ ./guess.py "fred sam" sw-examples/ | head

string to match:      fred sam
directory to search:  sw-examples
====================================

33.33 %  Freds-family.sw
31.58 %  fred-sam-friends.sw
25 %     frog.sw
23.08 %  small-fred.sw
20 %     friends.sw

So some things to note are:
1) "fred sam" and "sam fred" give essentially the same answer
2) I made it case insenstive
3) we don't need to use any regexp, which we would have to do if using grep in a similar manner.
4) we could possibly use this to help find the exact ket when we only partly know its name, and the sw file is too large to search manually.

Anyway, should be useful here and there.

Tuesday, 19 May 2015

comment on ket independence

Just a quick comment on the idea of ket independence.

We can say |x_i> for i in {1,2,..,n} are independent with respect to the op-sequence "op4 op3 op2 op1" if:
op4 op3 op2 op1 (|x_1> + |x_2> + ... + |x_n>)
= op4 op3 op2 op1 |x_1>
+ op4 op3 op2 op1 |x_2>
...
+ op4 op3 op2 op1 |x_n>

Not yet sure where this will be useful, but figure I should mention it.

Tuesday, 12 May 2015

average-categorize websites

Now for a more interesting example. Recall work on pattern recognition of websites. In that case, we constructed the averages manually. We manually took the average of abc {1,2,...,10}, adelaidenow {1,2,...,10} etc. Then applied simm to webpage-11. This time, we get average-categorize to find our sample classes (again, webpage-11's are filtered out). Here is the BKO:

sa: load improved-fragment-webpages.sw
sa: load label-training-data-for-website-fragments.sw
sa: average-categorize[training-hash-4B,0.7,phi,average-cat]
sa: load create-cat-ave-pat-rec-matrix.sw

Here is the result:

sa: matrix[result]
[ phi: 1 ] = [  91.69  28.73  25.76  37.77  29.45  24.33  ] [ abc 11         ]
[ phi: 2 ]   [  28.77  78.04  26.72  29.86  25.25  28.19  ] [ adelaidenow 11 ]
[ phi: 3 ]   [  25.76  26.88  79.05  28.28  26.86  23.21  ] [ slashdot 11    ]
[ phi: 4 ]   [  37.8   29.75  28.16  85.54  32.06  24.95  ] [ smh 11         ]
[ phi: 5 ]   [  29.71  25.25  26.91  31.87  85.17  22.09  ] [ wikipedia 11   ]
[ phi: 6 ]   [  24.32  28.18  23.47  24.92  21.94  82.02  ] [ youtube 11     ]

Which counts as a success! This matrix is essentially identical to this one. Except this time, we didn't specify the classes, average-categorize did that for us, resulting in phi {1,2,3,4,5,6}.

So, a nice example of unlabeled learning. Though admittedly, webpage superpositions are rather well behaved. Certainly better behaved than the adult wage prediction example! In the sense that different classes have distinct superpositions. The wage prediction example completely failed that test.

Now we have distinct classes we can give them names. Pretty simply actually:

M |phi: 1> => |abc>
M |phi: 2> => |adelaidenow>
M |phi: 3> => |slashdot>
M |phi: 4> => |smh>
M |phi: 5> => |wikipedia>
M |phi: 6> => |youtube>

Now, take a look:

sa: matrix[M,result]
[ abc         ] = [  1  0  0  0  0  0  ] [  91.69  28.73  25.76  37.77  29.45  24.33  ] [ abc 11         ]
[ adelaidenow ]   [  0  1  0  0  0  0  ] [  28.77  78.04  26.72  29.86  25.25  28.19  ] [ adelaidenow 11 ]
[ slashdot    ]   [  0  0  1  0  0  0  ] [  25.76  26.88  79.05  28.28  26.86  23.21  ] [ slashdot 11    ]
[ smh         ]   [  0  0  0  1  0  0  ] [  37.8   29.75  28.16  85.54  32.06  24.95  ] [ smh 11         ]
[ wikipedia   ]   [  0  0  0  0  1  0  ] [  29.71  25.25  26.91  31.87  85.17  22.09  ] [ wikipedia 11   ]
[ youtube     ]   [  0  0  0  0  0  1  ] [  24.32  28.18  23.47  24.92  21.94  82.02  ] [ youtube 11     ]

Or, as a merged matrix:

sa: merged-matrix[M,result]
[ abc         ] = [  91.69  28.73  25.76  37.77  29.45  24.33  ] [ abc 11         ]
[ adelaidenow ]   [  28.77  78.04  26.72  29.86  25.25  28.19  ] [ adelaidenow 11 ]
[ slashdot    ]   [  25.76  26.88  79.05  28.28  26.86  23.21  ] [ slashdot 11    ]
[ smh         ]   [  37.8   29.75  28.16  85.54  32.06  24.95  ] [ smh 11         ]
[ wikipedia   ]   [  29.71  25.25  26.91  31.87  85.17  22.09  ] [ wikipedia 11   ]
[ youtube     ]   [  24.32  28.18  23.47  24.92  21.94  82.02  ] [ youtube 11     ]

Presumably humans/children do something similar. They build up and recognize classes of objects before they even have a name for them. Only later after the classes are well specified they discover a label (in the above case, the M learn rules).

average-categorize H-I simple pattern recognition example

Last post I outlined and gave the code for average-categorize. This post a simple example. Recall the simple pattern recognition example, H-I-pat-rec.sw. Let's apply average-categorize to it.

Now, we have these patterns:

x: letter: H
1       1
1       1
1       1
1 1 1 1 1
1       1
1       1
1       1

x: noisy: H
        1
1       1
1       1
1 1 1   1
1
1       1
1       1

x: noisy: H2
1       1
1
1   1 1 1
1 1 1 1 1
1 1     1
1       1
1 1 1   1

x: letter: I
1 1 1 1 1
    1
    1
    1
    1
    1
1 1 1 1 1

x: noisy: I
1 1 1 1
    1


    1
    1
1   1 1 1

x: noisy: I2
1 1     1
  1 1 1
    1
    1
    1 1 1
1 1 1 1
1 1 1 1 1

x: letter: O
1 1 1 1 1
1
1
1
1
1
1 1 1 1 1

Run this code:

average_categorize(C,"pixels",0.65,"phi","pixels")

And now we have:

I think that works pretty well!

Webpage superposition example in the next post.

new function: average-categorize

This one I think will be super useful, and super powerful. I only just came up with it, so I haven't fully explored its properties, but I have already found some. And it's looking promising!

So the idea is, given a collection of points, how do we classify them into groups? I'm pretty sure this is a solved problem, but I wanted to see what I could come up with.

It goes something like this (though see the code for exact details). We have a list of currently known patterns (out_list), and a list of incoming patterns (one). For each incoming pattern (r), compare it with each of the known patterns (sp) using simm. Store the name and similarity value for the best match. At the end of the loop, if the best match is lower than a threshold (recall for simm, higher threshold is a better match, exactly 1 for exact match) then store it as its' own known pattern. Otherwise "average" it with the best matching known pattern, weighted based on its similarity. And in BKO land, simm scales/normalizes superpositions, so to average just means add the superpositions, and we don't need to divide by n.

Here is the code:

def average_categorize(context,parameters):
  try:
    op,t,phi,ave = parameters.split(',')
    t = float(t)
  except:
    return ket("",0)
    
  one = context.relevant_kets(op)
  out_list = []
  for x in one:
    r = x.apply_op(context,op)
    best_k = -1
    best_simm = 0
    for k,sp in enumerate(out_list):
      similarity = silent_simm(r,sp)
      if similarity > best_simm:
        best_k = k
        best_simm = similarity
    if best_k == -1 or best_simm < t:
      out_list.append(r)
    else:
      k = best_k
#      out_list[k] += r
      out_list[k] += r.multiply(best_simm)  # reweight based on result of simm.
  for k,sp in enumerate(out_list):
    context.learn(ave,phi + ": " + str(k+1),sp)
  return ket("average categorize")

Now, I may tweak this code yet, but it is already a good start. I will give a couple of examples in future posts.

Some properties:
1) it converges as you add more examples
2) the order you learn is important, earlier examples have a bigger effect than later ones
3) it reproduces the light-bulb effect, ie stronger examples have a bigger effect
4) it reproduces the reverse-light-bulb effect, ie, weaker examples have essentially no impact
5) I think it is roughly O(n^2), certainly cheaper than my previous categorize code.

I guess I need to explain what I mean by "stronger examples" and "weaker examples". Superpositions have a thing I call currency. Let's drop to the console:

sa: measure-currency |x>
|number: 1>

sa: measure-currency 2.71|y>
|number: 2.71>

sa: measure-currency (|a> + |b> + |c>)
|number: 3>

sa: measure-currency (0.2|a> + 4|b> + 11|c> + 0.3|d>)
|number: 15.5>

So when I talk of "stronger example" I mean r has a high currency, and a "weaker example" I mean r has a small currency.

If we look at the line:

      out_list[k] += r.multiply(best_simm)

then (ignoring the simm reweighting), the higher the currency of r, the more power it has to effect out_list[k]. I'm not sure that is clear! I'm working from mental images, and it is hard to translate them to words.

More to come!

BTW, I don't think I have mentioned this anywhere yet, but we can say an operator "op" preserves currency if:

measure-currency SP == measure-currency op SP

A related definition (and probably a slightly better one) for currency conservation is that if in matrix format the sum of each column is exactly 1, then we can say that operator conserves currency.

Thursday, 7 May 2015

an interesting little observation

OK. Just a quick one. There seems to be an interesting analogy between the different layers in the BKO scheme.

Take the BKO:
|y> => op4 op3 op2 op1 "" |x>
which simply corresponds to stepping from superposition to superposition, then storing it in |y>
Well, a level above that, we have something similar with .sw files. Noting that just loading an sw file can do computation (the only time they don't is if all the rules are literal superpositions).

So the analogy of the above, at the sw level is:
load x.sw
load op1.sw
load op2.sw
load op3.sw
load op4.sw
save y.sw

Anyway, I just thought it was an interesting little observation.

announcing: command line file similarity tool

Decided to write a script that spits out the similarity of two files, using my simm. It has three mappings from file to superposition/list (byte, 2-byte and fragment simm) and gives both scaled and unscaled simm results. Here is the code, and I decided to make it open source!

I guess some examples:

-- define some toy files:
$ echo -n "aa" > a.txt
$ echo -n "a" > b.txt

-- find their similarity:
$ ./simm.py a.txt b.txt
file 1:                a.txt
file 2:                b.txt

unscaled:
  byte similarity:     50 %
  2 byte similarity:   50 %

scaled:
  byte similarity:     100 %
  2 byte similarity:   50 %

Next example:

-- define toy files:
$ echo -n "fish" > a.txt
$ echo -n "fsih" > b.txt

-- find similarity:
$ ./simm.py a.txt b.txt
file 1:                a.txt
file 2:                b.txt

unscaled:
  byte similarity:     100 %
  2 byte similarity:   25 %

scaled:
  byte similarity:     100 %
  2 byte similarity:   25 %

A slightly more interesting example, and this time we also use fragment simm:

-- define a couple of sentences:
$ echo "Shakespeare produced most of his known work between 1589 and 1613" > a.txt
$ echo "Shakespeare produced most of his work after 1589" > b.txt

-- find their similarity, and fragment simm on spaces:
$ ./simm.py a.txt b.txt ' '
file 1:                a.txt
file 2:                b.txt

split strings:         ' '

unscaled:
  byte similarity:     71.21 %
  2 byte similarity:   65.15 %
  fragment similarity: 63.64 %

scaled:
  byte similarity:     82.19 %
  2 byte similarity:   66.2 %
  fragment similarity: 63.64 %

Now, on larger files. Recall the simm matrix I made here? Well, let's test a couple of examples:

$ ./simm.py example-files/binary.exe example-files/ebook.txt
file 1:                example-files/binary.exe
file 2:                example-files/ebook.txt

unscaled:
  byte similarity:     9.25 %
  2 byte similarity:   1.92 %

scaled:
  byte similarity:     12.44 %
  2 byte similarity:   2.66 %

$ ./simm.py example-files/ebook.txt example-files/website.html
file 1:                example-files/ebook.txt
file 2:                example-files/website.html

unscaled:
  byte similarity:     24.07 %
  2 byte similarity:   17.82 %

scaled:
  byte similarity:     72.7 %
  2 byte similarity:   47.26 %

$ ./simm.py example-files/encrypted.gpg example-files/zipped.zip
file 1:                example-files/encrypted.gpg
file 2:                example-files/zipped.zip

unscaled:
  byte similarity:     94.06 %
  2 byte similarity:   63.75 %

scaled:
  byte similarity:     97.6 %
  2 byte similarity:   65.2 %

Now, a more interesting fragment simm example. Recall the simm matrices I made here. Well, let's give some examples:

$ ./simm.py webpages-v2/abc-1.html webpages-v2/abc-7.html '<' '>'
file 1:                webpages-v2/abc-1.html
file 2:                webpages-v2/abc-7.html

split strings:         '<' '>'

unscaled:
  byte similarity:     98.28 %
  2 byte similarity:   95.28 %
  fragment similarity: 91.88 %

scaled:
  byte similarity:     98.76 %
  2 byte similarity:   95.65 %
  fragment similarity: 91.88 %

$ ./simm.py webpages-v2/abc-1.html webpages-v2/adelaidenow-1.html '<' '>'
file 1:                webpages-v2/abc-1.html
file 2:                webpages-v2/adelaidenow-1.html

split strings:         '<' '>'

unscaled:
  byte similarity:     11.41 %
  2 byte similarity:   10.63 %
  fragment similarity: 7.4 %

scaled:
  byte similarity:     82.51 %
  2 byte similarity:   61.66 %
  fragment similarity: 28.76 %

$ ./simm.py webpages-v2/abc-1.html webpages-v2/youtube-1.html '<' '>'
file 1:                webpages-v2/abc-1.html
file 2:                webpages-v2/youtube-1.html

split strings:         '<' '>'

unscaled:
  byte similarity:     13.23 %
  2 byte similarity:   10.87 %
  fragment similarity: 10.29 %

scaled:
  byte similarity:     76.47 %
  2 byte similarity:   49.16 %
  fragment similarity: 24.27 %

And finally, some text files:

$ ./simm.py text/ebook-Alices_Adventures_in_Wonderland_11.txt text/ebook-Tom_Sawyer_74.txt ' '
file 1:                text/ebook-Alices_Adventures_in_Wonderland_11.txt
file 2:                text/ebook-Tom_Sawyer_74.txt

split strings:         ' '

unscaled:
  byte similarity:     40.02 %
  2 byte similarity:   38.49 %
  fragment similarity: 28.69 %

scaled:
  byte similarity:     95.47 %
  2 byte similarity:   86.92 %
  fragment similarity: 54.34 %

$ ./simm.py text/ebook-Alices_Adventures_in_Wonderland_11.txt text/ebook-moby-shakespeare.txt ' '
file 1:                text/ebook-Alices_Adventures_in_Wonderland_11.txt
file 2:                text/ebook-moby-shakespeare.txt

split strings:         ' '

unscaled:
  byte similarity:     3.13 %
  2 byte similarity:   3.11 %
  fragment similarity: 2.5 %

scaled:
  byte similarity:     90.8 %
  2 byte similarity:   79.5 %
  fragment similarity: 41.8 %

$ ./simm.py text/WP-Adelaide.txt text/WP-Australia.txt ' '
file 1:                text/WP-Adelaide.txt
file 2:                text/WP-Australia.txt

split strings:         ' '

unscaled:
  byte similarity:     88.89 %
  2 byte similarity:   81.13 %
  fragment similarity: 46.52 %

scaled:
  byte similarity:     95.47 %
  2 byte similarity:   86.49 %
  fragment similarity: 48.83 %

$ ./simm.py text/WP-Adelaide.txt text/WP-physics.txt ' '
file 1:                text/WP-Adelaide.txt
file 2:                text/WP-physics.txt

split strings:         ' '

unscaled:
  byte similarity:     54.64 %
  2 byte similarity:   50.26 %
  fragment similarity: 23.42 %

scaled:
  byte similarity:     90.3 %
  2 byte similarity:   77.31 %
  fragment similarity: 35.54 %

And one final example. This is what happens if you try the wrong split strings for a given document type. eg here less-than and greater-than on a text file:

$ ./simm.py text/WP-Adelaide.txt text/WP-Australia.txt '<' '>'
file 1:                text/WP-Adelaide.txt
file 2:                text/WP-Australia.txt

split strings:         '<' '>'

unscaled:
  byte similarity:     88.89 %
  2 byte similarity:   81.13 %
  fragment similarity: 0 %

scaled:
  byte similarity:     95.47 %
  2 byte similarity:   86.49 %
  fragment similarity: 0 %

Note the 0% fragment similarity.

That's it for this post. BTW, in looking around to see if others have written file similarity tools I found simhash. Heh, almost certainly faster than my method!

BTW, the above should work just fine on a large range of document types. You just need to choose the right split strings. eg, for C we might use:

./simm.py code-1.c code-2.c ' ' '{' '}' '(' ')' ';'

ie, space, curly-brackets, normal brackets and semi-colon.

Indeed, we could also apply it to DNA (though presumably real researchers already have tools to do this!). So, we can specify split-strings, and apply it to text files of DNA sequences.
eg:

./simm.py DNA-1.txt DNA-2.txt 'GAAATTCCCA' 'ATACCACT' 'AACCACACAC' 'TTAGGG'

That we can do this should not be a surprise, since my fragment similarity idea was originally motivated by the idea of gel electrophoresis.

OK. Let's do a worked example!
Let's first choose 'TTAGGG' as our split string.
Let's choose these to be our 2 DNA sequences:
GAAATTCCCA TTAGGG ATACCACT
AACCACACAC TTAGGG GAAATTCCCA
And by construction, we expect 50% similarity.
Let's take a look:

$ echo -n "GAAATTCCCATTAGGGATACCACT" > a.txt
$ echo -n "AACCACACACTTAGGGGAAATTCCCA" > b.txt

$ ./simm.py a.txt b.txt 'TTAGGG'
file 1:                a.txt
file 2:                b.txt

split strings:         'TTAGGG'

unscaled:
  byte similarity:     84.62 %
  2 byte similarity:   73.08 %
  fragment similarity: 50 %

scaled:
  byte similarity:     89.1 %
  2 byte similarity:   75.64 %
  fragment similarity: 50 %

Success!

Finally, my code is somewhat slow! A C implementation would be nice.

Update: if your files are line based, '\n' (ie, new-line) should usually be a sufficient split string.
If you have csv files say, then maybe '\n' and ',' (ie, new-line and comma).

Update: talking of speeding it up, if we use lists of bits, we can speed it up somewhat.
wf = sum-of-bits(f)
wg = sum-of-bits(g)
wfg = sum-of-bits(xor(f,g))
result = (wf + wg - wfg)/2*max(wf,wg)

Update: OK. I given this "speed up" some thought, and decided it will not be as good as the full simm. A for example is, consider C code. If we use the bit method a file containing 1 "int" is "identical" to another containing say 200 "int". Which is not all that useful!