Sunday 21 June 2015

some more similar[inverse-links-to] results

This time using 300,000 pages of wikipedia (out of 15,000,000 total). So roughly 2% of total. Even with EC2, I don't really have the processing power (with the current code) to use much larger sets than this.
sa: load 300k--wikipedia-links.sw
sa: find-inverse[links-to]
sa: H |*> #=> how-many inverse-links-to merge-labels(|WP: > + |_self>)
sa: S |*> #=> table[wikipage,coeff] select[1,60] 100 self-similar[inverse-links-to] merge-labels(|WP: > + |_self>)

sa: S |Love>
+-----------------------------------+--------+
| wikipage                          | coeff  |
+-----------------------------------+--------+
| Love                              | 100.0  |
| Pride                             | 17.391 |
| Pleasure                          | 13.043 |
| Jealousy                          | 13.043 |
| Philotes_(mythology)              | 13.043 |
| Imagination                       | 13.043 |
| Pity                              | 13.043 |
| Envy                              | 13.043 |
| Peace                             | 12.121 |
| Matter                            | 12     |
| Fear                              | 8.696  |
| Measurement                       | 8.696  |
| Number                            | 8.696  |
| Observation                       | 8.696  |
| Misanthropy                       | 8.696  |
| Piety                             | 8.696  |
| Courage                           | 8.696  |
| Hope                              | 8.696  |
| Lust                              | 8.696  |
| Asteria                           | 8.696  |
| Orthrus                           | 8.696  |
| Modesty                           | 8.696  |
| Punishment                        | 8.696  |
| Idea                              | 8.696  |
| Politeness                        | 8.696  |
| Learning                          | 8.696  |
| Luck                              | 8.696  |
| Sexual_attraction                 | 8.696  |
| Necessity                         | 8.696  |
| Physical_intimacy                 | 8.696  |
| Wrath                             | 8.696  |
| Gluttony                          | 8.696  |
| Prediction                        | 8.696  |
| Darkness                          | 8.696  |
| Safety                            | 8.696  |
| Optimism                          | 8.696  |
| Doubt                             | 8.696  |
| Moderation                        | 8.696  |
| Compassion                        | 8.696  |
| Respect                           | 8.696  |
| Nomenclature                      | 8.696  |
| Courtship                         | 8.696  |
| Jonathan_Barnes                   | 8.696  |
| DielsKranz_numbering_system       | 8.696  |
| John_Raven                        | 8.696  |
| De_amore_(Andreas_Capellanus)     | 8.696  |
| Infatuation                       | 8.696  |
| Category:Love               | 8.696  |
| Contempt                          | 8.696  |
| Memory                            | 8.696  |
| Quantity                          | 8.696  |
| cyclops                           | 8.696  |
| Curiosity                         | 8.696  |
| Passion_(emotion)                 | 8.696  |
| Category:Philosophy_of_love | 8.696  |
| nonverbal_communication           | 8.696  |
| Air                               | 8.696  |
| Neikea                            | 8.696  |
| Peter_Kingsley_(scholar)          | 8.696  |
| Inquiry                           | 8.696  |
+-----------------------------------+--------+
  Time taken: 1 hour, 42 minutes, 23 seconds, 210 milliseconds

sa: S |Knowledge>
+----------------------------+--------+
| wikipage                   | coeff  |
+----------------------------+--------+
| Knowledge                  | 100.0  |
| Inquiry                    | 16     |
| Measurement                | 12     |
| Pride                      | 12     |
| Idea                       | 12     |
| Learning                   | 12     |
| Prediction                 | 12     |
| Experience                 | 12     |
| Memory                     | 12     |
| Intelligence_(trait)       | 12     |
| understanding              | 10.345 |
| Imre_Lakatos               | 8.333  |
| Beauty                     | 8      |
| Outline_of_education       | 8      |
| Faith                      | 8      |
| Love                       | 8      |
| Meaning_of_life            | 8      |
| Metaphor                   | 8      |
| Nominalism                 | 8      |
| Number                     | 8      |
| Observation                | 8      |
| Platonic_idealism          | 8      |
| Pain                       | 8      |
| Pathological_science       | 8      |
| Problem_of_other_minds     | 8      |
| Misanthropy                | 8      |
| Piety                      | 8      |
| Virtue                     | 8      |
| Lust                       | 8      |
| Discovery_(observation)    | 8      |
| Ineffability               | 8      |
| Belief                     | 8      |
| Organization               | 8      |
| Modesty                    | 8      |
| Placebo                    | 8      |
| Punishment                 | 8      |
| Quasi-empirical_method     | 8      |
| Pleasure                   | 8      |
| Jealousy                   | 8      |
| Authority                  | 8      |
| Karl_Mannheim              | 8      |
| Paradigm                   | 8      |
| Intensionality             | 8      |
| Problem_of_induction       | 8      |
| Necessity                  | 8      |
| Elegance                   | 8      |
| Prattyasamutpda            | 8      |
| Moderation                 | 8      |
| Phenomenalism              | 8      |
| Nomenclature               | 8      |
| Potentiality_and_actuality | 8      |
| Max_Scheler                | 8      |
| Matter                     | 8      |
| Panpsychism                | 8      |
| Information                | 8      |
| knowledge_management       | 8      |
| Lev_Shestov                | 8      |
| Interpretation_(logic)     | 8      |
| Outline_of_philosophy      | 8      |
| Outline_of_logic           | 8      |
+----------------------------+--------+
  Time taken: 1 hour, 48 minutes, 29 seconds, 868 milliseconds

sa: H |Google>
|number: 704>

sa: S |Google>
+---------------------------------------+--------+
| wikipage                              | coeff  |
+---------------------------------------+--------+
| Google                                | 100.0  |
| Apple_Inc.                            | 14.063 |
| Microsoft                             | 12.732 |
| Facebook                              | 11.222 |
| Yahoo!                                | 9.375  |
| World_Wide_Web                        | 8.807  |
| IBM                                   | 8.093  |
| Sun_Microsystems                      | 7.955  |
| Android_(operating_system)            | 7.812  |
| Internet                              | 7.487  |
| Amazon.com                            | 7.102  |
| Intel                                 | 6.676  |
| Linux                                 | 6.537  |
| Hewlett-Packard                       | 6.25   |
| Stanford_University                   | 6.108  |
| Twitter                               | 6.108  |
| web_browser                           | 6.108  |
| HTML                                  | 5.824  |
| operating_system                      | 5.803  |
| YouTube                               | 5.657  |
| Forbes                                | 5.384  |
| Massachusetts_Institute_of_Technology | 5.324  |
| Java_(programming_language)           | 4.83   |
| AOL                                   | 4.687  |
| smartphone                            | 4.687  |
| open_source                           | 4.687  |
| C_(programming_language)              | 4.608  |
| Silicon_Valley                        | 4.545  |
| Nokia                                 | 4.403  |
| C++                                   | 4.403  |
| Microsoft_Windows                     | 4.354  |
| JavaScript                            | 4.261  |
| Wired_(magazine)                      | 4.261  |
| Motorola                              | 4.119  |
| XML                                   | 4.119  |
| Wall_Street_Journal                   | 4.119  |
| CNET                                  | 4.119  |
| copyright                             | 4.119  |
| software                              | 4.119  |
| Oracle_Corporation                    | 3.977  |
| Sony                                  | 3.977  |
| Unix                                  | 3.977  |
| Mac_OS_X                              | 3.977  |
| Wikipedia                             | 3.977  |
| Internet_Explorer                     | 3.835  |
| OS_X                                  | 3.835  |
| source_code                           | 3.835  |
| eBay                                  | 3.835  |
| computer_science                      | 3.748  |
| University_of_California,_Berkeley    | 3.732  |
| IP_address                            | 3.693  |
| Larry_Page                            | 3.693  |
| iPhone                                | 3.693  |
| algorithm                             | 3.693  |
| free_software                         | 3.693  |
| University_of_Michigan                | 3.551  |
| GNU_General_Public_License            | 3.551  |
| database                              | 3.551  |
| Carnegie_Mellon_University            | 3.409  |
| Cisco_Systems                         | 3.409  |
+---------------------------------------+--------+
  Time taken: 1 day, 18 hours, 53 minutes, 1 second, 791 milliseconds

sa: H |Blog>
|number: 32>

sa: S |Blog>
+-----------------------------------------------------------------+-------+
| wikipage                                                        | coeff |
+-----------------------------------------------------------------+-------+
| Blog                                                            | 100   |
| Active_Server_Pages                                             | 9.375 |
| Desktop_publishing                                              | 9.375 |
| Online_chat                                                     | 9.375 |
| CAPTCHA                                                         | 9.375 |
| RSS                                                             | 9.302 |
| Dynamic_HTML                                                    | 6.25  |
| Malware                                                         | 6.25  |
| Chat_room                                                       | 6.25  |
| Content_management_system                                       | 6.25  |
| ABC_World_News_Tonight                                          | 6.25  |
| Cross-site_scripting                                            | 6.25  |
| Primetime_(TV_series)                                           | 6.25  |
| Phishing                                                        | 6.25  |
| home_page                                                       | 6.25  |
| Open_source_software                                            | 6.25  |
| impact_factor                                                   | 6.25  |
| Terminate_and_Stay_Resident                                     | 6.25  |
| electronic_mailing_list                                         | 6.25  |
| Podcast                                                         | 6.25  |
| Google_Scholar                                                  | 6.25  |
| OPML                                                            | 6.25  |
| feed_aggregator                                                 | 6.25  |
| peer-review                                                     | 6.25  |
| Social_networking_service                                       | 6.25  |
| Digg                                                            | 6.25  |
| carbon_copy                                                     | 6.25  |
| online_community                                                | 6.25  |
| Freemium                                                        | 6.25  |
| Microsoft_Silverlight                                           | 6.25  |
| Wikia                                                           | 6.25  |
| Peer-to-peer_file_sharing                                       | 6.25  |
| Fully_qualified_domain_name                                     | 6.25  |
| Category:Internet_forums                                  | 6.25  |
| Category:American_broadcast_news_analysts                 | 6.25  |
| arXiv.org                                                       | 6.25  |
| preprint                                                        | 6.25  |
| Cicada_3301                                                     | 6.25  |
| fansite                                                         | 6.25  |
| Affiliate_marketing                                             | 6.25  |
| Category:American_television_news_anchors                 | 6.25  |
| Category:ABC_News_personalities                           | 6.25  |
| Category:American_television_reporters_and_correspondents | 6.25  |
| Lisa_McRee                                                      | 6.25  |
| Category:Electronic_publishing                            | 6.25  |
| Kevin_Newman_(journalist)                                       | 6.25  |
| Robin_Roberts_(sportscaster)                                    | 6.25  |
| Internet_Information_Services                                   | 6.061 |
| newsmagazine                                                    | 5.882 |
| George_Stephanopoulos                                           | 5.714 |
| news_presenter                                                  | 5.714 |
| FAQ                                                             | 5.556 |
| Internet_meme                                                   | 5.405 |
| Common_Gateway_Interface                                        | 5.263 |
| Bulletin_board_system                                           | 5.172 |
| Internet_slang                                                  | 5     |
| news_anchor                                                     | 4.651 |
| Document_Object_Model                                           | 4.444 |
| Staff_writer                                                    | 4.444 |
| web_application                                                 | 4.348 |
+-----------------------------------------------------------------+-------+
  Time taken: 2 hours, 12 minutes, 50 seconds, 381 milliseconds

sa: H |arXiv.org>
|number: 3>

sa: S |arXiv.org>
+------------------------------------------------------------------------------------+--------+
| wikipage                                                                           | coeff  |
+------------------------------------------------------------------------------------+--------+
| arXiv.org                                                                          | 100    |
| citation_impact                                                                    | 40     |
| serials_crisis                                                                     | 40     |
| NEC_Research_Institute                                                             | 40     |
| postprint                                                                          | 40     |
| institutional_repository                                                           | 40     |
| OAIster                                                                            | 40     |
| SHERPA_(organisation)                                                              | 40     |
| Category:Electronic_publishing                                               | 40     |
| Paul_Ginsparg                                                                      | 33.333 |
| preprint                                                                           | 27.273 |
| self-archiving                                                                     | 25     |
| Category:Academic_publishing                                                 | 23.077 |
| Methodological_naturalism                                                          | 20     |
| Presocratics                                                                       | 20     |
| Cryptology_ePrint_Archive                                                          | 20     |
| Open_publishing                                                                    | 20     |
| Hubble_diagram                                                                     | 20     |
| GZK_paradox                                                                        | 20     |
| List_of_unsolved_problems_in_physics                                               | 20     |
| Print_on_demand                                                                    | 20     |
| TeV                                                                                | 20     |
| Boundary_condition                                                                 | 20     |
| Black_body_radiation                                                               | 20     |
| Subscriptions                                                                      | 20     |
| R.P._Feynman                                                                       | 20     |
| Citeseer                                                                           | 20     |
| Citation_index                                                                     | 20     |
| File:Solvay_conference_1927.jpg                                              | 20     |
| File:Senenmut-Grab.JPG                                                       | 20     |
| bioacoustics                                                                       | 20     |
| pattern_formation                                                                  | 20     |
| University_Physics                                                                 | 20     |
| File:Archimedes-screw_one-screw-threads_with-ball_3D-view_animated_small.gif | 20     |
| Bryn_Mawr_Classical_Review                                                         | 20     |
| File:Acceleration_components.JPG                                             | 20     |
| Delayed_open-access_journal                                                        | 20     |
| Astronomical_ceiling_of_Senemut_Tomb                                               | 20     |
| quantitative_finance                                                               | 20     |
| File:CMS_Higgs-event.jpg                                                     | 20     |
| James_Madison_Award                                                                | 20     |
| Public_Knowledge_Project                                                           | 20     |
| the_central_science                                                                | 20     |
| Difference_between_chemistry_and_physics                                           | 20     |
| theses                                                                             | 20     |
| Optical_physics                                                                    | 20     |
| analytic_solution                                                                  | 20     |
| weakly_interacting_massive_particle                                                | 20     |
| superclusters                                                                      | 20     |
| Open_Humanities_Press                                                              | 20     |
| iBooks_Author                                                                      | 20     |
| econophysics                                                                       | 20     |
| ultrasonics                                                                        | 20     |
| OAI-PMH                                                                            | 20     |
| Journal_of_Library_Administration                                                  | 20     |
| File:Einstein1921_by_F_Schmutzer_2.jpg                                       | 20     |
| Ancient_Greek_poetry                                                               | 20     |
| Publish_or_perish                                                                  | 20     |
| higher_dimension                                                                   | 20     |
| IBEX                                                                               | 20     |
+------------------------------------------------------------------------------------+--------+
  Time taken: 41 minutes, 16 seconds, 954 milliseconds

sa: H |Theory_of_everything>
|number: 13>

sa: S |Theory_of_everything>
+-------------------------------------------------------------+--------+
| wikipage                                                    | coeff  |
+-------------------------------------------------------------+--------+
| Theory_of_everything                                        | 100.0  |
| Ultimate_fate_of_the_universe                               | 21.429 |
| Planck_scale                                                | 17.391 |
| Big_Rip                                                     | 15.385 |
| Eddington_limit                                             | 15.385 |
| Supersymmetry                                               | 15.385 |
| Arrow_of_time                                               | 15.385 |
| Dimensionless_physical_constant                             | 15.385 |
| Plumian_Professor_of_Astronomy_and_Experimental_Philosophy  | 15.385 |
| Sir_Roger_Penrose                                           | 15.385 |
| Bakerian_Lecture                                            | 15.385 |
| grand_unified_theory                                        | 15.385 |
| Big_Freeze                                                  | 15.385 |
| Topological_order                                           | 15.385 |
| Baryon_asymmetry                                            | 15.385 |
| Neutrino_mass                                               | 15.385 |
| Unified_field_theory                                        | 15.385 |
| Membrane_(M-theory)                                         | 15.385 |
| Static_forces_and_virtual-particle_exchange                 | 15.385 |
| Generation_(particle_physics)                               | 15.385 |
| Stellar_nucleosynthesis                                     | 14.286 |
| Compact_Muon_Solenoid                                       | 13.333 |
| Cosmic_inflation                                            | 13.333 |
| neutrino_oscillation                                        | 12.5   |
| Hermann_Bondi                                               | 11.765 |
| Category:Presidents_of_the_Royal_Astronomical_Society | 11.111 |
| YangMills_theory                                            | 11.111 |
| anthropic_principle                                         | 10.345 |
| Dark_matter                                                 | 9.524  |
| James_Watson                                                | 9.091  |
| CP_violation                                                | 8      |
| Anisotropy                                                  | 7.692  |
| Antiparticle                                                | 7.692  |
| Acts                                                        | 7.692  |
| Centripetal_force                                           | 7.692  |
| Graviton                                                    | 7.692  |
| Gluon                                                       | 7.692  |
| Hydrogen_atom                                               | 7.692  |
| Liquid_crystal                                              | 7.692  |
| Main_sequence                                               | 7.692  |
| Morphogenesis                                               | 7.692  |
| Panspermia                                                  | 7.692  |
| Proton_decay                                                | 7.692  |
| Qubit                                                       | 7.692  |
| Tokamak                                                     | 7.692  |
| Quintessence_(physics)                                      | 7.692  |
| Sonoluminescence                                            | 7.692  |
| Gravitational_lens                                          | 7.692  |
| High-temperature_superconductor                             | 7.692  |
| Fact                                                        | 7.692  |
| Timeline_of_gravitational_physics_and_relativity            | 7.692  |
| Timeline_of_stellar_astronomy                               | 7.692  |
| List_of_astronomers                                         | 7.692  |
| Astrophysicist                                              | 7.692  |
| Triple-alpha_process                                        | 7.692  |
| Religious                                                   | 7.692  |
| Quark_matter                                                | 7.692  |
| Gravity_assist                                              | 7.692  |
| Theory_of_Everything                                        | 7.692  |
| Color_confinement                                           | 7.692  |
+-------------------------------------------------------------+--------+
  Time taken: 1 hour, 8 minutes, 42 seconds, 470 milliseconds
OK. Some cool results in there. Actually, I think they are amazing! I think I have done enough examples of this now.

Though maybe I should note, that the bigger the number H returns, the better the result. Which presumably means if we used even more of wikipedia, we would get even better results! And brings to mind the question, how many wikipages do we need to know more than the average human?

BTW, I don't think I have linked to this yet, the full wikipedia link structure in sw notation. bzip2 down to about 2 GB I seem to recall.

No comments:

Post a Comment