Wednesday 25 November 2015

revisiting wikipedia inverse-links-to semantic similarities

Nothing new here, just some more wikipedia semantic similarity examples. The motivation was partly word2vec. They have some examples of semantic similarity using their word vectors. I had planned to write my own word2sp but so far my idea failed! And I couldn't use their word vectors because they have negative coeffs, while my similarity metric requires positive coeffs.

So in the mean-time I decided to re-run my wikipedia code. I tried to use 300,000 wikipedia links sw file, but that failed too. It needed too much RAM and took too long to run. I thought I had used it in the past, in which case I don't know why it failed this time!

Here is the first word2vec example (distance to "france"):
                 Word       Cosine distance
-------------------------------------------
                spain              0.678515
              belgium              0.665923
          netherlands              0.652428
                italy              0.633130
          switzerland              0.622323
           luxembourg              0.610033
             portugal              0.577154
               russia              0.571507
              germany              0.563291
            catalonia              0.534176
Here it is using my code:
sa: load 30k--wikipedia-links.sw
sa: find-inverse[links-to]
sa: T |*> #=> table[page,coeff] select[1,200] 100 self-similar[inverse-links-to] |_self>
sa: T |WP: France>
+--------------------------------+--------+
| page                           | coeff  |
+--------------------------------+--------+
| France                         | 100.0  |
| Germany                        | 31.771 |
| United_Kingdom                 | 30.537 |
| Italy                          | 27.452 |
| Spain                          | 23.566 |
| United_States                  | 20.152 |
| Japan                          | 19.556 |
| Netherlands                    | 19.309 |
| Russia                         | 18.877 |
| Canada                         | 18.384 |
| Europe                         | 17.273 |
| India                          | 17.212 |
| China                          | 16.78  |
| Paris                          | 16.595 |
| England                        | 16.286 |
| World_War_II                   | 15.923 |
| Australia                      | 15.238 |
| Soviet_Union                   | 14.867 |
| Belgium                        | 14.189 |
| Poland                         | 14.127 |
| Portugal                       | 13.819 |
| World_War_I                    | 13.757 |
| Austria                        | 13.695 |
| Sweden                         | 13.572 |
| Switzerland                    | 13.51  |
| Egypt                          | 12.647 |
| European_Union                 | 12.4   |
| Brazil                         | 12.338 |
| United_Nations                 | 12.091 |
| Greece                         | 11.906 |
| London                         | 11.906 |
| Israel                         | 11.783 |
| Turkey                         | 11.783 |
| Denmark                        | 11.598 |
| French_language                | 11.536 |
| Norway                         | 11.413 |
| Latin                          | 10.611 |
| Rome                           | 10.364 |
| Mexico                         | 10.364 |
| English_language               | 9.994  |
| South_Africa                   | 9.747  |
...
which works pretty well I must say.

Here is the next word2vec example (distance to San Francisco):
                 Word       Cosine distance
-------------------------------------------
          los_angeles              0.666175
          golden_gate              0.571522
              oakland              0.557521
           california              0.554623
            san_diego              0.534939
             pasadena              0.519115
              seattle              0.512098
                taiko              0.507570
              houston              0.499762
     chicago_illinois              0.491598
Here it is using my code:
+---------------------------------------+--------+
| page                                  | coeff  |
+---------------------------------------+--------+
| San_Francisco                         | 100.0  |
| Los_Angeles                           | 16.704 |
| Chicago                               | 15.919 |
| 1924                                  | 15.522 |
| 1916                                  | 14.566 |
| California                            | 14.502 |
| 1915                                  | 14.286 |
| 2014                                  | 14.217 |
| 1933                                  | 14.031 |
| 1913                                  | 14.006 |
| 1918                                  | 14.0   |
| 1930                                  | 14.0   |
| Philadelphia                          | 13.99  |
| 1925                                  | 13.984 |
| 1931                                  | 13.904 |
| 1920                                  | 13.802 |
| 1932                                  | 13.776 |
| 1942                                  | 13.744 |
| 1999                                  | 13.725 |
...
Hrmm... that didn't work so great. I wonder why.

Here is the next word2vec example:
Enter word or sentence (EXIT to break): /en/geoffrey_hinton

                        Word       Cosine distance
--------------------------------------------------
           /en/marvin_minsky              0.457204
             /en/paul_corkum              0.443342
 /en/william_richard_peltier              0.432396
           /en/brenda_milner              0.430886
    /en/john_charles_polanyi              0.419538
          /en/leslie_valiant              0.416399
         /en/hava_siegelmann              0.411895
            /en/hans_moravec              0.406726
         /en/david_rumelhart              0.405275
             /en/godel_prize              0.405176
And here it is using my code:
+------------------------------------------------------------+--------+
| page                                                       | coeff  |
+------------------------------------------------------------+--------+
| Geoffrey_Hinton                                            | 100    |
| perceptron                                                 | 66.667 |
| Tom_M._Mitchell                                            | 66.667 |
| computational_learning_theory                              | 66.667 |
| Nils_Nilsson_(researcher)                                  | 66.667 |
| beam_search                                                | 66.667 |
| Raj_Reddy                                                  | 50     |
| AI_effect                                                  | 40     |
| ant_colony_optimization                                    | 40     |
| List_of_artificial_intelligence_projects                   | 33.333 |
| AI-complete                                                | 33.333 |
| Cyc                                                        | 33.333 |
| Hugo_de_Garis                                              | 33.333 |
| Joyce_K._Reynolds                                          | 33.333 |
| Kleene_closure                                             | 33.333 |
| Mondegreen                                                 | 33.333 |
| Supervised_learning                                        | 33.333 |
...
And now a couple more examples:
sa: T |WP: Linux>
+------------------------------------------------+--------+
| page                                           | coeff  |
+------------------------------------------------+--------+
| Linux                                          | 100.0  |
| Microsoft_Windows                              | 46.629 |
| operating_system                               | 37.333 |
| Unix                                           | 28.956 |
| Mac_OS_X                                       | 26.936 |
| C_(programming_language)                       | 24.242 |
| Microsoft                                      | 22.535 |
| GNU_General_Public_License                     | 22.222 |
| Mac_OS                                         | 19.529 |
| Unix-like                                      | 19.192 |
| IBM                                            | 19.048 |
| open_source                                    | 17.845 |
| FreeBSD                                        | 17.845 |
| Apple_Inc.                                     | 16.498 |
| Java_(programming_language)                    | 15.825 |
| OS_X                                           | 15.488 |
| free_software                                  | 15.488 |
| Sun_Microsystems                               | 15.152 |
| C++                                            | 15.152 |
| source_code                                    | 15.152 |
| Macintosh                                      | 14.815 |
| MS-DOS                                         | 13.468 |
| Solaris_(operating_system)                     | 13.468 |
| PowerPC                                        | 13.131 |
| DOS                                            | 13.131 |
| Android_(operating_system)                     | 13.131 |
| Windows_NT                                     | 12.795 |
| Intel                                          | 12.458 |
| programming_language                           | 12.121 |
| personal_computer                              | 12.121 |
| OpenBSD                                        | 11.785 |
| Unicode                                        | 11.111 |
| graphical_user_interface                       | 10.774 |
| video_game                                     | 10.774 |
| Cross-platform                                 | 10.774 |
| Internet                                       | 10.574 |
| OS/2                                           | 10.438 |
...

sa: T |WP: Ronald_Reagan>
+---------------------------------------------------------+--------+
| page                                                    | coeff  |
+---------------------------------------------------------+--------+
| Ronald_Reagan                                           | 100.0  |
| John_F._Kennedy                                         | 22.951 |
| Bill_Clinton                                            | 22.404 |
| Barack_Obama                                            | 22.283 |
| George_H._W._Bush                                       | 22.131 |
| Jimmy_Carter                                            | 22.131 |
| Richard_Nixon                                           | 22.131 |
| George_W._Bush                                          | 22.131 |
| Republican_Party_(United_States)                        | 21.785 |
| Democratic_Party_(United_States)                        | 20.779 |
| United_States_Senate                                    | 19.444 |
| President_of_the_United_States                          | 17.538 |
| White_House                                             | 15.574 |
| Franklin_D._Roosevelt                                   | 15.301 |
| Vietnam_War                                             | 15.242 |
| United_States_House_of_Representatives                  | 14.754 |
| United_States_Congress                                  | 14.085 |
| Supreme_Court_of_the_United_States                      | 13.388 |
| Lyndon_B._Johnson                                       | 13.388 |
| Margaret_Thatcher                                       | 13.115 |
| Cold_War                                                | 13.093 |
| Dwight_D._Eisenhower                                    | 12.568 |
| Nobel_Peace_Prize                                       | 12.368 |
| The_Washington_Post                                     | 12.295 |
| Gerald_Ford                                             | 12.022 |
...

sa: T |WP: Los_Angeles>
+--------------------------------------------------+--------+
| page                                             | coeff  |
+--------------------------------------------------+--------+
| Los_Angeles                                      | 100.0  |
| Chicago                                          | 20.852 |
| California                                       | 18.789 |
| Los_Angeles_Times                                | 17.833 |
| San_Francisco                                    | 16.704 |
| New_York_City                                    | 15.536 |
| Philadelphia                                     | 14.221 |
| NBC                                              | 12.641 |
| Washington,_D.C.                                 | 11.484 |
| Boston                                           | 11.061 |
| USA_Today                                        | 10.609 |
| Texas                                            | 10.384 |
| Academy_Award                                    | 10.158 |
| Seattle                                          | 9.932  |
| New_York                                         | 9.88   |
| Time_(magazine)                                  | 9.851  |
| Mexico_City                                      | 9.707  |
| The_New_York_Times                               | 9.685  |
| Rolling_Stone                                    | 9.481  |
| CBS                                              | 9.481  |
| Toronto                                          | 9.481  |
...
Now for a final couple of comments. First, my code is painfully slow. But to be fair I'm a terrible programmer and this is research, not production, code. The point I'm trying to make is that a real programmer could probably make the speed acceptable, and hence usable.

For the second point, let me quote from word2vec:
"It was recently shown that the word vectors capture many linguistic regularities, for example vector operations vector('Paris') - vector('France') + vector('Italy') results in a vector that is very close to vector('Rome'), and vector('king') - vector('man') + vector('woman') is close to vector('queen')"

I haven't tested it, but I'm pretty sure my inverse-links-to sparse vectors do not have this fun property. That is why I want to create my own word2sp code. Though it would take a slightly different form in BKO. Something along the lines of:
vector |some object> => exclude(vector|France>,vector|Paris>) + vector|Italy>
table[object,coeff] select[1,20] 100 self-similar[vector] |some object>
should, if it works, return |Rome> as one of the top similarity matches.
Likewise:
vector |some object> => exclude(vector|man>,vector|king>) + vector|woman>
table[object,coeff] select[1,20] 100 self-similar[vector] |some object>
should return |queen> as a top hit.

Definitely something I would like to test, but I need working code first.

No comments:

Post a Comment