So in the mean-time I decided to re-run my wikipedia code. I tried to use 300,000 wikipedia links sw file, but that failed too. It needed too much RAM and took too long to run. I thought I had used it in the past, in which case I don't know why it failed this time!
Here is the first word2vec example (distance to "france"):
Word Cosine distance ------------------------------------------- spain 0.678515 belgium 0.665923 netherlands 0.652428 italy 0.633130 switzerland 0.622323 luxembourg 0.610033 portugal 0.577154 russia 0.571507 germany 0.563291 catalonia 0.534176Here it is using my code:
sa: load 30k--wikipedia-links.sw sa: find-inverse[links-to] sa: T |*> #=> table[page,coeff] select[1,200] 100 self-similar[inverse-links-to] |_self> sa: T |WP: France> +--------------------------------+--------+ | page | coeff | +--------------------------------+--------+ | France | 100.0 | | Germany | 31.771 | | United_Kingdom | 30.537 | | Italy | 27.452 | | Spain | 23.566 | | United_States | 20.152 | | Japan | 19.556 | | Netherlands | 19.309 | | Russia | 18.877 | | Canada | 18.384 | | Europe | 17.273 | | India | 17.212 | | China | 16.78 | | Paris | 16.595 | | England | 16.286 | | World_War_II | 15.923 | | Australia | 15.238 | | Soviet_Union | 14.867 | | Belgium | 14.189 | | Poland | 14.127 | | Portugal | 13.819 | | World_War_I | 13.757 | | Austria | 13.695 | | Sweden | 13.572 | | Switzerland | 13.51 | | Egypt | 12.647 | | European_Union | 12.4 | | Brazil | 12.338 | | United_Nations | 12.091 | | Greece | 11.906 | | London | 11.906 | | Israel | 11.783 | | Turkey | 11.783 | | Denmark | 11.598 | | French_language | 11.536 | | Norway | 11.413 | | Latin | 10.611 | | Rome | 10.364 | | Mexico | 10.364 | | English_language | 9.994 | | South_Africa | 9.747 | ...which works pretty well I must say.
Here is the next word2vec example (distance to San Francisco):
Word Cosine distance ------------------------------------------- los_angeles 0.666175 golden_gate 0.571522 oakland 0.557521 california 0.554623 san_diego 0.534939 pasadena 0.519115 seattle 0.512098 taiko 0.507570 houston 0.499762 chicago_illinois 0.491598Here it is using my code:
+---------------------------------------+--------+ | page | coeff | +---------------------------------------+--------+ | San_Francisco | 100.0 | | Los_Angeles | 16.704 | | Chicago | 15.919 | | 1924 | 15.522 | | 1916 | 14.566 | | California | 14.502 | | 1915 | 14.286 | | 2014 | 14.217 | | 1933 | 14.031 | | 1913 | 14.006 | | 1918 | 14.0 | | 1930 | 14.0 | | Philadelphia | 13.99 | | 1925 | 13.984 | | 1931 | 13.904 | | 1920 | 13.802 | | 1932 | 13.776 | | 1942 | 13.744 | | 1999 | 13.725 | ...Hrmm... that didn't work so great. I wonder why.
Here is the next word2vec example:
Enter word or sentence (EXIT to break): /en/geoffrey_hinton Word Cosine distance -------------------------------------------------- /en/marvin_minsky 0.457204 /en/paul_corkum 0.443342 /en/william_richard_peltier 0.432396 /en/brenda_milner 0.430886 /en/john_charles_polanyi 0.419538 /en/leslie_valiant 0.416399 /en/hava_siegelmann 0.411895 /en/hans_moravec 0.406726 /en/david_rumelhart 0.405275 /en/godel_prize 0.405176And here it is using my code:
+------------------------------------------------------------+--------+ | page | coeff | +------------------------------------------------------------+--------+ | Geoffrey_Hinton | 100 | | perceptron | 66.667 | | Tom_M._Mitchell | 66.667 | | computational_learning_theory | 66.667 | | Nils_Nilsson_(researcher) | 66.667 | | beam_search | 66.667 | | Raj_Reddy | 50 | | AI_effect | 40 | | ant_colony_optimization | 40 | | List_of_artificial_intelligence_projects | 33.333 | | AI-complete | 33.333 | | Cyc | 33.333 | | Hugo_de_Garis | 33.333 | | Joyce_K._Reynolds | 33.333 | | Kleene_closure | 33.333 | | Mondegreen | 33.333 | | Supervised_learning | 33.333 | ...And now a couple more examples:
sa: T |WP: Linux> +------------------------------------------------+--------+ | page | coeff | +------------------------------------------------+--------+ | Linux | 100.0 | | Microsoft_Windows | 46.629 | | operating_system | 37.333 | | Unix | 28.956 | | Mac_OS_X | 26.936 | | C_(programming_language) | 24.242 | | Microsoft | 22.535 | | GNU_General_Public_License | 22.222 | | Mac_OS | 19.529 | | Unix-like | 19.192 | | IBM | 19.048 | | open_source | 17.845 | | FreeBSD | 17.845 | | Apple_Inc. | 16.498 | | Java_(programming_language) | 15.825 | | OS_X | 15.488 | | free_software | 15.488 | | Sun_Microsystems | 15.152 | | C++ | 15.152 | | source_code | 15.152 | | Macintosh | 14.815 | | MS-DOS | 13.468 | | Solaris_(operating_system) | 13.468 | | PowerPC | 13.131 | | DOS | 13.131 | | Android_(operating_system) | 13.131 | | Windows_NT | 12.795 | | Intel | 12.458 | | programming_language | 12.121 | | personal_computer | 12.121 | | OpenBSD | 11.785 | | Unicode | 11.111 | | graphical_user_interface | 10.774 | | video_game | 10.774 | | Cross-platform | 10.774 | | Internet | 10.574 | | OS/2 | 10.438 | ... sa: T |WP: Ronald_Reagan> +---------------------------------------------------------+--------+ | page | coeff | +---------------------------------------------------------+--------+ | Ronald_Reagan | 100.0 | | John_F._Kennedy | 22.951 | | Bill_Clinton | 22.404 | | Barack_Obama | 22.283 | | George_H._W._Bush | 22.131 | | Jimmy_Carter | 22.131 | | Richard_Nixon | 22.131 | | George_W._Bush | 22.131 | | Republican_Party_(United_States) | 21.785 | | Democratic_Party_(United_States) | 20.779 | | United_States_Senate | 19.444 | | President_of_the_United_States | 17.538 | | White_House | 15.574 | | Franklin_D._Roosevelt | 15.301 | | Vietnam_War | 15.242 | | United_States_House_of_Representatives | 14.754 | | United_States_Congress | 14.085 | | Supreme_Court_of_the_United_States | 13.388 | | Lyndon_B._Johnson | 13.388 | | Margaret_Thatcher | 13.115 | | Cold_War | 13.093 | | Dwight_D._Eisenhower | 12.568 | | Nobel_Peace_Prize | 12.368 | | The_Washington_Post | 12.295 | | Gerald_Ford | 12.022 | ... sa: T |WP: Los_Angeles> +--------------------------------------------------+--------+ | page | coeff | +--------------------------------------------------+--------+ | Los_Angeles | 100.0 | | Chicago | 20.852 | | California | 18.789 | | Los_Angeles_Times | 17.833 | | San_Francisco | 16.704 | | New_York_City | 15.536 | | Philadelphia | 14.221 | | NBC | 12.641 | | Washington,_D.C. | 11.484 | | Boston | 11.061 | | USA_Today | 10.609 | | Texas | 10.384 | | Academy_Award | 10.158 | | Seattle | 9.932 | | New_York | 9.88 | | Time_(magazine) | 9.851 | | Mexico_City | 9.707 | | The_New_York_Times | 9.685 | | Rolling_Stone | 9.481 | | CBS | 9.481 | | Toronto | 9.481 | ...Now for a final couple of comments. First, my code is painfully slow. But to be fair I'm a terrible programmer and this is research, not production, code. The point I'm trying to make is that a real programmer could probably make the speed acceptable, and hence usable.
For the second point, let me quote from word2vec:
"It was recently shown that the word vectors capture many linguistic regularities, for example vector operations vector('Paris') - vector('France') + vector('Italy') results in a vector that is very close to vector('Rome'), and vector('king') - vector('man') + vector('woman') is close to vector('queen')"
I haven't tested it, but I'm pretty sure my inverse-links-to sparse vectors do not have this fun property. That is why I want to create my own word2sp code. Though it would take a slightly different form in BKO. Something along the lines of:
vector |some object> => exclude(vector|France>,vector|Paris>) + vector|Italy> table[object,coeff] select[1,20] 100 self-similar[vector] |some object>should, if it works, return |Rome> as one of the top similarity matches.
Likewise:
vector |some object> => exclude(vector|man>,vector|king>) + vector|woman> table[object,coeff] select[1,20] 100 self-similar[vector] |some object>should return |queen> as a top hit.
Definitely something I would like to test, but I need working code first.
No comments:
Post a Comment