Wednesday 10 June 2015

wikipedia fragment to frequency list

Now, we had a lot of success in mapping wikipedia to its link structure, and finding semantic similarities. That uses "similar[op]". This time, we map wikipedia (well, a small piece of it) to frequency lists, and see how "find-topic[op]" works.

Here is my code to convert wikipedia xml to single word frequencies.
$ ./play_with_wikipedia_freq_list.py data/fragments/0.xml
10,604 minutes later, we have this.

Now, some examples:
sa: T |*> #=> table[wikipage,coeff] select[1,300] 100 intn-find-topic[words-1] |_self>
sa: T |river torrens>
+-----------------------------+----------+
| wikipage                    | coeff    |
+-----------------------------+----------+
| Murray_River                | 2210.857 |
| The_Bronx                   | 1450.875 |
| South_Australia             | 1243.607 |
| Adelaide                    | 1130.552 |
| Prince_Edward_Island        | 746.164  |
| Gypsum                      | 710.633  |
| Port_Adelaide_Football_Club | 678.331  |
| June_14                     | 552.714  |
| Trade                       | 552.714  |
| October_25                  | 497.443  |
| Dinosaur                    | 226.11   |
+-----------------------------+----------+
  Time taken: 27 minutes, 19 seconds, 709 milliseconds

sa: T |adelaide university>
+-------------------------------------------+--------+
| wikipage                                  | coeff  |
+-------------------------------------------+--------+
| Macquarie_University                      | 90.953 |
| Immanuel_Kant                             | 74.416 |
| Robert_Menzies                            | 71.625 |
| David_Hume                                | 71.625 |
| Theology                                  | 68.214 |
| Adelaide                                  | 65.114 |
| Austin,_Texas                             | 65.114 |
| Yoga                                      | 65.114 |
| Gregor_Mendel                             | 63.951 |
| Mike_Moore_(New_Zealand_politician)       | 63.951 |
| New_South_Wales                           | 61.393 |
| Perth                                     | 61.393 |
| Aristophanes                              | 61.393 |
| Bob_Hawke                                 | 61.393 |
| Culture_of_Canada                         | 61.393 |
| John_Milton                               | 61.393 |
| West_Bengal                               | 61.393 |
| Brewing                                   | 61.393 |
| Fyodor_Dostoyevsky                        | 61.393 |
| Hunter_College                            | 61.393 |
| John_Stuart_Mill                          | 61.393 |
...
  Time taken: 27 minutes, 8 seconds, 965 milliseconds

sa: T |apple juice>
+---------------------------------------+---------+
| wikipage                              | coeff   |
+---------------------------------------+---------+
| Vinegar                               | 402.189 |
| McIntosh_(apple)                      | 367.835 |
| Fruit                                 | 361.97  |
| Cuisine_of_the_United_States          | 329.064 |
| Drink                                 | 321.751 |
| Vietnamese_cuisine                    | 321.751 |
| List_of_cocktails                     | 321.751 |
| Hungarian_language                    | 294.268 |
| Arsenic                               | 289.576 |
| Chardonnay                            | 289.576 |
| Pear                                  | 271.478 |
| Swedish_cuisine                       | 271.478 |
| Cuisine_of_the_Southern_United_States | 271.478 |
| Food_preservation                     | 241.314 |
| Turkish_cuisine                       | 241.314 |
| Mead                                  | 241.314 |
| French_cuisine                        | 217.182 |
| Mojito                                | 206.84  |
...
  Time taken: 27 minutes, 25 seconds, 378 milliseconds

T |russia china japan australia new zealand egypt>
+-----------------------------------------------------------+--------+
| wikipage                                                  | coeff  |
+-----------------------------------------------------------+--------+
| Tram                                                      | 77.349 |
| List_of_national_capitals_and_largest_cities_by_country   | 75.967 |
| General_Motors                                            | 74.448 |
| 2000s_(decade)                                            | 70.903 |
| History_of_painting                                       | 70.903 |
| 2010s                                                     | 67.68  |
| British_Empire                                            | 67.68  |
| Foreign_relations_of_China                                | 67.68  |
| Self-determination                                        | 67.68  |
| Foreign_relations_of_Taiwan                               | 67.68  |
| Toyota                                                    | 67.68  |
| Dwight_D._Eisenhower                                      | 65.991 |
| Psychology                                                | 65.991 |
| 2008                                                      | 63.813 |
| List_of_former_sovereign_states                           | 63.813 |
| Foreign_relations_of_Indonesia                            | 63.813 |
| Foreign_relations_of_Japan                                | 63.813 |
| Foreign_relations_of_North_Korea                          | 63.813 |
| Peninsula                                                 | 63.813 |
| Pandemic                                                  | 63.813 |
| United_Nations_Security_Council                           | 63.813 |
| 1996                                                      | 63.813 |
| List_of_mountains                                         | 63.813 |
...
  Time taken: 1 hour, 39 minutes, 16 seconds, 194 milliseconds
Anyway, largely rubbish results! Doesn't mean find-topic[op] is completely useless, eg, seems to work well with finding name type (male, female, last), but just doesn't work that well on wikipedia.

No comments:

Post a Comment