Here is my code to convert wikipedia xml to single word frequencies.
$ ./play_with_wikipedia_freq_list.py data/fragments/0.xml10,604 minutes later, we have this.
Now, some examples:
sa: T |*> #=> table[wikipage,coeff] select[1,300] 100 intn-find-topic[words-1] |_self> sa: T |river torrens> +-----------------------------+----------+ | wikipage | coeff | +-----------------------------+----------+ | Murray_River | 2210.857 | | The_Bronx | 1450.875 | | South_Australia | 1243.607 | | Adelaide | 1130.552 | | Prince_Edward_Island | 746.164 | | Gypsum | 710.633 | | Port_Adelaide_Football_Club | 678.331 | | June_14 | 552.714 | | Trade | 552.714 | | October_25 | 497.443 | | Dinosaur | 226.11 | +-----------------------------+----------+ Time taken: 27 minutes, 19 seconds, 709 milliseconds sa: T |adelaide university> +-------------------------------------------+--------+ | wikipage | coeff | +-------------------------------------------+--------+ | Macquarie_University | 90.953 | | Immanuel_Kant | 74.416 | | Robert_Menzies | 71.625 | | David_Hume | 71.625 | | Theology | 68.214 | | Adelaide | 65.114 | | Austin,_Texas | 65.114 | | Yoga | 65.114 | | Gregor_Mendel | 63.951 | | Mike_Moore_(New_Zealand_politician) | 63.951 | | New_South_Wales | 61.393 | | Perth | 61.393 | | Aristophanes | 61.393 | | Bob_Hawke | 61.393 | | Culture_of_Canada | 61.393 | | John_Milton | 61.393 | | West_Bengal | 61.393 | | Brewing | 61.393 | | Fyodor_Dostoyevsky | 61.393 | | Hunter_College | 61.393 | | John_Stuart_Mill | 61.393 | ... Time taken: 27 minutes, 8 seconds, 965 milliseconds sa: T |apple juice> +---------------------------------------+---------+ | wikipage | coeff | +---------------------------------------+---------+ | Vinegar | 402.189 | | McIntosh_(apple) | 367.835 | | Fruit | 361.97 | | Cuisine_of_the_United_States | 329.064 | | Drink | 321.751 | | Vietnamese_cuisine | 321.751 | | List_of_cocktails | 321.751 | | Hungarian_language | 294.268 | | Arsenic | 289.576 | | Chardonnay | 289.576 | | Pear | 271.478 | | Swedish_cuisine | 271.478 | | Cuisine_of_the_Southern_United_States | 271.478 | | Food_preservation | 241.314 | | Turkish_cuisine | 241.314 | | Mead | 241.314 | | French_cuisine | 217.182 | | Mojito | 206.84 | ... Time taken: 27 minutes, 25 seconds, 378 milliseconds T |russia china japan australia new zealand egypt> +-----------------------------------------------------------+--------+ | wikipage | coeff | +-----------------------------------------------------------+--------+ | Tram | 77.349 | | List_of_national_capitals_and_largest_cities_by_country | 75.967 | | General_Motors | 74.448 | | 2000s_(decade) | 70.903 | | History_of_painting | 70.903 | | 2010s | 67.68 | | British_Empire | 67.68 | | Foreign_relations_of_China | 67.68 | | Self-determination | 67.68 | | Foreign_relations_of_Taiwan | 67.68 | | Toyota | 67.68 | | Dwight_D._Eisenhower | 65.991 | | Psychology | 65.991 | | 2008 | 63.813 | | List_of_former_sovereign_states | 63.813 | | Foreign_relations_of_Indonesia | 63.813 | | Foreign_relations_of_Japan | 63.813 | | Foreign_relations_of_North_Korea | 63.813 | | Peninsula | 63.813 | | Pandemic | 63.813 | | United_Nations_Security_Council | 63.813 | | 1996 | 63.813 | | List_of_mountains | 63.813 | ... Time taken: 1 hour, 39 minutes, 16 seconds, 194 millisecondsAnyway, largely rubbish results! Doesn't mean find-topic[op] is completely useless, eg, seems to work well with finding name type (male, female, last), but just doesn't work that well on wikipedia.
No comments:
Post a Comment