Now, the problem is wikipedia is somewhat large. 50 GB xml file (for English wikipedia), and just over 15 million pages. Prior to this I gave optimizations little thought (I was more interested in getting things working, rather than run speed), but with wikipedia, we have to at least start to think about more efficient ways to do things!
So, this was my first attempt to process wikipedia. But it was memory hungry!! I threw the largest memory EC2 instance type with 244 GB of RAM and it still wasn't enough (yeah, that's what you get when you don't think about efficiencies!). I thought this was the section of code that was problematic:
# now learn it all: for x in r: link, anchor = ket_process_anchor_text(x) C.add_learn("links-to",page_name_ket,link) C.add_learn("inverse-links-to",link,page_name_ket) C.add_learn("contains-anchor",page_name_ket,anchor) C.add_learn("inverse-contains-anchor",anchor,page_name_ket) C.add_learn("links-to",anchor,link) C.add_learn("inverse-links-to",link,anchor)So I tweaked it to this. The learn it all section now writes it to disk as it goes, instead of loading it all into memory (and we cut down on what we are trying to learn to just "links-to"):
result = fast_superposition() for x in r: result += ket_process_link(x) dest.write("links-to " + str(page_name_ket) + " => " + str(result.superposition().coeff_sort()) + "\n\n")But that still chomped all the RAM I could throw at it. I eventually figured out the real problem was trying to load the entire 50 GB XML into memory at once, ie, this at the top of the code:
document = open(sys.argv,'rb') soup = BeautifulSoup(document)So the fix was to write some code to split that 50 GB into more manageable pieces. Hence my fragment-wikipedia-xml code.
Now, put it to use:
$ ./fragment_wikipedia_xml.py Usage: ./fragment_wikipedia_xml.py wiki.xml [pages-per-fragment] $ ./fragment_wikipedia_xml.py wiki.xml 30000And after a couple of days of processing we have 517 xml files that are small enough to load into memory. Small enough that I no longer need EC2, and small enough to take less than 2 GB of RAM for BeautifulSoup to load it. Cool. So, we still can't process all of wikipedia, not yet, but we have a decent subset to play with. Now, put it to use:
$ du -h data/fragments/0.xml 676M data/fragments/0.xml $ grep -c "<page>" data/fragments/0.xml 30000 $ ./play_with_wikipedia__fast_write.py data/fragments/0.xmlAnd we now have this (the full hyperlink structure of that fragment of wikipedia, in sw format): 30k--wikipedia-links.sw
$ ./spit-out-sw-stats.sh 30k--wikipedia-links.sw (92M, 1 op types and 30000 learn rules) links-to: 30000 learn rulesI'll put this sw file to use in the next post.
Now, it is useful to have the inverse-links-to as well, so:
sa: find-inverse[links-to] sa: save 30k--wikipedia-links--with-inverse.swWhich only takes a couple of minutes.
More to come!
Update: some stats:
$ ./spit-out-sw-stats.sh 30k--wikipedia-links--with-inverse-simm.sw (317M, 5 op types and 1326090 learn rules) how-many-in-links: 1 learn rules inverse-links-to: 1296087 learn rules inverse-simm: 14 learn rules inverse-simm-op: 1 learn rules links-to: 29984 learn rulesNow, why are there less than 30k links-to learn rules in this second example? Simple enough. The fast_write even wrote out:
op |ket> => |>
but in the process of loading them into memory (then processing them), the |> rules are not learned/ignored.
So, there must be 30000 - 29984 = 16 learn empty-sp rules. Let's see what grep has to say:
$ grep " => |>" 30k--wikipedia-links.sw links-to |WP: Cogency> => |> links-to |WP: Wikipedia:Complete_list_of_language_wikis_available> => |> links-to |WP: Floccinaucinihilipilification> => |> links-to |WP: Wikipedia:GNE_Project_Files> => |> links-to |WP: Wikipedia:GNE_Project_Files/Proposed_GNU_Moderation_System> => |> links-to |WP: Wikipedia:GNE_Project_Files/GNE_Project_Design> => |> links-to |WP: Wikipedia:The_future_of_Wikipedia> => |> links-to |WP: Portal:Contents/Outlines> => |> links-to |WP: Wikipedia:WikiBiblion> => |> links-to |WP: Wikipedia:PHP_script_tech_talk> => |> links-to |WP: Normoxic> => |> links-to |WP: Conducted_interference> => |> links-to |WP: Local_battery> => |> links-to |WP: Wikipedia:Blocked_IPs> => |> $ grep -c " => |>" 30k--wikipedia-links.sw 14So, still 2 missing :). Not sure why. Don't really care.
Update: Let's show the full steps from wikipedia, to wikipedia link structure:
wiki.xml from here: http://en.wikipedia.org/wiki/Wikipedia:Database_download $ bzip2 -dc enwiki-20150515-pages-articles.xml.bz2 > wiki.xml $ ./fragment_wikipedia_xml.py wiki.xml 30000 for file in $(ls -1 data/fragments/*); do echo "file: $file" ./play_with_wikipedia__fast_write.py "$file" done $ cd sw-results/ for i in $(seq 0 9); do echo "$i" cat $i--30k--wikipedia-links.sw >> 300k--wikipedia-links.sw done $ bzip2 -v *.swAnd the resulting sw files are compressed and stored here.
$ ./spit-out-sw-stats.sh 300k--wikipedia-links.sw (494M, 1 op types and 300000 learn rules) links-to: 300000 learn rules $ ./spit-out-sw-stats.sh 300k--wikipedia-links--with-inverse.sw (1.5G, 2 op types and 4571683 learn rules) inverse-links-to: 4272183 learn rules links-to: 299499 learn rules