Friday, 5 June 2015

towards processing all of wikipedia

I now have aspirations to process all of wikipedia, and see what we can get out of it. Certainly the best curated collection of knowledge on the planet (and structurally much more homogenous than the web at large), and being all hyperlinked makes it great for my project. We can think of wikipedia as one giant network, and my project is all about networks.

Now, the problem is wikipedia is somewhat large. 50 GB xml file (for English wikipedia), and just over 15 million pages. Prior to this I gave optimizations little thought (I was more interested in getting things working, rather than run speed), but with wikipedia, we have to at least start to think about more efficient ways to do things!

So, this was my first attempt to process wikipedia. But it was memory hungry!! I threw the largest memory EC2 instance type with 244 GB of RAM and it still wasn't enough (yeah, that's what you get when you don't think about efficiencies!). I thought this was the section of code that was problematic:
# now learn it all:
    for x in r:
      link, anchor = ket_process_anchor_text(x)
      C.add_learn("links-to",page_name_ket,link)
      C.add_learn("inverse-links-to",link,page_name_ket)

      C.add_learn("contains-anchor",page_name_ket,anchor)
      C.add_learn("inverse-contains-anchor",anchor,page_name_ket)

      C.add_learn("links-to",anchor,link)
      C.add_learn("inverse-links-to",link,anchor)
So I tweaked it to this. The learn it all section now writes it to disk as it goes, instead of loading it all into memory (and we cut down on what we are trying to learn to just "links-to"):
    result = fast_superposition()
    for x in r:
      result += ket_process_link(x)
    dest.write("links-to " + str(page_name_ket) + " => " + str(result.superposition().coeff_sort()) + "\n\n")
But that still chomped all the RAM I could throw at it. I eventually figured out the real problem was trying to load the entire 50 GB XML into memory at once, ie, this at the top of the code:
document = open(sys.argv[1],'rb')
soup = BeautifulSoup(document)
So the fix was to write some code to split that 50 GB into more manageable pieces. Hence my fragment-wikipedia-xml code.
Now, put it to use:
$ ./fragment_wikipedia_xml.py

Usage: ./fragment_wikipedia_xml.py wiki.xml [pages-per-fragment]

$ ./fragment_wikipedia_xml.py wiki.xml 30000
And after a couple of days of processing we have 517 xml files that are small enough to load into memory. Small enough that I no longer need EC2, and small enough to take less than 2 GB of RAM for BeautifulSoup to load it. Cool. So, we still can't process all of wikipedia, not yet, but we have a decent subset to play with. Now, put it to use:
$ du -h data/fragments/0.xml
676M    data/fragments/0.xml

$ grep -c "<page>" data/fragments/0.xml
30000

$ ./play_with_wikipedia__fast_write.py data/fragments/0.xml
And we now have this (the full hyperlink structure of that fragment of wikipedia, in sw format): 30k--wikipedia-links.sw
$ ./spit-out-sw-stats.sh 30k--wikipedia-links.sw
(92M, 1 op types and 30000 learn rules)
links-to: 30000 learn rules
I'll put this sw file to use in the next post.

Now, it is useful to have the inverse-links-to as well, so:
sa: find-inverse[links-to]
sa: save 30k--wikipedia-links--with-inverse.sw
Which only takes a couple of minutes.

More to come!

Update: some stats:
$ ./spit-out-sw-stats.sh 30k--wikipedia-links--with-inverse-simm.sw
(317M, 5 op types and 1326090 learn rules)
how-many-in-links: 1 learn rules
inverse-links-to: 1296087 learn rules
inverse-simm: 14 learn rules
inverse-simm-op: 1 learn rules
links-to: 29984 learn rules
Now, why are there less than 30k links-to learn rules in this second example? Simple enough. The fast_write even wrote out:
op |ket> => |>
but in the process of loading them into memory (then processing them), the |> rules are not learned/ignored.
So, there must be 30000 - 29984 = 16 learn empty-sp rules. Let's see what grep has to say:
$ grep " => |>" 30k--wikipedia-links.sw
links-to |WP: Cogency> => |>
links-to |WP: Wikipedia:Complete_list_of_language_wikis_available> => |>
links-to |WP: Floccinaucinihilipilification> => |>
links-to |WP: Wikipedia:GNE_Project_Files> => |>
links-to |WP: Wikipedia:GNE_Project_Files/Proposed_GNU_Moderation_System> => |>
links-to |WP: Wikipedia:GNE_Project_Files/GNE_Project_Design> => |>
links-to |WP: Wikipedia:The_future_of_Wikipedia> => |>
links-to |WP: Portal:Contents/Outlines> => |>
links-to |WP: Wikipedia:WikiBiblion> => |>
links-to |WP: Wikipedia:PHP_script_tech_talk> => |>
links-to |WP: Normoxic> => |>
links-to |WP: Conducted_interference> => |>
links-to |WP: Local_battery> => |>
links-to |WP: Wikipedia:Blocked_IPs> => |>

$ grep -c " => |>" 30k--wikipedia-links.sw
14
So, still 2 missing :). Not sure why. Don't really care.

Update: Let's show the full steps from wikipedia, to wikipedia link structure:
wiki.xml from here: http://en.wikipedia.org/wiki/Wikipedia:Database_download
$ bzip2 -dc enwiki-20150515-pages-articles.xml.bz2 > wiki.xml
$ ./fragment_wikipedia_xml.py wiki.xml 30000

for file in $(ls -1 data/fragments/*); do
  echo "file: $file"
  ./play_with_wikipedia__fast_write.py "$file"
done

$ cd sw-results/

for i in $(seq 0 9); do
  echo "$i"
  cat $i--30k--wikipedia-links.sw >> 300k--wikipedia-links.sw
done

$ bzip2 -v *.sw
And the resulting sw files are compressed and stored here.

BTW:
$ ./spit-out-sw-stats.sh 300k--wikipedia-links.sw
(494M, 1 op types and 300000 learn rules)
links-to: 300000 learn rules

$ ./spit-out-sw-stats.sh 300k--wikipedia-links--with-inverse.sw
(1.5G, 2 op types and 4571683 learn rules)
inverse-links-to: 4272183 learn rules
links-to: 299499 learn rules

No comments:

Post a Comment