Thursday, 5 March 2015

mapping webpages to well defined superpositions

This is an interesting one, not sure how to best describe it. I guess the general idea is to map webpages to superpositions, and then we can do "pattern recognition" on those superpositions using similar[op]. The code is quite general, and should work for things like XML, or program code, if you choose the right substrings to fragment on. Text of ebooks say require more processing, you almost certainly need to look at word 3-grams, since if you just look at single word frequency, most English text is similar.

The outline of the algo:
- you select substrings to fragment your document on.
(In html and xml "<" and ">" should be sufficient. For C perhaps "{", "}", ";" and so on.)
- split the document into fragments
- take a hash of the fragments (and choose how many digits of the hash to keep, eg 8)
- add up kets of those hashes, and that becomes your superposition.

A couple of advantages of this scheme:
- local changes in the document do not have non-local effects. Fragments further down the page are completely unaffected by modifications elsewhere. This is a powerful feature!
- webpages that change their content daily largely have a similar structure from day to day. On the order of 90% or so similarity. We will see the actual numbers in a future post. I guess the content changes daily, but the underlying html sites use for their pages is largely invariant from day to day.

And now the python:
import sys
import hashlib
from the_semantic_db_code import *
from the_semantic_db_functions import *
from the_semantic_db_processor import *

C = context_list("fragment webpages")
fragment_strings = ["<",">"]

def fragment_string(s,fragment_strings):
  r = [s]
  for frag in fragment_strings:
    list = r
    r = []
    for s in list:
      r += s.split(frag)
  return [s.strip() for s in r if len(s.strip()) > 0 ]

def create_fragment_hashes(filename,fragment_strings,size):
  result = fast_superposition()
  with open(filename,'r') as f:
    text = f.read()
    for fragment in fragment_string(text,fragment_strings):
      hash = hashlib.sha1(fragment.encode('utf-8')).hexdigest()[-size:]
      result += ket(hash)
  return result.superposition().coeff_sort()

size_list = [[4,"64k"],[5,"1M"],[8,"4B"]]

def learn_webpage_hashes(C,webpage,n):
  for k in range(n):
    web_file = "webpages-v2/clean-" + webpage + "-" + str(k+1) + ".html"
    print("web_file:",web_file)
    for pair in size_list:
      size, string_size = pair
      print("size:",size)
      print("string size:",string_size)
      ket_name = webpage + " " + str(k+1)
      print("ket_name:",ket_name)
      hash_name = "hash-" + string_size

# now lets learn the superpositions:
      r = create_fragment_hashes(web_file,fragment_strings,size)
      C.learn(hash_name,ket_name,r)

# learn the drop-n hashes:
# an experiment really, trying to work out what is best.
# Probably r by itself, but we need to check.
      C.learn("drop-2-" + hash_name,ket_name,r.drop_below(2))
      C.learn("drop-3-" + hash_name,ket_name,r.drop_below(3))
      C.learn("drop-4-" + hash_name,ket_name,r.drop_below(4))
      C.learn("drop-5-" + hash_name,ket_name,r.drop_below(5))
      C.learn("drop-6-" + hash_name,ket_name,r.drop_below(6))
      C.learn("drop-7-" + hash_name,ket_name,r.drop_below(7))
      C.learn("drop-8-" + hash_name,ket_name,r.drop_below(8))
      C.learn("drop-9-" + hash_name,ket_name,r.drop_below(9))
      C.learn("drop-10-" + hash_name,ket_name,r.drop_below(10))

# learn how many of each:
      C.learn("count-1-" + hash_name,ket_name,r.number_count())
      C.learn("count-2-" + hash_name,ket_name,r.drop_below(2).number_count())
      C.learn("count-3-" + hash_name,ket_name,r.drop_below(3).number_count())
      C.learn("count-4-" + hash_name,ket_name,r.drop_below(4).number_count())
      C.learn("count-5-" + hash_name,ket_name,r.drop_below(5).number_count())
      C.learn("count-6-" + hash_name,ket_name,r.drop_below(6).number_count())
      C.learn("count-7-" + hash_name,ket_name,r.drop_below(7).number_count())
      C.learn("count-8-" + hash_name,ket_name,r.drop_below(8).number_count())
      C.learn("count-9-" + hash_name,ket_name,r.drop_below(9).number_count())
      C.learn("count-10-" + hash_name,ket_name,r.drop_below(10).number_count())

# learn them all!
sites = ["abc","adelaidenow","slashdot","smh","wikipedia","youtube"]
number = 11

for site in sites:
  learn_webpage_hashes(C,site,number)

name = "sw-examples/improved-fragment-webpages.sw"
save_sw(C,name)
I downloaded one page a day for 11 days from each of the listed sites. Ran my code on that, and here is the result. BTW, for 66 webpages the code took roughly 75 minutes.

That's it for this post. More on this in the next few posts.

Update: the original motivation for this algo was gel electrophoresis.

No comments:

Post a Comment