The outline of the algo:
- you select substrings to fragment your document on.
(In html and xml "<" and ">" should be sufficient. For C perhaps "{", "}", ";" and so on.)
- split the document into fragments
- take a hash of the fragments (and choose how many digits of the hash to keep, eg 8)
- add up kets of those hashes, and that becomes your superposition.
A couple of advantages of this scheme:
- local changes in the document do not have non-local effects. Fragments further down the page are completely unaffected by modifications elsewhere. This is a powerful feature!
- webpages that change their content daily largely have a similar structure from day to day. On the order of 90% or so similarity. We will see the actual numbers in a future post. I guess the content changes daily, but the underlying html sites use for their pages is largely invariant from day to day.
And now the python:
import sys import hashlib from the_semantic_db_code import * from the_semantic_db_functions import * from the_semantic_db_processor import * C = context_list("fragment webpages") fragment_strings = ["<",">"] def fragment_string(s,fragment_strings): r = [s] for frag in fragment_strings: list = r r = [] for s in list: r += s.split(frag) return [s.strip() for s in r if len(s.strip()) > 0 ] def create_fragment_hashes(filename,fragment_strings,size): result = fast_superposition() with open(filename,'r') as f: text = f.read() for fragment in fragment_string(text,fragment_strings): hash = hashlib.sha1(fragment.encode('utf-8')).hexdigest()[-size:] result += ket(hash) return result.superposition().coeff_sort() size_list = [[4,"64k"],[5,"1M"],[8,"4B"]] def learn_webpage_hashes(C,webpage,n): for k in range(n): web_file = "webpages-v2/clean-" + webpage + "-" + str(k+1) + ".html" print("web_file:",web_file) for pair in size_list: size, string_size = pair print("size:",size) print("string size:",string_size) ket_name = webpage + " " + str(k+1) print("ket_name:",ket_name) hash_name = "hash-" + string_size # now lets learn the superpositions: r = create_fragment_hashes(web_file,fragment_strings,size) C.learn(hash_name,ket_name,r) # learn the drop-n hashes: # an experiment really, trying to work out what is best. # Probably r by itself, but we need to check. C.learn("drop-2-" + hash_name,ket_name,r.drop_below(2)) C.learn("drop-3-" + hash_name,ket_name,r.drop_below(3)) C.learn("drop-4-" + hash_name,ket_name,r.drop_below(4)) C.learn("drop-5-" + hash_name,ket_name,r.drop_below(5)) C.learn("drop-6-" + hash_name,ket_name,r.drop_below(6)) C.learn("drop-7-" + hash_name,ket_name,r.drop_below(7)) C.learn("drop-8-" + hash_name,ket_name,r.drop_below(8)) C.learn("drop-9-" + hash_name,ket_name,r.drop_below(9)) C.learn("drop-10-" + hash_name,ket_name,r.drop_below(10)) # learn how many of each: C.learn("count-1-" + hash_name,ket_name,r.number_count()) C.learn("count-2-" + hash_name,ket_name,r.drop_below(2).number_count()) C.learn("count-3-" + hash_name,ket_name,r.drop_below(3).number_count()) C.learn("count-4-" + hash_name,ket_name,r.drop_below(4).number_count()) C.learn("count-5-" + hash_name,ket_name,r.drop_below(5).number_count()) C.learn("count-6-" + hash_name,ket_name,r.drop_below(6).number_count()) C.learn("count-7-" + hash_name,ket_name,r.drop_below(7).number_count()) C.learn("count-8-" + hash_name,ket_name,r.drop_below(8).number_count()) C.learn("count-9-" + hash_name,ket_name,r.drop_below(9).number_count()) C.learn("count-10-" + hash_name,ket_name,r.drop_below(10).number_count()) # learn them all! sites = ["abc","adelaidenow","slashdot","smh","wikipedia","youtube"] number = 11 for site in sites: learn_webpage_hashes(C,site,number) name = "sw-examples/improved-fragment-webpages.sw" save_sw(C,name)I downloaded one page a day for 11 days from each of the listed sites. Ran my code on that, and here is the result. BTW, for 66 webpages the code took roughly 75 minutes.
That's it for this post. More on this in the next few posts.
Update: the original motivation for this algo was gel electrophoresis.
No comments:
Post a Comment