Monday, 23 March 2015

find-topic[names]

In this post we make use of normed-frequency-class, map-to-topic and find-topic[op] to give a probability of a name being male, female, or a last name. The data is from "Frequently Occurring Surnames from Census 1990", and here is our sw file of the same.

In the console:
-- load our data:
sa: load names.sw

-- let's see how big our data set is:
sa: how-many names |female name>
|number: 4275>

sa: how-many names |male name>
|number: 1219>

sa: how-many names |last name>
|number: 88799>

-- now play with the data:
sa: find-topic[names] |alice>
91.304|female name> + 8.696|last name> 

sa: find-topic[names] |bob>
61.404|male name> + 38.596|last name> 

sa: find-topic[names] |alex>
51.012|male name> + 33.401|last name> + 15.587|female name>

sa: find-topic[names] |sam>
47.818|male name> + 37.571|last name> + 14.611|female name>

-- save some typing:
h |*> #=> find-topic[names] |_self>

sa: h |smith>
100.000|last name>

sa: h |frank>
47.324|male name> + 41.831|last name> + 10.845|female name>

sa: h |bella>
53.333|last name> + 46.667|female name>

sa: h |lisa>
92.105|female name> + 7.895|last name>

sa: h |tim>
88.421|male name> + 11.579|last name>

sa: h |jane>
91.304|female name> + 8.696|last name>

sa: h |alexandria>
82.353|female name> + 17.647|last name>
Hopefully that is all clear enough. If you look at the relative probabilities in our results, it really does do quite a good job. Interestingly though, find-topic[op] works with any frequency list! Say perhaps you had frequencies of names from different countries. ie, names that are common in Italy, France, Germany, Japan, Scotland, Ireland and so on. Just having such frequency lists, find-topic[op] could guess the country of origin of a sample name.

Another thing to note is that frequency lists give vastly better results than plain lists with all coeffs equal. In that case, if a ket is in n lists, the coeff for each type would all be 100/n. There would be no information about if it is more of a female name vs a last name, say.

Also, it may be interesting to process the data and find the set of all names that are in one name type only (ie, those that return a coeff of 100 for find-topic[names]). I'll give it some thought on the easiest way to do that.

More uses of find-topic[op] coming up.

Update: this is how you find names that are in only 1 of the frequency lists:
is-unique |*> #=> is-equal[100] push-float find-topic[names] |_self>

|unique male names> => such-that[is-unique] names |male name>
|unique female names> => such-that[is-unique] names |female name>
|unique last names> => such-that[is-unique] names |last name>
Hrmm... seems slow. Here is another approach:
|unique names> => drop-above[1] (clean names |male name> + clean names |female name> + clean names |last name>)

is-male-name |*> #=> do-you-know intn(|_self>,names |male name>)
|unique male names> => such-that[is-male-name] "" |unique names>

is-female-name |*> #=> do-you-know intn(|_self>,names |female name>)
|unique female names> => such-that[is-female-name] "" |unique names>

is-last-name |*> #=> do-you-know intn(|_self>,names |last name>)
|unique last names> => such-that[is-last-name] "" |unique names>
Nope! That is even slower! I think I need to write some python to fix this.

Update: these examples represent what will probably become a common pattern:
is-something |*> #=> do-you-know mbr(|_self>,some-op |some list>)
-- where mbr is a kind of optimization of intersection.

No comments:

Post a Comment