The Semantic DB Project: a similarity metric

In this post I will try to explain the derivation of my similarity metric.This thing is useful in a bunch of places.

First, note that in my travels I have seen this given as a similarity metric:

similarity-metric(f,g,s) = exp(-||f - g||^2/2s)

So, if f == g, it returns 1, if ||f - g||^2 is big, it approaches 0.

Now, for the derivation of my metric.
First, define a multiplication operator that will simplify the equations a little:

a*b = \Sum_k abs(a[k] . b[k]) -- discrete version
a*b = \Int dt abs(a(t) . b(t)) -- continuous version
where . is the standard multiplication operator.

Both of these have the property:

0 <= w*[f - g] <= w*f + w*g
where 0 if f == g, and w*f + w*g if f and g are perfectly disjoint.

which is really just another way of stating a couple of standard properties of metrics:

2. d(x,y) = 0 if and only if x = y
4. d(x,z) <= d(x,y) + d(y,z)

And it is illustrative to note that:

w*[f - g] = \Sum_k |w[k].f[k] - w[k].g[k]|  -- discrete version
w*[f - g] = \Int dt |w(t).f(t) - w(t).g(t)| -- continuous version

Now, just divide through by the right hand side, casting it into the range [0,1]

0 <= w*[f - g]/(w*f + w*g) <= 1

Now, this is a good start, but for some examples it doesn't work as well as expected (maybe one day I will dig up an example to show my point!).
But we can fix this by adding a term: R.abs(w*f - w*g), where R >= 0, and works best with R = 1
And hence simm:

0 <= (w*[f - g] + R.abs(w*f - w*g))/(w*f + w*g + R.abs(w*f - w*g)) <= 1

So, OK. Let's set R = 1, and note that a + b + abs(a - b) = 2.max(a,b), we can now compact it down to:

0 <= (w*[f - g] + abs(w*f - w*g))/2.max(w*f,w*g) <= 1

Cool. And now observe, as in the physics tradition, this equation has some symmetries:

given the scalar k (not equal to 0), we have:
1) symmetry under w => k.w                       
2) symmetry under f => k.f and g => k.g
3) symmetry under w => w.exp(i*t)/R(t), f => R(t).exp(-i*t).f, g => R(t).exp(-i*t).g

Now, I can't think of a use case for the third symmetry there, but I put it in because it reminds me of gauge transformations in physics, and is a good decider on what terms are valid to use in our metric. eg: w*f, w*g, w*[f - g], abs(w*f - w*g).

We are about done. Final thing to note is the above metric returns 0 for exact match, and 1 for perfectly disjoint. If we want to swap this just map the metric m to 1 - m (presuming m "attains" its upper and lower bounds)
ie:

(w*[f - g] + R.abs(w*f - w*g))/(w*f + w*g + R.abs(w*f - w*g))

becomes:

1 - (w*[f - g] + R.abs(w*f - w*g))/(w*f + w*g + R.abs(w*f - w*g))
  = (w*f + w*g + R.abs(w*f - w*g) - w*[f - g] - R.abs(w*f - w*g))/(w*f + w*g + R.abs(w*f - w*g))
  = (w*f + w*g - w*[f - g])/(w*f + w*g + R.abs(w*f - w*g)) 
  = (w*f + w*g - w*[f - g])/2.max(w*f,w*g)                -- assuming R = 1

So there we have it:

simm(w,f,g) = (w*f + w*g - w*[f - g])/2.max(w*f,w*g)
NB: if w is not given, ie we are using simm(f,g), then assume w[k] = 1 for all k, or w(t) = 1 for all t.

Now a couple of observations:
1) the above derivation of my simm was based on the Manhattan metric. A similar derivation could be done using the standard Euclidean metric. Or presumably any metric. Though I think Manhattan is better than Euclidean because it is in some sense linear, and doesn't require any sqrt()s.
2) the fact that the denominator collapses down only when R = 1 is a hint that that is the right choice!
3) from experience playing with this, simm(w,f,g) usually, though not always, gives best results when f and g are rescaled so that w*f == w*g (excluding the case where they are length 1 vectors, in which case you should not rescale!)
4) if f[k] and g[k] are >= 0 for all terms, then the equation can be compressed using a + b - abs(a - b) = 2.min(a,b), and becomes:
simm(w,f,g) = \Sum_k w[k] min(f[k],g[k]) / max(w*f,w*g)

And that's it! An implementation in python, and some examples coming up next.

Update: A quick expansion on observation (1). So, using the Manhattan metric as the base, we have:
w*[f - g] = \Sum_k |w[k].f[k] - w[k].g[k]|
If we used the Eucldiean metric instead, we would have:
w*[f - g] = sqrt(\Sum_k |w[k].f[k] - w[k].g[k]|^2)
and hence my comment "in some sense linear".

And an expansion on observation (4):
Consider just the numerator:
w*f + w*g - w*[f - g]
Now, expand that out using our definition of a*b:

w*f + w*g - w*[f - g]
= \Sum_k |w[k].f[k]| + \Sum_k |w[k].g[k]| - \Sum_k |w[k].f[k] - w[k].g[k]|
= \Sum_k |w[k].f[k]| + |w[k].g[k]| - |w[k].f[k] - w[k].g[k]|
= \Sum_k |w[k]| (|f[k]| + |g[k]| - |f[k] - g[k]|)

Now, make use of "f[k] and g[k] are >= 0 for all terms":

= \Sum_k |w[k]| (f[k] + g[k] - |f[k] - g[k]|)

Now, make use of a + b - abs(a - b) = 2.min(a,b):

= \Sum_k |w[k]| 2.min(f[k],g[k])

And hence (assuming w[k] >= 0 for all k):

simm(w,f,g) = \Sum_k w[k] min(f[k],g[k]) / max(w*f,w*g)

as promised! And we take this pretty compression as a strong hint that Manhattan was a better choice than Euclidean! BTW, also note the factor of 2 from in front of min, and max cancelled out.

Update: OK. I don't know a use case for these, but we can generalize to more than pairs {f,g}.

simm(w,f1,f2) = \Sum_k w[k] min(f1[k],f2[k])/max(w*f1,w*f2)
simm(w,f1,f2,f3) = \Sum_k w[k] min(f1[k],f2[k],f3[k])/max(w*f1,w*f2,w*f3)
simm(w,f1,f2,f3,f4) = \Sum_k w[k] min(f1[k],f2[k],f3[k],f4[k])/max(w*f1,w*f2,w*f3,w*f4)
simm(w,f1,f2,f3,f4,f5) = \Sum_k w[k] min(f1[k],f2[k],f3[k],f4[k],f5[k])/max(w*f1,w*f2,w*f3,w*f4,w*f5)

and so on.

The Semantic DB Project

Tuesday, 17 February 2015

a similarity metric

No comments:

Post a Comment