I really liked this paper. The approach is refreshingly uncomplicated. The use of circular correlation is well-argued, and the experimental results speak for themselves.
Knowledge graph expansion is an active area of research. If I know that
- cats have fur
- dogs have fur
- mice have fur
It would be great if I could infer that “rabbits have fur”, even if it hasn’t been manually added. In some sense, we can think of this as identifying a class of animals that are “furry”, assuming they share some other attributes. We can represent a “fact” as a tuple: (relation, x, y), or for convenience relation(x, y). In the case of our furry examples, we might write isFurry(cat).
The paper focuses on binary relations, e.g. bornIn(BarackObama, Hawaii). When we consider expanding our knowledge graph, we’re interested in giving a probability that a given unknown relation R(a, b) is true. The standard way to do this is to learn a representation for a and b, combine the representations in some manner, and compute a function p_r(a . b) to output a final probability.
The tricky part is how we combine our representations of a and b. Many approaches first compute some distance function of the terms and and then apply a relationship specific adjustment. Unfortunately, this fails to capture inter-dimensional relationships between terms and makes it difficult to learn relationship specific interactions. Similar problems occur when we combine terms by projection or concatenation.
At the other end of the spectrum we can compute the tensor (outer) product of our representations. Now we have every pair-wise interaction of a and b. Unfortunately, we now need to learn d^2 parameters to make a prediction for each relation.
So what’s the way out? The authors propose using circular correlation to combine the representations of a and b.
This can be efficiently implemented using Fourier transforms. The authors argue that this representation can be viewed as a compression of the full tensor product: we’re computing the interaction of each source term with various partitions of the target term. These partitions correspond to requiring different parts of the target to be similar.
This operation is non-commutative so p_r(a, b) != p_r(b, a); it’s also dimension preserving, so we can still perform some interesting operations on the output before we emit our final probability. Conveniently, the first dimension of the output is the dot product of a and b; if the overall similarity of the 2 vectors is important than this is trivially easy to make use of for a relation.
Finally, the authors note how this technique has many analogues to the idea of associative memory for neural networks.