I'm trying to choose the right database and schemas for the following problem:
There are millions of type A nodes and type B nodes in the system. A and B nodes are disjoint and don't have direct connections within the same type. Each type A node can be associated up to thousands of type B nodes, and the edge between an A node and a B node can carry weight.
The queries to the system are, for a given type A node, what are the top-k most similar type A nodes, measured by the number of type B nodes they share (multiplied by the edge weights)?
My research led me to some potential solutions, none of which are satisfying.
- A naive SQL approach, where we have a table to track the pair-wise similarity score between every pair of A nodes. This is not scalable to millions of A nodes as the space complexity is O(N^2).
- A graph database, and something like SimRank for computing a non-exact result. This feels like overkill though, given the specific graph in this case is more restrained.