0

I'm trying to choose the right database and schemas for the following problem:

There are millions of type A nodes and type B nodes in the system. A and B nodes are disjoint and don't have direct connections within the same type. Each type A node can be associated up to thousands of type B nodes, and the edge between an A node and a B node can carry weight.

The queries to the system are, for a given type A node, what are the top-k most similar type A nodes, measured by the number of type B nodes they share (multiplied by the edge weights)?

My research led me to some potential solutions, none of which are satisfying.

  1. A naive SQL approach, where we have a table to track the pair-wise similarity score between every pair of A nodes. This is not scalable to millions of A nodes as the space complexity is O(N^2).
  2. A graph database, and something like SimRank for computing a non-exact result. This feels like overkill though, given the specific graph in this case is more restrained.

0