sql - Algorithm for finding top-k similar nodes in database

I'm trying to choose the right database and schemas for the following problem:

There are millions of type A nodes and type B nodes in the system. A and B nodes are disjoint and don't have direct connections within the same type. Each type A node can be associated up to thousands of type B nodes, and the edge between an A node and a B node can carry weight.

The queries to the system are, for a given type A node, what are the top-k most similar type A nodes, measured by the number of type B nodes they share (multiplied by the edge weights)?

My research led me to some potential solutions, none of which are satisfying.

A naive SQL approach, where we have a table to track the pair-wise similarity score between every pair of A nodes. This is not scalable to millions of A nodes as the space complexity is O(N^2).
A graph database, and something like SimRank for computing a non-exact result. This feels like overkill though, given the specific graph in this case is more restrained.

asked Jul 19, 2023 at 3:33

Zizheng Tai

6,4701 gold badge36 silver badges85 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Algorithm for finding top-k similar nodes in database

0

Browse other questions tagged
sql
database
nosql
graph-theory
graph-databases
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Browse other questions tagged sqldatabasenosqlgraph-theorygraph-databases or ask your own question.

Browse other questions tagged
sql
database
nosql
graph-theory
graph-databases
or ask your own question.