Re: Full Outer Join Primary Key Issue

ivancans · 06-04-2024 01:41 AM

Hi,

I would appreciate any help related to the following situation.

I have 3 views:

Users - Unique table of users
Orders - All the orders users have
Service Purchases - All the subscriptions users have

The model starts with orders to which users is joined via full outer join to which service purchases are joined via regular left join.

join: user {
relationship: many_to_one
type: full_outer
sql_on: ${order.user_id} = ${user.id} ;;
}

join: service_purchase {
relationship: one_to_many
sql_on: ${user.id} = ${service_purchase.user_id} ;;
}

There are 2 reasons why I need to user full outer join. 1.The starting point of the model is order view; 2.Not all users who have orders also have service purchases and vice versa.

So far so good, but the issue starts when I need to create a measure that combines $ amount from orders and $ amount from service purchases. Looker allows that but it can't deduplicate the results based on primary keys because by default, it uses the primary key from the view that measure is built in (service purchases in this case) and using sql_distincrt_key from orders also doesn't work. Why? Well, as we know, full outer join in the rows where table B doesn't match with table A, return nulls, including primary key columns. So, the result is that, when I try to user this measure that combines both views connected through full outer join, there are repeating ID's when they are present, and nulls for IDs when there is no match.

I tried using row_number window function to create artificial primary key for each view and use it in a coalesce when tables don't match and rows in either A or B table return nulls but that doesn't work because deduplication logic is used within a sum function and because deduplication logic uses my window function, it can't be done because it isn't allowed to use window functions within aggregate functions.

So this is the issue. Please help 🙂

andy4

My top advice for all situations which have a x_to_many join is to make a change so that it is unnecessary. There are lots of reasons I believe that, but in this case the simple fact you are having an issue is reason enough. Consider creating a Derived Table of the right-hand table which summarizes it into having the a primary key which exists in the left hand table. In this case that probably means rolling up service_purchase to a new table with the primary key of user_id. This will improve performance and help steer you away form the "one explore to do absolutely everything" anti-pattern. I also highly recommend doing this with the Native Derived Table feature instead of a SQL Derived Table. That way you can use measures you already have, and can easily add new ones in the future as your needs evolve for fields in the right-hand table.

While I strongly recommend the above approach, I can also offer a more direct answer to your question. The secret to fixing a variety of Measure-related issues is to split up their logic into smaller pieces. Take any complex SQL in the sql property and split that into yesno dimensions and additional measures. Once you've done that, you will probably notice measure(s) which only have dimensions from one or the other LookML view. If not... you aren't done splitting up the logic yet. Once you have that, you can now move the measures to the correct view and they will use the correct primary key to perform the deduplication for that portion of the logic. The location of all asymmetric measures (sum, count, average, etc) matters, as you have seen!

Eventually you will need to decide "where to put" the measures of type number which stitch together the other measures. For that, I prefer to think about the inaccessible field problem to help me decide. You will be referencing another view which means it must be joined to all explores which have that field (i.e. creating a dependency), so try and put those measures in the LookML view which will likely always come with the other one anyway. In this case you have three very helpful tables in many cases, so you might want to look into the option of a bare join - see this post from Looker's founder for a short explanation:
https://www.googlecloudcommunity.com/gc/Modeling/Best-practices-for-excluding-erroring-fields-from-e...

PS: You can use SDTs (SQL Derived Tables) if you must, but I really prefer NDTs!

ivancans

Hey Andy,

Thanks for the suggestion. The reasons rollup to user level wouldn't work is because we have and need that timestamp dimension. The final solution, in simple terms, after lots of trial and error was this:

1.Have the model start from user

2.Left join order on user_id

3.Full outer join service purchases on user_id to the user view and on created_date_raw to orders

At this stage, the final "table" has all the combinations of users+timestamps+order and/or service purchases but we are still missing 2 main things :

one, united, timestamp column - to filter all the dates at once or to plot this all on one timeline for charts;
one, united, user to identify who made which purchase

4.a.So, to have those united columns, we created an empty view (select null) , that was joined to model using condition 1=1, which joins it to everything

4.b.In the new "dummy" view which will be used for unified date or user filtering we created 2 dimensions:

one to coalesce date from orders and service purchases
another to coalesce user id from orders and service purchases

5.Added another view to the model by using users table which we joined on used_id that was coalesced from step 4.

6.In the "dummy" view we pulled username from that new user view from step 5 as a new dimension

That's it, at this moment we have correct values in total, for a specific day, and for specific day with users added.