Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chore: Added stage tables for portal pageviews #1267

Open
wants to merge 51 commits into
base: master
Choose a base branch
from

Conversation

munish7771
Copy link
Contributor

@munish7771 munish7771 commented May 2, 2023

Summary

  • Adding stage Portal prod pageview* tables to capture cloud signups.
  • Adding intermediate table to capture signup stages.
  • Verified the join to Stripe customers.

select count(*) from ANALYTICS.DBT_CLOUD_PR_226810_1267.INT_PORTAL_PROD_SIGNUPS_AGGREGATED_TO_USERS signups join analytics.stripe.customers c on c.cws_customer = signups.portal_customer_id where account_created and c.id is not null; -- equal to unique customers in identify.

Ticket Link

Required for changes to PR -> #1261

Copy link
Contributor

@ifoukarakis ifoukarakis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No changes here :)

@munish7771 munish7771 added the Work In Progress Not yet ready for review label May 2, 2023
@@ -0,0 +1,5 @@
{% set rudder_relations = dbt_utils.get_relations_by_prefix(schema="PORTAL_PROD", database="RAW", prefix="PAGEVIEW_") %}

{{ dbt_utils.union_relations(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all pageview columns required?

Unioning the tables to create a staging "tracks" from events is not really following DBT's suggestion for project structure, but definitely a necessary thing to do. One thing that can be done to make things more predictable is to explicitly define which columns to keep. In this case it should be the columns shared between event tables. This way whenever a new property is added on any pageview_ table the current model won't be affected.

Comment on lines 1 to 14
WITH identifies as(
SELECT
{{ dbt_utils.star(source('portal_prod', 'identifies')) }}
FROM
{{ source('portal_prod', 'identifies') }}
)

SELECT
user_id,
portal_customer_id,
context_traits_portal_customer_id,
received_at
FROM
identifies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach introduces a new approach for staging models. Let's follow the pattern described in DBT's documentation. DBT codegen automatically generated this, so less effort there.

Comment on lines 2 to 15
with pageviews as(
SELECT
{{ dbt_utils.star(ref('base_portal_prod__tracks')) }}
FROM
{{ ref ('base_portal_prod__tracks') }}
)

select
id as pageview_id,
user_id,
event as event_table,
received_at
from
pageviews
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the CTE is needed.

FROM
{{ ref('stg_portal_prod__identifies') }}
WHERE
coalesce(portal_customer_id, context_traits_portal_customer_id) IS NOT NULL and received_at >= '2023-04-04'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 2023-04-04? Any chance this is a left-over to make things run faster?

false AS workspace_created
FROM
pageviews
JOIN (select * from identifies where row_number = 1) identifies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use qualify in identifies CTE?

{{
config({
"materialized": "table",
"incremental_strategy": "merge",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incremental strategy is defined, but the model is not incremental.

-- Email is verified and user is redirected to `pageview_create_workspace` screen.
MAX(CASE WHEN pageviews.event_table = 'pageview_create_workspace' THEN true ELSE false END) AS email_verified,
-- Setting to false as we consider `workspace_installation_id` from stripe as source of truth
false AS workspace_created
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this false always? If it is information that will come from a different source, why add it here?

@@ -0,0 +1,45 @@
{{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this model capture the users that have completed signups. The only aggregation that happens seems to be happening in order to achieve some deduplication (?). Perhaps the name of the model could be int_user_signups.

WITH identifies as (
SELECT user_id,
coalesce(portal_customer_id, context_traits_portal_customer_id) as portal_customer_id,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY RECEIVED_AT) AS row_number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use qualify in identifies CTE?

If I understand correctly, the idea is to create a mapping between user_id and portal_customer_id. For this case, the logic is rather simple, as for each user_id a single portal_customer_id must exist and vice versa. This can be moved to a separate intermediate model because:

  • it's extremely likely that it will be reused,
  • this way tests on the 1:1 mapping can be added.

In fact it's a common challenge when using data from CDP platforms. Here's a few examples on how it's performed using DBT:

@@ -0,0 +1,17 @@
version: 2
Copy link
Contributor

@ifoukarakis ifoukarakis May 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better place to place signup related models can be under the name of the team responsible for this flow. Placing them under data_eng feels a bit strange.

WHEN pageview_create_workspace.pageview_id IS NOT NULL
THEN TRUE
ELSE FALSE
END AS email_verified,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps is_email_verified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding the is_ prefix.

account_created,
email_verified
from user_signup_stages
qualify row_number() over (partition by portal_customer_id order by timestamp) = 1
Copy link
Contributor

@ifoukarakis ifoukarakis May 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very sure on email_verified specification is.

Here's an example of two :

| portal_customer_id | email_verified | timestamp  |
|--------------------|----------------|------------|
| 1                  | false          | 2023-05-01 |
| 1                  | true           | 2023-05-02 |
| 1                  | false          | 2023-05-03 |
| 2                  | true           | 2023-05-02 |
| 2                  | false          | 2023-05-03 |
  • With quailify, resulting email_verified will be false.
  • With i.e. group by portal_customer_id and max(email_verified), email_verified will be true .

1/5 which one should be used.

FROM
{{ ref('stg_portal_prod__identifies') }}
WHERE
portal_customer_id IS NOT NULL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use same ordering like in the SELECT

@catalintomai catalintomai added Do Not Merge Should not be merged until this label is removed and removed 2: Dev Review Requires review by a core committer labels Jun 26, 2023
@ifoukarakis ifoukarakis added 2: Dev Review Requires review by a core committer and removed 2: Dev Review Requires review by a core committer labels Jun 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Do Not Merge Should not be merged until this label is removed
3 participants