Enhancing cross-product insights within dbt workflows
For multi-product companies, one critical metric is often what is called “cross-product adoption”. (i.e. understanding how users engage with multiple offerings in a given product portfolio)
One measure suggested to calculate cross-product or cross-feature usage in the popular book Hacking Growth [1] is the Jaccard Index. Traditionally used to measure the similarity between two sets, the Jaccard Index can also serve as a powerful tool for assessing product adoption patterns. It does this by quantifying the overlap in users between products, you can identify cross-product synergies and growth opportunities.
A dbt package dbt_set_similarity is designed to simplify the calculation of set similarity metrics directly within an analytics workflow. This package provides a method to calculate the Jaccard Indices within SQL transformation workloads.
To import this package into your dbt project, add the following to the packages.yml
file. We will also need dbt_utils for the purposes of this articles example. Run a dbt deps
command within your project to install the package.
packages:
- package: Matts52/dbt_set_similarity
version: 0.1.1
- package: dbt-labs/dbt_utils
version: 1.3.0
The Jaccard Index, also known as the Jaccard Similarity Coefficient, is a metric used to measure the similarity between two sets. It is defined as the size of the intersection of the sets divided by the size of their union.
Mathematically, it can be expressed as:
Where:
- A and B are two sets (ex. users of product A and product B)
- The numerator represents the number of elements in both sets
- The denominator represents the total number of distinct elements across both sets
The Jaccard Index is particularly useful in the context of cross-product adoption because:
- It focuses on the overlap between two sets, making it ideal for understanding shared user bases
- It accounts for differences in the total size of the sets, ensuring that results are proportional and not skewed by outliers
For example:
- If 100 users adopt Product A and 50 adopt Product B, with 25 users adopting both, the Jaccard Index is 25 / (100 + 50 — 25) = 0.2, indicating a 20% overlap between the two user bases by the Jaccard Index.
The example dataset we will be using is a fictional SaaS company which offers storage space as a product for consumers. This company provides two distinct storage products: document storage (doc_storage) and photo storage (photo_storage). These are either true, indicating the product has been adopted, or false, indicating the product has not been adopted.
Additionally, the demographics (user_category) that this company serves are either tech enthusiasts or homeowners.
For the sake of this example, we will read this csv file in as a “seed” model named seed_example
within the dbt project.
Now, let’s say we want to calculate the jaccard index (cross-adoption) between our document storage and photo storage products. First, we need to create an array (list) of the users who have the document storage product, alongside an array of the users who have the photo storage product. In the second cte, we apply the jaccard_coef
function from the dbt_set_similarity
package to help us easily compute the jaccard coefficient between the two arrays of user id’s.
with product_users as (
select
array_agg(user_id) filter (where doc_storage = true)
as doc_storage_users,
array_agg(user_id) filter (where photo_storage = true)
as photo_storage_users
from {{ ref('seed_example') }}
)select
doc_storage_users,
photo_storage_users,
{{
dbt_set_similarity.jaccard_coef(
'doc_storage_users',
'photo_storage_users'
)
}} as cross_product_jaccard_coef
from product_users
As we can interpret, it seems that just over half (60%) of users who have adopted either of products, have adopted both. We can graphically verify our result by placing the user id sets into a Venn diagram, where we see three users have adopted both products, amongst five total users: 3/5 = 0.6.
Using the dbt_set_similarity
package, creating segmented jaccard indices for our different user categories should be fairly natural. We will follow the same pattern as before, however, we will simply group our aggregations on the user category that a user belongs to.
with product_users as (
select
user_category,
array_agg(user_id) filter (where doc_storage = true)
as doc_storage_users,
array_agg(user_id) filter (where photo_storage = true)
as photo_storage_users
from {{ ref('seed_example') }}
group by user_category
)select
user_category,
doc_storage_users,
photo_storage_users,
{{
dbt_set_similarity.jaccard_coef(
'doc_storage_users',
'photo_storage_users'
)
}} as cross_product_jaccard_coef
from product_users
We can see from the data that amongst homeowners, cross-product adoption is higher, when considering jaccard indices. As shown in the output, all homeowners who have adopted one of the product, have adopted both. Meanwhile, only one-third of the tech enthusiasts who have adopted one product have adopted both of the products. Thus, in our very small dataset, cross-product adoption is higher amongst homeowners as opposed to tech enthusiasts.
We can graphically verify the output by again creating Venn diagram:
dbt_set_similarity provides a straightforward and efficient way to calculate cross-product adoption metrics such as the Jaccard Index directly within a dbt workflow. By applying this method, multi-product companies can gain valuable insights into user behavior and adoption patterns across their product portfolio. In our example, we demonstrated the calculation of overall cross-product adoption as well as segmented adoption for distinct user categories.
Using the package for cross-product adoption is simply one straightforward application. In reality, there exists countless other potential applications of this technique, for example some areas are:
- Feature usage analysis
- Marketing campaign impact analysis
- Support analysis
Additionally, this style of analysis is certainly not limited to just SaaS, but can apply to virtually any industry. Happy Jaccard-ing!
References
[1] Sean Ellis and Morgan Brown, Hacking Growth (2017)
Resources
dbt package hub