Back to the Presentation Page: https://raulingaverage.dev/Presentations

Analytics Engineering with

dbt

A little bit about me

Note:

  • I am not a dbt core-contributor, but rather a user
  • None of the information or examples today is a reflection of and my own work @Autodesk
  • I hope you and your families continue to be safe and healthy during these times!
    • Moreover, I encourage y’all to Social Distance, Wear Masks, and support social-movements in the upcoming future

What will we be talking about today?

dbt

What is dbt?

(In regards to technology)
Thinking

dbt, Data Build Tool, is an open-source package developed by Fishtown Analytics, that applies the principles of software engineering to analytics...

hence the recent terminology "Analytics Engineering."

Mainly, dbt helps in creating positive analytics engineering in the Transformation in ELT (Extract, Load, Transform) Lifecycle Source

What's in it for me?

Evil Grin

Let's check out one scenario

We have this fake data ingested from a fake API Pizza-is-Cool-blog.com.

[{
  "id": 1,
  "first_name": "Reamonn",
  "last_name": "Dreier",
  "email": "rdreier0@tinyurl.fake",
  "gender": "Male",
  "ip_address": "34.14.555.55"
}, {
  "id": 2,
  "first_name": "Ulric",
  "last_name": "Skarr",
  "email": "uskarr1@csmonitor.yahoo",
  "gender": "Male",
  "ip_address": "Unknown"
}, {
  "id": 3,
  "first_name": "Ransom",
  "last_name": "Doe",
  "email": "rspeirs2@usgs.nah",
  "gender": "Male",
  "ip_address": 'N/A'
}]

We may want to do some cleaning/transformations within our extraction. (Of the older ETL philosophy)

#PseudoCode.
# Note: Also pretend we extracted the data in another script or unseen function call.

# import json
# pizzaJson = json.loads('Resources/Pizza/Is/Cool/Fam/api_pizza.json)
### Etc cleaning, removing corner cases....
### Blah....


# Then use another script/function to push data to database

What could be some disadvantages in this case?

Case 1. Do we actually have 'raw' data?

-- Raw data in base_pizza_users_logins.sql file.
-- Note: We do not necessarily have raw data.
WITH source AS (

    SELECT * FROM {{ source('API_PIZZA', 'Logins')}}

)

SELECT * FROM source

Case 2. Transformations, on transformations?

Next, we clean up the data for pre-reporting data model creation

-- Next layer of data model called staging_pizza_user_logins.sql
WITH raw_source AS (

    SELECT 
        id AS user_id, 
        first_name, 
        last_name, 
        email, 
        gender, 
        CASE 
            WHEN ip_address = 'Unknown' OR ip_address = 'N/A'
            THEN NULL
            ELSE ip_address
        END AS ip_address
    FROM {{ ref('base_pizza_users_logins')}}

)

SELECT * FROM raw_source

Case 3. How are these queries stored?

  • Are they temporarily created to accomplish a reporting effort?
  • Where are these queries stored?
  • Where is the business logic upheld?

Again, What's in it for me?

  • D.R.Y
  • Unified logic
  • Quality and Assurance (Maybe insurance too?)

Don't Repeat Yourself (D.R.Y)

Version Control Data Models & Logic

SQL Queries

Templating

Jinja Statements <3

In dbt, you can combine SQL with Jinja, a templating package in Python.

Using Jinja turns your dbt project into a programming environment for SQL, giving you the ability to do things that aren’t normally possible in SQL. For example, with Jinja you can:

dbt Source

Macros

Create Macros/User Defined Functions (UDF) and additional optimizations to SQL logic

Example

{% macro haversine_distance_calculation(column_long1, column_lat1, column_long2, column_lat2 ) %}
    DEGREES(
        ACOS(
            LEAST(1.0,
                 COS(RADIANS({{column_lat1}})) * COS(RADIANS({{column_lat2}})) *
                 COS(RADIANS({{column_long1}}) - RADIANS({{column_long2}})) +
                 SIN(RADIANS({{column_lat1}})) * SIN(RADIANS({{column_lat2}}))
            )
        )
    )
{% endmacro %}

Unified Logic

Source of truth

What is the source of truth for metrics, definitions, and more?

Transformations or "feature engineering" can be created in the Mart layer, or another location, as the designated zone for source of truth of core business logic.

Documentation

“Documentation is a love letter to your future self” - Damian Conway

Information about your project: including model code, a DAG of your project, any tests you’ve added to a column, and more.

Documentation

yml
version: 2

models:

  - name: household_demographics_dim
    columns:
      - name: hd_demo_sk
        description: "Blah"
      - name: hd_income_band_sk
        description: "Blah"
      - name: hd_buy_potential
        description: "Blah"
      - name: hd_dep_count
        description: "Blah"
      - name: hd_dep_count
        description: "Blah"
      - name: hd_vehicle_count
        description: "Blah"
      - name: income_band_bound
        description: "Blah"
  - name: customers_dim
    meta:
      contains_pii: true

Quality and Assurance (Maybe insurance too?)

Tests

dbt Test SQL Assertions or other SQL-related testing

  • Logging
  • Alerting

And more!

Again, there are other powerful dbt features

But I'll end this as a cliffhanger...

In the meantime,