Note:
dbt, Data Build Tool, is an open-source package developed by Fishtown Analytics, that applies the principles of software engineering to analytics...
hence the recent terminology "Analytics Engineering."
Mainly, dbt helps in creating positive analytics engineering in the Transformation
in ELT (Extract, Load, Transform)
Source
We have this fake data ingested from a fake API Pizza-is-Cool-blog.com.
[{
"id": 1,
"first_name": "Reamonn",
"last_name": "Dreier",
"email": "rdreier0@tinyurl.fake",
"gender": "Male",
"ip_address": "34.14.555.55"
}, {
"id": 2,
"first_name": "Ulric",
"last_name": "Skarr",
"email": "uskarr1@csmonitor.yahoo",
"gender": "Male",
"ip_address": "Unknown"
}, {
"id": 3,
"first_name": "Ransom",
"last_name": "Doe",
"email": "rspeirs2@usgs.nah",
"gender": "Male",
"ip_address": 'N/A'
}]
We may want to do some cleaning/transformations within our extraction. (Of the older ETL philosophy)
#PseudoCode.
# Note: Also pretend we extracted the data in another script or unseen function call.
# import json
# pizzaJson = json.loads('Resources/Pizza/Is/Cool/Fam/api_pizza.json)
### Etc cleaning, removing corner cases....
### Blah....
# Then use another script/function to push data to database
What could be some disadvantages in this case?
-- Raw data in base_pizza_users_logins.sql file.
-- Note: We do not necessarily have raw data.
WITH source AS (
SELECT * FROM {{ source('API_PIZZA', 'Logins')}}
)
SELECT * FROM source
Next, we clean up the data for pre-reporting data model creation
-- Next layer of data model called staging_pizza_user_logins.sql
WITH raw_source AS (
SELECT
id AS user_id,
first_name,
last_name,
email,
gender,
CASE
WHEN ip_address = 'Unknown' OR ip_address = 'N/A'
THEN NULL
ELSE ip_address
END AS ip_address
FROM {{ ref('base_pizza_users_logins')}}
)
SELECT * FROM raw_source
Create Macros/User Defined Functions (UDF) and additional optimizations to SQL logic
Example
{% macro haversine_distance_calculation(column_long1, column_lat1, column_long2, column_lat2 ) %}
DEGREES(
ACOS(
LEAST(1.0,
COS(RADIANS({{column_lat1}})) * COS(RADIANS({{column_lat2}})) *
COS(RADIANS({{column_long1}}) - RADIANS({{column_long2}})) +
SIN(RADIANS({{column_lat1}})) * SIN(RADIANS({{column_lat2}}))
)
)
)
{% endmacro %}
What is the source of truth for metrics, definitions, and more?
Transformations or "feature engineering" can be created in the Mart layer, or another location, as the designated zone for source of truth of core business logic.
“Documentation is a love letter to your future self” - Damian Conway
Information about your project: including model code, a DAG of your project, any tests you’ve added to a column, and more.
yml
version: 2
models:
- name: household_demographics_dim
columns:
- name: hd_demo_sk
description: "Blah"
- name: hd_income_band_sk
description: "Blah"
- name: hd_buy_potential
description: "Blah"
- name: hd_dep_count
description: "Blah"
- name: hd_dep_count
description: "Blah"
- name: hd_vehicle_count
description: "Blah"
- name: income_band_bound
description: "Blah"
- name: customers_dim
meta:
contains_pii: true
In the meantime,
— dbt (@getdbt) September 8, 2020