SQL Readability (2020)
Scenario: You submitted a Pull Request (PR) for a sophisticated contribution to production (prod), that you spent several hours working on.
Kudos to you!
But then a code reviewer does not approve your PR.
Your query worked, so what happened?
It may turn out that you’re writing style may not be up-to-par with the team’s standards.
And that’s not your fault! It’s something you should not be ashamed about, but understanding about.
But why? First…
What is Readability?
Readability is the style, standardization, and practice for “readable” code. An analogy, it’s like correcting for punctuation in a writing practice.
The enforcement of readability practices, typically by teams, are tied to code reviewing between peers to push code to production.
Why should I encourage readability best practices?
From the introduction, I emphasized that coding practices are enforced by teams in agreement for better practices, standards, and alignment to contribute value to a product or organization. But what’s in it for the team? What’s in it for you
Let’s look at a scenario.
The Scenario
When a developer initially writes the code their knowledge of the system is very detailed and complex. And in it, they’ve spent a substantial amount of time to know the logical conditions, output, and corner cases–well hopefully.
1 year later, another person would like to change this query due to updated business logic regarding the query implementation.
Recall the logic being complex? Well, this developer added another layer of complexity as well–poor formatting, which can actually put-off, confuse, or just reduce productivity for the current individual.
This new developer may experience:
- Difficult to understand logic and outputs
- Longer to debug
- Maintenance issues
- The complexity of reading code
With readability practices, we alleviate the complexity of reading the code for the developer.
This can enable:
- Increased Productivity
- More time to focus on the core logic.
- Quicker turnover of implementations
Why am I writing about this?
Being an upcoming analyst, I didn’t mind code readability along with the content I produced.
However, one job I submitted SQL, Python, and Javascript PR’s (Change Lists[CL]) to code production, and I learned the impact of readability for teams, and the long-term benefits from it.
Being with Autodesk, I wanted to drive impact to both colleagues and individuals that should be encouraged to enforce these same practices, as we evolve in analytics and (soft) engineering.
Now,
SQL Readability
The Scenario
Let’s say we want to have the total amount of logins for the past 7 days of Myspace data per user, and some attributes along with that.
We create a workable query using ASCII SQL syntax.
select
id, createdDate, events.login_id, geoState, firstName, lastName, sum(login.logIn) total_logins
from mySpace_events as events
LEFT OUTER JOIN mySpace_users users
on users.id = events.user_id
WHERE login.date >= current_date - (select min(createdDate) from mySpace_users where profile = 'Musician')
group by 1,2,3,4,5,6
Statement Syntax
-
Upper case
declarative statements withing your SQL queries. This allows for another dimension to mentally filter out non-essential information when going to your main target–the logic. -
Have the statements be on
Seperate Lines
-
Ensure particular words are reserved in your SQL queries.
- For example, some reserve words are
SELECT
,INNER
,JOIN
–so don’t use those words as columns.
- For example, some reserve words are
Here, we modified the query to have declarative statements be on seperate lines and capitalize said statements.
SELECT
id, createdDate, events.login_id, geoState, firstName, lastName, SUM(login.logIn) total_logins
FROM mySpace_events AS events
LEFT OUTER JOIN mySpace_users users ON users.id = events.user_id
WHERE login.date >= CURRENT_DATE - (SELECT MIN(createdDate) FROM mySpace_users WHERE profile = 'Musician')
GROUP BY 1,2,3,4,5,6
Spacing
“Do not optimize for a smaller number of lines of code. newlines are cheap, brain time is expensive” - Source
SELECT
id,
createdDate,
events.login_id,
geoState,
firstName,
lastName,
SUM(login.logIn) total_logins
FROM mySpace_events AS events
LEFT OUTER JOIN mySpace_users users ON
users.id = events.user_id
WHERE login.date >= CURRENT_DATE - (SELECT MIN(createdDate) FROM mySpace_users WHERE profile = 'Musician')
GROUP BY 1,2,3,4,5,6
80 Characters
Here, I am abiding by past enforcement of reading ergonomics that I learned from the engineering community. However, I think there is something more, historical to this…seen here.
Here, the long-winded sub-query went from one line, to multiple lines, within an 80 character limit.
SELECT
id,
createdDate,
events.login_id,
geoState,
firstName,
lastName,
SUM(login.logIn) total_logins
FROM mySpace_events AS events
LEFT OUTER JOIN mySpace_users users
ON users.id = events.user_id
WHERE login.date >= CURRENT_DATE - (
SELECT
MIN(createdDate)
FROM mySpace_users
WHERE profile = 'Musician'
)
GROUP BY 1,2,3,4,5,6
Subqueries and WITH statements
WITH earliest_musician_user_date AS (
SELECT
MIN(createdDate)
FROM mySpace_users
WHERE profile = 'Musician'
)
SELECT
id,
createdDate,
events.login_id,
geoState,
firstName,
lastName,
SUM(login.logIn) total_logins
FROM mySpace_events AS events
LEFT OUTER JOIN mySpace_users users
ON users.id = events.user_id
WHERE login.date >= CURRENT_DATE - earliest_musician_user_date
GROUP BY 1,2,3,4,5,6
Aliases
Don’t pass on the AS
statement. With adding the alias, we are ensuring we are explicit with defining nick-names of ingested information, to what we can expect for output information.
Here, we move the subqueries from the bottom of the list to the top, and then call the reference in the same location.
WITH earliest_musician_user_date AS (
SELECT
MIN(createdDate)
FROM mySpace_users
WHERE profile = 'Musician'
)
SELECT
id,
createdDate,
events.login_id,
geoState,
firstName,
lastName,
SUM(login.logIn) AS total_logins
FROM mySpace_events AS events
LEFT OUTER JOIN AS mySpace_users users ON
users.id = events.user_id
WHERE login.date >= CURRENT_DATE - earliest_musician_user_date
GROUP BY 1,2,3,4,5,6
Camel Case or Snake?
In other coding language readability guidelines, there is an enforcement of code readability. Particularly, in Python PEP 8 variable names are set to camelCase
and file names are set to snake_case
.
with SQL, we have our naming conventions be ‘Snake” conventions.
Here, we have all variable names be snake case (e.g. createdDate
$\rightarrow$ created_date
), along with other columns.
WITH earliest_musician_user_date AS (
SELECT
MIN(created_date)
FROM mySpace_users
WHERE profile = 'Musician'
)
SELECT
id,
created_date,
events.login_id,
geo_state,
first_name,
last_name,
SUM(login.logIn) AS total_logins
FROM mySpace_events AS events
LEFT OUTER JOIN AS myspace_users users
ON users.id = events.user_id
WHERE login.date >= CURRENT_DATE - earliest_musician_user_date
GROUP BY 1,2,3,4,5,6
Comments
Both in the beginning of the code snippet to say what data the snippet pulls, the business logic it includes, and throughout, as much as possible, to reference subqueries. SQL makes it hard to read through an entire piece of code without running specific pieces one by one to get where you’re going and comments will help jog the memory. Include them at the end of lines in code, or in blocks at the beginning of code.
Here, we add a block-quote comment /* */
and single-line comments --
to clarify logic
/*
* Querying the Myspace Events table with user information for the time
* after the first Musician profile-user was created.
*/
WITH earliest_musician_user_date AS (
SELECT
MIN(created_date)
FROM mySpace_users
WHERE profile = 'Musician'
)
SELECT
id,
created_date,
events.login_id,
geo_state,
first_name,
last_name,
SUM(login.logIn) AS total_logins -- Obtaining total logins
FROM mySpace_events AS events
LEFT OUTER JOIN AS myspace_users users
ON users.id = events.user_id
WHERE login.date >= CURRENT_DATE - earliest_musician_user_date
GROUP BY 1,2,3,4,5,6
There are other things to refine here, but let’s stop here.
e.g. GROUP BY names should be actual names of columns, when code is in production
Now, take a look at the query above compared to the original one?
select
id, createdDate, events.login_id, geoState, firstName, lastName, sum(login.logIn) total_logins
from mySpace_events as events
LEFT OUTER JOIN mySpace_users users
on users.id = events.user_id
WHERE login.date >= current_date - (select min(createdDate) from mySpace_users where profile = 'Musician')
group by 1,2,3,4,5,6
Isn’t that better?!
Recap
We learned by encouraging a readability practice for submitting code to a team’s repository, production, developers do not have to face:
- Difficult to understand logic and outputs
- Longer to debug
- Maintenance issues
- The complexity of reading code
And possible enable:
- Increased Productivity
- More time to focus on the core logic.
- Quicker turnover of fixes or tasks
Now,
Never RIGHT, move it to the LEFT..and SQL ON…
Bonus
SQL Linters
VS Code
- https://marketplace.visualstudio.com/items?itemName=sensourceinc.vscode-sql-beautify
Mode Analytics
Resources
- Unofficial: My own past experience on SQL Readability form 2018 learnings
Note: Company never open-sourced Readability standards for SQL, but has for others
- PEP 8