Why Learn dbt
At Shipyard, one of our biggest target audiences is data engineers. Our core platform is designed to help anyone rapidly construct pipelines and automate processes throughout the data lifecycle.
But in my day-to-day as an engineer on the team, I'm not actually doing any data engineering myself. I wanted to learn more and get some first-hand experience from the perspective of our users, understanding the technology that powers their everyday work.
To that end I enrolled in the co:rise course on analytics engineering and dbt since dbt is a huge tool in the data community.
Where I'm Coming From
I'm an engineer at Shipyard, working full-stack building out all aspects of the application. While I had a high-level understanding of the data analytics and engineering landscape, it was primarily from the perspective of the tools that we rope together within our Shipyard-powered pipelines.
My focus has been on how things stitch together not so much which tools or why they should be stitched together. Occasionally when helping our users develop solutions I'd gain some insight into particular pain points but nothing holistic. The course offered an opportunity to get some hands-on experience working with a tool addressing at least one segment of the data lifecycle - in this case the "transform" segment with dbt.
With regards to dbt specifically, I also had a very basic understanding of what it was and how it operated. I basically conceptualized it as a way to create “pipelines” for collecting data into a format that was easily digestible. Another way I looked at it was as a sort of “SQL wrangler” that helped to organize disparate SQL queries and statements into one, cohesive project.
In any case, I also understood dbt to be the clear leader in the “transform” space (the "T" of ETL).
Data Analytics Industry
While I previously knew some things about the space, the class expanded them.
First is that the industry is in its infancy (if you haven't heard that said a million times already). This means that industry participants are still working through very basic things like best practices and even how to define different roles.
Another core point is that analytics tools must be cloud based. Removing the need for lower-level infrastructure maintenance coupled with the increasing processing capacity in the cloud allows analysts and engineers to move rapidly.
Lastly, there is an array of different roles available at various points within the data lifecycle. Data engineers might be responsible for the infrastructure to collect and move data while data analysts are responsible for maintaining data structure and integrity with business analysts pulling that data to make decisions. Separate from these two is the analytics engineer which sort of sits in the middle and is responsible for the “glue” between the other two roles - connecting the various tools and services together to pull data from sources and deliver it to users - which was more the focus of this class.
Settling in to learn a tool in one of these roles would be value-added in my daily work.
Understanding dbt Core
The primary tool taught in the course was dbt Core. dbt (data build tool) is an open-source Python framework and CLI use to compile SQL statements into an executable DAG. As mentioned previously, it's one of the most prominent players in the "transform" section of ELT and has a rapidly growing user base.
Under the hood, dbt is a relatively simple tool essentially weaving together a series of SQL
SELECT statements ("models" in dbt terminology) into a comprehensive DAG. A number of features allow the user to reduce boilerplate and duplication throughout their queries. In the class example of running the analytics team for a small startup, a number of these features were able to be used such as tests to ensure data integrity or macros to remove large chunks of repetitive query statements.
All of this felt very familiar as a software engineer as principles such as DRY, test coverage, and logging were all being adhered to by using dbt. For example, in my day-to-day software development work, I’ll be writing tests (in a
*_test.go file for example) that I run individually as I make changes or together with all unit tests to ensure that I haven’t broken anything. In dbt, a similar concept is at play where running
dbt test runs a series of
*.sql files you’ve configured that ensure the data is in the format you expect it to be in - for example, checking to make sure there are no null values.
So while the subject matter and tool were new, the concepts being applied were not.
After completing the course, I had a few takeaways.
- dbt is a powerful but simple tool: applying software development best practices to data analytics processes is super helpful. The more things that can be scripted, tested, and automated, the less room there is for human error and the more rapidly a team can scale their data operations. This is a paradigm something I would expect to see more tools moving towards.
- Small features can create significant value: using common features in the ubiquitous SQL highlighted how value can be created with small improvements on existing tools. For example, the chaining together of statements helps to break up and increase comprehensibility of complex SQL statements. Others like the built-in templating language allows for dynamic statements while
snapshotscreate a “time-lapse” of changes to a table with nothing more than, you guessed it, a simple
selectstatement! My initial image of dbt as a “SQL wrangler” wasn’t far off but it definitely didn’t do justice to the feature set available in the tool.
- "Push" is missing from dbt: one area where the tool appears to fall short is that even after the DAG completes its run, the end consumers of the output data still need to fetch it. This is a "pull"-based model as opposed to a "push"-based one that could trigger actions based on its outcome. This may be an area that the dbt development team focuses on going forwards or some other player in the data space.
- Data teams must realize value quickly: while data utilization is growing, data teams need to be fast and nimble, quickly generating useful data to draw insights from, in order prove their value. It's still a young industry and small and large companies may be hesitant to invest the necessary resources. Tools that focus on reducing the time-to-value are what data analytics teams should gravitate towards.
- Declarative tooling is a work in progress: an ideal state for data pipeline workflows would be instructing the tools what to do while avoiding having to go into detail on how to do it. Declarative versus imperative. This becomes sharply into focus with the explosion of tools in the data space and the need to coordinate all of them. While dbt is a powerful tool, it does come with a significant upfront time investment cost to setup and get running even for relatively simple projects.
All in all, it was great getting a first-hand perspective on the data space. Learning about the internal workings of the data lifecycle as well as the roles and tools associated with it was eye opening. It's definitely a young industry with a lot of potential