Not able to attend the great sessions on the third day of the dbt Coalesce conference? We've written down the highlights! There's 2 more days of content (catered to EMEA and APAC) that you can register to attend for free, but this is the last recap from our team.
After attending most of the sessions yesterday, here's what stood out to us the most.
The next evolution of the data stack is the metrics layer
We've heard a lot about the metrics layer over the past 6 months, but it's been hard to pin down exactly what that means. Fortunately, between the presentations from Benn Stancil, Chief Analytics Officer at Mode, and Drew Banin, Chief Product Officer at dbt Labs, that vision started to become a lot clearer.
In the current data framework, every tool that's building off your data has to define metrics separately. This results in a situation where each tool could potentially have important metrics (like Revenue, ROI, and Churn) get defined differently in every tool. Metrics also have to be recalculated at every level of aggregation, opening up room for error in situations where someone might average an average or improperly join the data.
Typically BI is built upon tables and columns. Metrics don't fit into that paradigm. You don't care about the table.... you care about the metric.
The future state with a metrics layer is an entirely different approach to analyzing data. In the current world, you have to think of all the tables the data lives in, join them together, and then aggregate the data at the level of specificity you desire. When metrics are pre-defined in a separate layer, you will be able to select metrics you want to look at first, then filter down those metrics by multiple dimensions, having the underlying SQL be auto-generated.
What really made it click for us was an example in SQL. In the future, instead of running the following query:
SELECT EXTRACT(MONTH FROM order_date) as order_month customer, sum(order_total)-sum(tax)-sum(discounts) as revenue FROM orders LEFT JOIN customers on customers.id = orders.customer_id WHERE EXTRACT(YEAR FROM order_date) = 2021 GROUP BY customer
we might instead run a query that looks like this:
SELECT revenue FROM metrics WHERE DIMENSIONS IN (customer, month, year)
These two statements would get the same end result, but one is a lot cleaner approach for inherently asking questions of your data and trying to find trends at varying levels of granularity. If you want to pull revenue by location instead of customer, you just substitute
country,region instead of having to build out entirely new join statements or worry about what tables the data lives in. The syntax for working with the metrics layer might vary, but the underlying concept remains the same. You select metrics first and the right tables/columns get pulled in as a result.
If implemented successfully, Benn thinks this will be the end of the "Modern Data Stack" and the beginning of the "Modern Data Experience" where new, innovate data apps are able to help teams create value at a faster pace.
We're excited to explore more of the technologies like Transform, Metricql, and dbt that are opening up possibilities for the metric layer!
The Next Era of Data is Polyglot
- In the 1970s SQL was introduced, letting teams access data in a warehouse.
- In the 1990s, spreadsheets took over as a way to allow anyone to manipulate data and move between computers easily.
- In the 2010s Python and R started taking over as a way to process massive amounts of data at scale.
- With the rise of the cloud warehouse, SQL gained popularity again towards 2017.
The technology will continue to evolve over time, but rather than choosing one language and sticking with it, we're seeing that data teams need to have the ability to work with multiple technologies hand-in-hand. Each technology has its own pros/cons and we should be maximizing the use of each.
It's not really possible to just use one language. Well, it is possible.... but in practice it might not be the most effective.
For those of you that have worked with Pandas dataframes - they're great for bringing data into Python! But sometimes the complex syntax structure can be overkill and you just want to be able to query your dataframe using some SQL.
In the example below, you could accomplish linear regression in SQL, but it's way simpler to run this type of logic in Python.
Whether you use the term polyglot or language agnostic, it's good to see others in the space agree that data workspaces and workflows should allow teams to build solutions in a flexible manner that combines languages that are best fit for the problem, rather than forcing a specific language.
Using the dbt meta config to power external workflows
Teghan Nightengal, Data Engineer at Community.com, walked through an intriguing use case of the dbt
meta field. In an effort to tie his team's dbt tests to Slack alerts, they wanted to define a set of parameters that should be used for sending a slack message if any data test fails, such as the recipients and the message. They wanted the capability to customize that slack message per test for a higher level of specificity.
Since anything in the meta field would be generated in the
manifest.json, their team was able to parse these files and trigger downsteam processes after all dbt commands were run. This parsing processes was built as part of a package called "Metahub" which should be open-sourced shortly.
Teghan mentioned that the team relied on the meta field, rather than exposures, because it allowed them to tie actions you want to take to individual tests, columns, or nodes. This gave the team more customization for the level of depth in which they could trigger actions internally.
We hadn't seen this approach used before and will definitely be running some tests of our own to see if we can better get Shipyard to react to the end results of dbt, beyond successes and failures.
New Technologies to Check Out
While the sessions were great, Day 3 resulted in an introduction of way more new technologies that we weren't aware of (and need to dig into over the holidays).
- Apache Arrow - Described as "Pandas Dataframes, but for any language and more performant". It sounds like a great way to help enhance the types of language-agnostic workflows we're helping customers build by allowing easier data sharing without having to rely on file generation/reading.
- Duck DB - An in-memory, serverless database that allows you to quickly manipulate data with SQL. This is another language-agnostic tool to quickly work with data across a workflow.
- Snowpark - Query and transform data from your Snowflake warehouse without the data ever having to leave Snowflake. This sounds like a great way to ease security concerns when working with a 3rd party vendor. There's still no Python API, but that should be coming soon according to Christiaan Kleinerman, Snowflake's SVP of Product.
- Materialize - Streaming database that you can query with SQL, giving you access to real-time data. We were really impressed by the speed and ease of the technology, especially since you only need to publish dbt models once and the source continues to update.
- Matik - New presentation SaaS that enables you to embed data from your warehouse directly into a Powerpoint or Google Slide.
- dbt-core v1.0.0 - The latest and greatest update to dbt Core includes loads of improvements, including being able to define metrics.
That's the end of our Coalesce recaps! We hope this gives you a taste of what you missed out on from day 3 of the conference.
And be sure to check out some of freebies you can get from attending this free conference!