dbt and the Modern Data Team
Building a data science portfolio requires owning the entire data stack from beginning to end to produce the final product (e.g. a Tableau dashboard) and I’ve most often relied upon an extract, transform, load (ETL) process to prepare the data for analysis. However, this framework created certain complications; specifically, it lacked the modularity to experiment with transformations without compromising data integrity as well as the transparency to trace where any particular error originated during the process. Traditional data teams have faced similar challenges over the past decade but recent advances such as cloud-based data warehouses and data pipeline services have accelerated an industry shift from ETL to ELT.
In addition, a new title—analytics engineer—has increasingly become common to describe the person who owns the entire data stack as I had for my past projects. I recently discovered data build tool (dbt), a popular ELT data transformation tool that enables data analysts and engineers to transform, test, and document data in their cloud warehouse, and decided to revisit my most recent project by rebuilding the data stack from scratch as an analytics engineer using BigQuery and dbt.
Modular data modeling in dbt allows analytics engineers to build dependencies between sources (defined using YML files) and models (SQL files written as select statements) using dbt functions such as source() and ref(). As a result, source-centric models, such as staging models, can be easily recombined as needed to create business-centric marts models that can then be used downstream to inform business decisions. Additionally, the ref() function replaces the practice of hard-coding table names, which can result in problems if a typo is made, while the source() function allows dbt to check the freshness of raw tables in the data warehouse using the dbt source freshness command.
dbt provides full transparency about models, sources and their interdependencies by automatically generating documentation such as descriptions about models, sources and columns as well as the directed acyclic graph (or DAG), which provides a visual representation of dependencies between sources and models. Combined with documentation added by the user in YML files close to the actual code, these resources empower other stakeholders in the organization to easily answer their own questions and thereby reduces the burden on analytics engineers.
Robust testing functionality in dbt identifies exactly where a problem occurred and can prevent further errors by ensuring that models meet certain assertions about the source data. Generic tests (e.g. accepted_values, not_null) are specified in YML files to run on particular columns in a model while specific tests are written as a SQL query to run against an entire model, as in this example verifying that a credit card balance has not exceeded its limit. These tests automatically check for broken models that may result in problems downstream rather than only learning about it from an analyst when something goes wrong.
Additionally, development and deployment environments allow dbt to run on a schedule in a deployment environment and provide up-to-date model outputs to inform dashboards, reporting and other key business decisions. Meanwhile, analytics engineers can continue working on the dbt project in a separate development environment to fix any potential issues without affecting end users.
dbt empowers data teams through transparent modeling, testing and documentation features that make the analytics data stack accessible to engineers as well as analysts. As a result, dbt improves organizational efficiency because its modularity allows models to be recombined to allow end users to answer their own questions or refer to the automatically-generated dbt documentation site to further examine the datasets. Meanwhile, its robust testing functionality pinpoints potential issues which can then be swiftly resolved in a separate development environment so the organization never misses a beat.