In his 2016 post “Building a Mature Analytics Workflow,” our own Tristan Handy outlined the significant issues facing analytics teams, and how we might begin solving them.
The crux of the issue was that data practitioners clean & transform their datasets in isolation, as part of building their own analytics outputs (reports, notebooks, ML models, etc).
This defeats the whole purpose of having a data team.
So how do we write analytics code collaboratively? Tristan looked to software engineering for inspiration:
Analytics doesn’t have to be this way. In fact, the playbook for solving these problems already exists — on our software engineering teams. The same techniques that software engineering teams use to collaborate on the rapid creation of quality applications can apply to analytics.
In that founding post, Tristan laid out 5 key techniques that analytics teams can adopt from software engineering teams.
Five years later, let’s explore how each of these transformations is playing out in practice:
1. Analytics code should be version controlled #
The Before Times: Chaos! Analysts, engineers and data scientists saved queries to their personal machines, or at best checked them into a repository as a stored procedure. When the pipeline broke, analysts looked to the data engineers and vice versa, without anyone being sure of what code change broke the pipeline.
Today: Data people (analysts, analytics engineers, data engineers, data scientists) collaborate on analytics code in a shared git repository. We review each other’s code changes, and test them for quality before merging into production. No one can remember why we ever used to store our precious queries in a DO_NOT_DELETE folder on our desktop.
2. Analytics code should have quality assurance #
The Before Times: Data teams were always on defense, reacting to issues in the data pipeline. Spot checks happened here and there, but we lacked programmatic testing of data transformations.
Today: Thousands of teams are bulking up their data transformation projects with dbt tests, breaking the cycle of reactivity that can plague our work with data. Catching bugs before they hit production means fewer hotfixes, more reliability, and more trust in your numbers.
3. Analytics code should be modular #
The Before Times: Analysts live in their own siloes. How did she formulate that metric? How did they filter that dataset? Each new spreadsheet or report starts with running the same queries and copying the same spreadsheet formulas. Everyone mutters why it’s so hard to share code.
Today: Analysts, data engineers and data scientists collaborate on a unified code base, where the work of each person can build directly on the work of others. One model can serve as the basis to several. Template languages like Jinja help teams write more modular SQL. Less boilerplate work means more time to tackle big problems.
4. Analytics should use environments #
The Before Times: Many teams were running hot, testing in production. Crossing fingers that the daily marketing attribution report isn’t going to break on this next hotfix.
Today: Teams develop, test, and review changes or additions in a development environment. See the results of changes before committing to production. You know, like software.
5. Analytics code should be designed for maintainability #
The Before Times: New data routinely breaks existing data models, without so much as an alert. When things do break down, teams have limited visibility across the stack to isolate the issue.
Today: By practicing modular data modeling, teams isolate the risk of changing source data to the single, individual model that ingests that data. When that schema changes down the road, teams must only make one update, which automatically propagates downstream to the rest of the pipeline.
It’s not easy #
This has been a mindset shift. Working collaboratively in public can be exposing.
But once the lightbulb clicks on, and teams get a taste of working collaboratively in the analytics engineering workflow, we find that they rarely go back to the old way of working.
Working like software engineers enables a data team to produce quality datasets faster, make data pipelines more reliable, and actually function as a team.