I love anything that helps me get deep, anxiety-free, sleep. Dog-on-air-vent, cat-in-window kind of sleep.
Back in September 2010, for my first job in analytics, I careened into the office at 6:30am every day, unsure of what issues might be awaiting me. My assignment was to sign off on risk analytics reporting for BlackRock’s banking clients.
Before analytics data made it to my team, it traveled through what must’ve been at the time one of the world’s most complex ETL workflows.
The early wakeup time wasn’t really the issue—it was the shallowness of the sleep I did get, knowing that when I woke up I might be drawn into a time-sensitive flurry of debugging.
ETL is the process of ingesting data into a centralized data warehouse, by extracting it from its original source, transforming or normalizing it for upload to the warehouse, and finally loading it into the warehouse.
This initial data extraction process forms the basis of almost every downstream data & analytics workflow (reporting, ML modeling, exploratory analysis, operational workflows).
Our ETL process ingested data from thousands of sources, using a suite of proprietary tools + custom scripts, all running in on-prem data centers. A couple hundred people around the world monitored this process as the US team slept—contacting vendors to track down missing data, fixing data formatting issues, and generally keeping the ETL machine running.
The workflow looked roughly as follows:
Fast forward to today, and this work can largely be accomplished reliably with modern ETL tools running on cloud data warehouses. In fact, my former employer does just that.
Crucially, these tools are accessible to a team of almost any size or budget level, not just the largest enterprises. As a consulting “data team of one,” I helped many small businesses stand up an automated data pipeline with just a few hours of work.
We could call reliable data ingestion tools + fast, inexpensive and scalable cloud data warehouses the foundation that make the practice of analytics engineering possible. They provide a quality of life and sense of freedom that were difficult for me to imagine back in 2010.
Let’s have a quick lovefest to mark the occasion – I’ve personally experienced 3 key freedoms from implementing these tools:
Please indulge me as I express my undying gratitude…
While we now agree that Engineers Shouldn’t Write ETL, we weren’t always so wise.
I’ve had more than one analytics project held up because we couldn’t get raw data flowing on time, due to a backed up engineering queue.
For all involved, that’s a state that I can only describe as a bummer. Not much you can do as an analytics engineer without raw data, and it’s not like engineers enjoy a queue full of ETL scripting chores.
Worse than the initial build of data feeds was the maintenance—remember that huge team that we had at BlackRock? ETL scripting requires human care and feeding, as API outputs can change consistently.
I guess the perpetual question was: when this data integration inevitably breaks, who’s responsible for fixing it and when? Do we need an internal team dedicated to this stuff (which was not tenable for most of my consulting clients), or will someone else come in and own this problem for us?
When data ingestion tools like Fivetran, HVR, Stitch (and later on their open-source cousins Airbyte & Singer) arrived, they blew open the ETL bottleneck.
As an analytics engineer, I could now login to a SaaS tool, authenticate my data warehouse + data sources (Postgres application database, Google Analytics, Shopify, Facebook Ads, etc), select the tables + columns to sync, schedule it, and away I go.
Data is flowing in 15 minutes, and I can get to work transforming it into human-usable metrics.
I’m incredibly grateful that, as a non-engineer, I can now just say “yes” to layering new data integrations into a pipeline, vs throwing the work over the fence to an engineer.
And a side benefit! API documentation quality varies extremely widely—some data sources (like Shopify) present very clear definitions for each field returned in an endpoint’s response, making it easy to map the endpoints + columns required for your ETL pipeline.
Some APIs do…not make it so easy.
But—ETL tools know the table schemas available from your raw data sources, and can present them to you in a normalized way. That way, you can scan through available tables + fields in one place, as opposed to hunting and pecking through a minefield of API docs.
That may not seem like a huge deal, but the simple quality-of-life improvements brought by ETL tools meant a lot in my day-to-day work as an analytics engineer.
At that first analytics job, the output of our vast ETL workflow ended up in a MySQL database.
Note: By the time I left in 2013 we were exploring implementing Cassandra, and as I write their entire platform has moved to the cloud. I’m recounting ancient history here for the benefit of our understanding.
When you do analytics in a transactional (OLTP) database, life is full of tradeoffs. They’re great at acting on individual rows extremely quickly, but for large query loads like those required for reporting, row-based databases like MySQL are sloooow.
What that meant, is that we accepted having to pre-aggregate data for reporting. Analysts would run a script on the command line (or kick it off in a GUI) that’d take anywhere from 2-20 minutes to generate a suite of HTML-based reports, accessible via a web portal.
In the modern data stack, that’d be like rerunning dbt models every time you wanted to refresh data or modify a column in your dashboard. I know. I know.
But think of the upside—so much time for coffee breaks or to check your email while your reports compile.
What we truly needed was a lightning fast cloud data warehouse, where a reporting tool could query fresh data directly and cache it, instead of reading from pre-aggregated static files.
We would all get our wish soon after, with the introduction of AWS Redshift in late 2012, and the many other analytics warehouses (BigQuery, Databricks, Snowflake et al) that’ve become available since then.
Among other differences, cloud data warehouses are columnar, meaning they store + query each column separately.
That means if you only select 2 columns from a table, the query will truly only fetch those two columns—as opposed to a row-based transactional database, which would first get entire rows, and then lop off all but those two columns.
Call it zero-waste analytics.
The accessibility and affordability of these warehouses meant that, as analytics engineers, we now had the keys to an essentially infinitely-scalable supercomputer on which to run SQL queries.
No more building Rube Goldberg machines or complex pre-compilation processes between the database and downstream usage, because you could now do this work directly in the data warehouse.
This unlocked a wave of analytics creativity working on raw data in the cloud data warehouse rather than around the constraints of transactional databases: a process we now have come to understand as the ELT workflow.
The cosmic cost to an analytics team of having to pre-aggregate data for reporting cannot be overstated.
Everything requires a rerun: each column change, each issue that you find during QC, each late arrival of fresh data. Rerun, rerun, rerun—this time of course adds up quickly and causes undue stress, especially when you’re up against a release time SLA.
Being able to transform data directly in the warehouse would’ve saved my team thousands of hours per year in aggregate (1hr per person per week x 100 people conservatively), as we could’ve rerun only the transformations that were affected by a change.
This is the concept of continuous integration that’s well-understood to software engineering, but newer in our analytics field.
For more info on why CI is such a quality of life improvement for analytics teams, see Joel Labes’ writeup of his team’s Slim CI implementation on Discourse.
Let’s level-set real quick about all of these acronyms.
ETL tools extract, transform and load data from APIs and external sources into a data warehouse.
This first transformation step could also be called normalization: data from many APIs requires some type of initial column-naming standardization or type correction before being ready to load. This is critical to not break your warehouse table schemas—like a paddling duck, ETL tools do a ton of these normalizing transforms out of sight to keep your data flowing consistently.
After that normalized raw data is loaded, analytics engineers then transform that source-conformed data into human-usable metrics and dimensions.
You could call this the ETLT workflow (extract, normalizing transform, load, analytics transform), but most of us just drop the first T and refer to this as the ELT workflow (extract, load, transform).
These naming conventions do not need to be an exact science. What’s important is that there’s a final T at the end of “ELT workflow.” This finality implies freedom—the freedom of analytics teams to transform data to meet the needs of many downstream users (reporting, ML modeling, operational analytics, exploratory analysis, etc).
Without this T living at the end of the workflow, analytics teams are boxed in by the format the data arrives in.
With this final transformation step at the end of the ELT workflow, we have the power to work like software engineers.
In the next section on data transformations, we’ll get into how that power is manifested, in terms of our work on data modeling, testing, documentation and how all of those can be version controlled like a software product.