This recorded session is from DataEngConf NYC 17. Slides are available on the event page.
There’s an often-quoted statistic that says that data analysts spend 80% of their time preparing data and only 20% actually analyzing it. There’s a lot that we as data engineers can do to help our analytics teams be more productive and spend less time worrying about data preparation. This session discusses common problems in data warehousing infrastructure from the point of view of analytics teams, and suggests practical solutions.
Watch the session video or read the key takeaways below.
Video: Building ETL Infrastructure Analysts Love
Key Takeaways
An ETL infrastructure that analysts and engineers love has two main properties:
It satisfies data analysts’ needs: Correctness, availability, recency, cleanliness and speed, and completeness.
It empowers analysts to manage it with ease, while alleviating resource-constrained engineering teams.
This summary covers reasons for why both those properties are important and should be considered when building out ETL infrastructure, whether starting from scratch or transforming an existing, inefficient system with one that analysts and engineers will love.
Part 1: Data Analysts’ Hierarchy of Needs
For data analysts, there’s a hierarchy where they will only care about the higher levels once the more basic levels are taken care of, and in my experience, it can be roughly classified as follows:
Correct – Data correctness is the most fundamental of analysts’ needs. Building trust in the data is key.
Available – If data isn’t available—eg, if there’s some infrastructure component going down, or a change to the data that the ETL system isn’t setup to handle—analysts can’t do their job for an indefinite amount of time.
Up-to-date – Once analysts are confident that data is correct, and that they can rely on it being available the vast majority of the time, this is the next thing they’ll care about.
Clean and Fast – Now we are transitioning from enabling analysts to helping them be as effective at their jobs as they can be.
Complete – Does the warehouse have all the data that analysts care about? When analysts realize they have fast access to up-to-date data, they start asking more questions, which leads to demands for more sources of data to be integrated.
Part 2: Opening Up the ETL System to Analysts
As the size and needs of the analyst team grow, so does the complexity of the ETL system. It needs ongoing maintenance and improvements, and the burden of maintaining this system falls on engineers while analysts are forced to wait for weeks or months.
We can plot the few number of tasks involved with ETL and data warehousing at on a spectrum:
The solution is to move the line to the right: make it possible for analysts to maintain and scale the ETL infrastructure with less dependence from engineering.
Benefits of moving the line to the right:
Flexible systems
Decouple operational from analytical data systems
Faster time-to-insight
Data is trusted more
Analysts can satisfy all their data needs
How to Open the ETL System to Analysts:
When moving the line of ownership from left-to-right it’s critical to ensure analysts aren’t just tasked with maintaining the ETL system but are empowered to do so.
Some business intelligence (BI) tools shift the line slightly to the right, for instance, by allowing on-the-fly transformations and storing materialized views in a “cache layer.” This is a step in the right direction, but still leaves a large part of the ownership spectrum with the resource-constrained engineering team.
A data pipeline automation tool like Etleap takes care of many maintenance and optimization ETL tasks—such as managing data warehouse schemas—and simplifies the rest—such as adding new data from existing sources—so that analysts can do them on their own.
The end result is an infrastructure that gives analysts the data they need. What’s more, engineers love this infrastructure, too, because there’s less burden on them to maintain or scale the ETL infrastructure.