Right now, the world of analytical data engineering and data
architecture is awash with confusion and controversy about how we should handle
data for analytics. A lot of the controversy centers on which is better, a data
warehouse architecture, or a data lake architecture, or some combination of the
two.
I think we’re all asking the wrong question.
Analytics end users don’t care where or how data is
stored.
Executives, business analysts, data scientists, even line-of-business
workers – they’re interested in the analytics, but not as much in the data.
As data engineers and architects, we have to care where the
data is stored, how it gets there, how it gets clean, and managed to feed into
analytics. We have to worry about real-time pipelines, historical storage, and
combining incoming time-series sensor data with geographic data about the weather
at that timestamp.
However, the people driving the business, the people who write
our paychecks, don’t care. As a profession, we need to stop forcing them to
worry about where their data resides.
Unify the analytics, not the data
Once upon a time, data warehouse architectures were designed
to gather data, combine it, polish it, and present it to visualization tools
that showed everyone how the business was performing. Business analysts put in
SQL queries as needed.
Then, along came Doug Laney’s three V’s –massive increases in data volume, velocity, and
variety, including streaming real-time data from devices that, in many ways,
encompassed all three. Also, new people called data scientists, who looked a
lot like our old quants, statisticians, and actuaries, needed all that data to
do sophisticated predictive analytics and machine learning.
The data lake was touted as the solution. Dump everything
here and do analytics on top of that crazy mess. It’ll be great.
But it wasn’t so great.
Governance wasn’t there; security wasn’t there. Most
importantly to end users, concurrency and response times weren’t there. The
architecture could no longer support all the people who wanted to perform
analytics, much less expand to allow more people in the company to use data to
drive their decisions. Nor could it provide analytical answers at the speed
they wanted to ask the questions, much less the speed of automation.
A data scientist training a model might cause the system to
crash or bog down, causing a business analyst to miss their SLA, or reports to
be generated only after the CEO went into a stockholder meeting, or an
automated threshold not to be triggered and miss shutting off a valve before it
spewed. Putting everything in one giant lake meant everyone competed for the
same resources.
The brave new world poses
a false choice
Now, we’ve got attempts by some of the folks who failed to
deliver on the promises of the data lake trying to convince us that they’re
going to add a few features from the old data warehouse, call it some silly new
name, and ta-da, all is solved.
We’ve got data architects struggling to find workarounds.
They’re building complex combination architectures that include both a data
warehouse and a data lake. Then, depending on who the end user is, they tell
them where to find their data, what condition it’s in, force them to fetch data
from several different locations in different formats, and let the consumers figure
out how to analyze it.
Folks are arguing whether they should do everything
streaming first, if they should use only open source software, or only proprietary,
or only one brand of proprietary software. Should they put all the data on a
cloud?
They’re missing the point.
Depending on who you ask, data scientists end up spending 60
to 90 percent of their time combining and cleaning data to get it ready for
analytics – which is something data engineers and data architects get paid to
do.
What’s more, business analysts and dashboard consumers
really don’t care if the architecture is built all on the cloud, or if it’s
from a proudly open-source shop. They genuinely don’t care where you put their
data any more than an Amazon shopper cares what warehouse the retail giant
stored their product in.
Would you like to drive to a particular Amazon warehouse,
find your product, put it in a box, and drive it home yourself? Similarly, analytics
consumers don’t want you to just tell them where the data is and wish them
luck.
So, what do analytics consumers really need?
- Ease of Use – How hard is it going to be to get
the analysis I need?
- Accuracy – Can I trust that the analysis will be
accurate?
- Workload Isolation – Can I ask the analytical
questions I need to ask without crashing the system or slowing down my boss’s
dashboard?
- Concurrency – When I need access to analytics, can
I get it, or will I have to wait in line?
- Response Speed – Am I going to get an analytical
answer back fast enough to matter?
In other words, they care about the analytics.
How can data engineers and architects deliver better
analytics?
Stop focusing on unifying the data storage and focus on
unifying the analytics experience. You
might think, “But processing and storing data is what a data engineer
does.” That’s like saying, “Moving boxes is what an Amazon delivery person
does.”
An Amazon delivery person needs to focus on making sure the
right package is delivered to the right address within the stated delivery
window. They need to know things like storage location, and packaging process,
and best transportation route, but that’s not the focus.
The people designing and building data architectures should
not be focused on where and how to transport, store, and process data. They
have to be focused on how to serve analytics.
Architects need to work backward. Look first at what the
analytics consumer needs. Analytics consumer requirements, and keeping costs
reasonable, are the primary concerns.
This throws a lot of things out the window that you might
have thought were important.
Open source or proprietary? Doesn’t matter. Choose
what will do the job best and keep costs reasonable, not just software costs,
but also maintenance, support, and operational costs.
Cloud, on-premises, hybrid, or something else?
Doesn’t matter. Choose what will do the job best now, and expect it to change
over time, so also prepare for the future.
Data warehouse, data lake, combination, or something
else? Doesn’t matter. Choose architecture based on making analytics
accessible, not on where data is stored, and be wary of vendors who insist that
you must store all your data in their platform.
Focusing on analytic consumer needs sounds simple, but it’s
a lot easier said than done.
What makes a solid data architecture?
Working backwards, start with the analytical consumer needs:
1) Ease of use
To an executive or line-of-business
person, ease of use means a
dashboard or report that shows them what they need to know quickly and
understandably. It also means that if they click on that visualization
dashboard to drill down on a particular region or important fact, they get back
a new visualization quickly. From a data engineering perspective, that means
solid integration with visualization software, a data querying engine that
robustly supports full ANSI SQL since those visualization tools send some
gnarly SQL queries, and fast performance.
To a business analyst, ease of use meansgenerating reports
and building dashboards quickly and easily, without worrying about where the
data they need is stored. It means sending an ad-hoc SQL query to get answers
to questions they were just asked, without having to go back to an ETL or data
engineering team and ask them to add a column of data that they left out
before. And it means SQL again, the business analyst language of choice.
To a data scientist, ease of use means using familiar tools
like Python, R, or a notebook like Jupyter. Since SQL is needed by other users,
the flexibility to use different tools to access data is a key aspect of good
architecture. Ease of use also means addressing the entire end-to-end data
science workflow in one place without moving chunks of data somewhere else. It
means quick, easy, complex data preparation operations like geo-fencing, or
disparate time series data joins, or missing value interpolation. Training
models should happen on a distributed system for speed and accuracy, without
moving data or re-doing work. This includes not having the data engineering
team re-do their work in a different framework to operationalize. The
environment they develop on should be virtually identical to production to make
that essential final jump to production as easy as possible. And it would be
nice if they could manage model life cycles as well, without moving either the
model or the data that trained it.
2) Accuracy
For the data scientist and the business
analyst, who are building analytics, accuracy means knowing where to find
the right data. But you don’t want to point them at a giant warehouse or lake,
and say, “Go fishing.” They’ll need a specific inventory, so they know where to
find exactly what they need.
For all analytics consumers, accurate
analysis requires knowing where the data came from, knowing the data is clean
and verified fit for use. Accuracy often
comes down to data quality, data
lineage, and data governance. If you thought you could let those slide
in the age of big data, I’ve got some bad news. Clean, known data from the
right source is just as important now as ever.
Accuracy also depends on the business
analyst and the data scientist building good analytics. You might think that
isn’t the data engineer’s problem. But, to some extent, it is.
For business analysts, a big part of building accurate analyses is
having a complete picture from all relevant data sets. Some or most of
the data with the rest off in a silo somewhere won’t give them a complete
picture of the organization. Now, this may sound like the old story – move all
the data to one place first. But providing access to all the relevant data sets
doesn’t necessarily mean moving all the data to one place. Storage location
doesn’t matter, but analytic access does.
For data scientists, a big part of building accurate analyses is
having complete data sets. Machine learning requires a lot of data for
training. More data even beats a better algorithm for increasing accuracy. Provide
data scientists with access to the entire data set, no matter how big. Taking a
small sub-sample that can fit in memory on a laptop, and building a model from
that, is a recipe for reduced model accuracy, not to mention re-doing work.
Focus on building an architecture where the only reason data scientists need to
sample data is to separate out training and verification samples.
3) Workload isolation
This is a subject that hasn’t gotten the
attention it deserves, especially when many big data vendors want you to shove
your data all in one place first and foremost.
Business
intelligence teams are the essential heart of many data-driven
organizations, building reports for everyone to use, and answering questions as
they come up with ad-hoc queries.
Data
science teams need access to the same data, and in bursts, huge amounts of
compute power to train models.
Executives
and line-of-business people want to drill down on dashboards and get fast
responses.
And
what about data engineering? When and where are the data transformation
jobs going to run if three other teams need those same resources?
If every group uses the same compute
resources on the same data, there’s going to be some obvious conflicts. Isolating workloads from each other by
providing dedicated compute resources and separate access to data can make your
business a lot more harmonious. It can also provide each team with what
it needs to do the job right.
The first thought most people have on this
is making copies of the data for each team. That’s how data marts proliferated
back in the day. The data inconsistencies and the constant need to update
multiple locations makes that less than ideal. Don’t make yourself crazy,
trying to build that spiderweb of pipelines.
These days, we have a better option – cheap
shared storage in HDFS, S3, etc. Cloud computing has the concept of spinning up
sub-clusters. Whatever data is needed is copied from communal storage, and whatever
compute is needed for that particular job or team is ephemerally assigned just
to them and no one else. The beauty of the sub-cluster concept is that it isn’t
just a cloud thing. HDFS or S3 style shared storage options are available
on-premises as well.
Sub-clustering to isolate workloads makes sense and doesn’t tie you to one deployment option, as you might think.
There are other ways to isolate workloads, but sub-clustering is a really good
one.
4) Concurrency
Concurrency is pretty straightforward. If the
goal is for more aspects of a business to be data-driven, you have to provide
access to data analytics to more people. Make sure your data analytics architecture can support
everyone who can benefit from it. Don’t think you’re doing the organization any
favors by using a cheaper option if it unreasonably limits the number of people
who can use it.
5) Response speed
From an architecture perspective,
analytical response speed comes down to the performance of whatever engine
you’re using to do the analysis. Concurrency matters, too, though. Some
analytic technologies have great response speed until more than ten people use
it at the same time, then performance drops like a rock.
And, of course, your SLA matters. For some
situations, getting an answer back in an hour is great. For others, three
seconds is too long.
You may need to do things like train a model in one place on a large historical data set, then deploy it out to the edge where it can detect a pattern, and react to it in sub-second time frames. The flexibility to meet various demands is a big concept to keep in mind.
Unify the analytics, not the data
These days – with data from devices, data from transactional systems, data from external sources, structured data, unstructured data, complex hierarchical data – the data landscape is far too complex for moving all the data to one place to be practical. Instead of focusing on where the data lives, focus on making the analytics experience as smooth as possible for everyone in your organization.
Put
those packages of analytics right on consumers’ doorsteps.