Hadoop Data in the Dark? How Governance, Metadata Can Help

data governance

Data governance and metadata synchronization can prevent Hadoop data from going dark.

“We have to evolve how we manage data,” said Philip Russom, TDWI’s senior research director for data management, during a recent webinar on Hadoop’s role in big data.

For those entrenched in the big data world, this isn’t necessarily a new idea. The issue has been around ever since the advent of big data itself.

But, according to Russom, central to the discussion of properly managing data is how businesses navigate governance, control metadata, and make analytics accessible. And according to his research, the difficulties of these three points can all be addressed with Hadoop’s open-source architecture.

Russom said, “Hadoop is known for its linear scalability. Hadoop can become, essentially, a bigger and better data staging area for both warehousing and data integration.” It’s not just a storage space, but also a processing engine for handling massive volumes of data—relevant for companies who are getting their data from sensors or telematics.

He added: “Hadoop has desirable use cases, but it can be a challenge in terms of data governance. Don’t forget—Hadoop is still kind of new, and it’s still kind of spartan in a lot of ways. That’s part of the secret sauce.”

Of most relevance to businesses is governance, followed by managing metadata. Once those are set in place, they can start creating systems that allow “self-service” analytics — essentially, enabling non-programmer employees to work with analytics to make big business decisions.

Hadoop data governance challenges

Russom noted that while Hadoop offers enormous advantages in linear scalability and the ability to offload data analytics, it does have some governance concerns. “Hadoop is good with high availability, but  … Hadoop replicates data into multiple places,” he said. “That’s actually a data lineage problem. It’s hard to know where data went, how many copies there are. That’s a governance issue.”

In order to tackle this, Russom insists that companies need to think of data governance as much more than business compliance, or negotiating the often-complex regulations that define a business’ activities. Technical standards need to be incorporated into the wider picture. These guidelines can cover data usage, privacy, and security, and if they’re not followed, a business can expose itself to legal issues and potential erosion of their brand’s value.

Metadata is a part of this as well. Jean-Michael Franco, the director of product marketing at Talend, said in the webinar that delivering metadata by design and synchronizing it across data platforms is critical to not only keeping that metadata under control, but also establishing those self-service tools that enable all employees to make analytics-based decisions.

Where Hadoop gets it right

“The primary path to getting business value from big data, and a lot of new data, like machine data, is through analytics,” Russom said. “There are challenges around Hadoop, but I don’t see them stopping anybody.”

Cost and complexity are additional challenges that can make companies stumble if they’re not prepared (see ‘Help! I’ve Been Told I’m Supposed to Use the Cloud and Hadoop But I Don’t Know Why”). TDWI has even published an online tool to help gauge Hadoop readiness.

Franco said that Accolade, which brands itself as a “on-demand healthcare concierge,” has been using Hadoop-enabled analytics to enable important efficiency gains in their processes. Now, they can better individualize services to those in need of medical care, which has resulted in a 75 percent drop in the onboarding effort and time. Many other industries are finding similar success.

While speed can be key to any big data investment, maybe companies should focus first on ensuring that whatever tools they do use are going to assist in the governance process.

More on this topic:

Apache Hadoop and Spark

Analyzing billions of stock transactions with Hadoop

Joel Hans

About Joel Hans

Joel Hans is a copywriter and technical content creator for open source, B2B, and SaaS companies at Commit Copy, bringing experience in infrastructure monitoring, time-series databases, blockchain, streaming analytics, and more. Find him on Twitter @joelhans.

Leave a Reply

Your email address will not be published. Required fields are marked *