Big Data Trends for 2017: Hadoop Meets Machine Learning


Data and source-agnostic platforms will beat out siloed systems; Spark and machine learning continue to thrive.

Big data has proven to be a dynamic, rapidly-moving area of innovation over the last few years, and that seems to be accelerating rather than slowing down. A new report from TDWI and Tableau reiterates this conclusion while tackling opportunities and challenges behind some of the big-picture trends.

As an overall movement, the report claims that companies will put a priority on systems that “support large volumes of both structured and unstructured data.” On top of that, there will be a rapid rise in demand for platforms that help data custodians govern and control their big data implementations while also allowing that data to be accessible from scientists, analysts, and general business users. That’s the focus in fact of the May 2017 Data Platforms Conference, which includes sessions on modern big data platforms and big data as-a-service.

What else can we expect in 2017?

Hadoop’s role is changing

Of the ten key trends outlined in the Tableau report, Hadoop takes a majority of the focus. In many ways, Hadoop is the industry standard for big data, but it isn’t the fastest performer out there. It’s not ideal for machine learning or SQL-based queries, which are “the conduit to business users” who want to create dashboards or do “exploratory analysis,” according to the report.

Between solutions like Exasol and MemSQL, many are converting to other databases that offer faster queries. Or, they’re taking advantage of SQL-on-Hadoop engines like Apache Impala, Hive LLAP, Presto, Phoenix, and Drill, in an effort to get the best of both worlds.

On top of this, doing analytics solely based around Hadoop data is a thing of the past. According to the report’s authors, “enterprises … no longer want to adopt a soloed BI access point just for one data source (Hadoop).” Instead, they will want to investigate both structured and unstructured data that’s living in all types of databases. The report argues that only platforms that are “data- and source-agnostic” will be successful in 2017 and beyond.

If it’s starting to feel like Hadoop is getting heaped on a bit, new tools are designed to make it more compliant than ever. Three different programs from Apache—Sentry, Atlas, and Ranger—are all built to enable more fine-grained authorization and administration. The authors say these new capabilities are “eliminating yet another barrier to enterprise adoption.”

Data prep goes self-service

The report paints a rather desolate example of how many businesses are operating today: They have many IoT devices that are creating structured and unstructured data, which are living across “multiple relational and non-relational systems, from Hadoop clusters to NoSQL databases.” Because more business users want to be able to take this data and create intelligence around it, demand is growing for analytics tools that seamlessly bridge the gaps between these systems and data types, whether they’re hosted on-premises or via the cloud.

The report says that “making Hadoop data accessible to business users is one of the biggest challenges of our time.” These users—who aren’t computer or data scientists—simply can’t spend hours and hours preparing data. Thus, according to the authors, demand is going to rise dramatically for “agile self-service data-prep tools” that both lower the learning curve on Hadoop data and also support data snapshots. They call out Alteryx, Trifacta, and Paxata as early leaders in this rapidly-growing niche.

Machine learning takes center stage

Machine learning (ML) with big data is a notoriously complex technology to master, but a number of new systems are making it more user-friendly than ever before, according to the report. Microsoft Azure ML, for example, is leading the friendliness charge with a platform that helps users create ML workflows with visual tools and even offers a free tier for experimentation.

The report says, “Opening up ML to the masses will lead to the creation of more models and applications generating petabytes of data.” Of course, all this machine learning-created data will need to be easily accessed, leading us back to the aforementioned data prep/self-service developments.

For best-in-class ML capability, however, most organizations are turning to Apache Spark. A recent Syncsort report shows that 70 percent of IT managers and BI analysts are interested in Spark over MapReduce, and we’ve seen a similar trend. Spark is already being used to invigorate mainframes and detect fraud, and IBM offers Spark-as-a-Service because of its flexibility to both ingest real-time data and then perform analytics.

Of course, all the report’s predictions are just that—and we’ve seen others argue that SQL databases are going nowhere even in the face of Hadoop. No matter what, however, we can be certain of one demand, however difficult to accomplish: for platforms to handle more data, do it faster than ever before, and also make it safe and easy for anyone to use.


Hadoop and Spark

Machine learning

Apache Spark use cases: Why Spark is so hot

Joel Hans

About Joel Hans

Joel Hans is a copywriter and technical content creator for open source, B2B, and SaaS companies at Commit Copy, bringing experience in infrastructure monitoring, time-series databases, blockchain, streaming analytics, and more. Find him on Twitter @joelhans.

Leave a Reply

Your email address will not be published. Required fields are marked *