An open-source GPU initiative could drastically speed analytics, including analyses using deep learning.
MapD is widely recognized as a leader in leveraging the unique computing power of the graphics processing unit (GPU) to make big data analytics faster than previously thought possible. According to a recent webinar, the company’s technology has been able to reduce the time of processing 10 GB of data from 30 seconds down to 75 milliseconds.
Now, in a blog post written by Todd Mostak, the company’s CEO, MapD is making its GPU-powered database technology open source under a permissive Apache 2.0 license. This means the MapD database core—including the tiered caching system and LLVM query compilation engine—can be leveraged by individuals, academics, and businesses wanting to leverage GPU power for analytics, but otherwise previously couldn’t or didn’t want to pay for the enterprise-level edition.
Mostak says, “We are doing this first and foremost out of our belief in the transformative power of open source software. Whether in the Hadoop or deep learning ecosystems, open source is driving tremendous innovation that simply has not been possible with proprietary software.”
Early on, Mostak says, he hesitated to make the company’s core technology open source because he was focused on building the business and the technology simultaneously. And now that both are performing well, it makes sense to branch them out, particularly given that he always believed that open source technology could be “disruptive,” and wanted to shake up the existing GPU-based analytics space.
On top of that, Mostak says that now MapD can integrate with other open source ecosystems and communities, such as the newly-formed GPU Open Analytics Initiative (GOAI) with H2o.ai and Continuum Analytics.
Deep learning and GPUs
In the blog post, Mostak explains the situation: “We noted that while GPU-accelerated machine learning was eating the world, there was a gaping hole in the analytics stack running on GPUs. Almost the entire GPU ML and deep learning stack was open source, but there was no open source data processing engine to complement it.”
For example, one could already leverage a GPU-based deep learning engine like h2o or Caffe, but they wouldn’t be able to use that same data to do other kinds of analytics while also leveraging the sheer power of GPUs.
Now, MapD is working together with H2o.ai on what they call GPU Data Frame, which can act as an interchange for data between different applications that run on the GPU. That means that the MapD database can be used in conjunction with not only deep learning analysis, but also other open source analytics software, without having to manipulate the data or send it through the CPU at any point.
Mostak says, “Our hope is that this project will be a step towards enabling an open end-to-end pipeline on GPUs.”
GPU big data analytics
That could equate to rapidly sped-up analytics applications in certain situations. One of the MapD database’s primary functions is to keep so-called “hot” data in GPU memory for maximum speed. A single node can support 192 GB of GPU RAM, which enables analysis of huge datasets. And, by splitting queries into small batches across many cores, the technology aims to prevent bottlenecking.
On their own, GPUs now contain upwards of 5,000 cores, compared to 16-32 in today’s most powerful CPUs, which is ideal for parallel workloads. They can perform significant analysis with less hardware overhead, dramatically reducing costs over many CPU-based options.
How the new open source MapD project might be used
MapD is moving two products into the open source community. First is the core database, which allows for multi-GPU acceleration of SQL queries. Second is the MapD visualization libraries, which will allow anyone to build web-based visualization apps.
In the past, MapD has touted the platform’s ability to visualize, as an example, every New York City taxi ride in real-time to analyze human patterns, identify trends in stock prices, and figure out where New Yorkers are most likely to go on a weekday morning (Starbucks, unsurprisingly). The company also talked about how telecoms have troubleshooted issues in real time by analyzing streaming call records, or using visualization to identify clothing purchase trends at the store level.
Now, some of these operations would be possible without paying for an enterprise-level plan with MapD. A paid enterprise version still exists, of course, and includes higher level of support and additional technology that isn’t included in the open source version.
Academics seem to be excited about the news, given their often-restrictive budgets. John Owens, a professor of engineering and entrepreneurship at the University of California, Davis, says, “We’ve been impressed for some time with the work MapD is doing, only wishing we could use it as a real-world test bed for our research.” Scientists and engineers at MIT are also onboard.
Mostak says that “we’re at the beginning of a GPU era of analytics,” and it’s hard not to believe him. CPU-based analytics is built atop a vibrant ecosystem of open source options, such as Hadoop and Apache Spark, and it’s only a matter of time before the GPU world accelerates in similar fashion.