LinkedIn Open Source Tool to Deploy TensorFlow on Hadoop

LinkedIn tool reduces the amount of time required to create AI models by making massive amounts of data stored in Hadoop more accessible.

LinkedIn at the Strata Data Conference today announced it is making code it developed to run the open source TensorFlow framework for building artificial intelligence (AI) applications on Hadoop clusters running Yet Another Resource Manager (YARN) available as an open source project.

TensorFlow on YARN (TonY) was originally developed to facilitate access to massive amounts of data that LinkedIn requires to feed AI models employing deep learning algorithms, also known as neural networks, built using TensorFlow. LinkedIn uses those models to enhance the relevance of feeds and enable a Smart-Replies capability on its social media network.

TonY proved especially useful in reducing the amount of time required to create AI models by making massive amounts of data stored in Hadoop more accessible, says Jonathan Hung, senior software engineer for LinkedIn.

“We wanted to speed up the training,” says Hung.

That’s critical because AI training takes place on graphical processor units (GPUs) that are expensive resources that need to be optimally employed, notes Hung.

Training AI models remains the most challenging aspect of building AI models. Each AI model needs access to massive amount of data to increase the accuracy of the machine and deep learning algorithms being applied. Hadoop provides a natural source for that data that can be more easily aggregated and managed across multiple clusters.

The baseline for TonY has already been completed. LinkedIn is expecting organizations will extend TonY for use cases that go beyond social media networks, says Hung.

TensorFlow has emerged as a flexible framework for building AI applications that be deployed on top of everything from Hadoop to Kubernetes clusters. That’s critical because while AI models tend to be built or trained in the cloud or data center, the models themselves tend to be deployed as close to the processes they are intended to automate at the network edge. AI models need to interact with those processes in near real time. Deploying AI models in data centers often creates latency issues that would result in AI recommendations not being created in near real time. In many cases those AI models are tapping directly into data as it streams from the network edge.

Just about every application will soon either have AI models embedded within it or will be able to access an AI model via REST application programming interfaces. Most organizations are still in the early stages of mastering AI. In fact, AI will force many of them to finally embrace more consistent approaches to data management. Data may be the new oil, but few organizations have the pipelines and refineries in place required to process it in a way that enables massive amounts of data to be consumed by AI models. Those pipelines and refineries will require IT organizations to not only acquire new tools, but also master the processes required to pervasively embed AI models across the distributed enterprise.

Naturally, that may take a while to ultimately occur. But as more open source AI tools become available the range of available AI expertise in the enterprise will considerably increase in the months and years ahead.

Recommended Articles

Leave a Reply Cancel reply