A discussion about the challenges data scientists face, why they’re looking to Python for help, and the need for enterprise-class features and support.
Businesses constantly strive to transform operations and differentiate their offerings to stay ahead of the competition. At the heart of most efforts is the need to rapidly develop and deploy many new data-centric applications. Unfortunately, projects often are slowed or delayed because data scientists are swamped with ever-more projects, or there are inefficiencies in the handoff between development and production.
Increasingly, Python—by virtue of its ease of use and powerful automation capabilities—is being used to speed the creation and deployment of data-driven applications. However, a limiting factor is that such open-source tools may not meet the performance, security, and replicability demands of production business environments.
To sort through these and other issues, RTInsights sat down with Stanley Seibert, Director, Community Innovation, at Anaconda; and Heidi Pan, Director, Data Analytics Software, at Intel. We discussed the challenges data scientists face, why they’re looking to Python for help, and the need for enterprise-class features and support.
Getting the Most Out of Data Scientists
RTInsights: With data scientists in such high demand, how can companies help them work to their fullest potential?
Seibert: Early data scientists were jacks of all trades, doing many different tasks. They would be involved in data prep, data quality checks, modeling, and then figuring out how to deploy the models. They did a bit of everything, which was part of the reason they were in such high demand. It was hard to find people with skills in so many different areas. But the field has matured. We’re starting to see specializations emerge. We’re starting to see more focus on organizations building out a team with different people with special skill sets.
You can have someone who just focuses on data prep and data quality, and you have someone who can focus on the model in question and how you validate the model. And you can have someone else focus on how to get models into production looking at what’s required to go from research prototype to something that you can deploy at scale. Each of these groups is becoming their own subspecialty. Getting more out of your data scientists might mean limiting the definition of what is a data scientist, but then augmenting them with other people with other job titles.
Pan: Adding to that thought, data scientists are the most expensive part of the pipeline. We really do want them to be productive. We need them to easily have the tools and the right vocabulary that they always use. We also need to enable them to understand, explore, and analyze data quickly. One of the big points is that data scientists take a lot of time to really sift through the data to figure out which subset of the data to use.
We want data scientists to do a fast iteration of large amounts of data. The role that Intel is playing is to bring modern compute power to data science. You have a lot of parallel hardware, you have a lot of large memories, and it’s relatively cheap compared to data scientists in the grand scheme of things. How do you throw machines at their problem and make their workflows much faster? At the end of the day, you get the insights, and you get the models much more quickly.
Seibert: That’s a great point. One of the things that is maybe underappreciated is that interactive exploration for data scientists is really key to their productivity. They’re often doing something where there isn’t an already known best or good way to do it. And they are going to have to get very familiar with the data. So, the tools that Heidi is talking about are really important to give data scientists the power to rapidly try ideas. Once you’ve enabled data scientists to focus on just the modeling or something like that, they’re still going to need to try out a lot of different ideas to be successful. Giving them the ability to turn that around as quickly as possible, with the hardware they can access, is a big deal.
Python’s Benefits for Empowering Modern Applications
RTInsights: What are the benefits of using Python for automating data tests to empower machine learning and other modern applications?
Seibert: One of the benefits of Python is that, because it is an easier-to-learn language but still very productive, it allows data scientists that rapid iteration that we were just talking about. It lets them quickly code up an idea, do a visualization they want to see, carry out a particular analysis, and filter the data very quickly. It’s not just used for cleaning up the data, and then you have to switch to make your model. Because you’re doing the data prep and the modeling, and even the deployment potentially in Python, it really streamlines things to not have to switch languages partway through.
In the past, some ML [machine learning] pipelines would get you to a point where the data scientist would have done all their work in Python. Then, they’d have to throw the code over a wall to a developer to recode the whole thing in Java, for example, for deployment. Now that it’s becoming easier and more accepted to deploy Python into production, that’s giving people a lot of power to be able to do the experimentation, and also to be able to automate the whole thing with one language to go from beginning to end.
Pan: The ability to use Python end-to-end from development-to-deployment is huge. The more streamlined you can make that workflow and the more connected, the better it is for everyone. Also, keep in mind that machine learning is getting more mature. So, we’ll get more and more best practices for software, in general. Infrastructure-as-a-code, repeatability, reproducibility, and sensibility will need to be in a flexible scripting language that everyone knows.
Seibert: That makes it a little easier for more stakeholders to see what’s happening. If the data scientists, developers, and data engineers all speak the same language, it’s easier for them to understand what the other one’s doing. So, when you’re automating stuff, information is not lost in translation.
Help for Deploying Data-Driven Applications
RTInsights: While data scientists are strong at developing data-driven applications, what help do they need with deploying those applications?
Pan: It’s two things. One is the streamlining idea of scalable Python. Traditionally, Python has been about very small data, experimental, and slow. But we’ve been growing this vision about scalable Python and trying to bring it to reality. The idea is that if you keep the flow that the model is developed on and deploy it and if you accelerate it to bigger data, and make it much faster, you can directly deploy it without changing much. That’s going to help a lot because right now the bottleneck is the handoff between the data scientists and the DevOps. And the more you can make that streamlined, the better. Even when people do rewrite it, what they end up doing is having a lot of custom functions from the development side and rewriting the rest. And that’s just not sustainable for the growth of the ecosystem.
Two, big corporations are going to move towards the model that Stan was talking about, which is you have a separation of concerns for different things. That being said, the more streamlined the handoff is, the easier it is for everyone. There’s going to be more and more of a move to bring DevOps to the data scientist. For example, we’re looking at one-line deployment where a data scientist can say, “I want this data and this code to run remotely here on this cluster, with this many nodes. Here are my credentials. Go.” Any data scientist can do that. It replaces the numerous pages of DevOps code that you have to do manually. I think the ecosystem will do more and more of that, streamline the process, and bring data science and DevOps together.
Seibert: Another thing that we’ve seen from working with people is that they need help because data scientists tend not to understand how to achieve high performance and scalability. Tools that can do some of that behind the scenes for them, things like Intel is working on, can be extremely helpful to many data scientists who might not be familiar with how to make the best use of their large multi-core servers that they might have in production. You want them to be able to do that without having to learn an entirely new skill set about optimization, and all sorts of things that are well outside of their normal concerns.
Making Open-Source Tools Enterprise-Class
RTInsights: What’s the benefit of using open-source solutions with enterprise-class features and support?
Seibert: It’s an interesting trade-off that you must balance. Innovation happens in the open-source space very quickly. There are thousands and thousands of independent developers working to solve their piece of the puzzle for machine learning and data science. Progress can be made very quickly, but that progress is sometimes hard to manage. If you’re a business that needs to manage your risk, you need to understand exactly how the software is being used and if there are known security issues or bugs with that software. And you must address reproducibility. I need to be able to say, “The model I run today will be the same one that runs tomorrow, and I can prove that to an auditing agency,” or something like that.
The hard part often is to take that innovation and make it usable by an enterprise that needs to manage things like risk and needs to be able to have governance and control. That’s an interface point that Anaconda certainly tries to sit at. There are many customers we have who want that open-source innovation, and then want to know how you attach a layer of curation, security, and governance so that IT organizations can ensure they know what their data scientists are using. Sometimes that’s as simple as knowing what versions exactly are being used to do the data science because that’s really going to be important for reproducing results, and auditing results in the future.
Pan: We began this conversation talking about data scientists being the bottleneck. The number-one priority is to make them productive. And open-source APIs are really what they know. There are vibrant communities to advance it very quickly and to document how everything works. And it’s their vocabulary. We can’t change the vocabulary. That’s the premise we start with. That being said, I’m very excited to be working with Anaconda. Because Anaconda not only advances the open source communities, it advances the technologies, and it also provides enterprise-class support for production.