The recent ODSC conference highlighted changes in data science in the last few years including the move to cloud, the need to support increasingly sophisticated workflows, and more attention on security.
RTInsights recently had the opportunity to speak with Sheamus McGovern, Founder and CEO of the Open Data Science Conference (ODSC). ODSC is one of the leading, most comprehensive conference and training organizations dedicated to data science. It brings together experts from the tech industry, academia, and government organizations and a cross-section of data scientists with the various professionals that support AI and big data solutions.
How have you seen the field of data science evolve in the years since you founded ODSC in 2015?
McGovern: I would say definitely the biggest shift since 2015 has been the move off the laptop and onto the cloud. What that really meant was the scale of data science and machine learning could really expand over the last five years. Before, people were working on what we would call small to medium-sized data sets with smaller feature sets. Back then, if you were dealing with 10,000 features, you would think that is a massive data set. Just working with ten features was a challenge.
Data science took a different path to cloud than software development because software development went to the cloud for the purposes of productivity and certainly scalability, but machine learning went there because it was imperative in order to scale within the machine learning workflow.
Now that data science is firmly linked to cloud capabilities and has crossed new scale boundaries. Where are you seeing the most change now?
McGovern: Over the last three years, we’ve seen a practice evolve that was searching for a label. We were calling it data ops, AI ops, and dev data ops. And then finally, in the last two years, it became known as MLOps.
Once data science moved to the cloud, the workflows got more sophisticated. So, your workflow had to cover feature engineering, feature modeling, feature deprecation, monitoring, etc. In addition, because you’ve gone to the cloud now, machine learning and data science were catching up with software architecture and what the software field was doing in terms of DevOps.
And now, we’re starting to pay attention to continuous integration, continuous monitoring, and real-time monitoring of models and applications. The workflow now encompasses the whole range of real-time event-processing, data analytics, data science, and machine learning. These were emerging on different paths, but now you see software engineering, data engineering, and data science machine learning starting to converge–primarily because they moved to the cloud.
Would you say that data science is becoming a truly interdisciplinary endeavor?
McGovern: Right. When data scientists were working away on their laptops, they could forget about a lot of the dependencies. Now everything is converging because of the cloud.
How does that affect the traditional role of a data scientist, if there ever was such a thing as a traditional role?
McGovern: When we hear that someone is hiring a data scientist, we ask ourselves, what does that really mean? There’s even a big difference between the related roles of a data scientist and a machine learning engineer. It always seemed to be too narrow a focus. Maybe data science will follow the same trend as software engineering.
Ten or fifteen years ago, we were all programmers. Now there are so many more defined roles. Are you a backend engineer, data engineer, or full-stack engineer? The machine learning and data science field is waking up to the disambiguation because, for example, when you go look for an NLP engineer, you don’t see many jobs for that, right?
You do see specialization skills around NLP or computer vision. So, I think there’s still a way to go in the specialization of the field as well, but there is still a long way to go before that level of specialization is needed.
On the other end of the spectrum, there are serious gaps, for example, the QA engineer. I have yet to see a role for the AI test engineer. That responsibility is either being put on the software QA side in the IT department or software development team, but you actually need very special skills to test models.
Are there any other gaps that you’re aware of in the AI/ML workflow?
McGovern: The project management side of it is becoming very important. We see a lot of companies struggling with the whole concept of machine learning and data science projects. ODSC has developed a co-located event called AIX that looks at AI from the business and industries perspective. There have been many sessions on the need for project management.
Project management, the PMP designation, came out of the construction industry and then very successfully moved to the software industry. Once that discipline was coupled with agile development, it became hugely successful for the software field.
However, machine learning and data science projects start like research projects that, if they’re going to be successful, have to be managed from the get-go as real projects. Having a vague target of porting AI to the cloud is not enough. Adding storyboarding and time/resource estimates is better but still doesn’t cover all the aspects.
For example, questions even as basic as where am I going to get the data have to be considered. Is the data already available, or do I have to collect it? Once we get the data, how is our model going to work out? There are more sophisticated questions around loss functions and error rates. Or back to basics again–will it actually work, and how do we prove that out? That’s just the tip of the iceberg.
That’s often what is meant by the phrase “operationalizing” AI. Can you shed more light on what’s involved in bringing projects from the research and sandboxing phase through to production?
McGovern: Companies are coming to their project managers and saying they want a new business solution, for example, a new credit risk scoring mechanism that doesn’t rely on static and inflexible algorithms but instead uses machine learning. Great. Where do we get the data? Where do we get the team? How do you put a framework around the complexities of machine learning and data science?
Presently, because we don’t have that framework, it’s difficult to estimate the size of these projects. That leads to problems because then how do you start to measure failure rates? How do you determine how close the end product is to the feature specs? How do you measure the success of outcomes since you are often not sure of the outcomes until you are past the research stage? Research stages lend themselves well to masking failure rates because their goal is to discover a good approach or method, which can then be built into a solution.
Would you say that that’s one of the next frontiers people are seeking to cross?
McGovern: That was an important focus for our community last year. This year, we are focused on security and cybersecurity. I was especially interested to see that more is being done in the field of machine learning safety, which is not the same thing as responsible AI, which we are also seeing a lot of interest in. Responsible AI centers on the processes of the data scientists and engineers creating AI systems. Machine learning safety or reliable machine learning centers on the hardening of AI systems against a malevolent actor.
There is a connection to the topic of project management in both of these areas. So, when you’re looking at starting new projects, you shouldn’t be waiting until the end and asking whether it was an ethical project. Having a responsible AI checklist is the wrong approach. The questions of responsibility and ethics must be asked at the outset and at every step along the way. Some of these questions impact whether the outcomes are even accurate. You can say that a model is not being trained on a wide enough feature set, or the model is being trained on a certain persona with all the assumptions and biases baked into it. One example of this is looking at gender-blind income levels and credit profiles but then assuming the persona is male.
We have to think about responsible AI at the data generation phase, the data capture phase, the feature engineering phase, and the model deployment phase. What happens if your models deprecate? They were ethical at the start, but how are you measuring that they remain ethical?
I’ve always thought responsible AI was the result of unintended consequences. No one set out to create a biased model. It is more of an educational, leadership, and awareness kind of issue.
If responsible AI seeks to prevent unintended negative consequences, does reliable AI mitigate intended negative consequences?
McGovern: Exactly. AI safety is slightly different because now you’re dealing with malicious actors. Just like with cybersecurity, you are protecting your AI environment and systems from being tampered with. But there might be many more ways to “hack” AI than a network.
For example, a lot of data training is done with data captured from the web, right? You are mining social media for sentiment analysis on companies, their stock, or their products. This data is, at best poor, but imagine that someone has deployed bots to generate a lot of negative reviews to poison the data. That data is then fed into an AI system and worked into your model with disastrous results. You can sink a company’s reputation and even lower its stock value. Digital images are also especially vulnerable. Changing a few pixels on an image can make it look like something entirely different to computer vision software.
Another problem is that reverse engineering an algorithm is not impossible, and it’s not illegal. There are people with enough time on their hands that will try to reverse engineer your model. You can only guess how they can profit from that. It’s the new proving ground for hackers, data science hackers.
So, I predict that just like bad trading caused the crash of a lot of financial houses, in the future, bad AI algorithms will bring down companies through monetary losses or reputational losses. Companies need to be more aware of the risks and take action. Responsible people think responsibly. But AI safety is about robustness, engineering robust machine learning systems that can deal with adversarial attacks.
Well, we took an unexpected dark turn. Tell us about the bright spots and opportunities you see in the world of AI.
McGovern: Yes, let’s talk about something much cooler. ODSC now includes a startup showcase. Seeing all the new ideas come into the industry is very exciting. I’ve been noticing a really exciting trend with these startups. One of the reasons I started ODSC, an open data science conference, was that from my perspective coming from finance, which is a bit of a closed industry, I loved the whole idea of open source.
People were out there building these unbelievable systems that were better than paid products. I was using both kinds of products as a programmer and found the open-source ones were just better. I couldn’t believe these people were spending all their time and effort creating these programs, and I wanted to basically give them a soapbox.
Back in the day, hedge funds were the place to be if you had a Ph.D. Now it’s AI. I see a lot of the smartest people joining open-source-focused startups. They are actually using their startup funding to build great open-source communities that will build great platforms and tools.