How Open Source is Driving the Future of Data Science

With its reliance on a community of physically dispersed individuals and flexibility of adoption, open-source data science is becoming an even more attractive choice among cash-strapped governments, non-profits, and businesses.

Over the past decade, data science and machine learning have made their way from an obscure academic discipline to widespread corporate adoption. The academic community has a natural preference towards open source. Science is a collaborative effort, and its advancement is best served by enabling as large a community as possible to build upon existing research.

Private companies, on the other hand, have a much stronger incentive for proprietary technology. Developing software systems is an expensive endeavor. Naturally, a business wants to make a return on this investment. Making the results of your work freely available to competitors doesn’t seem like the smartest choice if you are a business owner.

Still, in data science, several powerful incentives pull corporate interests in the direction of favoring open-source implementations.

Access to open source tools and talent

Open source tools offer a lower barrier to entry than licensed software. Companies can experiment more easily and with fewer constraints. They are also more likely to find talent for programming languages and data science tools that are freely available to everyone.

A case in point is Python, the dominant programming language for data science, which happens to be open source. It has the most versatile and extensive capabilities for manipulating data and building machine learning models. Python has even superseded commercial tools like MatLab in terms of capabilities for data science applications.

Most data science and machine learning frameworks such as TensorFlow, SciKit-Learn, or PyTorch build directly on Python and are also open-source.

Often, their creators are large companies that are already dominant in their respective markets. Evidently, the benefits of making a library like TensorFlow open-source outweigh the costs for its creator Google.

While Google gave potential competitors a powerful deep learning tool, it probably benefits more from the massively expanded talent pool, the sprawling deep learning innovation, and the widespread adoption of the framework by other companies that open-sourcing TensorFlow entailed.

Other machine learning libraries, such as XGBoost, originated as research projects in universities. For these institutions, the benefits of open-source software are overwhelming for the reasons discussed above.

Access to data and models

Most machine learning models require large amounts of data to train. Modern machine learning models, especially deep neural networks used in computer vision and natural language processing, require vast amounts of computational resources to train. This would present an almost insurmountable challenge for smaller organizations and individuals, who simply do not have this amount of data internally, nor the budget to run expensive model training experiments. If it weren’t for open source data, machine learning would be almost exclusively the domain of large corporations. This may be in the interest of the shareholders of said corporations, but certainly not of society at large, which benefits from the innovations produced by startups and individuals.

Even for large corporations, the widespread availability of open-source data and pre-trained machine learning models has benefits.

Many of the cutting-edge models developed by researchers at companies like Google and Facebook have been open-sourced. Anyone can download these models from Github and use them in their custom data science projects.

But why are these corporations so generous in sharing their models and their data?

From the perspective of an established corporation, it makes sense to avoid risky ventures and instead aim to expand market share through more traditional strategies.

Startups tend to be better suited for engaging in novel high-risk ventures because they are smaller, more agile, and have nothing to lose.

If a large company wants to enter a novel market, or obtain new technology, acquiring a successful startup in the desired field may be a smarter move than trying to do everything from scratch in-house.

For example, Google acquired Deep Mind in 2014 for the potential it saw in DeepMind’s research in reinforcement learning and general-purpose AI.

To maximize the potential for the emergence of innovative data science and artificial intelligence startups, it makes sense to give ambitious new upstarts the tools and data they need.

Furthermore, many of the researchers working on commercial projects come from academic settings. They bring with them a culture of collaboration based on open source.

Researchers and developers are naturally inclined to showcase their work. Therefore, a commitment to open source and the opportunity for employees to participate in open source projects can go a long way to make a company a more attractive employer for highly coveted data science talent.

Open source data science education

The foundational knowledge for data science includes advanced skills in mathematics, statistics, and programming. Until a few years ago, this knowledge was deeply buried in academic textbooks and usually acquired by obtaining a technical university degree.

Today, an ambitious self-starter can learn all of these things via resources that are freely available on the web. An army of Youtube educators and bloggers has emerged that makes previously dry and highly academic topics accessible in a fun and easy-to-digest way.

These new educational resources grow the talent pool by making data science more accessible for a larger group of people, which also benefits companies.

Without open-source software and open-source data, offering this type of education for free would be much more difficult.

Online education platforms offer academic curricula that often match or exceed traditional university courses in terms of quality. In many cases, these courses are accompanied by Github repositories full of open source code.

Reliability, security, and speed

Developing and maintaining a custom data science solution from scratch in-house presents a major challenge to most companies. The larger a software system grows, the more susceptible it is to bugs and the more difficult it is to find problems in the source code and deploy the system into production.

Building on open source software and models can significantly alleviate these burdens and speed up time to market. Bugs in widely used open-source libraries are likely to have been discovered by previous users. If bugs do occur, developers are free to go into the code and fix them without having to worry about violating licensing agreements. If the open-source tool turns out to not be a good fit, no money has been sunk on a failed trial.

Conclusion

Even for private businesses who have a commercial interest in protecting their software, there are strong incentives for using and building open-source data science solutions.

More recently, the Covid-19 pandemic has put many organizations under enormous pressure to digitize data-heavy processes as quickly as possible while physically scattering technical talent. With its reliance on a community of physically dispersed individuals and flexibility of adoption, open-source data science is becoming an even more attractive choice among cash-strapped governments, non-profits, and businesses.