Expert: Big Data Requires Scaling and Accessibility

Big Data

Big Data has transformed business and may hold the key to battling complex diseases such as cancer. Taking advantage of Big Data, however, requires high-performance hardware as well as software approaches that can lower the barrier of entry for users.

Big Data problems are widely characterized by the five V’s—volume, velocity, variety, value and veracity. In the scientific world, the data captured by experiments, simulations and new instruments has dramatically increased over the past years. This volume of data requires new scalable computing methods to store and to efficiently process the increasing volume and variety of the data. But Big Data also has changed the scientific research process itself, with data analysis no longer a simple step to reject or accept certain hypotheses, but to learn more insights and to build new models directly from the data. This, in fact, has caused discussion in experimental science fields, such as biology, on the use of traditional “hypothesis-driven” research over the new “unbiased data driven” research.

Big Data bioinformaticsTen years ago, for example, bioinformatics was still considered an emerging field and few biological labs emphasized Big Data analysis, but that has now become a routine requirement due to technology development, such as next-generation sequencing, and data collection projects such as the 1000 Genomes Project and the Cancer Genome Atlas. Jim Gray, a visionary computer scientist, has hailed data-intensive computing, or so called Big Data, as the fourth paradigm in science. And Big Data research may indeed hold the key to conquer complex diseases such as cancer.

Big Data and Business Transformation

A similar transformation has happened in the business world, and data generated from everyday life are far bigger and faster than that from scientific research. In a report published by McKinsey Global Institute in 2011, the average digital data per company with more than 1,000 employees is 200 terabytes across all business sectors and the use of Big Data is crucial for company growth. At the time of the McKinsey report, the Internet was estimated to have about 500 Exabytes of data; the estimated size of Internet in 2015 is 8,000 Exabytes (or 8 Zettabytes). Scalable methods and tools are required to effectively process and harness Big Data, but the data itself becomes a driving force of new business opportunities. For example, data analytics has been a foundation and core component in many successful enterprises, such as Google and Amazon.

Data analysis is no longer used to just process sales summaries, but also to detect and predict new trends in both individuals and groups. The data to be analyzed is no longer just numerical measurements but diverse sets of information in the form of text, image and videos. Along with the emerging Internet of Things and wearable computing technology, the amount of sensory data that can be gathered will no doubt fuel the Big Data challenge and create more business opportunities.

Machine Learning and Prediction

At the core of Big Data lies computer science and computer industry. A key word in realizing the power of Big Data is “prediction.” Many machine learning and data-mining methods are developed to build predictive models from existing datasets based on mathematical and statistical principles such as support vector machine, hidden Markov models, and Boltzmann chains, to name a few. While they differ on their applicability and efficiency towards a particular problem, the solutions always require substantial computational resource support. Despite how smart a learning algorithm is, or can be optimized, Big Data analysis will eventually require high- performance computing resources and support.

To support Big Data, computing resources must be able to scale up, pushing for high-end powerful hardware, and/or scale out, distributing workload to multiple nodes. Fueled by these computational needs, new computing and storage technologies are in rapid development to address Big Data problems. For examples, computer architectures such as General Purpose Graphic Processing Unit and Intel Xeon Phi processors can be easily exploited to increase the number of processing cores within one box; new solid-state drives and flash drives can improve IOPS by orders of magnitude over traditional hard drives. Meanwhile, new programming paradigms such as MapReduce are in rapid development and adoption. These programming paradigms are powerful and simplify the code-development process to improve execution. Furthermore, many new systems for Big Data processing are open-source software and are quickly adopted by domain researchers, business venues, and service providers.

Big Data Twitter unstructured dataOne example of real-time analysis of Big Data occurs with Twitter data sets. Real-time data analysis usually requires the system can ingest and analyze large volume data within short a time frame. For years, real-time analysis has been coupled with database analysis on structured data but was difficult to apply on unstructured data such as text streams. In the past couple of years, a number of Big Data systems are developed for real-time analysis such as Spark, Storm, and Kafka. Those systems provide function supports from data preprocessing to machine-learning algorithms with distributed computing implementations to facilitate real-time analysis on high volume of data. Many large companies have begun to use systems like Hadoop and Spark together with their existing data warehouse and commercial database tools such as Oracle. Information technology providers are also embracing these new technologies.

The Promise of R

Big Data problems have brought academia, business and industry close than ever. Take R, an open- source analysis environment tool, as an example. R was initially developed 20-something years ago as an alternative tool for statistical analysis. Due to its low cost, it has been widely used within academia and developed as a software environment with comprehensive data analysis support by the user communities. Recently members of the R and HPC communities have tried to step up to Big Data challenge with R, resulting in methods for effectively adapting R with a variety of high-performance and high-throughput computing technologies. Although lacking the efficiency of some other high-performance computing approaches, R has also been adapted in business world as well and has been included as part of Big Data offerings by many IT providers such as Oracle and Intel. Such software tools not only provide collections of analytic methods but also have the capability to use new hardware transparently and reduce the efforts required from end users.

While new hardware technologies bring opportunities to improve performance, it remains a challenge for domain analysts to fully realize the potential of Big Data. Conquering Big Data with the latest software and hardware requires both an understanding of the hardware benefits, and a lower barrier of entry based on well-designed software. It is important now to make the latest technological advances accessible and usable to the end users.

We’ll tackle some of these topics at the IEEE International Conference on Big Data, which will show some of the latest high-performance computing technologies, software and algorithmic developments. Hosted three years in a row, the IEEE Big Data Conference has established itself as the top-tier research conference on Big Data, and brings together leading researchers and developers across the globe from academia and industry. The conference will discuss high-quality theory and applied research findings in Big Data, infrastructure, data management, search and mining, privacy/security, and applications.

Want more? Check out our most-read content:

Frontiers in Artificial Intelligence for the IoT: White Paper
Research from Gartner: Real-Time Analytics with the Internet of Things
How Real-Time Railroad Data Keeps Trains Running
Operational Analytics: Five Tips for Better Decisions
Why Gateways and Controllers Are Critical for IoT Architecture

Liked this article? Share it with your colleagues!

Weijia Xu and Xiaohua Tony Hu

About Weijia Xu and Xiaohua Tony Hu

Dr. Weijia Xu is a research scientist and the group manager for Data Mining & Statistics group at the Texas Advanced Computing Center at the University of Texas at Austin. Dr. Xu's main research interest is to enable data-driven discoveries through developing new methods and applications that facilitate the data-to-knowledge transfer process. Dr. Xu has over 40 peer-reviewed conference and journal publications in similarity-based data retrieval, data analysis, and information visualization with data from various scientific domains. He has a master’s degree in biological sciences and a doctoral degree in computer science from The University of Texas at Austin. This article was co-authored with Xiaohua Tony Hu, a professor and the founding director of the data mining and bioinformatics lab at the College of Computing and Informatics, Drexel University.

Leave a Reply

Your email address will not be published. Required fields are marked *