SAP’s new data catalog capability has changed how the enterprise solutions firm helps clients to tackle big data.
At the heart of an emerging distributed approach to managing data is a new data catalog capability that has been inserted in to SAP Data Hub.
SAP Data Hub combines data virtualization capabilities with an instance of the Apache Spark in-memory computing framework to make it possible to read and write data across a distributed computing environment regardless of whether data is stored in the SAP HANA database or somewhere else. SAP originally made that instance of a distribution of Apache Spark available in an offering known as Vora.
Now SAP has eliminated Vora as a standalone offering in favor of embedding that capability alongside data pipelining and virtualization software included in version 2.3 of SAP Data Hub, says Ken Tsai, Ken Tsai, global vice president and head of product marketing for cloud platform and data management.
To keep track of what data is located where SAP Data Hub also now includes a Data Catalog that captures all the metadata that exists within a distributed computing environment, says Tsai. That functionality will prove to be a critical element of SAP’s approach to processing data inside and out of HANA in near real-time, says Tsai.
As an in-memory database, HANA has emerged as a cornerstone of the SAP approach to processing data in real time. But it’s not feasible to move every piece of data in the enterprise into HANA. By including an instance of Apache Spark in SAP Data Hub it becomes possible to process data outside of HANA at speeds that can keep pace with the rate at which data is being processed within HANA.
Coming soon: container services
As part of that effort both HANA and SAP Data Hub will soon be running as a set of container services hosted on a Kubernetes cluster, adds Tsai. Kubernetes makes it simpler to deploy either HANA or SAP Data Hib anywhere as part of a hybrid cloud computing strategy that will enable IT organizations to more easily process data anywhere it’s located across what SAP describes as an emerging intelligent enterprise. That intelligent enterprise will, for example, be able to process massive amounts of data required to drive machine and deep learning algorithms required to drive artificial intelligence (AI) model in near real time, say Tsai.
“Algorithms are meaningless without being able to access data,” says Tsai.
Tsai also notes the SAP approach to data management will enable IT organizations to anonymously process queries without having to redact data in a way that makes it unreadable. That will prove to be a critical requirement in healthcare applications where researchers will need to be able to study data involving a larger number of patients without having to sacrifice data sets because the underlying data has been masked in a way a query can’t process, explains Tsai.
In general, hybrid cloud computing has proven to be an elusive goal because each cloud and on-premises IT environment processes and stores data differently. Via a combination of HANA and Vora, it’s apparent that SAP is setting out to solve that challenge by putting in a layer of data processing software that shares access to a common framework for processing metadata. It may take a while for HANA and SAP Data Hub to become federated across all of enterprise IT. But once they do SAP is betting that the future of IT will once again be driven more by an ability to most efficiently process data rather than what underlying platform that data happens to reside on.