Organizations are adopting modern data management approaches, such as semantic-based knowledge graphs, to connect data across the enterprise and accelerate the value from their data lake investments.
Data lakes have the ability to store a variety of data types and rapidly handle the huge volumes of data, which has led to their widespread adoption. Gartner defines a data lake as a collection of storage instances of various data assets that are stored in a near-exact, or even exact, copy of the source format of the originating data stores. So, data lakes hold enormous promise in supporting modern enterprise data architectures. Implementations continue to be successful in uniting enterprise data physically; however, they can fall short in delivering returns for business users. This is because the bulk of the data within the data lake is unconnected and stored in its native form, requiring businesses to spend considerable time and money to prepare it for analysis.
When used in conjunction with data lakes, data lakehouses, an approach that combines elements of the data warehouse with those of the data lake, help organizations co-locate data from across the organization using cost-effective approaches for storage. They also provide the opportunity to leverage the data at the computational layer to capitalize on the benefits of AI and reduce the need to maintain expensive and brittle ETL pipelines against traditional structured and costly on-prem data warehouses. However, while data lakes address the data access problem, they have yet to democratize access so that non-technical users can self-serve and collaborate to generate the rapid insights needed to keep pace with consumer preferences and changing business dynamics.
In the past, organizations linked BI tools to their data lake, but this resulted in other issues, such as higher latency, reduced collaboration and reuse, and the inability to leverage data across domains to provide context. These storage solutions also hindered the ability to conduct self-service through data exploration in support of enriching analytics and inferring new insights.
To resolve those challenges, organizations are adopting modern data management approaches such as enterprise knowledge graphs to connect data across the enterprise and accelerate the value from their data lake investments. By connecting enterprise data with business semantics, knowledge graphs reduce the cost of data integration and help generate powerful insights into complex business challenges, all while enabling more agile data operations.
Semantic layers relate data for real-world use cases
A semantic layer is a data layer that operates between data storage and analytics. It represents a logically-enriched view of information as a set of interrelated business concepts and is achieved from the implementation of a knowledge graph. Powering this semantic data layer, the enterprise knowledge graph enables users to explore and exploit connections across their data universe with business context so they can achieve a complete and accurate
understanding of any given scenario, such as:
- Asking questions based on business concepts and the inter-relationships between them. Mapping concepts to the underlying metadata (i.e., tables, views, attributes) makes it possible to quickly establish a pipeline for sharing data across applications.
- Quickly run flexible federated queries between the data found in the data lake and other structured, semi-structured, or unstructured sources in support of ad hoc data analysis. By linking and querying data in and outside of the data lake, organizations can realize just-in-time, cross-domain analytics for richer and faster insights without creating data sprawl challenges.
- Reduce data wrangling and data movement by easily sharing findings through visualizations to promote data storytelling and enable self-service analytics directly within the re-usable semantic layer.
See also: What is a Data Lakehouse?
How Boehringer Ingelheim Used a semantic layer to transform its data lake
As the world’s largest privately held pharmaceutical company, Boehringer Ingelheim had numerous teams of researchers working independently to develop new treatments. However, data was often siloed within these groups, making it difficult to link targets, genes, and disease data across various parts of the company. The team tried several different tech stack approaches with some building data lakes, but inadequate virtualization capabilities necessitated ETL pipelines to move data. Others had worked to predefine all requirements from scratch in an RDBMS, but that approach couldn’t support the necessary levels of complexity or flexibility.
Ultimately, they realized that they needed an approach that would establish a technical foundation to enable data sharing across the entire company and connect data from disparate parts of the company to increase research and operational efficiency, escalate output, and accelerate drug research. To support these goals, Boehringer Ingelheim began by applying an enterprise knowledge graph platform as a semantic layer over its data lake, making the information easier to navigate, query, and analyze. The semantic layer acted as a consolidated one-stop shop for 90% of their R&D data, and the knowledge graph connected metadata from across workflow systems, for example, by integrating data about how samples were generated and stored, identifying which studies were currently underway or completed, and recording how specific data points were created and stored.
The semantic layer allowed bioinformaticians to access and work with the data without the need for cleaning and maintaining data in relation to the proper entities. Users were able to search for a particular disease, study, or gene and then explore the results in a Wikipedia-like experience. It also allowed analysts to see, directly in the data model, how one piece of data relates to the rest of the R&D data, enabling them to use an intuitive query builder UI to pull reports from the knowledge graph with no SPARQL knowledge required.
The knowledge graph has allowed bioinformaticians to easily identify useful signals within large sets of noisy data and to answer highly-specific questions. This is possible because they can query directly using the linked data dictionary and move immediately into analysis without any integration or cleansing needed.
Analysts were also able to work more efficiently because R&D data was made accessible through a standardized protocol. This meant they no longer needed to reference data catalogs, use alternative methods to find out where data is located, or spend time trying to understand how various datasets are organized to integrate them. Rather, they simply referenced the knowledge graph and asked questions using a natural language interface.
Finally, using the knowledge graph’s virtualization capabilities, the organization was able to save money on redundant data storage and costly, time-consuming ETL processes. Virtualization allowed data scientists to create a single, centralized access point for data scientists to work from while allowing the data to remain in the relational databases and other environments where it already exists. The data models accompanying this integration also enabled the organization to be more efficient by avoiding redundant research, reusing past answers, and focusing on new opportunities that built on existing knowledge.