How Knowledge Graphs Make LLMs Accurate, Transparent, and Explainable


When LLMs are reinforced with knowledge graphs, they have the potential to significantly benefit knowledge workers.

Large Language Models (LLMs) are an important technology that has already made inroads into the enterprise. ChatGPT is just a single manifestation of the broader LLM concept, which, despite its immense potential for enhancing business processes, brings a range of challenges. Reaping the benefits of it in a reliable way requires some work on your part. Increasingly, knowledge graphs have a role to play.

Why? Some of the challenges of LLMs are well-known, such as hallucinations. OpenAI notes that ChatGPT sometimes writes “plausible-sounding but incorrect or nonsensical answers,” which remains a significant research problem in both academia and industry. Other issues include the high costs associated with training models and the associated complexity of updating LLMs, the difficulty of conducting audits and providing explanations, inconsistency in reciprocal answering, and the predominance of English language content from which they are trained. There are also other factors like the fact they are poor at reasoning and need careful prompting to bias for correct answers.

In systems where compliance or safety are important, we can neither take LLMs at face value, but there is a way forward. Many of these issues can be minimized by supporting an LLM with a knowledge graph, blending the best of ChatGPT with a graph data model, which is well-suited for modeling interconnected relationships.

Supporting LLMs with a knowledge graph

A knowledge graph is an information-rich structure that provides a view of entities and how they interrelate. It enables us to represent these identities and connections as a network of verifiable facts, essentially forming a structured graph of our collective knowledge.

Once you’ve constructed a knowledge graph, it becomes a versatile resource. Beyond simple queries for patterns, you can leverage it for computational tasks by employing graph algorithms and harnessing graph data science. This expanded toolkit empowers you to pose sophisticated questions about the entire graph of billions of elements rather than limiting your exploration to specific subgraphs.

Representing data as a graph can unveil previously hidden facts and unlock valuable insights. Furthermore, it allows for the generation of embeddings from the graph, which encompass both its data and structure. These embeddings can be employed in machine learning workflows or serve as a point of integration with large language models (LLMs). However, it’s crucial to recognize that a knowledge graph is only one part of the equation. The other half involves understanding how to integrate LLMs into this framework.

A small LLM

A “small” Large Language Model can radically cut down the kinds of errors that can occur with ChatGPT. While this approach may limit the range of responses the LLM can generate–because it will have typically been trained on far less data than it consumes from the Internet–it also means that the responses it generates will be more reliable.

It is not possible to achieve these outcomes simply by connecting ChatGPT with a document cache. If the CIO wants to start exploiting the untapped potential in their internal data stores by applying LLMs, then building and refining knowledge graphs using proven graph database technology is the way ahead.

Here, a real breakthrough has been made by a group of researchers through the creation of BioCypher. This FAIR (findable, accessible, interoperable, reusable) framework transparently builds biomedical ‘knowledge graphs’ while preserving all the links back to the source data.

What made the difference was using a graph-based knowledge graph to organize data from multiple sources, capture information about entities of interest in a given domain, and create connections between them. The team took a big corpus of medical research papers, built a “Small” LLM around it, and derived a knowledge graph from this new model.

This approach allows researchers to more effectively interrogate and work with a mass of previously unstructured data in a well-organized and well-structured way. And having this information resource in a knowledge graph means it is transparent, and the reasons for its answers are clear.

There is nothing to stop you from collecting a substantial amount of information in text form and running an LLM to do the natural language ingestion, too. From the trained model, you can then extract a knowledge graph to help you make the most sense of vital corporate knowledge in a transparent and reproducible way.

The reverse is also true. You can control the training of a small language model by feeding it into a knowledge graph. This would allow you to control the input to the model, resulting in a responsive, easy-to-interrogate natural language interface on top of your graph–and without the hallucinations of generative AI trained on the open Web.

See also: A Perfect Pairing: EDA and ChatGPT

Supporting natural language queries

Using Generative AI can help knowledge workers and specialists to execute natural language queries without having to understand and interpret a query language or build multi-layered APIs. This has the potential to increase efficiency and allow employees to focus their time and energy on more value-added tasks.

Basecamp Research is a UK-based biotech firm mapping Earth’s biodiversity that is trying to ethically support bringing new solutions from nature into the market. To do so, it has built the planet’s largest natural biodiversity knowledge graph, BaseGraph, which has over 4 billion relationships.

The dataset is feeding a lot of other innovative projects. One is protein design, where the team is utilizing a large language model fronted by a ChatGPT-style model for enzyme sequence generation called ZymCtrl. With BaseGraph being purpose-built for Generative AI, Basecamp is now wrapping increasingly more LLMs around its entire knowledge graph. The firm is upgrading BaseGraph to a fully LLM-augmented knowledge graph.

LLM-knowledge graph combination

Basecamp Research is not alone in exploring the LLM-knowledge graph combination. A household-name global energy company is using knowledge graphs with ChatGPT in the cloud for its enterprise knowledge hub. And a global publisher is readying a Generative AI tool trained on knowledge graphs that will make a huge wealth of complex academic content more findable, accessible, and explainable to research customers using pure natural language.

By training an LLM on a knowledge graph’s curated, high-quality, structured data, the challenges associated with ChatGPT will be addressed, enabling organizations to get the full benefit of Generative AI.

Finally, it’s worth reflecting on the future of this technology combination. For example, Yejin Choi and her colleagues at the University of Washington have discovered that LLMs like ChatGPT can be improved with a combination of other AIs and Knowledge Graphs to produce models that are smaller and better.

This approach may initially seem counterintuitive since it involves throwing away data from the larger model. However, by using a “critic AI” that asks questions and engages in logical reasoning, we can distill a knowledge graph. This graph can be used to train a secondary LLM only on good-quality information. In this case, the “student” tends to be smaller and smarter than its original teacher.

I think we can agree that when LLMs are reinforced with knowledge graphs, they have the potential to significantly benefit knowledge workers. The synergy is proving to be a winning combination—and one where the notable limitations of ChatGPT are substantially mitigated.

Jim Webber

About Jim Webber

Jim Webber is Chief Scientist at graph database and analytics leader Neo4j, co-author of Graph Databases (1st and 2nd editions, O’Reilly) and Graph Databases for Dummies (Wiley). More detail on this discussion can be found in the just published (August) Building Knowledge Graphs (O’Reilly).

Leave a Reply

Your email address will not be published. Required fields are marked *