Computational Requirements of Graph Analytics and How to Meet Them

Graph analytics is gaining favor for business applications in which insights must be derived from massive unstructured, connected datasets. But there are different computational issues compared to running analytics on traditional, structured data. RTInsights recently sat down with co-founders Keshav Pingali, CEO, and Chris Rossbach, CTO of Katana Graph. We discussed what makes graph analytics different, how to speed the time to results, and how the company’s partnership with Intel fits into the equation. Here is a summary of our conversation.

RT Insights: How does graph analytics differ from traditional analytics?
RT Insights: Different as in complex to build? Expensive? Time-consuming? All the above?
RT Insights: What type of datasets lends themselves to graph analytics?
RTInsights: Would this translate into shorter times to develop the applications?
RT Insights: At the heart of your offering is the ability to scale up graph analytics, so why is that important?
RT Insights: How do you accomplish the scale-up capability?
RT Insights: What has your relationship with Intel allowed you to do?
RTInsights: So, it’s almost as though the rising tide will lift all the boats.

RT Insights: How does graph analytics differ from traditional analytics?

Pingali: The way we see it at Katana is that as datasets become larger, they tend to become unstructured, and they also tend to become sparse.

I’ll give you an example of what I mean by that. We are all familiar with social networks. A social network graph has vertices for each person in an organization, and if two people know each other, you put an edge between them in the social network graph.

If you consider the social network graph of Katana, we have about 25 employees at this point, everybody knows everybody else, so it’s a very structured graph. That kind of data can be put into SQL tables, and you can use SQL queries with it.

But if you imagine a bigger company, what’s going to happen is that each person will know fewer and fewer people overall. That’s what we mean by sparse. The people that one person knows would be a very different group than the people that someone else in the company knows. That’s what we mean by unstructured.

It turns out that once data gets to that size – once data becomes sparse and unstructured – it makes sense to process that data using what are called graph algorithms. You could use SQL, but it would just be very inefficient because SQL is not intended for these sorts of sparse unstructured datasets.

Rossbach: Yes, the key issue is that sparsity and irregularity argue for a different set of algorithms and different systems support. Graphs obviously are the most succinct and natural way to capture the data that is being represented. Still, once you represent them in that form, the kinds of algorithms you need are different. The kind of runtime, the kind of infrastructure you need to support at the lower layer, to implement graph algorithms efficiently is also very different from what you would see in a traditional relational database.

RT Insights: Different as in complex to build? Expensive? Time-consuming? All the above?

Rossbach: I’m not sure that I would argue that graph analytics has higher or lower complexity necessarily than a relational database. It’s just a very different way of approaching the problem, which means that the kinds of components that you would assemble to solve the problem are different.

You can solve math problems with a relational database. Just as Keshav said, that’s very inefficient because a relational database is designed to capture data as very dense structured groups of things that go together in tables. The algorithms and the storage layers that support that are all designed to be very efficient when dealing with data with that kind of structure. If you’re dealing with sparse data, using that kind of infrastructure no longer makes sense. Further, suppose you’re willing to customize the storage layer, computer engine, and all the lower layers to fit the kinds of data you’re computing over. In that case, you can achieve massive gains in efficiency and performance. That’s the key motivation behind Katana.

RT Insights: What type of datasets lends themselves to graph analytics?

Pingali: Okay, that’s my favorite subject, so let me give you my view on that. We tend to see graphs everywhere there are large unstructured sparse datasets.

Katana is engaged with a few pharma companies, and in pharma, they have what is called knowledge graphs. PubMed is an example of a very famous knowledge graph. It is a graph that contains vertices for all biologically active entities, like viruses, bacteria, animals, and more. It also has vertices for biologically active compounds like arsenic, for example, which obviously can kill you if you take it. And, it has vertices for authors of articles, and as well as vertices for articles that they have written. If you write an article about the effect of arsenic on human beings, then many edges get added to a graph to capture all that connected information.

Pharma companies are trying to mine all this knowledge that currently exists in all the articles that have been written in the biological area to find promising treatments for, say, COVID. That’s an example of an area where graphs, and in particular knowledge graphs, are very, very important. We’re providing the computer science expertise, machine learning expertise, and systems expertise that will enable these folks to do their work more efficiently.

Another area where we are currently engaged is security in identity management and online intrusion detection in networks. You have a computer network, and bad guys are trying to break in. You can build a graph that captures all the activities that are going on in the network. And then, if you see certain forbidden patterns, you raise an alarm. We have worked with BAE to build a system for them on intrusion detection. That was very successful; it was a DARPA [Defense Advanced Research Projects Agency] project.

Still another area where graphs show up a lot is in the financial services sector. For example, all major banks want what are known as 360-degree views of their customers. For example, if you have a mortgage, they may come to you and say, “Oh, you know what? Maybe you should refinance your mortgage because we’ve looked at your spending patterns, and we think this might be a better deal for you.”

The final area I want to mention is that you can use graph analytics in workflows for designing electronic circuits on chips. The gates or pins are the vertices, and the wires are the edges of the graph. We’re currently engaged with some chip design companies. We’ve shown them that we can use our graph engine to do many things, like partitioning the circuit, placing the gates, and wiring the gates faster than they can do with their current approach.

So basically, we find graphs everywhere we look, from circuits to pharma to banking to online security.

Rossbach: I’d like to add a little bit from a programmer/developer-facing view on that question. As Keshav stated, there’s a very wide range of areas and use cases where a graph is a natural data structure for thinking about problems. But there’s also a view that graphs make it much easier to reduce your data, which reduces time to insight when working with a dataset as a graph versus other modalities.

Consider the traditional way I might try to understand large datasets. I must first put the data into a database. That means I need to introduce a schema. There’s a much lower barrier with graphs because you don’t necessarily need a schema. You just need to define the nodes and the edges. And it’s an incredibly attractive property of graph datasets in terms of shortening the time to insight and action.

RTInsights: Would this translate into shorter times to develop the applications?

Rossbach: Yes, I believe so. If you look at the traditional development lifecycle for applications that consume large amounts of data, a big chunk of time is allocated to data design, data cleaning, data management, and data input. You have to manage your data. You have to get it in and out of your system. The degree of effort required to create a model and enforce it is much, much lower with a graph database. There is value in terms of having a very clean and easy programming model, as well.

RT Insights: At the heart of your offering is the ability to scale up graph analytics, so why is that important?

Rossbach: Nobody is going to have less data tomorrow than they have today. Graph analytics is a great way of dealing with data you don’t understand yet. From the perspective of Katana, scaling out graph algorithms is something Keshav’s group has spent many, many years researching. It is a hard problem, and they are getting good at it. As technology trends move towards bigger and bigger data, it’s an area where Katana has a fundamental advantage.

Pingali: And just to give you a few numbers. One of the companies we’re working with has a graph with more than four trillion edges. So that’s how big their graph is, and obviously, they want to process that instantaneously: faster, please! They don’t want to wait for insights from a graph of that size. And they also want to be able to ingest new data faster, which is equally important. We are not talking about a static graph, but a graph that updates as transactions are happening. These activities happen in real time, so you need to ingest that new data and then update your graph in real time. That’s another problem that we’re addressing. There might be a billion events every 15 minutes that need to be ingested and updated. That gives you an idea of the problem’s scale and why having scale-out solutions for graph analytics is very important.

RT Insights: How do you accomplish the scale-up capability?

Rossbach: A lot of that is through great, well-researched clever algorithms. The careful partitioning of graphs makes it possible to distribute them over more and more computers while not be overwhelmed by the communication overhead. Intuitively, I can see why people think that doing distributed graph computation is difficult. If you partition a large graph, you put it on lots of different hosts to compute over it. Fundamentally those computers are going to have to talk to each other.

We’ve had great success in the past doing parallel distributed processing with big data with legacy engines like MapReduce and Spark because partitioning the data is very straightforward for traditional dense or relational data. With graph algorithms, that is not so much the case because when you’re traversing edges and examining those and following paths in the graph, it’s much more difficult to predict when you’re going to need a piece of data is in some other partition. And how are you going to get a piece of data that’s in some other partition? You can imagine that it is quite bad for performance if done inefficiently.

I think in addition to the algorithmic innovations that Keshav’s group has developed, there’s also key research into how you partition graphs in a way that maximizes efficiency and minimizes communication.

That’s at the top layer. If you start following the technology stack downwards, what you see in Katana is a tiered design to preserve all the performance benefits derived from clever partitioning and minimizing communication all the way down through the stack.

Some of our customers have graphs with trillions of edges. We store those in a way that makes it possible for us to quickly get that data into the graph engine’s memory. That helps us deliver performance. Doing that involves storage layer design, runtime design, and so on all the way up the stack.

RT Insights: What has your relationship with Intel allowed you to do?

Rossbach: Speed is the answer. What we’re able to do by collaborating with Intel is to take advantage of CPU features that are emerging and be able to maximize them for performance. To know how to optimize Katana’s software runtime to be as efficient as possible every time Intel comes out with a new performance-focused solution, to take advantage of that quickly and effectively, that’s an advantage. But I think there are also big advantages when it comes to the panoply of emerging hardware that we’re seeing come out of Intel.

Some of that boils down to already-shipping, more mature technologies, like Intel Optane Persistent Memory. Persistent memory has a compelling advantage. I was talking about how we operate over partitioned graphs to deal with scale. Well, if you’re storing things in persistent memory, it’s already in memory. That’s one advantage.

There’s also the fact that persistent memory is often used as a lower tier in the memory hierarchy to give the abstraction of much, much higher DRAM capacity than is needed. And guess what? The more DRAM you have, the larger your host partitions can be. This translates directly to less communication when you’re doing compute over irregular data. So, I would say graph computing is one of the domains where Optane has very compelling performance.

Intel also is coming out with some very beefy and compelling GPUs. We’ve invested a lot of effort in figuring out how to do graph algorithms well on GPUs. Collaborating with Intel is bringing an advantage there.

And, of course, there are many efforts at Intel with other forms of hardware acceleration. Things like FPGAs are clearly something that most people know about. Then there are graph accelerators. That’s a research area that both Keshav and I have worked in quite a bit. We’ve explored hardware acceleration algorithms and runtime support for them both.

Pingali: Intel, like every other company in the tech space, realizes the importance of AI and machine learning. Everybody’s trying to use AI and machine learning applications to drive the hardware that they build and rebrand themselves as AI companies.

We’re working with Intel to understand how to redesign some of their hardware for AI and machine learning applications. That’s to complement what Chris said about using their current hardware offerings like Optane. We’re also working with them to see where they might go next.

And then, from our perspective, they have a huge customer base, and most of them need graphs. Those customers approach Intel and ask: “Can you optimize this graph application for us?” What this partnership allows us to do is to step in at that point and say, “Oh, by the way, here’s what we can do for graphs on Intel hardware.”

RTInsights: So, it’s almost as though the rising tide will lift all the boats.

Pingali: That’s right. We see this as a win-win for everybody. People are very familiar with relational databases like SQL, but they are not as familiar with graph databases and graph analytics. The entire area is a wonderful opportunity for startups like us because we know what to do. Having Intel as a partner helps us with this missionary activity, so to speak, of proselytizing graphs and making converts of everybody!