Trillion Edge Knowledge Graph
In our paper, we present the first demonstration of a massive knowledge graph that consists of materialized and virtual graphs that span multiple hybrid clouds. We show that it is possible to have a one trillion-edge knowledge graph with sub-second query times without storing all the data in a single centralized location. This capability has the ability to usher in a new era where the knowledge graph is a powerful component of company profitability and competitive advantage.
Our 1 trillion-edge graph was distributed over three systems: Stardog, Amazon Redshift in AWS, and SQL Server in Azure. Stardog handles the execution of SPARQL graph queries by reaching out to Redshift and SQL Server as needed and hiding the complexity of data distribution from end users. We used the Berlin SPARQL Benchmark, which is commonly used for measuring the performance of systems that support SPARQL query answering.
Conventional relational database management systems worked acceptably well when the enterprise data landscape was, itself, predominantly structured. But the world has changed. The enterprise data landscape is increasingly voluminous, varied, and changing. The emergence of IoT, the rise in unstructured data volume, the increasing relevance of external data, and the trend towards hybrid multi-cloud environments are challenges that must be overcome with each new request for data.
Enterprise data fabrics offer a new path forward. A data fabric weaves together data from internal silos and external sources, creates a seamless network of information, and supports the full gambit of the connected enterprise. Rather than tackle the data fabric with another abstraction layer, it makes more sense to leverage a database technology that was engineered for data relationships; a graph database.
A knowledge graph is a type of data integration platform that takes components from graph databases, data virtualization, query federation, and semantic inference capabilities to connect diverse forms of connected knowledge It is also the key ingredient to transforming existing data infrastructure into a data fabric.
Physically consolidating data for a knowledge graph can be prohibitive. Yet, that is necessary when you cannot use data from its natural locations in the architecture.
In this report, we have shown that Stardog allows users to query trillion-edge graphs distributed over multiple hybrid clouds. The Stardog server used to conduct this benchmark costs $6.60/hour to operate. Even when the cost of Redshift and SQL Server are taken into account, (which is an additional $2/hour), our distributed setup has an order of magnitude lower operational costs than the AnzoGraph setup which, of course, under-estimates the ongoing operational cost differences between operating a single server and a 200-node cluster with respect to devops and other support personnel.
The combination of materialization and virtualization capabilities gives companies the option to store data in Stardog when needed but leave other data in its desired data store and to be queried on-demand. Average query execution times below one second show that performance at this scale is in line with fully materialized enterprise queries, as is the cost.
Stardog has made data location irrelevant to knowledge graphs. It has finally given companies the ability to perform high-volume, realistic enterprise knowledge graphs, which can accelerate knowledge discovery across a wide range of assets or processes a company wants to optimize without limitation. Get the full details in our paper.